[Deprecated][SQL] Add support for collation for StringTrim type of functions/expressions#45749
[Deprecated][SQL] Add support for collation for StringTrim type of functions/expressions#45749davidm-db wants to merge 8 commits into
Conversation
|
Also, please add tests to |
|
@mihailom-db I'm adding tests to |
|
@MaxGekk, @cloud-fan could someone please take a look at this PR? Tagging the rest of Belgrade collation crew if anyone else would like to review additionally: @dbatomic, @nikolamand-db, @stefankandic, @uros-db |
| return collatedTrimLeft(trimString, collationId); | ||
| } | ||
|
|
||
| private UTF8String lowercaseTrimLeft(UTF8String trimString) { |
There was a problem hiding this comment.
Instead of implementing lowercaseTrimLeft and collatedTrimLeft separately (these functions look very similar to me), I think we could make use of new StringSearch(pattern, target) (with .toLowerCase() for both params, and no collationId for UTF8_BINARY_LCASE)
For more context, please take a look at: https://github.com/apache/spark/pull/45704/files#r1538624688
| if (evals.length == 1) { | ||
| ev.copy(code = code""" | ||
| |${srcString.code} | ||
| |boolean ${ev.isNull} = false; | ||
| |UTF8String ${ev.value} = null; | ||
| |if (${srcString.isNull}) { | ||
| | ${ev.isNull} = true; | ||
| |} else { | ||
| | ${ev.value} = ${srcString.value}.$trimMethod($collationId); | ||
| |}""".stripMargin) | ||
| } else { | ||
| val trimString = evals(1) | ||
| ev.copy(code = code""" | ||
| |${srcString.code} | ||
| |boolean ${ev.isNull} = false; | ||
| |UTF8String ${ev.value} = null; | ||
| |if (${srcString.isNull}) { | ||
| | ${ev.isNull} = true; | ||
| |} else { | ||
| | ${trimString.code} | ||
| | if (${trimString.isNull}) { | ||
| | ${ev.isNull} = true; | ||
| | } else { | ||
| | ${ev.value} = | ||
| | ${srcString.value}.$trimMethod(${trimString.value}, $collationId); | ||
| | } | ||
| |}""".stripMargin) | ||
| } |
There was a problem hiding this comment.
I think this code is very similar to the one above, consider rewriting it a bit more neatly in a way that incorporates $collationId
| trimByteIdx -= stringCharLen[numChars - 1]; | ||
| numChars--; | ||
| } | ||
| else { |
There was a problem hiding this comment.
nit: in jvm languages else is almost always on the same line as the closing brace of it's if
| return copyUTF8String(trimByteIdx, numBytes - 1); | ||
| } | ||
|
|
||
| private UTF8String collatedTrimLeft(UTF8String trimString, int collationId) { |
There was a problem hiding this comment.
@uros-db I see a number of PRs from different people adding similar functionality so just wondering if you can start a discussion to standardize the names of these kinds of methods because I've seen multiple variants
my vote would be for collationAware...
There was a problem hiding this comment.
collationAwareFunctionName sounds good to me, we can discuss consolidating the naming for all functions
|
heads up: we’ve done some major code restructuring in #45978, so please sync these changes before moving on @davidm-db you’ll likely need to rewrite the code in this PR, so please follow the guidelines outlined in https://issues.apache.org/jira/browse/SPARK-47410 |
|
Please add these expressions to CollationTypeCasts rules. |
…unctions/expressions (for UTF8_BINARY & LCASE) Recreating [original PR](#45749) because code has been reorganized in [this PR](#45978). ### What changes were proposed in this pull request? This PR is created to add support for collations to StringTrim family of functions/expressions, specifically: - `StringTrim` - `StringTrimBoth` - `StringTrimLeft` - `StringTrimRight` Changes: - `CollationSupport.java` - Add new `StringTrim`, `StringTrimLeft` and `StringTrimRight` classes with corresponding logic. - `CollationAwareUTF8String` - add new `trim`, `trimLeft` and `trimRight` methods that actually implement trim logic. - `UTF8String.java` - expose some of the methods publicly. - `stringExpressions.scala` - Change input types. - Change eval and code gen logic. - `CollationTypeCasts.scala` - add `StringTrim*` expressions to `CollationTypeCasts` rules. ### Why are the changes needed? We are incrementally adding collation support to a built-in string functions in Spark. ### Does this PR introduce _any_ user-facing change? Yes: - User should now be able to use non-default collations in string trim functions. ### How was this patch tested? Already existing tests + new unit/e2e tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46206 from davidm-db/string-trim-functions. Authored-by: David Milicevic <david.milicevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Left to do/discussion topics (to be removed by the PR completion)
utf8_binary_lcase) collation type seems a bit grueling in the long run. This PR handles it the same way as in every other place, since I think it is out of the scope of this PR, but should we maybe create a work item to generalize this collation type better (if we don't have it already)?What changes were proposed in this pull request?
This PR is created to add support for collations to StringTrim family of functions/expressions, specifically:
StringTrimStringTrimBothStringTrimLeftStringTrimRightChanges:
stringExpressions.scalaUTF8String-doEvalanddoGenCode.UTF8String.javatrimLeftandtrimRight).Other minor changes in this PR:
CollationFactory.getStringSearch()to a more descriptive ones.Why are the changes needed?
We are incrementally adding collation support to a built-in string functions in Spark.
Does this PR introduce any user-facing change?
Yes:
DATATYPE_MISMATCH.COLLATION_MISMATCHexception if collations of two function parameters do not match.How was this patch tested?
Already existing tests + new unit/e2e tests.
Was this patch authored or co-authored using generative AI tooling?
No.