[SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)#46206
[SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)#46206davidm-db wants to merge 20 commits into
Conversation
There was a problem hiding this comment.
I know that we're trying to push all collation aware changes away from UTF8String, and into CollationSupport.CollationAwareUTF8String
but we should be careful about changing access modifiers here, adding @cloud-fan to take a look and advise on what's the preferred approach:
- modify access in UTF8String in order to allow collation-aware implementation in CollationSupport
- fall back to putting everything in UTF8String, instead of using CollationSupport.CollationAwareUTF8String
There was a problem hiding this comment.
Well, other methods like numBytes and getBytes are already exposed publicly. These methods seem as "important" and "risky" as the methods I changed. The methods I changed also don't modify anything, they are either helpers or they access some information that is already exposed through numBytes or getBytes.
That was my reasoning, but let's wait for @cloud-fan to provide his thoughts.
There was a problem hiding this comment.
I agree, these don't appear to be too dangerous, and I also think Milan needed some of these particular methods in his PRs as well (#45704), so it looks like we do need these to be public
700b737 to
ce73910
Compare
uros-db
left a comment
There was a problem hiding this comment.
great improvements across the board, even though there is a considerable amount of code in this PR, now everything is clearly separated, and more optimized too
we will however need a small fix for the ICU implementation, but otherwise looks great
1e08324 to
28c3cfc
Compare
uros-db
left a comment
There was a problem hiding this comment.
based on our offline discussion, ICU implementation proved to be a bit more complicated than expected, with lots of hard-to-catch edge cases
these are already pretty big changes from your side, so please remove the ICU implementation from this PR and we will update the ticket accordingly (only binary & lowercase support in this effort)
thanks David!
afb5dd9 to
cb80e4c
Compare
983810c to
0e4d060
Compare
uros-db
left a comment
There was a problem hiding this comment.
lgtm, thanks David for pushing this through!
@cloud-fan ready for review, also note: we separated out CollationAwareUTF8String.java in this PR
uros-db
left a comment
There was a problem hiding this comment.
let's resolve conflicts and revive this
|
@cloud-fan some tests are stale here, but overall I think the changes in this PR are good |
0bfbf24 to
fff7d87
Compare
uros-db
left a comment
There was a problem hiding this comment.
I think we're ready to merge this
|
thanks, merging to master! |
Recreating original PR because code has been reorganized in this PR.
What changes were proposed in this pull request?
This PR is created to add support for collations to StringTrim family of functions/expressions, specifically:
StringTrimStringTrimBothStringTrimLeftStringTrimRightChanges:
CollationSupport.javaStringTrim,StringTrimLeftandStringTrimRightclasses with corresponding logic.CollationAwareUTF8String- add newtrim,trimLeftandtrimRightmethods that actually implement trim logic.UTF8String.java- expose some of the methods publicly.stringExpressions.scalaCollationTypeCasts.scala- addStringTrim*expressions toCollationTypeCastsrules.Why are the changes needed?
We are incrementally adding collation support to a built-in string functions in Spark.
Does this PR introduce any user-facing change?
Yes:
How was this patch tested?
Already existing tests + new unit/e2e tests.
Was this patch authored or co-authored using generative AI tooling?
No.