[SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE) by davidm-db · Pull Request #46206 · apache/spark

davidm-db · 2024-04-24T16:21:09Z

Recreating original PR because code has been reorganized in this PR.

What changes were proposed in this pull request?

This PR is created to add support for collations to StringTrim family of functions/expressions, specifically:

StringTrim
StringTrimBoth
StringTrimLeft
StringTrimRight

Changes:

CollationSupport.java
- Add new StringTrim, StringTrimLeft and StringTrimRight classes with corresponding logic.
- CollationAwareUTF8String - add new trim, trimLeft and trimRight methods that actually implement trim logic.
UTF8String.java - expose some of the methods publicly.
stringExpressions.scala
- Change input types.
- Change eval and code gen logic.
CollationTypeCasts.scala - add StringTrim* expressions to CollationTypeCasts rules.

Why are the changes needed?

We are incrementally adding collation support to a built-in string functions in Spark.

Does this PR introduce any user-facing change?

Yes:

User should now be able to use non-default collations in string trim functions.

How was this patch tested?

Already existing tests + new unit/e2e tests.

Was this patch authored or co-authored using generative AI tooling?

No.

uros-db · 2024-04-25T06:01:47Z

I know that we're trying to push all collation aware changes away from UTF8String, and into CollationSupport.CollationAwareUTF8String

but we should be careful about changing access modifiers here, adding @cloud-fan to take a look and advise on what's the preferred approach:

modify access in UTF8String in order to allow collation-aware implementation in CollationSupport

fall back to putting everything in UTF8String, instead of using CollationSupport.CollationAwareUTF8String

Well, other methods like numBytes and getBytes are already exposed publicly. These methods seem as "important" and "risky" as the methods I changed. The methods I changed also don't modify anything, they are either helpers or they access some information that is already exposed through numBytes or getBytes.

That was my reasoning, but let's wait for @cloud-fan to provide his thoughts.

I agree, these don't appear to be too dangerous, and I also think Milan needed some of these particular methods in his PRs as well (#45704), so it looks like we do need these to be public

uros-db

some changes needed

uros-db

great improvements across the board, even though there is a considerable amount of code in this PR, now everything is clearly separated, and more optimized too

we will however need a small fix for the ICU implementation, but otherwise looks great

uros-db

based on our offline discussion, ICU implementation proved to be a bit more complicated than expected, with lots of hard-to-catch edge cases

these are already pretty big changes from your side, so please remove the ICU implementation from this PR and we will update the ticket accordingly (only binary & lowercase support in this effort)

thanks David!

uros-db

we also need tests for implicit collation in CollationStringExpressions.scala

btw the ticket is now updated to reflect that the scope is now binary & lowercase collations only

uros-db

lgtm, thanks David for pushing this through!

@cloud-fan ready for review, also note: we separated out CollationAwareUTF8String.java in this PR

uros-db

let's resolve conflicts and revive this

uros-db · 2024-05-08T09:46:37Z

@cloud-fan some tests are stale here, but overall I think the changes in this PR are good
however, changes are indeed large and merge conflicts arise daily - could you please review this?

… comments

uros-db

I think we're ready to merge this

cloud-fan · 2024-05-09T15:04:57Z

thanks, merging to master!

github-actions Bot added the SQL label Apr 24, 2024

mihailomilosevic2001 reviewed Apr 25, 2024

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala Outdated