Skip to content

[SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)#46206

Closed
davidm-db wants to merge 20 commits into
apache:masterfrom
davidm-db:string-trim-functions
Closed

[SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE)#46206
davidm-db wants to merge 20 commits into
apache:masterfrom
davidm-db:string-trim-functions

Conversation

@davidm-db
Copy link
Copy Markdown
Contributor

Recreating original PR because code has been reorganized in this PR.

What changes were proposed in this pull request?

This PR is created to add support for collations to StringTrim family of functions/expressions, specifically:

  • StringTrim
  • StringTrimBoth
  • StringTrimLeft
  • StringTrimRight

Changes:

  • CollationSupport.java
    • Add new StringTrim, StringTrimLeft and StringTrimRight classes with corresponding logic.
    • CollationAwareUTF8String - add new trim, trimLeft and trimRight methods that actually implement trim logic.
  • UTF8String.java - expose some of the methods publicly.
  • stringExpressions.scala
    • Change input types.
    • Change eval and code gen logic.
  • CollationTypeCasts.scala - add StringTrim* expressions to CollationTypeCasts rules.

Why are the changes needed?

We are incrementally adding collation support to a built-in string functions in Spark.

Does this PR introduce any user-facing change?

Yes:

  • User should now be able to use non-default collations in string trim functions.

How was this patch tested?

Already existing tests + new unit/e2e tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions Bot added the SQL label Apr 24, 2024
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that we're trying to push all collation aware changes away from UTF8String, and into CollationSupport.CollationAwareUTF8String

but we should be careful about changing access modifiers here, adding @cloud-fan to take a look and advise on what's the preferred approach:

  1. modify access in UTF8String in order to allow collation-aware implementation in CollationSupport
  2. fall back to putting everything in UTF8String, instead of using CollationSupport.CollationAwareUTF8String

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, other methods like numBytes and getBytes are already exposed publicly. These methods seem as "important" and "risky" as the methods I changed. The methods I changed also don't modify anything, they are either helpers or they access some information that is already exposed through numBytes or getBytes.

That was my reasoning, but let's wait for @cloud-fan to provide his thoughts.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, these don't appear to be too dangerous, and I also think Milan needed some of these particular methods in his PRs as well (#45704), so it looks like we do need these to be public

Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some changes needed

@davidm-db davidm-db force-pushed the string-trim-functions branch from 700b737 to ce73910 Compare April 25, 2024 17:07
Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great improvements across the board, even though there is a considerable amount of code in this PR, now everything is clearly separated, and more optimized too

we will however need a small fix for the ICU implementation, but otherwise looks great

@davidm-db davidm-db force-pushed the string-trim-functions branch from 1e08324 to 28c3cfc Compare April 26, 2024 10:55
Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on our offline discussion, ICU implementation proved to be a bit more complicated than expected, with lots of hard-to-catch edge cases

these are already pretty big changes from your side, so please remove the ICU implementation from this PR and we will update the ticket accordingly (only binary & lowercase support in this effort)

thanks David!

Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also need tests for implicit collation in CollationStringExpressions.scala

btw the ticket is now updated to reflect that the scope is now binary & lowercase collations only

@davidm-db davidm-db force-pushed the string-trim-functions branch from 983810c to 0e4d060 Compare April 30, 2024 11:32
Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks David for pushing this through!

@cloud-fan ready for review, also note: we separated out CollationAwareUTF8String.java in this PR

Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's resolve conflicts and revive this

@uros-db
Copy link
Copy Markdown
Contributor

uros-db commented May 8, 2024

@cloud-fan some tests are stale here, but overall I think the changes in this PR are good
however, changes are indeed large and merge conflicts arise daily - could you please review this?

@davidm-db davidm-db force-pushed the string-trim-functions branch from 0bfbf24 to fff7d87 Compare May 8, 2024 10:12
@davidm-db davidm-db changed the title [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions (for UTF8_BINARY & LCASE) May 8, 2024
Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're ready to merge this

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 21333f8 May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants