Skip to content

[SPARK-47410][SQL] Refactor UTF8String and CollationFactory#45978

Closed
uros-db wants to merge 14 commits into
apache:masterfrom
uros-db:SPARK-47410
Closed

[SPARK-47410][SQL] Refactor UTF8String and CollationFactory#45978
uros-db wants to merge 14 commits into
apache:masterfrom
uros-db:SPARK-47410

Conversation

@uros-db
Copy link
Copy Markdown
Contributor

@uros-db uros-db commented Apr 10, 2024

What changes were proposed in this pull request?

This PR introduces comprehensive support for collation-aware expressions in Spark, focusing on improving code structure, clarity, and testing coverage for various expressions (including: Contains, StartsWith, EndsWith).

Why are the changes needed?

The changes are essential to improve the maintainability and readability of collation-related code in Spark expressions. By restructuring and centralizing collation support through, we simplify the addition of new collation-aware operations and ensure consistent testing across different collation types.

Does this PR introduce any user-facing change?

No, this PR is focused on internal refactoring and testing enhancements for collation-aware expression support.

How was this patch tested?

Unit tests in CollationSupportSuite.java
E2E tests in CollationStringExpressionsSuite.scala

Was this patch authored or co-authored using generative AI tooling?

Yes.

@github-actions github-actions Bot added the SQL label Apr 10, 2024
@uros-db uros-db changed the title [DRAFT][SPARK-47410][SQL] refactor UTF8String and CollationFactory [SPARK-47410][SQL] refactor UTF8String and CollationFactory Apr 10, 2024
@uros-db uros-db marked this pull request as ready for review April 10, 2024 12:35
@uros-db uros-db requested a review from dbatomic April 10, 2024 12:35
@HyukjinKwon HyukjinKwon changed the title [SPARK-47410][SQL] refactor UTF8String and CollationFactory [SPARK-47410][SQL] Refactor UTF8String and CollationFactory Apr 11, 2024
@HyukjinKwon
Copy link
Copy Markdown
Member

Can you please fill the PR description?

@uros-db uros-db requested a review from dbatomic April 11, 2024 04:48
@uros-db
Copy link
Copy Markdown
Contributor Author

uros-db commented Apr 11, 2024

updated PR description, stand by for some more small changes before merging

@dbatomic
Copy link
Copy Markdown
Contributor

dbatomic commented Apr 11, 2024

Just wanted to thank you for doing this. IMO, things are much cleaner than they used to be.
Also, @mihailom-db , @nikolamand-db , @stefankandic, @miland-db and @stevomitric as FYI.

@uros-db
Copy link
Copy Markdown
Contributor Author

uros-db commented Apr 11, 2024

@cloud-fan all checks look good, ready to merge

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 3103627 Apr 11, 2024
cloud-fan pushed a commit that referenced this pull request May 9, 2024
…unctions/expressions (for UTF8_BINARY & LCASE)

Recreating [original PR](#45749) because code has been reorganized in [this PR](#45978).

### What changes were proposed in this pull request?
This PR is created to add support for collations to StringTrim family of functions/expressions, specifically:
- `StringTrim`
- `StringTrimBoth`
- `StringTrimLeft`
- `StringTrimRight`

Changes:
- `CollationSupport.java`
  - Add new `StringTrim`, `StringTrimLeft` and `StringTrimRight` classes with corresponding logic.
  - `CollationAwareUTF8String` - add new `trim`, `trimLeft` and `trimRight` methods that actually implement trim logic.
- `UTF8String.java` - expose some of the methods publicly.
- `stringExpressions.scala`
  - Change input types.
  - Change eval and code gen logic.
- `CollationTypeCasts.scala` - add `StringTrim*` expressions to `CollationTypeCasts` rules.

### Why are the changes needed?
We are incrementally adding collation support to a built-in string functions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes:
- User should now be able to use non-default collations in string trim functions.

### How was this patch tested?
Already existing tests + new unit/e2e tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46206 from davidm-db/string-trim-functions.

Authored-by: David Milicevic <david.milicevic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants