Skip to content

[Deprecated][SQL] Add support for collation for StringTrim type of functions/expressions#45749

Closed
davidm-db wants to merge 8 commits into
apache:masterfrom
davidm-db:SPARK-47409
Closed

[Deprecated][SQL] Add support for collation for StringTrim type of functions/expressions#45749
davidm-db wants to merge 8 commits into
apache:masterfrom
davidm-db:SPARK-47409

Conversation

@davidm-db
Copy link
Copy Markdown
Contributor

@davidm-db davidm-db commented Mar 28, 2024

Left to do/discussion topics (to be removed by the PR completion)

  • Handling of a lowercase (utf8_binary_lcase) collation type seems a bit grueling in the long run. This PR handles it the same way as in every other place, since I think it is out of the scope of this PR, but should we maybe create a work item to generalize this collation type better (if we don't have it already)?

What changes were proposed in this pull request?

This PR is created to add support for collations to StringTrim family of functions/expressions, specifically:

  • StringTrim
  • StringTrimBoth
  • StringTrimLeft
  • StringTrimRight

Changes:

  • stringExpressions.scala
    • Check input types to be of the same collation.
    • Route to proper endpoints in UTF8String - doEval and doGenCode.
  • UTF8String.java
    • Implement logic for all trim functions (logic should be well commented-out in the code and mostly matches already existing implementations of trimLeft and trimRight).

Other minor changes in this PR:

  • Change parameter names in CollationFactory.getStringSearch() to a more descriptive ones.

Why are the changes needed?

We are incrementally adding collation support to a built-in string functions in Spark.

Does this PR introduce any user-facing change?

Yes:

  • User should now be able to use non-default collations in string trim functions.
  • User will receive DATATYPE_MISMATCH.COLLATION_MISMATCH exception if collations of two function parameters do not match.

How was this patch tested?

Already existing tests + new unit/e2e tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions Bot added the SQL label Mar 28, 2024
Comment thread common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java Outdated
@mihailomilosevic2001
Copy link
Copy Markdown
Contributor

Also, please add tests to CollationStringExpressionSuite

@davidm-db
Copy link
Copy Markdown
Contributor Author

davidm-db commented Mar 28, 2024

@mihailom-db I'm adding tests to CollationStringExpressionSuite - I just pushed the changes so they can be reviewed while I'm adding the tests :)

@davidm-db
Copy link
Copy Markdown
Contributor Author

davidm-db commented Mar 29, 2024

@MaxGekk, @cloud-fan could someone please take a look at this PR?

Tagging the rest of Belgrade collation crew if anyone else would like to review additionally: @dbatomic, @nikolamand-db, @stefankandic, @uros-db

return collatedTrimLeft(trimString, collationId);
}

private UTF8String lowercaseTrimLeft(UTF8String trimString) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of implementing lowercaseTrimLeft and collatedTrimLeft separately (these functions look very similar to me), I think we could make use of new StringSearch(pattern, target) (with .toLowerCase() for both params, and no collationId for UTF8_BINARY_LCASE)

For more context, please take a look at: https://github.com/apache/spark/pull/45704/files#r1538624688

Comment on lines +1104 to +1131
if (evals.length == 1) {
ev.copy(code = code"""
|${srcString.code}
|boolean ${ev.isNull} = false;
|UTF8String ${ev.value} = null;
|if (${srcString.isNull}) {
| ${ev.isNull} = true;
|} else {
| ${ev.value} = ${srcString.value}.$trimMethod($collationId);
|}""".stripMargin)
} else {
val trimString = evals(1)
ev.copy(code = code"""
|${srcString.code}
|boolean ${ev.isNull} = false;
|UTF8String ${ev.value} = null;
|if (${srcString.isNull}) {
| ${ev.isNull} = true;
|} else {
| ${trimString.code}
| if (${trimString.isNull}) {
| ${ev.isNull} = true;
| } else {
| ${ev.value} =
| ${srcString.value}.$trimMethod(${trimString.value}, $collationId);
| }
|}""".stripMargin)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is very similar to the one above, consider rewriting it a bit more neatly in a way that incorporates $collationId

trimByteIdx -= stringCharLen[numChars - 1];
numChars--;
}
else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in jvm languages else is almost always on the same line as the closing brace of it's if

return copyUTF8String(trimByteIdx, numBytes - 1);
}

private UTF8String collatedTrimLeft(UTF8String trimString, int collationId) {
Copy link
Copy Markdown
Contributor

@stefankandic stefankandic Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uros-db I see a number of PRs from different people adding similar functionality so just wondering if you can start a discussion to standardize the names of these kinds of methods because I've seen multiple variants

my vote would be for collationAware...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collationAwareFunctionName sounds good to me, we can discuss consolidating the naming for all functions

@uros-db
Copy link
Copy Markdown
Contributor

uros-db commented Apr 11, 2024

heads up: we’ve done some major code restructuring in #45978, so please sync these changes before moving on

@davidm-db you’ll likely need to rewrite the code in this PR, so please follow the guidelines outlined in https://issues.apache.org/jira/browse/SPARK-47410

@mihailomilosevic2001
Copy link
Copy Markdown
Contributor

Please add these expressions to CollationTypeCasts rules.

Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change the PR title so we can remove it from the corresponding ticket in favor of your new PR

@davidm-db davidm-db changed the title [SPARK-47409][SQL] Add support for collation for StringTrim type of functions/expressions [Deprecated][SQL] Add support for collation for StringTrim type of functions/expressions Apr 30, 2024
cloud-fan pushed a commit that referenced this pull request May 9, 2024
…unctions/expressions (for UTF8_BINARY & LCASE)

Recreating [original PR](#45749) because code has been reorganized in [this PR](#45978).

### What changes were proposed in this pull request?
This PR is created to add support for collations to StringTrim family of functions/expressions, specifically:
- `StringTrim`
- `StringTrimBoth`
- `StringTrimLeft`
- `StringTrimRight`

Changes:
- `CollationSupport.java`
  - Add new `StringTrim`, `StringTrimLeft` and `StringTrimRight` classes with corresponding logic.
  - `CollationAwareUTF8String` - add new `trim`, `trimLeft` and `trimRight` methods that actually implement trim logic.
- `UTF8String.java` - expose some of the methods publicly.
- `stringExpressions.scala`
  - Change input types.
  - Change eval and code gen logic.
- `CollationTypeCasts.scala` - add `StringTrim*` expressions to `CollationTypeCasts` rules.

### Why are the changes needed?
We are incrementally adding collation support to a built-in string functions in Spark.

### Does this PR introduce _any_ user-facing change?
Yes:
- User should now be able to use non-default collations in string trim functions.

### How was this patch tested?
Already existing tests + new unit/e2e tests.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #46206 from davidm-db/string-trim-functions.

Authored-by: David Milicevic <david.milicevic@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants