[SPARK-47566][SQL] Support SubstringIndex function to work with collated strings#45725
[SPARK-47566][SQL] Support SubstringIndex function to work with collated strings#45725miland-db wants to merge 45 commits into
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Thank you for making a PR.
Just one preliminary question, is there any chance of performance regression after this PR, @miland-db ?
|
So far the computational complexity of this function was
I hope this helps @dongjoon-hyun |
MaxGekk
left a comment
There was a problem hiding this comment.
There is the test suite UTF8StringWithCollationSuite. Could you add/move tests there for the changes in UTF8String + collation.
|
I am testing functions from In this PR: #45615 I have added tests to |
|
@stefankandic can you also review this change please? |
# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
uros-db
left a comment
There was a problem hiding this comment.
just flagging this PR will likely need a fix for the ICU implementation
# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java # common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala # sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java # common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
| } | ||
| public static UTF8String execLowercase(final UTF8String string, final UTF8String delimiter, | ||
| final int count) { | ||
| return CollationAwareUTF8String.lowercaseSubStringIndex(string, delimiter, count); |
There was a problem hiding this comment.
The class CollationAwareUTF8String is getting bigger. Shall we move it to an individual file?
There was a problem hiding this comment.
Maybe in the next PR. We will consider this option
|
thanks, merging to master! |
What changes were proposed in this pull request?
Extend built-in string functions to support non-binary, non-lowercase collation for: substring_index.
Why are the changes needed?
Update collation support for built-in string functions in Spark.
Does this PR introduce any user-facing change?
Yes, users should now be able to use COLLATE within arguments for built-in string function SUBSTRING_INDEX in Spark SQL queries, using non-binary collations such as UNICODE_CI.
How was this patch tested?
Unit tests for queries using SubstringIndex (
CollationStringExpressionsSuite.scala).Was this patch authored or co-authored using generative AI tooling?
No
To consider:
There is no check for collation match between string and delimiter, it will be introduced with Implicit Casting.
We can remove the original
public UTF8String subStringIndex(UTF8String delim, int count)method, and get the existing behavior usingsubStringIndex(delim, count, 0).