[SPARK-47476][SQL] Support REPLACE function to work with collated strings#45704
[SPARK-47476][SQL] Support REPLACE function to work with collated strings#45704miland-db wants to merge 43 commits into
Conversation
|
Hi, @miland-db and @cloud-fan . I saw a series of |
|
SQL tag is sufficient, but I don't mind people adding more grouping if the number of PRs is large enough. |
Thank you, @cloud-fan . Then, let's not use this. I don't think this is a permanent grouping. |
MaxGekk
left a comment
There was a problem hiding this comment.
Let's improve PR's title since it is too generic.
|
heads up: we’ve done some major code restructuring in #45978, so please sync these changes before moving on @miland-db you’ll likely need to rewrite the code in this PR, so please follow the guidelines outlined in https://issues.apache.org/jira/browse/SPARK-47410 |
# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java # sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
uros-db
left a comment
There was a problem hiding this comment.
just flagging this PR will need a fix for the ICU implementation
(you already added some tests for this that are failing)
# Conflicts: # common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java # common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
uros-db
left a comment
There was a problem hiding this comment.
lgtm, @cloud-fan ready for review
|
the Spark Connect test failure is flaky and unrelated here, I'm merging it to master, thanks! |
What changes were proposed in this pull request?
Extend built-in string functions to support non-binary, non-lowercase collation for: replace.
Why are the changes needed?
Update collation support for built-in string functions in Spark.
Does this PR introduce any user-facing change?
Yes, users should now be able to use COLLATE within arguments for built-in string function REPLACE in Spark SQL queries, using non-binary collations such as UNICODE_CI.
How was this patch tested?
Unit tests for queries using StringReplace (
CollationStringExpressionsSuite.scala).Was this patch authored or co-authored using generative AI tooling?
No
Algorithm explanation
searchstring in thesourcesource. We need to convert this position to position in bytes so we can perform replace operation correctly.lowercaseReplace). It is done by performing matching on lowercase strings (source & search) and using that information to do operations on the originalsourcestring. String building is performed in the same way as for other non-binary collations.Similar logic can be found in existing
int find(UTF8String str, int start)&int indexOf(UTF8String v, int start)methods.