Skip to content

[SPARK-47476][SQL] Support REPLACE function to work with collated strings#45704

Closed
miland-db wants to merge 43 commits into
apache:masterfrom
miland-db:miland-db/string-replace
Closed

[SPARK-47476][SQL] Support REPLACE function to work with collated strings#45704
miland-db wants to merge 43 commits into
apache:masterfrom
miland-db:miland-db/string-replace

Conversation

@miland-db
Copy link
Copy Markdown
Contributor

@miland-db miland-db commented Mar 25, 2024

What changes were proposed in this pull request?

Extend built-in string functions to support non-binary, non-lowercase collation for: replace.

Why are the changes needed?

Update collation support for built-in string functions in Spark.

Does this PR introduce any user-facing change?

Yes, users should now be able to use COLLATE within arguments for built-in string function REPLACE in Spark SQL queries, using non-binary collations such as UNICODE_CI.

How was this patch tested?

Unit tests for queries using StringReplace (CollationStringExpressionsSuite.scala).

Was this patch authored or co-authored using generative AI tooling?

No

Algorithm explanation

  • StringSearch.next() returns position of the first character of search string in the source source. We need to convert this position to position in bytes so we can perform replace operation correctly.
  • For UTF8_BINARY_LCASE collation there is no corresponding collator so we have to implement custom logic (lowercaseReplace). It is done by performing matching on lowercase strings (source & search) and using that information to do operations on the original source string. String building is performed in the same way as for other non-binary collations.

Similar logic can be found in existing int find(UTF8String str, int start) & int indexOf(UTF8String v, int start) methods.

@github-actions github-actions Bot added the SQL label Mar 25, 2024
@miland-db
Copy link
Copy Markdown
Contributor Author

@mihailom-db @uros-db @dbatomic

@dongjoon-hyun
Copy link
Copy Markdown
Member

Hi, @miland-db and @cloud-fan . I saw a series of [COLLATION]. Are you going to make [COLLATION] as the official community module tag in the PR title? For me, SQL seems to be enough for SQL module PR.

$ git log --oneline | grep COLLATION
8762e256d16 [SPARK-47296][SQL][COLLATION] Fail unsupported functions for non-binary collations
456d246badb [SPARK-47248][SQL][COLLATION] Improved string function support: contains
d5f35ec97fc [SPARK-46835][SQL][COLLATIONS] Join support for non-binary collations
6534a3398ae [SPARK-47102][SQL] Add the `COLLATION_ENABLED` config flag
ca7c60b4998 [SPARK-47268][SQL][COLLATIONS] Support for repartition with collations
479954cf73a [SPARK-47131][SQL][COLLATION] String function support: contains, startswith, endswith
71767cfe376 [SPARK-46834][SQL][COLLATIONS] Support for aggregates

@cloud-fan
Copy link
Copy Markdown
Contributor

SQL tag is sufficient, but I don't mind people adding more grouping if the number of PRs is large enough.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-47476][SQL][COLLATION] String function support: replace [SPARK-47476][SQL] String function support: replace Mar 25, 2024
@dongjoon-hyun
Copy link
Copy Markdown
Member

SQL tag is sufficient, but I don't mind people adding more grouping if the number of PRs is large enough.

Thank you, @cloud-fan . Then, let's not use this. I don't think this is a permanent grouping.

Copy link
Copy Markdown
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's improve PR's title since it is too generic.

@miland-db miland-db changed the title [SPARK-47476][SQL] String function support: replace [SPARK-47476][SQL] Support REPLACE function to work with collated strings Mar 25, 2024
Comment thread common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java Outdated
@miland-db miland-db requested a review from cloud-fan March 29, 2024 16:55
Comment thread common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java Outdated
@uros-db
Copy link
Copy Markdown
Contributor

uros-db commented Apr 11, 2024

heads up: we’ve done some major code restructuring in #45978, so please sync these changes before moving on

@miland-db you’ll likely need to rewrite the code in this PR, so please follow the guidelines outlined in https://issues.apache.org/jira/browse/SPARK-47410

# Conflicts:
#	common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
#	sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/CollationStringExpressionsSuite.scala
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts:
#	common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just flagging this PR will need a fix for the ICU implementation
(you already added some tests for this that are failing)

# Conflicts:
#	common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationSupport.java
#	common/unsafe/src/test/java/org/apache/spark/unsafe/types/CollationSupportSuite.java
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CollationTypeCasts.scala
Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, @cloud-fan ready for review

@cloud-fan
Copy link
Copy Markdown
Contributor

the Spark Connect test failure is flaky and unrelated here, I'm merging it to master, thanks!

@cloud-fan cloud-fan closed this in 07b84dd Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants