[SQL] Add collations support to split regex expression#45856
[SQL] Add collations support to split regex expression#45856nikolamand-db wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
what does it mean for "regex" to be of type StringTypeAnyCollation, as it doesn't seem to me that collation is respected/needed for this parameter to begin with? for example, consider: [,] with UNICODE_CI collation
There was a problem hiding this comment.
+1, why would we allow any collation for the regex string?
There was a problem hiding this comment.
This is to support session-level default collation. If the user changes it and passes regex string literal, it will be interpreted as collated string. We don't want to throw exception in such cases.
|
heads up: we’ve done some major code restructuring in #45978, so please sync these changes before moving on @nikolamand-db you’ll likely need to rewrite the code in this PR, so please follow the guidelines outlined in https://issues.apache.org/jira/browse/SPARK-47410 |
27d0e43 to
f684c4b
Compare
uros-db
left a comment
There was a problem hiding this comment.
LGTM @cloud-fan please review
| if (collation.supportsBinaryEquality) { | ||
| return execBinary(string, regex, limit); | ||
| } else { | ||
| return execLowercase(string, regex, limit); |
There was a problem hiding this comment.
there is no execICU for this operation?
There was a problem hiding this comment.
If it's not supported yet, we should add assert(collation.supportsLowercaseEquality)
There was a problem hiding this comment.
I think final lazy val collationId: Int = str.dataType.asInstanceOf[StringType].collationId for StringTypeBinaryLcase won't allow that to happen. But an additional assert in the else branch couldn't hurt too
| StringSplitTestCase("ABC", "[b]", "UTF8_BINARY_LCASE", Seq("A", "C")), | ||
| StringSplitTestCase("AAA", "[a]", "UTF8_BINARY_LCASE", Seq("", "", "", "")), | ||
| StringSplitTestCase("AAA", "[b]", "UTF8_BINARY_LCASE", Seq("AAA")), | ||
| StringSplitTestCase("aAbB", "[ab]", "UTF8_BINARY_LCASE", Seq("", "", "", "", "")), |
There was a problem hiding this comment.
please write unit tests for all corner cases, instead of end-to-end tests
There was a problem hiding this comment.
Rewritten tests, please check.
uros-db
left a comment
There was a problem hiding this comment.
I don't like where this PR is going for regexp expressions, the way we do these appears very different compared to StringExpressions...
let's pause this PR for now while we figure it out
|
Closing as we have new approach for all regex functions #46077. |
What changes were proposed in this pull request?
With these changes, split expression should support any collation with binary equality and
UTF8_BINARY_LCASE. Other collations are derived from external ICU implementation and do not support split operation with regex splitters, for these we'll throwTypeCheckFailure.Why are the changes needed?
To introduce collations support for split expression.
Does this PR introduce any user-facing change?
Yes, user should now successfully split collated strings.
How was this patch tested?
Modified existing test in
CollationRegexpExpressionsSuite.Was this patch authored or co-authored using generative AI tooling?
No.