[SPARK-47296][SQL][COLLATION] Fail unsupported functions for non-binary collations#45422
[SPARK-47296][SQL][COLLATION] Fail unsupported functions for non-binary collations#45422uros-db wants to merge 29 commits into
Conversation
|
Without updating |
|
@cloud-fan yes, that is a problem... should we settle only on on a more important note, even if we were to update while type coercion is a separate effort, and will probably cover other parts of the codebase, what do we think about implementing this for now? @dbatomic
|
|
I don't think it's safe to only handle expressions in I'd prefer only updating functions that support collation to have more fine-grained collation check, which shouldn't be many right now. |
|
@cloud-fan that makes a lot of sense, now new case classes should handle this:
|
| * equality and hashing). | ||
| */ | ||
| def isBinaryCollation: Boolean = CollationFactory.fetchCollation(collationId).isBinaryCollation | ||
| def isLowercaseCollation: Boolean = collationId == CollationFactory.LOWERCASE_COLLATION_ID |
There was a problem hiding this comment.
Can you remove even this guy and push the check into StringTypeBinaryLcase?
There was a problem hiding this comment.
I am not sure this is possible. StringTypeBinaryLcase does not extend StringType, and the point of this function for now is to call it on StringType object in acceptsType to check if we should let the function proceed with that input.
|
LGTM As a follow up we should revisit error messages. IMO it is weird to expose message with "string_any_collation" type to customer. But I think that we can do that as a follow up. |
| case (st: StringType, _: StringTypeCollated) => st | ||
| // Cast any atomic type to string. | ||
| case (any: AtomicType, StringType) if any != StringType => StringType | ||
| case (any: AtomicType, _: StringTypeCollated) if any != StringType => StringType |
There was a problem hiding this comment.
The code can be more readable if we call StringTypeCollated#defaultConcreteType, which is StringType(0)
There was a problem hiding this comment.
Is this what you meant?
|
|
||
| // If a function expects a StringType, no StringType instance should be implicitly cast to | ||
| // StringType with a collation that's not accepted (aka. lockdown unsupported collations). | ||
| case (StringType, StringType) => None |
There was a problem hiding this comment.
isn't this case match covered by the first case match case _ if expectedType.acceptsType(inType) => Some(inType)?
There was a problem hiding this comment.
I think this should be case (_: StringType, StringType) ...
|
|
||
| // "canANSIStoreAssign" doesn't account for targets extending StringTypeCollated, but | ||
| // ANSIStoreAssign is generally expected to return "true" for (AtomicType, StringType) | ||
| case (_: AtomicType, _: StringTypeCollated) => Some(StringType) |
There was a problem hiding this comment.
is this correct? StringType does not satisfy StringTypeCollated
There was a problem hiding this comment.
Yes it is. canANSIStoreAssign has a rule for casting AtomicType to StringType, but since StringTypeCollated does not extend StringType, but only AbstractDataType, this cast rule will not be picked up. But I would say this rule has to be improved to check for all canANsiStoreAssign rules.
| case (StringType, datetime: DatetimeType) => datetime | ||
| case (StringType, AnyTimestampType) => AnyTimestampType.defaultConcreteType | ||
| case (StringType, BinaryType) => BinaryType | ||
| case (st: StringType, StringType) => st |
There was a problem hiding this comment.
reading the code around here, I think null means no implicit cast.
| // If a function expects a StringType, no StringType instance should be implicitly cast to | ||
| // StringType with a collation that's not accepted (aka. lockdown unsupported collations). | ||
| case (StringType, StringType) => None | ||
| case (StringType, _: StringTypeCollated) => None |
There was a problem hiding this comment.
This case should be put before case (StringType, a: AtomicType) =>, otherwise it's useless
| case (StringType, AnyTimestampType) => AnyTimestampType.defaultConcreteType | ||
| case (StringType, BinaryType) => BinaryType | ||
| case (st: StringType, StringType) => st | ||
| case (st: StringType, _: StringTypeCollated) => st |
There was a problem hiding this comment.
I think this will be covered by the last default case match?
There was a problem hiding this comment.
Yes and no. The following two lines would have made a cast, but I changed them so they doesn't.
Could this be an onboarding task? |
|
thanks, merging to master! |
…ypeCollated ### What changes were proposed in this pull request? Renaming simpleString in StringTypeAnyCollation. This PR should only be merged after #45383 is merged. ### Why are the changes needed? [SPARK-47296](#45422) introduced a change to fail all unsupported functions. Because of this change expected inputTypes in ExpectsInputTypes had to be changed. This change introduced a change on user side which will print "STRING_ANY_COLLATION" in places where before we printed "STRING" when an error occurred. Concretely if we get an input of Int where StringTypeAnyCollation was expected, we will throw this faulty message for users. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Existing tests were changed back to "STRING" notation. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45694 from mihailom-db/SPARK-47504. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Why are the changes needed?
Currently, all
StringTypearguments passed to built-in string functions in Spark SQL get treated as binary strings. This behaviour is incorrect for almost all collationIds except the default (0), and we should instead warn the user if they try to use an unsupported collation for the given function. Over time, we should implement the appropriate support for these (function, collation) pairs, but until then - we should have a way to fail unsupported statements in query analysis.Does this PR introduce any user-facing change?
Yes, users will now get appropriate errors when they try to use an unsupported collation with a given string function.
How was this patch tested?
Tests in CollationSuite to check if these functions work for binary collations and throw exceptions for others.
Was this patch authored or co-authored using generative AI tooling?
Yes.