[SPARK-47541][SQL] Collated strings in complex types supporting operations reverse, array_join, concat, map#45693
[SPARK-47541][SQL] Collated strings in complex types supporting operations reverse, array_join, concat, map#45693nikolamand-db wants to merge 9 commits into
Conversation
| array: Expression, | ||
| delimiter: Expression, | ||
| nullReplacement: Option[Expression]) extends Expression with ExpectsInputTypes { | ||
| nullReplacement: Option[Expression]) extends ImplicitCastInputTypes with ExpectsInputTypes { |
There was a problem hiding this comment.
You extend both Parent class and its Child. Remove ExpectsInputTypes.
| private val commonType: DataType = { | ||
| val elementType = array.dataType.asInstanceOf[ArrayType].elementType | ||
| val s = nullReplacement match { | ||
| case Some(replacement) => Seq(elementType, delimiter.dataType, replacement.dataType) | ||
| case _ => Seq(elementType, delimiter.dataType) | ||
| } | ||
| TypeCoercion.findWiderCommonType(s).getOrElse(StringType) | ||
| } | ||
|
|
||
| override def inputTypes: Seq[AbstractDataType] = if (nullReplacement.isDefined) { | ||
| Seq(ArrayType(StringType), StringType, StringType) | ||
| Seq(ArrayType(commonType), commonType, commonType) | ||
| } else { | ||
| Seq(ArrayType(StringType), StringType) | ||
| Seq(ArrayType(commonType), commonType) | ||
| } |
There was a problem hiding this comment.
This seems a bit too much hustle to get expected input types. I would say these have to not depend on the parameters. Why not use StringTypeAnyCollation?
There was a problem hiding this comment.
One more thing is. This what you do with findWiderCommonType is not correct. If delimiter has explicit collation, we would like to have that collation. In your case you would collect two different collations, where arrayType.elementType would go first and findWiderCommonType practically does not do anything with delimiter's type, but would keep the first element type.
| } | ||
|
|
||
| override def dataType: DataType = StringType | ||
| override def dataType: DataType = commonType |
There was a problem hiding this comment.
Why not use some type of pass through where we would say dataType = array.asInstanceOf[ArrayType].elementType? Implicit casting should solve the problem of collations.
| override def inputTypes: Seq[AbstractDataType] = { | ||
| val arrayType = array.dataType.asInstanceOf[ArrayType].elementType match { | ||
| case _: StringType => array.dataType | ||
| case _ => ArrayType(StringType) | ||
| } | ||
| if (nullReplacement.isDefined) { | ||
| Seq(arrayType, StringTypeAnyCollation, StringTypeAnyCollation) | ||
| } else { | ||
| Seq(arrayType, StringTypeAnyCollation) | ||
| } |
There was a problem hiding this comment.
This looks better, but I am still concerned that we are setting inputTypes according to inputs that we get. If I understand the logic correctly, it should be independent. I would rather leave this as ArrayType only and then override checkInputTypes to do this check you performed.
There was a problem hiding this comment.
@mihailom-db There are already examples of such checks performed in inputTypes, here are few:
I see a potential issue in implicit conversions if I replace it with
ArrayType without StringType information. Is there some kind of explicit conversion which will override this?
There was a problem hiding this comment.
ArrayJoin should be fine with only ArrayType. This is because TypeCoercion rules in FunctionArgumentConversion cast this argument to ArrayType(StringType) and catalyst rules will loop over this and cast it to proper collation if necessary.
| // Binary type data is `byte[]`, which can't use `==` to check equality. | ||
| case _: AtomicType | _: CalendarIntervalType | _: NullType | ||
| if !keyType.isInstanceOf[BinaryType] => new java.util.HashMap[Any, Int]() | ||
| if !keyType.isInstanceOf[BinaryType] && (!keyType.isInstanceOf[StringType] || |
There was a problem hiding this comment.
Let's wait for Collation Refactory PR for this. This should be stringType.isBinaryComparable?
Also, can you push StringType in a separate case. Relying on order evaluation of || is not a good thing (I don't know if scala standard enforces this). E.g. you can end up with keyType.asInstanceOf[StringType].isBinaryCollation evaluating first which will throw if this is not a StringType.
mihailomilosevic2001
left a comment
There was a problem hiding this comment.
Seems good to me in combination with #45383
|
|
||
| override def inputTypes: Seq[AbstractDataType] = if (nullReplacement.isDefined) { | ||
| Seq(ArrayType(StringType), StringType, StringType) | ||
| Seq(ArrayType, StringTypeAnyCollation, StringTypeAnyCollation) |
There was a problem hiding this comment.
what if the input is array of int? Before it will fail with type check, but now it seems different.
There was a problem hiding this comment.
We probably need to override checkInputDataTypes to check the array element.
There was a problem hiding this comment.
@cloud-fan There's a special case in FunctionArgumentConversion which implicitly casts array parameter to array of strings #21620.
There was a problem hiding this comment.
So if I'm not mistaken this behavior was allowed before and it still works as expected.
cloud-fan
left a comment
There was a problem hiding this comment.
LGTM except for one issue
|
thanks, merging to master! |
### What changes were proposed in this pull request? Addition of abstract arraytype which accepts StringTypeCollated as elementType. Changes in this PR #45693 work for ArrayJoin, but will not work in general for other functions. This PR introduces a change to give an interface for all functions. Merge only after #45693. ### Why are the changes needed? This is needed in order to enable functions to use collated arrays. ### Does this PR introduce _any_ user-facing change? Yes, collation functions will work. ### How was this patch tested? Test for array_join added to `CollationSuite` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45891 from mihailom-db/SPARK-47736. Authored-by: Mihailo Milosevic <mihailo.milosevic@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
There was a problem hiding this comment.
Hi, @nikolamand-db , @dbatomic and @cloud-fan .
Unfortunately, this broke master branch ANSI CI.
Please check the following and recover the ANSI CI environment.
$ SPARK_ANSI_SQL_MODE=true build/sbt "sql/testOnly *CollationSuite -- -z complex"
...
[info] - Support operations on complex types containing collated strings *** FAILED *** (144 milliseconds)
[info] java.lang.RuntimeException: Max iterations (100) reached for batch Resolution, please set 'spark.sql.analyzer.maxIterations' to a larger value.
|
Since there are many Anyway, thank you all for working on this feature. |
|
I have a fix for this, will create a followup soon. |
What changes were proposed in this pull request?
Add proper support for complex types containing collated strings in operations reverse, array_join, concat, map (create). Examples:
Why are the changes needed?
To enable listed complex types operations support for collated strings.
Does this PR introduce any user-facing change?
Yes, results of listed complex types operations will return expected results to users.
How was this patch tested?
Added checks to collations suite.
Was this patch authored or co-authored using generative AI tooling?
No.