Conversation
Co-authored-by: comphead <comphead@users.noreply.github.com>
Co-authored-by: comphead <comphead@users.noreply.github.com>
| fn union_generic_lists<OffsetSize: OffsetSizeTrait>( | ||
| l: &GenericListArray<OffsetSize>, | ||
| r: &GenericListArray<OffsetSize>, | ||
| ) -> Result<GenericListArray<OffsetSize>, DataFusionError> { |
There was a problem hiding this comment.
I prefer Result<GenericListArray<OffsetSize>>
| (_, DataType::Null) => Ok(array1.clone()), | ||
| (DataType::List(_), DataType::List(_)) => { | ||
| check_datatypes("array_union", &[&array1, &array2])?; | ||
| let list1 = array1.as_list::<i32>(); |
There was a problem hiding this comment.
I prefer datafusion::common::cast as_list_array and as_large_list_array
There was a problem hiding this comment.
datafusion::common::cast and arrow::array::cast are doing the similar thing. The difference is that common::cast return datafusion error, while arrow::array::cast return Arror error. To me, return datafusion error in datafusion project make much more senses for me. Also, mixing these two casting is quite messy, we can have arrow casting inside common::cast and call common::cast for other crate.
There was a problem hiding this comment.
I just found that arrow::array::cast does not return Result<>
There was a problem hiding this comment.
never mind, I don't have strong opinion on which to use yet.
There was a problem hiding this comment.
Why don't we directly use as_list_array here?
There was a problem hiding this comment.
For the offset i32 branch you mean @Weijun-H ? I like the fact that the code has the same structure for LargeList and List
There was a problem hiding this comment.
Yes, but I think as_list_array can avoid panic because it returns the Return type.
| (_, DataType::Null) => Ok(array1.clone()), | ||
| (DataType::List(_), DataType::List(_)) => { | ||
| check_datatypes("array_union", &[&array1, &array2])?; | ||
| let list1 = array1.as_list::<i32>(); |
There was a problem hiding this comment.
datafusion::common::cast and arrow::array::cast are doing the similar thing. The difference is that common::cast return datafusion error, while arrow::array::cast return Arror error. To me, return datafusion error in datafusion project make much more senses for me. Also, mixing these two casting is quite messy, we can have arrow casting inside common::cast and call common::cast for other crate.
9237fb4 to
4c1bcef
Compare
comphead
left a comment
There was a problem hiding this comment.
Thanks @edmondop for keep improving the PR
Please also test
- different datatypes, fixed len(int, float, etc) / variable(strings, inner arrays)
- nulls
- empty arrays
we can put tests for every type and figure out if this needs to be fixed within this PR or we can create a followup PR
Will do! Are the other cases for empty arrays that aren't there @comphead ? |
Now I can see would be very nice to see |
|
Adding these two tests break the sqltests, but it doesn't seem to be caused by my code. I checked in the backtrace and there is nothing that refers The stacktrace is the following |
array union should have the same data type which including the dimension of array. Union with 2d and 1d is not allowed |
|
|
||
| # array_union scalar function #6 | ||
| query ? | ||
| select array_union([], []); |
There was a problem hiding this comment.
checked in spark
scala> spark.sql("select array_union(array(), array())").show(false)
+-----------------------------+
|array_union(array(), array())|
+-----------------------------+
|[] |
+-----------------------------+
There was a problem hiding this comment.
I am even surprised my code work here tbh, do I need to add an additional branch to pattern matching where both array have null data type and return an empty array?
There was a problem hiding this comment.
I am even surprised my code work here tbh, do I need to add an additional branch to pattern matching where both array have null data type and return an empty array?
I prefer return empty array in this case
There was a problem hiding this comment.
I am even surprised my code work here tbh, do I need to add an additional branch to pattern matching where both array have null data type and return an empty array?
I don't think you should check the input types (they should still be lists, they'll just have an offset buffer of [0, 1]
Maybe you just need to handle the case specially in array_union
|
|
||
| # array_union scalar function #10 | ||
| query ? | ||
| select array_union(null, null); |
There was a problem hiding this comment.
scala> spark.sql("select array_union(array(null), array(null))").show(false)
+-------------------------------------+
|array_union(array(NULL), array(NULL))|
+-------------------------------------+
|[NULL] |
+-------------------------------------+
There was a problem hiding this comment.
This is more interesting, because I am not sure null has offsets in the ListArray, I suppose my code just get into the union function, but it exits immediately with empty results, because the offsets are empty...How would you go about it?
There was a problem hiding this comment.
scala> spark.sql("select array_union(array(null), array(null))").show(false) +-------------------------------------+ |array_union(array(NULL), array(NULL))| +-------------------------------------+ |[NULL] | +-------------------------------------+
array(null) is not the same as null.
array(null) is like array_union scalar function #8
There was a problem hiding this comment.
I tried to add an extra branch in my match case like so:
(DataType::Null, DataType::Null) => {
let data = ArrayData::new_empty(&DataType::Null);
Ok(Arc::new(FixedSizeListArray::from(data)))
} but that also produce a NULL response rather than an empty arary, do you have any guidance @comphead ? It's unclear to me how to get an [] and also a [NULL]
| [hello, datafusion] | ||
|
|
||
|
|
||
|
|
alamb
left a comment
There was a problem hiding this comment.
Thank you @edmondop -- this is a really nice (first 🤯 ) contribution and a great example of a team effort.
I suggest we merge this PR as is, and file a ticket to fix the null handling as a follow on PR. My rationale is that this is a new function, so it isn't introducing a regression and we can iteratively improve it over time
| let r_values = converter.convert_columns(&[r_values])?; | ||
|
|
||
| // Might be worth adding an upstream OffsetBufferBuilder | ||
| let mut offsets = Vec::<OffsetSize>::with_capacity(l.len() + 1); |
There was a problem hiding this comment.
This is a really neat implementation @edmondop.
|
|
||
| # array_union scalar function #6 | ||
| query ? | ||
| select array_union([], []); |
There was a problem hiding this comment.
I am even surprised my code work here tbh, do I need to add an additional branch to pattern matching where both array have null data type and return an empty array?
I don't think you should check the input types (they should still be lists, they'll just have an offset buffer of [0, 1]
Maybe you just need to handle the case specially in array_union
|
Thanks again everyone who contributed to this PR |
Which issue does this PR close?
Closes #6981
Rationale for this change
Support additional array_union SQL keyword
Are these changes tested?
Additional slt tests have been added, obviously they do not pass until duplication issue is solved :)