Move MAKE_MAP to ExprPlanner#11452
Conversation
| query ? | ||
| SELECT MAKE_MAP('POST', 41, 'HEAD', 'ab', 'PATCH', 30); | ||
| ---- | ||
| {POST: 41, HEAD: ab, PATCH: 30} |
There was a problem hiding this comment.
I expected the query would fail because similar behavior isn't allowed in other databases (e.g. DuckDB). However, seems make_array will coercion the value to find a suitable type for them. In this case, all of them will be converted to utf8.
> select make_array(1,'a',3);
+-----------------------------------------+
| make_array(Int64(1),Utf8("a"),Int64(3)) |
+-----------------------------------------+
| [1, a, 3] |
+-----------------------------------------+
1 row(s) fetched.
Elapsed 0.004 seconds.
> select arrow_typeof(make_array(1,'a',3));
+-----------------------------------------------------------------------------------------------------------------+
| arrow_typeof(make_array(Int64(1),Utf8("a"),Int64(3))) |
+-----------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.
I think if DataFusion allows this type of coercion for make_array, we can allow it for make_map too.
There was a problem hiding this comment.
I think we need another make_array to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.
There was a problem hiding this comment.
I see. Maybe I can create a scalar function make_array_strict that won't implement the coerce_types method for ScalarUDFImpl, but other implementations are the same as make_array.
WDYT?
There was a problem hiding this comment.
I think we need another
make_arrayto does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.
Instead can we pass a boolean arg should_coercion with default value as false, to control such behaviour
There was a problem hiding this comment.
I think we need another
make_arrayto does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.Instead can we pass a boolean arg
should_coercionwith default value as false, to control such behaviour
The coercion logic is not simply work like if-else statement. The make_array_inner doesn't care about coercion, the coercion is in type_coercion pass in analzyer.
There was a problem hiding this comment.
Agreed. That's why I planned to implement another scalar function for it.
There was a problem hiding this comment.
I think we need another
make_arrayto does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.
I did more tests for DuckDB behavior and found something interesting. I found they also try to coercion types when building arrays or maps.
I arranged some notes for the behaviors:
How DuckDB build a map
It seems that they also transform to two lists and call map function. Just like my first design, using make_array.
D select map {1:102, 2:20};
┌───────────────────────────────────────────────────────────┐
│ main.map(main.list_value(1, 2), main.list_value(102, 20)) │
│ map(integer, integer) │
├───────────────────────────────────────────────────────────┤
│ {1=102, 2=20} │
└───────────────────────────────────────────────────────────┘How DuckDB and DataFusion coercion array type
DuckDB
- Array constructed from
INT32 and numeric string: DuckDB will make it beINTEGER[].
D select array[1,2,'3'];
┌────────────────────┐
│ (ARRAY[1, 2, '3']) │
│ int32[] │
├────────────────────┤
│ [1, 2, 3] │
└────────────────────┘
D select typeof(array[1,2,'3']);
┌────────────────────────────┐
│ typeof((ARRAY[1, 2, '3'])) │
│ varchar │
├────────────────────────────┤
│ INTEGER[] │
└────────────────────────────┘- Array constructed from
INT32 and non-numeric string: DuckDB can't construct the array.
D select array[1,2,'a'];
Conversion Error: Could not convert the string 'a' to INT32
LINE 1: select array[1,2,'a'];DataFusion
- Array constructed from
INT32 and numeric string: DataFusion will make it beUf8 array.
> select [1,2,'1'];
+-----------------------------------------+
| make_array(Int64(1),Int64(2),Utf8("1")) |
+-----------------------------------------+
| [1, 2, 1] |
+-----------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> select arrow_typeof([1,2,'1']);
+-----------------------------------------------------------------------------------------------------------------+
| arrow_typeof(make_array(Int64(1),Int64(2),Utf8("1"))) |
+-----------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.- Array constructed from
INT32 and non-numeric string: DataFusion will make it beUf8 array.
> select [1,2,'a'];
+-----------------------------------------+
| make_array(Int64(1),Int64(2),Utf8("a")) |
+-----------------------------------------+
| [1, 2, a] |
+-----------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> select arrow_typeof([1,2,'a']);
+-----------------------------------------------------------------------------------------------------------------+
| arrow_typeof(make_array(Int64(1),Int64(2),Utf8("a"))) |
+-----------------------------------------------------------------------------------------------------------------+
| List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) |
+-----------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.The behavior of type coercion between INT32 and String is really different.
How DuckDB coercion map type
- INT32 value and numeric string value: We can find the value
'20'has been converted to20.
D select map {1:10, 2:'20'};
┌────────────────────────────────────────────────────────────┐
│ main.map(main.list_value(1, 2), main.list_value(10, '20')) │
│ map(integer, integer) │
├────────────────────────────────────────────────────────────┤
│ {1=10, 2=20} │
└────────────────────────────────────────────────────────────┘- INT32 value and non-numeric string value. (It's what I tried in the first time. That's why I thought it shouldn't be allowed)
D select map {1:10, 2:'abc'};
Conversion Error: Could not convert string 'abc' to INT32
LINE 1: select map {1:10, 2:'abc'};
^Conclusion
Referring to these behaviors, I think we can just back to using make_array to implement this. Because the behavior of type coercion is different, Our make_map can allow map {1:10, 2:'a'} but DuckDB can't do it. It makes sense for me.
@jayzhan211 WDYT?
There was a problem hiding this comment.
Alright, so the behaviour is actually depend on array itself.
I think we can use make_array in this case.
But, if we want to introduce nice dataframe API map(keys: Vec<Expr>, values: Vec<Expr>), I think we still need to pass Vec<Expr> instead of the result of make_array. However, we can introduce that in another PR.
current API expects map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])])
a slightly better API is map(vec![lit("a"), lit("b")], vec![lit(1), lit(2)])
There was a problem hiding this comment.
Alright, so the behaviour is actually depend on
arrayitself. I think we can usemake_arrayin this case.
Ok, I'll roll back to make_array first.
current API expects
map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])])a slightly better API ismap(vec![lit("a"), lit("b")], vec![lit(1), lit(2)])
I'm not very familiar with the data frame implementation. Curiously, does the API for data frames also use the UDF map? I think the UDF is a logical layer function, but we don't have a corresponding logical expression for vec! other than make_array.
There was a problem hiding this comment.
dataframe API is used for building Expr.
map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])]) is actually like Expr::ScalarFunction(map_udf(), args: ...).
The idea is something like
fn map(keys: Vec<Expr>, values: Vec<Expr>) {
let args: Vec<Expr> = concat (keys, values)
Expr::ScalarFunction(map_udf(), args)
}| let keys = make_array(keys); | ||
| let values = make_array(values); |
There was a problem hiding this comment.
I want to invoke the make_array to do the aggregation. That's why I put the implementation in functions-array.
There was a problem hiding this comment.
Ideally i think this should be implemented in functions inside core.
Do we have any downside of adding functions-array as depedency to functions ?
There was a problem hiding this comment.
Maybe we could move make-array to functions core-feature?
There was a problem hiding this comment.
Do we have any downside of adding functions-array as depedency to functions
Then you need to import unnecessary array function crate if you only care about funcitons
There was a problem hiding this comment.
I guess we can reuse make_array_inner if we move make_array to functions crate.
The alternative is to keep the code here in functions-array
There was a problem hiding this comment.
Yes, I think moving make_array to functions is a good idea. It would be beneficial for many scenarios.
There was a problem hiding this comment.
Hmm, okay. After some research, I believe it's not easy to move make_array to functions. It's tied to methods in utils.rs and macro.rs. Moving all the required methods to functions could make the codebase chaotic. For now, I prefer to keep them in functions-arrays first. We can do it in another PR.
| make_map, | ||
| "Returns a map created from the given keys and values pairs. This function isn't efficient for large maps. Use the `map` function instead.", | ||
| args, |
There was a problem hiding this comment.
I'm not sure where can put this doc. Maybe we can do it when #11435
There was a problem hiding this comment.
Agreed, We can use to document this function in https://datafusion.apache.org/user-guide/sql/scalar_functions.html
| return exec_err!("make_map requires an even number of arguments"); | ||
| } | ||
|
|
||
| let (keys, values): (Vec<_>, Vec<_>) = args |
There was a problem hiding this comment.
It is possible to avoid clone
| let (keys, values): (Vec<_>, Vec<_>) = args | |
| let (keys, values): (Vec<_>, Vec<_>) = args.into_iter().enumerate().partition(|(i, _)| i % 2 == 0); | |
| let keys = make_array(keys.into_iter().map(|(_, expr)| expr).collect()); | |
| let values = make_array(values.into_iter().map(|(_, expr)| expr).collect()); |
| } | ||
|
|
||
| #[derive(Debug)] | ||
| pub struct MakeArrayStrict { |
There was a problem hiding this comment.
Can we just add the function that convert keys and values to list of expr instead of introducing another udf
There was a problem hiding this comment.
This function are public function that could be used in datafusion-cli or other project. We are just converting keys to array, we just need internal private function for this
There was a problem hiding this comment.
I think the high level idea is that
SELECT MAKE_MAP('POST', 41, 'PAST', 33,'PATCH', 30)
We arrange args to ['POST', 'PAST', 'PATCH'], [41, 33, 30], and call
MAP(['POST', 'PAST', 'PATCH'], [41, 33, 30])
I just noticed that we can't directly pass these two array to MapFunc 😕
There was a problem hiding this comment.
I think we could figure how to build with dataframe API, map(keys, values)
Current function is like
pub fn map($($arg: datafusion_expr::Expr),*) -> datafusion_expr::Expr {
super::$FUNC().call(vec![$($arg),*])
}Expected
pub fn map(keys: Vec<Expr>, values: Vec<Expr>) -> Expr
...
}There was a problem hiding this comment.
For this PR, we just call make_array_inner and instead of make_array_strict, we could deal with others in another PR
There was a problem hiding this comment.
I still think we should find a way to avoid make_array_strict 🤔
There was a problem hiding this comment.
We can change the MapFunc first, let it takes arguments with Vec<Expr>. The first half is keys, the other is values
There was a problem hiding this comment.
I play around to make sure the suggestion makes sense #11526
There was a problem hiding this comment.
Thanks! I will check it tonight.
There was a problem hiding this comment.
I have some concerns for it. If we make MapFunc to accept one array, it would be used like
SELECT map([1,2,3,'a','b','c'])
After planning, the input array would be ['1','2','3','a','b','c'] because of the type coercion for array elements. I think the behavior is wrong. If we change the signature of MapFunc, we might need to have another implementation to solve it.
|
Thanks @goldmedal . I will file an issue about the |
|
Thanks @jayzhan211 and @dharanad for reviewing |
* move make_map to ExprPlanner * add benchmark for make_map * remove todo comment * update lock * refactor plan_make_map * implement make_array_strict for type checking strictly * fix planner provider * roll back to `make_array` * update lock
* move make_map to ExprPlanner * add benchmark for make_map * remove todo comment * update lock * refactor plan_make_map * implement make_array_strict for type checking strictly * fix planner provider * roll back to `make_array` * update lock
Which issue does this PR close?
Parietally solve #11434
Rationale for this change
The benchmark result:
It's much faster than the previous implementation #11361. Although the benchmark doesn't invoke the function, it contains the bottleneck of the original scalar function, aggregating the keys and values.
Thanks to @jayzhan211 for the nice suggestion.
What changes are included in this PR?
Remove the scalar function
make_map, and then plan it inExprPlanner.Are these changes tested?
yes
Are there any user-facing changes?
no