Move `MAKE_MAP` to ExprPlanner by goldmedal · Pull Request #11452 · apache/datafusion

goldmedal · 2024-07-13T15:59:02Z

Which issue does this PR close?

Parietally solve #11434

Rationale for this change

The benchmark result:

Gnuplot not found, using plotters backend
make_map_1000           time:   [234.07 µs 237.23 µs 240.86 µs]
                        change: [-91.706% -91.396% -91.136%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

It's much faster than the previous implementation #11361. Although the benchmark doesn't invoke the function, it contains the bottleneck of the original scalar function, aggregating the keys and values.
Thanks to @jayzhan211 for the nice suggestion.

What changes are included in this PR?

Remove the scalar function make_map, and then plan it in ExprPlanner.

Are these changes tested?

yes

Are there any user-facing changes?

no

goldmedal · 2024-07-13T16:24:07Z

datafusion/sqllogictest/test_files/map.slt

+query ?
 SELECT MAKE_MAP('POST', 41, 'HEAD', 'ab', 'PATCH', 30);
+----
+{POST: 41, HEAD: ab, PATCH: 30}


I expected the query would fail because similar behavior isn't allowed in other databases (e.g. DuckDB). However, seems make_array will coercion the value to find a suitable type for them. In this case, all of them will be converted to utf8.

> select make_array(1,'a',3); +-----------------------------------------+ | make_array(Int64(1),Utf8("a"),Int64(3)) | +-----------------------------------------+ | [1, a, 3] | +-----------------------------------------+ 1 row(s) fetched. Elapsed 0.004 seconds. > select arrow_typeof(make_array(1,'a',3)); +-----------------------------------------------------------------------------------------------------------------+ | arrow_typeof(make_array(Int64(1),Utf8("a"),Int64(3))) | +-----------------------------------------------------------------------------------------------------------------+ | List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | +-----------------------------------------------------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.002 seconds.

I think if DataFusion allows this type of coercion for make_array, we can allow it for make_map too.

I think we need another make_array to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.

I see. Maybe I can create a scalar function make_array_strict that won't implement the coerce_types method for ScalarUDFImpl, but other implementations are the same as make_array.
WDYT?

I think we need another make_array to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.

Instead can we pass a boolean arg should_coercion with default value as false, to control such behaviour

I think we need another make_array to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.

Instead can we pass a boolean arg should_coercion with default value as false, to control such behaviour

The coercion logic is not simply work like if-else statement. The make_array_inner doesn't care about coercion, the coercion is in type_coercion pass in analzyer.

Agreed. That's why I planned to implement another scalar function for it.

I think we need another make_array to does not apply coercion. I prefer to align the behaviour to other system unless there is a good reason not to.

I did more tests for DuckDB behavior and found something interesting. I found they also try to coercion types when building arrays or maps.
I arranged some notes for the behaviors:

How DuckDB build a map

It seems that they also transform to two lists and call map function. Just like my first design, using make_array.

D select map {1:102, 2:20}; ┌───────────────────────────────────────────────────────────┐ │ main.map(main.list_value(1, 2), main.list_value(102, 20)) │ │ map(integer, integer) │ ├───────────────────────────────────────────────────────────┤ │ {1=102, 2=20} │ └───────────────────────────────────────────────────────────┘

How DuckDB and DataFusion coercion array type

DuckDB

Array constructed from INT32 and numeric string: DuckDB will make it be INTEGER[].

D select array[1,2,'3']; ┌────────────────────┐ │ (ARRAY[1, 2, '3']) │ │ int32[] │ ├────────────────────┤ │ [1, 2, 3] │ └────────────────────┘ D select typeof(array[1,2,'3']); ┌────────────────────────────┐ │ typeof((ARRAY[1, 2, '3'])) │ │ varchar │ ├────────────────────────────┤ │ INTEGER[] │ └────────────────────────────┘

Array constructed from INT32 and non-numeric string: DuckDB can't construct the array.

D select array[1,2,'a']; Conversion Error: Could not convert the string 'a' to INT32 LINE 1: select array[1,2,'a'];

DataFusion

Array constructed from INT32 and numeric string: DataFusion will make it be Uf8 array.

> select [1,2,'1']; +-----------------------------------------+ | make_array(Int64(1),Int64(2),Utf8("1")) | +-----------------------------------------+ | [1, 2, 1] | +-----------------------------------------+ 1 row(s) fetched. Elapsed 0.001 seconds. > select arrow_typeof([1,2,'1']); +-----------------------------------------------------------------------------------------------------------------+ | arrow_typeof(make_array(Int64(1),Int64(2),Utf8("1"))) | +-----------------------------------------------------------------------------------------------------------------+ | List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | +-----------------------------------------------------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.001 seconds.

Array constructed from INT32 and non-numeric string: DataFusion will make it be Uf8 array.

> select [1,2,'a']; +-----------------------------------------+ | make_array(Int64(1),Int64(2),Utf8("a")) | +-----------------------------------------+ | [1, 2, a] | +-----------------------------------------+ 1 row(s) fetched. Elapsed 0.001 seconds. > select arrow_typeof([1,2,'a']); +-----------------------------------------------------------------------------------------------------------------+ | arrow_typeof(make_array(Int64(1),Int64(2),Utf8("a"))) | +-----------------------------------------------------------------------------------------------------------------+ | List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | +-----------------------------------------------------------------------------------------------------------------+ 1 row(s) fetched. Elapsed 0.001 seconds.

The behavior of type coercion between INT32 and String is really different.

How DuckDB coercion map type

INT32 value and numeric string value: We can find the value '20' has been converted to 20.

D select map {1:10, 2:'20'}; ┌────────────────────────────────────────────────────────────┐ │ main.map(main.list_value(1, 2), main.list_value(10, '20')) │ │ map(integer, integer) │ ├────────────────────────────────────────────────────────────┤ │ {1=10, 2=20} │ └────────────────────────────────────────────────────────────┘

INT32 value and non-numeric string value. (It's what I tried in the first time. That's why I thought it shouldn't be allowed)

D select map {1:10, 2:'abc'}; Conversion Error: Could not convert string 'abc' to INT32 LINE 1: select map {1:10, 2:'abc'}; ^

Conclusion

Referring to these behaviors, I think we can just back to using make_array to implement this. Because the behavior of type coercion is different, Our make_map can allow map {1:10, 2:'a'} but DuckDB can't do it. It makes sense for me.
@jayzhan211 WDYT?

Alright, so the behaviour is actually depend on array itself.
I think we can use make_array in this case.

But, if we want to introduce nice dataframe API map(keys: Vec<Expr>, values: Vec<Expr>), I think we still need to pass Vec<Expr> instead of the result of make_array. However, we can introduce that in another PR.

current API expects map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])])
a slightly better API is map(vec![lit("a"), lit("b")], vec![lit(1), lit(2)])

Alright, so the behaviour is actually depend on array itself. I think we can use make_array in this case.

Ok, I'll roll back to make_array first.

current API expects map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])]) a slightly better API is map(vec![lit("a"), lit("b")], vec![lit(1), lit(2)])

I'm not very familiar with the data frame implementation. Curiously, does the API for data frames also use the UDF map? I think the UDF is a logical layer function, but we don't have a corresponding logical expression for vec! other than make_array.

dataframe API is used for building Expr.

map(vec![make_array(vec![lit("a"), lit("b")]), make_array(vec![lit("1"), lit("2")])]) is actually like Expr::ScalarFunction(map_udf(), args: ...).

The idea is something like

fn map(keys: Vec<Expr>, values: Vec<Expr>) { let args: Vec<Expr> = concat (keys, values) Expr::ScalarFunction(map_udf(), args) }

goldmedal · 2024-07-13T16:26:04Z

datafusion/functions-array/src/planner.rs

+        let keys = make_array(keys);
+        let values = make_array(values);


I want to invoke the make_array to do the aggregation. That's why I put the implementation in functions-array.

Ideally i think this should be implemented in functions inside core.
Do we have any downside of adding functions-array as depedency to functions ?

Maybe we could move make-array to functions core-feature?

Do we have any downside of adding functions-array as depedency to functions

Then you need to import unnecessary array function crate if you only care about funcitons

I guess we can reuse make_array_inner if we move make_array to functions crate.

The alternative is to keep the code here in functions-array

Yes, I think moving make_array to functions is a good idea. It would be beneficial for many scenarios.

Hmm, okay. After some research, I believe it's not easy to move make_array to functions. It's tied to methods in utils.rs and macro.rs. Moving all the required methods to functions could make the codebase chaotic. For now, I prefer to keep them in functions-arrays first. We can do it in another PR.

goldmedal · 2024-07-13T16:28:16Z

datafusion/functions/src/core/mod.rs

-        make_map,
-        "Returns a map created from the given keys and values pairs. This function isn't efficient for large maps. Use the `map` function instead.",
-        args,


I'm not sure where can put this doc. Maybe we can do it when #11435

Agreed, We can use to document this function in https://datafusion.apache.org/user-guide/sql/scalar_functions.html

jayzhan211 · 2024-07-14T00:47:14Z

datafusion/functions-array/src/planner.rs

+            return exec_err!("make_map requires an even number of arguments");
+        }
+
+        let (keys, values): (Vec<_>, Vec<_>) = args


It is possible to avoid clone

Suggested change

let (keys, values): (Vec<_>, Vec<_>) = args

let (keys, values): (Vec<_>, Vec<_>) = args.into_iter().enumerate().partition(|(i, _)| i % 2 == 0);

let keys = make_array(keys.into_iter().map(|(_, expr)| expr).collect());

let values = make_array(values.into_iter().map(|(_, expr)| expr).collect());

jayzhan211 · 2024-07-16T12:59:29Z

datafusion/functions-array/src/make_array.rs

 }

+#[derive(Debug)]
+pub struct MakeArrayStrict {


Can we just add the function that convert keys and values to list of expr instead of introducing another udf

This function are public function that could be used in datafusion-cli or other project. We are just converting keys to array, we just need internal private function for this

I think the high level idea is that

SELECT MAKE_MAP('POST', 41, 'PAST', 33,'PATCH', 30)
We arrange args to ['POST', 'PAST', 'PATCH'], [41, 33, 30], and call
MAP(['POST', 'PAST', 'PATCH'], [41, 33, 30])

I just noticed that we can't directly pass these two array to MapFunc 😕

I think we could figure how to build with dataframe API, map(keys, values)

Current function is like

pub fn map($($arg: datafusion_expr::Expr),*) -> datafusion_expr::Expr { super::$FUNC().call(vec![$($arg),*]) }

Expected

pub fn map(keys: Vec<Expr>, values: Vec<Expr>) -> Expr ... }

For this PR, we just call make_array_inner and instead of make_array_strict, we could deal with others in another PR

I still think we should find a way to avoid make_array_strict 🤔

We can change the MapFunc first, let it takes arguments with Vec<Expr>. The first half is keys, the other is values

I play around to make sure the suggestion makes sense #11526

Thanks! I will check it tonight.

I have some concerns for it. If we make MapFunc to accept one array, it would be used like

SELECT map([1,2,3,'a','b','c'])

After planning, the input array would be ['1','2','3','a','b','c'] because of the type coercion for array elements. I think the behavior is wrong. If we change the signature of MapFunc, we might need to have another implementation to solve it.

jayzhan211

👍

jayzhan211 · 2024-07-19T09:32:21Z

Thanks @goldmedal . I will file an issue about the map API

goldmedal · 2024-07-19T09:39:36Z

Thanks @jayzhan211 and @dharanad for reviewing

* move make_map to ExprPlanner * add benchmark for make_map * remove todo comment * update lock * refactor plan_make_map * implement make_array_strict for type checking strictly * fix planner provider * roll back to `make_array` * update lock

goldmedal added 2 commits July 13, 2024 23:41

move make_map to ExprPlanner

952634d

add benchmark for make_map

47f64d8

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Jul 13, 2024

goldmedal added 2 commits July 14, 2024 00:12

remove todo comment

84c7b07

update lock

3fde4c8

goldmedal commented Jul 13, 2024

View reviewed changes

jayzhan211 reviewed Jul 14, 2024

View reviewed changes

goldmedal added 2 commits July 14, 2024 22:03

refactor plan_make_map

04a4c86

implement make_array_strict for type checking strictly

646cc66

jayzhan211 reviewed Jul 16, 2024

View reviewed changes

goldmedal added 5 commits July 17, 2024 20:05

Merge branch 'main' into feature/moving-make-map-to-expr

3070f29

fix planner provider

52d1c2b

Merge branch 'main' into feature/moving-make-map-to-expr

bb94296

roll back to make_array

fe570d3

update lock

ecf8d11

jayzhan211 approved these changes Jul 19, 2024

View reviewed changes

jayzhan211 merged commit 5f0dfbb into apache:main Jul 19, 2024

jayzhan211 mentioned this pull request Jul 19, 2024

Easier Dataframe API for map #11546

Closed

goldmedal deleted the feature/moving-make-map-to-expr branch July 19, 2024 09:39

goldmedal mentioned this pull request Jul 19, 2024

Implement the rewrite from the Map literal to Map function #11434

Closed

		let keys = make_array(keys);
		let values = make_array(values);

-        let (keys, values): (Vec<_>, Vec<_>) = args
+        let (keys, values): (Vec<_>, Vec<_>) = args.into_iter().enumerate().partition(|(i, _)| i % 2 == 0);
+        let keys = make_array(keys.into_iter().map(|(_, expr)| expr).collect());
+        let values = make_array(values.into_iter().map(|(_, expr)| expr).collect());

Conversation

goldmedal commented Jul 13, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

How DuckDB build a map

How DuckDB and DataFusion coercion array type

DuckDB

DataFusion

How DuckDB coercion map type

Conclusion

Uh oh!

jayzhan211 Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goldmedal Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goldmedal Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayzhan211 Jul 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayzhan211 Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayzhan211 Jul 18, 2024 •

edited

Loading

goldmedal Jul 15, 2024 •

edited

Loading

goldmedal Jul 15, 2024 •

edited

Loading

jayzhan211 Jul 16, 2024 •

edited

Loading

jayzhan211 Jul 18, 2024 •

edited

Loading