Upgrade to DataFusion 13 (784f10bb) / Arrow 25.0.0 by mildbyte · Pull Request #176 · splitgraph/seafowl

mildbyte · 2022-10-27T13:09:44Z

Fixes #173

Upgrade to a DataFusion version that's newer than 13 (might be close to 14RC) because it picks up Arrow 25. Arrow's schema-to-JSON serialization code went away between 22 and 24 (Move JSON Test Format To integration-testing apache/arrow-rs#2724) and got brought back in a different crate in Arrow 25.
- We might need to move to a different way to serialize table schemas altogether, but this keeps compatibility with the current Seafowl catalog structure (I've checked manually by running against an existing DB, since we don't have these kinds of tests for data migrations)
DataFusion now doesn't coerce types in binary expressions (remove type coercion in the binary physical expr apache/datafusion#3396) and does it in the logical query optimizer, using the TypeCoercion optimizer rule. This means we had to make the user-defined Update / Delete nodes liable to be optimized (otherwise something like DELETE FROM some_table WHERE some_float_value > 42 errors out because 42 isn't a float).
- Done by making them return the correct expressions() and children() (return a placeholder TableScan node that we don't use downstream) and reinitialize the nodes in from_template() (also stripping aliases like the Filter node does: https://github.com/apache/arrow-datafusion/blob/c50573939d21de40e591c04915d41f7c46a51d0d/datafusion/expr/src/utils.rs#L384-L428)
Port DataFusion changes supporting file compression types in CREATE EXTERNAL TABLE (we had to copypaste this and the parser code, so we don't pick them up automatically)

The actual 13.0.0 DF release uses Arrow 24.0.0, but we need to pick up 25.0.0, since it brings back the Arrow Schema/Field-to-JSON serialization code (albeit in a different crate for integration tests). apache/arrow-rs#2868 apache/arrow-rs#2724

It's now the default HashMap implementation and DF's planner uses it as well, so we can use std::HashMap everywhere.

Arrow file hash changes and minor changes in the query plan output

Make the `Update`/`Delete` nodes expose `inputs` and `expressions` in order to let the DF query optimizer work on the `WHERE ...` / `SET col = expr` expressions. This is slightly hacky: - as an "input", we return a `TableScan` node that we don't use after that (this is just so that the optimizer knows the input schema for all the expressions) - return the expressions used by the node and add code to pack/unpack them into a list The point of this is to let DataFusion run the `TypeCoercion` optimization, without which something like `WHERE float_col > 42` will raise an error (as after DF 13 these type coercions got removed from other places and moved into optimizations) (NB this doesn't work yet, we still get type coercion errors)

(normally it's run only by DataFusion's `create_physical_plan`, but we don't run that, so we have to execute it manually to get auto type coercion working)

Include `SET` expressions and the predicate if it exists to aid debugging.

These expressions are similar to what DataFusion uses in the `Filter` node and not doing this seems to break partition pruning (perhaps it stops at the `Alias` node and doesn't prone anything, didn't investigate in depth). Copy the `ExprRewriter` visitor from https://github.com/apache/arrow-datafusion/blob/c50573939d21de40e591c04915d41f7c46a51d0d/datafusion/expr/src/utils.rs#L384-L428 and adapt it to remove aliases from all expressions that the query optimizer gives back to `Update`/`Delete` nodes.

Make sure the constants are correctly cast and let us detect changes to the optimizer faster with new DF updates.

mildbyte added 12 commits October 26, 2022 12:34

Remove hashbrown

b2d78ef

It's now the default HashMap implementation and DF's planner uses it as well, so we can use std::HashMap everywhere.

Port DataFusion changes for file compression types

94dc4d4

Fix DataFusion deprecations and symbol moves

efb6187

Fix some expected output change tests

2635b05

Arrow file hash changes and minor changes in the query plan output

Add new df_settings table to expected output

38c0568

Run the query optimizer for UPDATE/DELETE

0c6ece6

(normally it's run only by DataFusion's `create_physical_plan`, but we don't run that, so we have to execute it manually to get auto type coercion working)

Simplify expressions() for Delete

2a27fc2

Add more verbose plan output to Update/Delete

b4cfc90

Include `SET` expressions and the predicate if it exists to aid debugging.

Assert the query plan in update/delete tests

7877070

Make sure the constants are correctly cast and let us detect changes to the optimizer faster with new DF updates.

mildbyte requested a review from gruuya October 27, 2022 13:09

gruuya approved these changes Oct 27, 2022

View reviewed changes

mildbyte merged commit a280251 into main Oct 27, 2022

mildbyte deleted the upgrade/datafusion-13 branch October 27, 2022 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade to DataFusion 13 (784f10bb) / Arrow 25.0.0#176

Upgrade to DataFusion 13 (784f10bb) / Arrow 25.0.0#176
mildbyte merged 12 commits intomainfrom
upgrade/datafusion-13

mildbyte commented Oct 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

mildbyte commented Oct 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments