Fix predicate pushdown for custom SchemaAdapters#15263
Conversation
| #[tokio::test] | ||
| async fn test_pushdown_with_missing_column_in_file() { |
There was a problem hiding this comment.
Replacing the unit test with a more e2e tests that shows that things work as expected
| // ArrowPredicate::evaluate is passed columns in the order they appear in the file | ||
| // If the predicate has multiple columns, we therefore must project the columns based | ||
| // on the order they appear in the file | ||
| let projection = match candidate.projection.len() { | ||
| 0 | 1 => vec![], | ||
| 2.. => remap_projection(&candidate.projection), | ||
| }; |
There was a problem hiding this comment.
I think this is no longer necessary and is handled by the SchemaAdapter. Might be nice to have a test to point to to confirm.
| let batch = self.schema_mapping.map_partial_batch(batch)?; | ||
| fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> { | ||
| let batch = self.schema_mapping.map_batch(batch)?; |
There was a problem hiding this comment.
Here is where we ditch map_partial_batch in favor of map_batch
| /// Computes the projection required to go from the file's schema order to the projected | ||
| /// order expected by this filter | ||
| /// | ||
| /// Effectively this computes the rank of each element in `src` | ||
| fn remap_projection(src: &[usize]) -> Vec<usize> { |
There was a problem hiding this comment.
I believe this is taken care of by SchemaAdapter now 😄. Again it would be nice to be able to point at a (maybe existing) test to confirm. Maybe I need to try removing this on main and confirming which tests break.
There was a problem hiding this comment.
Okay I can confirm that
fails on main if I replaceremap_projection with a no-op. So I think this change is 👍🏻
| let file_schema = Arc::new(file_schema.clone()); | ||
| let table_schema = Arc::new(table_schema.clone()); |
There was a problem hiding this comment.
We could change the signature of build_row_filter since the caller might have an Arc'd version already, but since it's pub that would introduce more breaking changes and the clone seemed cheap enough. Open to doing that though.
There was a problem hiding this comment.
I think you can avoid cloning the schema with a pretty simple change. Here is a proposal:
| // If a column exists in the table schema but not the file schema it should be rewritten to a null expression | ||
| #[test] | ||
| fn test_filter_candidate_builder_rewrite_missing_column() { |
There was a problem hiding this comment.
See newly added e2e test
| .expect("creating filter predicate"); | ||
|
|
||
| let mut parquet_reader = parquet_reader_builder | ||
| .with_projection(row_filter.projection().clone()) |
There was a problem hiding this comment.
Moved down because it needs access to the row filter's projection
| let table_schema = get_basic_table_schema(); | ||
|
|
||
| let file_schema = Schema::new(vec![Field::new("str_col", DataType::Utf8, true)]); | ||
| let file_schema = | ||
| Schema::new(vec![Field::new("string_col", DataType::Utf8, true)]); |
There was a problem hiding this comment.
There is no str_col in the data returned by get_basic_table_schema() but there is string_col.
|
cc @jeffreyssmith2nd as the original author of |
alamb
left a comment
There was a problem hiding this comment.
Thank you @adriangb -- it is the mark of a great engineer to fix bugs by deleting code in my mind
I think the only thing this PR needs is a few more tests (I specified what they are below). I do think pydantic#9 is worth considering too though.
FYI @itsjunetime who worked on #12135 and @jeffreyssmith2nd who worked on #10716
| /// After visiting all children, rewrite column references to nulls if | ||
| /// they are not in the file schema. | ||
| /// We do this because they won't be relevant if they're not in the file schema, since that's | ||
| /// the only thing we're dealing with here as this is only used for the parquet pushdown during | ||
| /// scanning | ||
| fn f_up( | ||
| &mut self, | ||
| expr: Arc<dyn PhysicalExpr>, | ||
| ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> { |
There was a problem hiding this comment.
I agree adding an API for stats on the new column would be 💯
| /// After visiting all children, rewrite column references to nulls if | ||
| /// they are not in the file schema. | ||
| /// We do this because they won't be relevant if they're not in the file schema, since that's | ||
| /// the only thing we're dealing with here as this is only used for the parquet pushdown during | ||
| /// scanning | ||
| fn f_up( | ||
| &mut self, | ||
| expr: Arc<dyn PhysicalExpr>, | ||
| ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> { |
There was a problem hiding this comment.
I do think in general we need to be correct first, then fast.
As someone once told me "if you don't constraint it (the compiler) to be corrrect, I'll make it as fast as you want!"
| let file_schema = Arc::new(file_schema.clone()); | ||
| let table_schema = Arc::new(table_schema.clone()); |
There was a problem hiding this comment.
I think you can avoid cloning the schema with a pretty simple change. Here is a proposal:
alamb
left a comment
There was a problem hiding this comment.
Thank you @adriangb I think this one is ready to go except for the datafusion-testing pin
Without fixing the pin, I think the extended tests are going to fail on main
For example, running
INCLUDE_SQLITE=true nice cargo test --profile release-nonlto --test sqllogictestsI think will error
Here is a PR to revert the change
|
|
||
| let file_schema = Arc::new(Schema::new(vec

See #15220 (comment).