Optimize Hybrid Scan for deleted files#215
Conversation
|
@rapoth @imback82 @pirz @apoorvedave1 Should we increase the max number of deleted files threshould ("spark.hyperspace.index.hybridscan.delete.maxNumDeletedFiles") And please review this change in case you'd like to deliver this PR in this release. Thanks! |
That is a good point @sezruby - I would vote conservative for keeping this value small (i.e. 10) for now to avoid potential performance degrades. We can bump it up later once we see more evidence of it performing as expected with higher values for this threshold. However, @sezruby, as you are conducting the benchmarks for this part, you should really make the final call :). |
|
LGTM, Thanks @sezruby ! |
What is the context for this pull request?
What changes were proposed in this pull request?
Apply catalyst's
OptimizeInrule to the injected filter-not-in plan for Hybrid Scan with delete files. (refer #171)https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L237
Does this PR introduce any user-facing change?
Yes, performance optimization is done by using "InSet" instead of "In", in case deleted files are larger than
spark.sql.optimizer.inSetConversionThresholdconfig, default value: 10.Changed plan example (with the threshold value=1):
How was this patch tested?
Unit test