Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

Optimize Hybrid Scan for deleted files#215

Merged
imback82 merged 4 commits into
microsoft:masterfrom
sezruby:delete_inset
Oct 17, 2020
Merged

Optimize Hybrid Scan for deleted files#215
imback82 merged 4 commits into
microsoft:masterfrom
sezruby:delete_inset

Conversation

@sezruby

@sezruby sezruby commented Oct 16, 2020

Copy link
Copy Markdown
Contributor

What is the context for this pull request?

What changes were proposed in this pull request?

Apply catalyst's OptimizeIn rule to the injected filter-not-in plan for Hybrid Scan with delete files. (refer #171)

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala#L237

Does this PR introduce any user-facing change?

Yes, performance optimization is done by using "InSet" instead of "In", in case deleted files are larger than spark.sql.optimizer.inSetConversionThreshold config, default value: 10.

Changed plan example (with the threshold value=1):

  +- Project [clicks#684, Date#680]
      +- Filter ((isnotnull(clicks#684) && (clicks#684 <= 4000)) && (clicks#684 >= 2000))
         +- Project [Date#680, clicks#684]
            +- Filter NOT _data_file_name#712 INSET (file:/C:/Users/eunsong/repo/hyperspace2/src/test/resources/data/sampleparquet4/part-00003-6fb13b2d-abcc-46e2-81ba-42b3a933acec-c000.snappy.parquet,file:/C:/Users/eunsong/repo/hyperspace2/src/test/resources/data/sampleparquet4/part-00002-6fb13b2d-abcc-46e2-81ba-42b3a933acec-c000.snappy.parquet)
               +- Relation[Date#680,clicks#684,_data_file_name#712] parquet

How was this patch tested?

Unit test

@sezruby sezruby self-assigned this Oct 16, 2020
@sezruby sezruby added the enhancement New feature or request label Oct 16, 2020
@sezruby

sezruby commented Oct 16, 2020

Copy link
Copy Markdown
Contributor Author

@rapoth @imback82 @pirz @apoorvedave1 Should we increase the max number of deleted files threshould ("spark.hyperspace.index.hybridscan.delete.maxNumDeletedFiles")
with this change? - I think this might requires some performance validation but we could increase it little more (like 30 or 50?) as the spark.sql.optimizer.inSetConversionThreshold is 10 by default.

And please review this change in case you'd like to deliver this PR in this release. Thanks!

@sezruby sezruby changed the title Optimize Hybrid Scan with deleted files Optimize Hybrid Scan for deleted files Oct 16, 2020
@pirz

pirz commented Oct 16, 2020

Copy link
Copy Markdown
Contributor

@rapoth @imback82 @pirz @apoorvedave1 Should we increase the max number of deleted files threshould ("spark.hyperspace.index.hybridscan.delete.maxNumDeletedFiles")
with this change? - I think this might requires some performance validation but we could increase it little more (like 30 or 50?) as the spark.sql.optimizer.inSetConversionThreshold is 10 by default.

And please review this change in case you'd like to deliver this PR in this release. Thanks!

That is a good point @sezruby - I would vote conservative for keeping this value small (i.e. 10) for now to avoid potential performance degrades. We can bump it up later once we see more evidence of it performing as expected with higher values for this threshold. However, @sezruby, as you are conducting the benchmarks for this part, you should really make the final call :).

Comment thread src/main/scala/com/microsoft/hyperspace/index/rules/RuleUtils.scala Outdated
Comment thread src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala Outdated
Comment thread src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala Outdated
Comment thread src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala Outdated
Comment thread src/test/scala/com/microsoft/hyperspace/index/HybridScanTest.scala Outdated
@pirz

pirz commented Oct 17, 2020

Copy link
Copy Markdown
Contributor

LGTM, Thanks @sezruby !

@imback82 imback82 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sezruby!

@imback82 imback82 merged commit e80cffc into microsoft:master Oct 17, 2020
@sezruby sezruby deleted the delete_inset branch October 19, 2020 05:39
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants