support page skipping when using vectorized Parquet reader#15211
Open
lurnagao-dahua wants to merge 2 commits intoapache:mainfrom
Open
support page skipping when using vectorized Parquet reader#15211lurnagao-dahua wants to merge 2 commits intoapache:mainfrom
lurnagao-dahua wants to merge 2 commits intoapache:mainfrom
Conversation
Contributor
Author
|
Hi team! |
Member
|
@lurnagao-dahua Do you have any benchmark numbers? |
Contributor
|
@wypoon I recall you also had a PR for this before. It didn't get merged then, would you mind sharing why, and maybe take a look at this one? |
Member
|
I believe it's #10399 which @lurnagao-dahua has also commented. |
Contributor
Author
|
Benchmark and benchmark result: |
Contributor
Author
Hi, I have added a simple benchmark and the result indicate that it can improve performance, could you please review it when you have free time? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parquet Column Index is a new feature in Parquet 1.11 which allows very efficient filtering on page level (some benchmark numbers can be found here), especially when data is sorted. The feature is largely implemented in parquet-mr (via classes such as ColumnIndex and ColumnIndexFilter).
The implementation of this feature was discussed in 193.
The implementation of the vectorized case is based on the implementation in Spark's Parquet reader (see spark-32753), which is in Spark 3.2.
In addition, PositionVectorReader supports position deletion based on readOrderToRowGroupPosMap(ParquetReadState.java#L67).
I look forward to someone interested in reviewing this PR, and I welcome anyone willing to be a co-author with me to improve it together.