Parquet: Use native getRowIndexOffset support instead of calculating it#11520
Parquet: Use native getRowIndexOffset support instead of calculating it#11520flyrain merged 8 commits intoapache:mainfrom
Conversation
|
@szehon-ho @flyrain can you please review? |
| List<TripleIterator<?>> columns(); | ||
|
|
||
| /** | ||
| * @deprecated since 1.6.0, will be removed in 1.7.0; use setPageSource(PageReadStore) instead. |
There was a problem hiding this comment.
| * @deprecated since 1.6.0, will be removed in 1.7.0; use setPageSource(PageReadStore) instead. | |
| * @deprecated since 1.8.0, will be removed in 1.9.0 or 2.0.0; use setPageSource(PageReadStore) instead. |
There was a problem hiding this comment.
In my change, I put down 1.9.0. If there is no 1.9.0, and the methods are removed in 2.0.0 instead, I don't think that would be a problem. On the other hand, we don't want to give the impression that the methods could still exist in a 1.9.0 and might be removed in 2.0.0 instead.
| * @param pages row group information for all the columns | ||
| * @param metadata map of {@link ColumnPath} -> {@link ColumnChunkMetaData} for the row group | ||
| * @param rowPosition the row group's row offset in the parquet file | ||
| * @deprecated since 1.6.0, will be removed in 1.7.0; use setRowGroupInfo(PageReadStore, |
There was a problem hiding this comment.
| * @deprecated since 1.6.0, will be removed in 1.7.0; use setRowGroupInfo(PageReadStore, | |
| * @deprecated since 1.8.0, will be removed in 1.9.0 or 2.0.0; use setRowGroupInfo(PageReadStore, |
|
@Fokko thanks for reviewing! |
| public void setRowGroupInfo( | ||
| PageReadStore pageStore, Map<ColumnPath, ColumnChunkMetaData> metaData) { | ||
| super.setRowGroupInfo(pageStore, metaData); | ||
| this.rowStartPosInBatch = pageStore.getRowIndexOffset().orElse(0L); |
There was a problem hiding this comment.
if pageStore.getRowIndexOffset() is empty, does it mean getRowIndexOffset() returns a negative value? Shall we throw Exception instead of default it to 0?
There was a problem hiding this comment.
That is a good question.
As I understand it, the PageReadStore implementation (ColumnChunkPageReadStore) is normally constructed with the rowIndexOffset, but if the offset is not available then it is constructed with -1 for the rowIndexOffset. PageReadStore::getRowIndexOffset() will not return a negative value; it will return Optional.empty() in that case.
I suppose we can throw an IllegalArgumentException instead in such a situation, instead of setting rowStartPosInBatch to 0.
@flyrain do you have an opinion on this?
Is there someone who knows Parquet well who can confirm that in normal operation, PageReadStore::getRowIndexOffset() should not return Optional.empty()?
|
@huaxingao @Fokko I have updated the PR; please review again. |
| .getRowIndexOffset() | ||
| .orElseThrow( | ||
| () -> | ||
| new IllegalArgumentException( |
There was a problem hiding this comment.
nit: Is there a better Exception than IllegalArgumentException? Is IllegalStateException a bit better?
There was a problem hiding this comment.
I can see why you might consider IllegalStateException. I do think IllegalArgumentException is appropriate, because the PageReadStore is an argument to the method being called, and the problem is with the PageReadStore. I think IllegalStateException is typically used to indicate an internal inconsistency in the module.
Consider if we use Guava's Preconditions to check a condition here. The condition would be that source.getRowIndexOffset().isPresent(). The checkArgument methods throw IllegalArgumentException and "Ensures the truth of an expression involving one or more parameters to the calling method." The checkState methods throw IllegalStateException and "Ensures the truth of an expression involving the state of the calling instance, but not involving any parameters to the calling method." (my emphasis)
Of course, that just expresses the opinions of the authors of Guava. There are others who might argue that IllegalStateException is appropriate here, or neither IllegalStateException nor IllegalArgumentException. It really doesn't matter too much. I can just throw RuntimeException if you do not agree with IllegalArgumentException.
|
@Fokko can you help merge this if you have no further feedback? |
|
Hi @Fokko, do you have any further feedback? |
|
@Fokko @flyrain Should we merge this PR? I am waiting for it to be merged so I can clean up my temporary code |
|
I will merge it if there is no new comment by EOD. |
szehon-ho
left a comment
There was a problem hiding this comment.
Very small comment, lgtm otherwise
| List<TripleIterator<?>> columns(); | ||
|
|
||
| /** | ||
| * @deprecated since 1.8.0, will be removed in 1.9.0; use setPageSource(PageReadStore) instead. |
There was a problem hiding this comment.
Nit: typically we use a link?
0a7f821 to
661b264
Compare
|
Thanks @wypoon for working on it. Thanks @huaxingao @Fokko @szehon-ho for the review. |
There are two Iceberg PRs that "broke" NesQuEIT: * apache/iceberg#11478 caused `testRewriteManifests` to fail due to the changed outcome of the `rewrite_manifests` procedure * apache/iceberg#11520 caused a class-path issue w/ Scala 2.13
Workaround for apache/iceberg#11520 that caused a class-path issue w/ Scala 2.13
There are two Iceberg PRs that "broke" NesQuEIT: * apache/iceberg#11478 caused `testRewriteManifests` to fail due to the changed outcome of the `rewrite_manifests` procedure * apache/iceberg#11520 caused a class-path issue w/ Scala 2.13
Workaround for apache/iceberg#11520 that caused a class-path issue w/ Scala 2.13
No description provided.