ARROW-8494: [C++][Parquet] Full support for reading mixed list and structs#8177
ARROW-8494: [C++][Parquet] Full support for reading mixed list and structs#8177emkornfield wants to merge 23 commits intoapache:masterfrom
Conversation
cpp/src/parquet/level_conversion.cc
Outdated
There was a problem hiding this comment.
we should consider doing the sume of the previous element here. Originally I did not because I thought at some point getting raw lengths would make it easier to handled chunked_arrays in reader.cc but I think that case is esoteric enough that removing the need to touch this data twice will be better.
python/pyarrow/tests/test_parquet.py
Outdated
There was a problem hiding this comment.
this was meant for my other PR< I willl revert it.
pitrou
left a comment
There was a problem hiding this comment.
Ok, thanks a lot for this PR. I think I am understanding the implementation (I skipped parquet/arrow/reader.cc for now, though). Some of the implementation details are still confusing me a bit. In any case, here are some comments.
There was a problem hiding this comment.
Hmm... is the comment pointing to some particular detail? It seems a bit cryptic.
There was a problem hiding this comment.
sorry. removed.
There was a problem hiding this comment.
For the record, is rep_level useful in this test?
There was a problem hiding this comment.
Yes. I added a comment about this in level_conversion.cc.
// It is simpler to rely on rep_level here until PARQUET-1899 is done and the code
172 is deleted in a follow-up release.
Once this is cleaned up it is not required.
There was a problem hiding this comment.
Please make validity_output a uint8_t or a uint8_t[1]. We don't want to encourage endianness issues (I realize this wouldn't happen here because we don't actually test the value of validity_output?).
There was a problem hiding this comment.
Hmm... is this supposed to be EXPECT_EQ? I'm curious why/how this line works.
There was a problem hiding this comment.
that is a good question maybe implicit EQ. fixed.
There was a problem hiding this comment.
I was a bit miffed here. Can lengths be renamed offsets?
There was a problem hiding this comment.
yes, it should be. sorry about that, I changed my mind on semantics late in this PR and didn't rename.
There was a problem hiding this comment.
FTR, I think that if BMI isn't available, you can still use a batch size of 5 or 6 bits and use a fast lookup table for ExtractBits (rather than the probably slow emulation code).
There was a problem hiding this comment.
I would need to think about this algorithm a little bit more and my expectation is that we should still be seeing runs for 0s or 1s in most cases. As noted before if this simulation doesn't work well on an AMD box we can revert to the scalar version
There was a problem hiding this comment.
Yeah, we can think about that for another PR anyway. Will try to run benchmarks.
|
Are there any benchmarks worth running here? |
Please let me know if there is more confusion, I will attempt to add clarifying comments. I think I addressed all your comments except for some in level_conversion_test.cc I'll address those tomorrow (I assume there will be more comments in reader.cc as well).
parquet-level-conversion-benchmark |
There was a problem hiding this comment.
Comment why MinGW is left out?
cpp/src/parquet/level_conversion.h
Outdated
cpp/src/parquet/level_conversion.h
Outdated
cpp/src/parquet/level_conversion.h
Outdated
There was a problem hiding this comment.
Looks like this include is not used after all?
cpp/src/parquet/level_conversion.cc
Outdated
There was a problem hiding this comment.
"platform"
But the comment is mistaken: ARM is little-endian most of the time (technically it supports both, but Linux runs it in little-endian mode AFAIK).
Also, I don't understand why DefLevelsToBitmapScalar is preferred here but DefLevelsToBitmapSimd is preferred below? Don't the same arguments apply?
There was a problem hiding this comment.
add a specific case for little endian.
I added a comment below, but when there is no repeated parent, all platfoms should have good SIMD options for converting to bitmap.
pitrou
left a comment
There was a problem hiding this comment.
Some comments on reader.cc now.
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
I was thinking there was some issue with with list arrays that always required two elements. I couldn't find the issue though.
cpp/src/parquet/arrow/reader.cc
Outdated
cpp/src/parquet/arrow/reader.cc
Outdated
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
Why do you need to write some values past the offsets end?
There was a problem hiding this comment.
changed to resize.
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
I'm not sure what this means, shouldn't you know up front the number of values? Do you mean the file was truncated before the row group end (is that supported)?
There was a problem hiding this comment.
Or is number_of_slots just an upper bound?
There was a problem hiding this comment.
removed. it is an upper bound. renamed it. and removed comment.
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
You mean "of rep levels"? You could have arbitrarily nested structs with a lot of dep levels?
There was a problem hiding this comment.
rephrased a little bit. There is always an equal number of repetition and definition levels for any particular leaf.
cpp/src/parquet/arrow/reader.cc
Outdated
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
Isn't this redundant with your if condition above? Also, why does length need to be filled out explicitly below?
(doesn't def_rep_level_child_->GetDefLevels do it?).
There was a problem hiding this comment.
Yes I think it is, removed.
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
You're dereferencing a null pointer (see if condition above).
There was a problem hiding this comment.
yeah, removed. we shouldn't need this.
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
Why? Shouldn't you resize the buffer instead?
There was a problem hiding this comment.
I didn't think about using Resizable buffers. changed all of the places that were filled to it.
|
Impressive! |
|
No changes on |
No, I don't think so. Since we were already using SIMD for non-nested types. |
|
@pitrou unfortunately, I was missing an "info.rep_level = 1;" in the benchmark, so it likely not as impressive on AMD, would you mind running again? (working on addressing the rest of the feedback. |
emkornfield
left a comment
There was a problem hiding this comment.
still need to refactor level_conversion_test but I think i forgot to respond to the last review here.
There was a problem hiding this comment.
OK, I removed. and place -mbmi2 specifically only for the parquet file. I think this should be safe because it is guarded via runtime dispatch.
I might not have been clear, but MSVC doesn't have any way of distinguishing these things so if we ever turn on AVX2 by default we have the same issue.
cpp/src/parquet/level_conversion.cc
Outdated
There was a problem hiding this comment.
add a specific case for little endian.
I added a comment below, but when there is no repeated parent, all platfoms should have good SIMD options for converting to bitmap.
cpp/src/parquet/level_conversion.h
Outdated
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
I move this to above to avoid recursively calling things multiple times, but i think we should be validating at least for structs (and not validating full) since rep/def level information could be inconsistent within them. It felt easier to call validate (and not too expensive) then writing custom logic for this.
I'm open to removing them, but it feels like there should be a contract here that someplace in this code for an underlying library we validate consistency.
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
Yes I think it is, removed.
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
yeah, removed. we shouldn't need this.
cpp/src/parquet/arrow/reader.cc
Outdated
There was a problem hiding this comment.
yes, I thought this was causing problems with type inference at some pont.
cpp/src/parquet/arrow/reader.cc
Outdated
|
@pitrou I think I responded to all review comments at this point, apologies if I missed something. level_conversion_test.cc is now refactored to a point where duplicate code I think adds to test understandability but there is still some redundancy. Also, please see my note about the level_conversion_benchmark having a bug in it on your prior run. |
|
I think the macOS failure is fixed by #8196 but the Appveyor failure looks legit: |
|
👍 we've reduced the failures to Flight (i.e. surely unrelated) issues. |
|
I'll take a look again on Monday, if that's ok with you. |
|
SGTM. If i have time I might get one or two CLs out based on this one but I can rebase afterwards. |
|
Running |
|
Things are a bit more balanced if the scalar version is used: |
|
Just for the record, apart from FixedSizeList, is there anything remaining for full nested Parquet -> Arrow reading? |
Add export to RunBasedExtract
add PARQUET_EXPORT to GreaterThanBitmap
This reverts commit d19ebc02fec16bc363ba610833ee76d8e9b02668.
|
Other than that, I see a ~20% improvement on |
|
I have no remaining concern over the code other than the AVX2 / BMI2 split. Congratulations for this PR, this is really a huge improvement! That said, I seem to get a test error on the Python side (pasted below). Let's see if it reproduces on CI: (beware: I rebased your branch on git master) |
@pitrou thank you for the thoughtful review. Let me know if you still have issues with the AVX/BMI2 after I added the comment (perhaps I didn't revert some compilation) or my analysis is wrong. I think the BMI2 check will be difficult/impossible at compile time for windows, so I'm not sure if it is worth the effort on linux. I also removed the failing test for parquet (which I should have removed in a prior PR, its strange it showed up again). |
We need to support LargeList, and Map which should be smaller change (I'm working on a PR) at the schema level inference. There are a few other JIRAs still open about benchmarking and randomized testing, Past that, there are some open JIRAs about performance improvements:
There is also an unrelated bug on the write side #8219 which I asked for @wesm to review (it is based on some changes in this PR). |
|
Do we also need ad hoc nested tests as a separate JIRA / PR? Randomized testing is nice to find corner cases, but it's always easier to diagnose hand-written test cases :-) |
|
Also we only have one-level nested benchmarks for now, I suppose we should add a bit more (two-level nesting may be enough). |
I think the internal tests you wrote have pretty good coverage. After #8219 is merged I was planning on make some of the one way (the ones I wrote for write and the ones your wrote for read) fully round-trip. If you think there are gaps, by all means we should add tests. |
My main concern is getting some data that reflects real workloads. @jorisvandenbossche it looks it sounded like you had Geo data that has multiple levels of nesting, I wonder if there is a canonical dataset we could make use of for benchmarking. |
Also: ARROW-9810 (generalize rep/def level conversion to list lengths/bitmaps)
This adds helper methods for reconstructing all necessary metadata
for arrow types. For now this doesn't handle null_slot_usage (i.e.
children of FixedSizeList), it throws exceptions when nulls are
encountered in this case. It uses there for generic reconstruction.
The unit tests demonstrate how to use the helper methods in combination
with LevelInfo (generated from parquet/arrow/schema.h) to reconstruct
the metadata. The arrow reader.cc is now rewritten to use these method.
column_reader
(uses rep/def levels)
(uses rep/def levels.).
and BMI2.
used as a fallback.