ARROW-8494: [C++][Parquet] Full support for reading mixed list and structs by emkornfield · Pull Request #8177 · apache/arrow

emkornfield · 2020-09-13T04:59:42Z

Also: ARROW-9810 (generalize rep/def level conversion to list lengths/bitmaps)

This adds helper methods for reconstructing all necessary metadata
for arrow types. For now this doesn't handle null_slot_usage (i.e.
children of FixedSizeList), it throws exceptions when nulls are
encountered in this case. It uses there for generic reconstruction.

The unit tests demonstrate how to use the helper methods in combination
with LevelInfo (generated from parquet/arrow/schema.h) to reconstruct
the metadata. The arrow reader.cc is now rewritten to use these method.

Refactors necessary APIs to use LevelInfo and makes use of them in
column_reader
Adds implementations for reconstructing list validity bitmaps
(uses rep/def levels)
Adds implementations for reconstruction list lengths
(uses rep/def levels.).
Adds dynamic dispatch for level comparison algorithms for AVX2
and BMI2.
Adds a pextract alternative that uses BitRunReader that can be
used as a fallback.
Fixes some bugs in detailed reconstruction to array tests.

github-actions · 2020-09-13T05:05:16Z

https://issues.apache.org/jira/browse/ARROW-8494

emkornfield · 2020-09-13T05:21:27Z

cpp/src/parquet/level_conversion.cc

we should consider doing the sume of the previous element here. Originally I did not because I thought at some point getting raw lengths would make it easier to handled chunked_arrays in reader.cc but I think that case is esoteric enough that removing the need to touch this data twice will be better.

emkornfield · 2020-09-14T06:59:41Z

python/pyarrow/tests/test_parquet.py

this was meant for my other PR< I willl revert it.

pitrou

Ok, thanks a lot for this PR. I think I am understanding the implementation (I skipped parquet/arrow/reader.cc for now, though). Some of the implementation details are still confusing me a bit. In any case, here are some comments.

pitrou · 2020-09-14T15:14:22Z

cpp/src/parquet/arrow/reconstruct_internal_test.cc

Hmm... is the comment pointing to some particular detail? It seems a bit cryptic.

sorry. removed.

pitrou · 2020-09-14T15:21:27Z

cpp/src/parquet/level_conversion_test.cc

For the record, is rep_level useful in this test?

Yes. I added a comment about this in level_conversion.cc.

// It is simpler to rely on rep_level here until PARQUET-1899 is done and the code
172 is deleted in a follow-up release.

Once this is cleaned up it is not required.

pitrou · 2020-09-14T15:28:34Z

cpp/src/parquet/level_conversion_test.cc

Please make validity_output a uint8_t or a uint8_t[1]. We don't want to encourage endianness issues (I realize this wouldn't happen here because we don't actually test the value of validity_output?).

pitrou · 2020-09-14T15:30:19Z

cpp/src/parquet/level_conversion_test.cc

Hmm... is this supposed to be EXPECT_EQ? I'm curious why/how this line works.

that is a good question maybe implicit EQ. fixed.

pitrou · 2020-09-14T15:35:50Z

cpp/src/parquet/level_conversion_test.cc

I was a bit miffed here. Can lengths be renamed offsets?

yes, it should be. sorry about that, I changed my mind on semantics late in this PR and didn't rename.

cpp/src/parquet/level_conversion.cc

pitrou · 2020-09-14T17:09:35Z

cpp/src/parquet/level_conversion_inc.h

FTR, I think that if BMI isn't available, you can still use a batch size of 5 or 6 bits and use a fast lookup table for ExtractBits (rather than the probably slow emulation code).

I would need to think about this algorithm a little bit more and my expectation is that we should still be seeing runs for 0s or 1s in most cases. As noted before if this simulation doesn't work well on an AMD box we can revert to the scalar version

Yeah, we can think about that for another PR anyway. Will try to run benchmarks.

cpp/src/parquet/level_conversion.h

pitrou · 2020-09-14T17:19:08Z

Are there any benchmarks worth running here?

emkornfield · 2020-09-15T06:25:24Z

Ok, thanks a lot for this PR. I think I am understanding the implementation (I skipped parquet/arrow/reader.cc for now, though). Some of the implementation details are still confusing me a bit. In any case, here are some comments.

Please let me know if there is more confusion, I will attempt to add clarifying comments. I think I addressed all your comments except for some in level_conversion_test.cc I'll address those tomorrow (I assume there will be more comments in reader.cc as well).

Are there any benchmarks worth running here?

parquet-level-conversion-benchmark
parquet-arrow-reader-writer-benchmark (this won't cover the nested cases though) There is an open JIRA under ARROW-1644 to add benchmarks for nested cases @npr mentioned there might be some example datasets that we wanted to try this on.

pitrou · 2020-09-15T16:43:21Z

cpp/src/parquet/level_conversion_inc.h

Comment why MinGW is left out?

pitrou · 2020-09-15T16:45:58Z

cpp/src/parquet/level_conversion.h

pitrou · 2020-09-15T16:46:39Z

cpp/src/parquet/level_conversion.h

Great comments, thank you.

pitrou · 2020-09-15T16:47:10Z

cpp/src/parquet/level_conversion.h

Looks like this include is not used after all?

pitrou · 2020-09-15T16:50:00Z

cpp/src/parquet/level_conversion.cc

"platform"

But the comment is mistaken: ARM is little-endian most of the time (technically it supports both, but Linux runs it in little-endian mode AFAIK).

Also, I don't understand why DefLevelsToBitmapScalar is preferred here but DefLevelsToBitmapSimd is preferred below? Don't the same arguments apply?

add a specific case for little endian.

I added a comment below, but when there is no repeated parent, all platfoms should have good SIMD options for converting to bitmap.

pitrou

Some comments on reader.cc now.

pitrou · 2020-09-15T16:54:53Z

cpp/src/parquet/arrow/reader.cc

I was thinking there was some issue with with list arrays that always required two elements. I couldn't find the issue though.

pitrou · 2020-09-15T16:55:34Z

cpp/src/parquet/arrow/reader.cc

"offsets" rather than "lengths"

pitrou · 2020-09-15T16:55:50Z

cpp/src/parquet/arrow/reader.cc

"offset_data" or "offsets_data"

pitrou · 2020-09-15T16:57:00Z

cpp/src/parquet/arrow/reader.cc

Why do you need to write some values past the offsets end?

changed to resize.

pitrou · 2020-09-15T16:58:46Z

cpp/src/parquet/arrow/reader.cc

I'm not sure what this means, shouldn't you know up front the number of values? Do you mean the file was truncated before the row group end (is that supported)?

Or is number_of_slots just an upper bound?

removed. it is an upper bound. renamed it. and removed comment.

pitrou · 2020-09-15T17:03:23Z

cpp/src/parquet/arrow/reader.cc

You mean "of rep levels"? You could have arbitrarily nested structs with a lot of dep levels?

rephrased a little bit. There is always an equal number of repetition and definition levels for any particular leaf.

pitrou · 2020-09-15T17:04:07Z

cpp/src/parquet/arrow/reader.cc

pitrou · 2020-09-15T17:05:23Z

cpp/src/parquet/arrow/reader.cc

Isn't this redundant with your if condition above? Also, why does length need to be filled out explicitly below?
(doesn't def_rep_level_child_->GetDefLevels do it?).

Yes I think it is, removed.

pitrou · 2020-09-15T17:06:37Z

cpp/src/parquet/arrow/reader.cc

You're dereferencing a null pointer (see if condition above).

yeah, removed. we shouldn't need this.

pitrou · 2020-09-15T17:07:50Z

cpp/src/parquet/arrow/reader.cc

Why? Shouldn't you resize the buffer instead?

I didn't think about using Resizable buffers. changed all of the places that were filled to it.

pitrou · 2020-09-15T17:25:47Z

parquet-level-conversion-benchmark (AMD Zen 2, clang 10, Ubuntu 20.04):

before:

BM_DefinitionLevelsToBitmapRepeatedAllMissing         966 ns          966 ns       714328 bytes_per_second=1.9753G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent        1626 ns         1626 ns       426358 bytes_per_second=1.173G/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       1850 ns         1850 ns       367809 bytes_per_second=1055.83M/s

after:

BM_DefinitionLevelsToBitmapRepeatedAllMissing         560 ns          560 ns      1239730 bytes_per_second=3.40515G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent         562 ns          562 ns      1244256 bytes_per_second=3.39684G/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent        562 ns          562 ns      1244691 bytes_per_second=3.39515G/s

Impressive!

pitrou · 2020-09-15T17:31:38Z

No changes on parquet-arrow-reader-writer-benchmark. I suspect it doesn't trigger any of the updated code?

emkornfield · 2020-09-15T17:42:13Z

No changes on parquet-arrow-reader-writer-benchmark. I suspect it doesn't trigger any of the updated code?

No, I don't think so. Since we were already using SIMD for non-nested types.

emkornfield · 2020-09-16T03:41:02Z

@pitrou unfortunately, I was missing an "info.rep_level = 1;" in the benchmark, so it likely not as impressive on AMD, would you mind running again? (working on addressing the rest of the feedback.

emkornfield

still need to refactor level_conversion_test but I think i forgot to respond to the last review here.

emkornfield · 2020-09-16T03:18:18Z

cpp/cmake_modules/SetupCxxFlags.cmake

OK, I removed. and place -mbmi2 specifically only for the parquet file. I think this should be safe because it is guarded via runtime dispatch.

I might not have been clear, but MSVC doesn't have any way of distinguishing these things so if we ever turn on AVX2 by default we have the same issue.

emkornfield · 2020-09-16T03:24:57Z

cpp/src/parquet/level_conversion.cc

add a specific case for little endian.

I added a comment below, but when there is no repeated parent, all platfoms should have good SIMD options for converting to bitmap.

emkornfield · 2020-09-16T03:25:41Z

cpp/src/parquet/level_conversion.h

emkornfield · 2020-09-16T03:26:49Z

cpp/src/parquet/level_conversion_inc.h

emkornfield · 2020-09-16T03:29:45Z

cpp/src/parquet/level_conversion_test.cc

emkornfield · 2020-09-16T04:30:08Z

cpp/src/parquet/arrow/reader.cc

I move this to above to avoid recursively calling things multiple times, but i think we should be validating at least for structs (and not validating full) since rep/def level information could be inconsistent within them. It felt easier to call validate (and not too expensive) then writing custom logic for this.

I'm open to removing them, but it feels like there should be a contract here that someplace in this code for an underlying library we validate consistency.

emkornfield · 2020-09-16T04:34:06Z

cpp/src/parquet/arrow/reader.cc

Yes I think it is, removed.

emkornfield · 2020-09-16T04:34:17Z

cpp/src/parquet/arrow/reader.cc

yeah, removed. we shouldn't need this.

emkornfield · 2020-09-16T04:35:07Z

cpp/src/parquet/arrow/reader.cc

yes, I thought this was causing problems with type inference at some pont.

emkornfield · 2020-09-16T04:35:16Z

cpp/src/parquet/arrow/reader.cc

emkornfield · 2020-09-16T05:27:29Z

@pitrou I think I responded to all review comments at this point, apologies if I missed something. level_conversion_test.cc is now refactored to a point where duplicate code I think adds to test understandability but there is still some redundancy. Also, please see my note about the level_conversion_benchmark having a bug in it on your prior run.

nealrichardson · 2020-09-16T15:14:25Z

I think the macOS failure is fixed by #8196 but the Appveyor failure looks legit:

C:/projects/arrow/cpp/src/parquet/level_conversion_test.cc(295): error C2220: warning treated as error - no 'object' file generated
C:/projects/arrow/cpp/src/parquet/level_conversion_test.cc(278): note: while compiling class template member function 'void parquet::internal::NestedListTest_SimpleLongList_Test<T>::TestBody(void)'
        with
        [
            T=Type
        ]
googletest_ep-prefix\include\gtest/internal/gtest-internal.h(665): note: see reference to class template instantiation 'parquet::internal::NestedListTest_SimpleLongList_Test<T>' being compiled
        with
        [
            T=Type
        ]
googletest_ep-prefix\include\gtest/internal/gtest-internal.h(657): note: while compiling class template member function 'bool testing::internal::TypeParameterizedTest<parquet::internal::NestedListTest,testing::internal::TemplateSel<parquet::internal::NestedListTest_SimpleLongList_Test>,parquet::internal::gtest_type_params_NestedListTest_>::Register(const char *,const testing::internal::CodeLocation &,const char *,const char *,int,const std::vector<std::string,std::allocator<_Ty>> &)'
        with
        [
            _Ty=std::string
        ]
C:/projects/arrow/cpp/src/parquet/level_conversion_test.cc(278): note: see reference to function template instantiation 'bool testing::internal::TypeParameterizedTest<parquet::internal::NestedListTest,testing::internal::TemplateSel<parquet::internal::NestedListTest_SimpleLongList_Test>,parquet::internal::gtest_type_params_NestedListTest_>::Register(const char *,const testing::internal::CodeLocation &,const char *,const char *,int,const std::vector<std::string,std::allocator<_Ty>> &)' being compiled
        with
        [
            _Ty=std::string
        ]
C:/projects/arrow/cpp/src/parquet/level_conversion_test.cc(278): note: see reference to class template instantiation 'testing::internal::TypeParameterizedTest<parquet::internal::NestedListTest,testing::internal::TemplateSel<parquet::internal::NestedListTest_SimpleLongList_Test>,parquet::internal::gtest_type_params_NestedListTest_>' being compiled
C:/projects/arrow/cpp/src/parquet/level_conversion_test.cc(295): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
[212/245] Linking CXX shared library release\parquet.dll
   Creating library release\parquet.lib and object release\parquet.exp
[213/245] Building CXX object src\parquet\CMakeFiles\parquet-arrow-test.dir\Unity\unity_0_cxx.cxx.obj
ninja: build stopped: subcommand failed.

nealrichardson · 2020-09-16T17:12:59Z

👍 we've reduced the failures to Flight (i.e. surely unrelated) issues.

pitrou · 2020-09-17T22:26:03Z

I'll take a look again on Monday, if that's ok with you.

emkornfield · 2020-09-17T22:46:00Z

SGTM. If i have time I might get one or two CLs out based on this one but I can rebase afterwards.

pitrou · 2020-09-21T14:22:14Z

Running parquet-level-conversion-benchmark again, results are much less good:

BM_DefinitionLevelsToBitmapRepeatedAllMissing         878 ns          878 ns       816435 bytes_per_second=2.17265G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent         967 ns          966 ns       713185 bytes_per_second=1.9736G/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       3258 ns         3258 ns       214817 bytes_per_second=599.563M/s

pitrou · 2020-09-21T14:29:34Z

Things are a bit more balanced if the scalar version is used:

BM_DefinitionLevelsToBitmapRepeatedAllMissing         660 ns          660 ns      1019688 bytes_per_second=2.8884G/s
BM_DefinitionLevelsToBitmapRepeatedAllPresent        1946 ns         1946 ns       358335 bytes_per_second=1003.9M/s
BM_DefinitionLevelsToBitmapRepeatedMostPresent       2024 ns         2023 ns       346078 bytes_per_second=965.229M/s

pitrou · 2020-09-21T14:42:41Z

Just for the record, apart from FixedSizeList, is there anything remaining for full nested Parquet -> Arrow reading?

Add export to RunBasedExtract

add PARQUET_EXPORT to GreaterThanBitmap

This reverts commit d19ebc02fec16bc363ba610833ee76d8e9b02668.

pitrou · 2020-09-21T14:49:27Z

Other than that, I see a ~20% improvement on BM_ReadStructColumn and BM_ReadListColumn

pitrou · 2020-09-21T14:55:52Z

I have no remaining concern over the code other than the AVX2 / BMI2 split. Congratulations for this PR, this is really a huge improvement!

That said, I seem to get a test error on the Python side (pasted below). Let's see if it reproduces on CI:

Traceback (most recent call last):
  File "/home/antoine/arrow/dev/python/pyarrow/tests/test_parquet.py", line 700, in test_pandas_can_write_nested_data
    _write_table(arrow_table, imos)
  File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/_pytest/python_api.py", line 747, in __exit__
    fail(self.message)
  File "/home/antoine/miniconda3/envs/pyarrow/lib/python3.7/site-packages/_pytest/outcomes.py", line 128, in fail
    raise Failed(msg=msg, pytrace=pytrace)
Failed: DID NOT RAISE <class 'ValueError'>

(beware: I rebased your branch on git master)

emkornfield · 2020-09-21T15:42:29Z

I have no remaining concern over the code other than the AVX2 / BMI2 split. Congratulations for this PR, this is really a huge improvement!

@pitrou thank you for the thoughtful review. Let me know if you still have issues with the AVX/BMI2 after I added the comment (perhaps I didn't revert some compilation) or my analysis is wrong. I think the BMI2 check will be difficult/impossible at compile time for windows, so I'm not sure if it is worth the effort on linux.

I also removed the failing test for parquet (which I should have removed in a prior PR, its strange it showed up again).

emkornfield · 2020-09-21T15:50:55Z

Just for the record, apart from FixedSizeList, is there anything remaining for full nested Parquet -> Arrow reading?

We need to support LargeList, and Map which should be smaller change (I'm working on a PR) at the schema level inference. There are a few other JIRAs still open about benchmarking and randomized testing, Past that, there are some open JIRAs about performance improvements:

Computing all all offsets/bitmaps together (the JIRA is about non-vectorized). I would expect that for deeply nested structures containing lists this would start to show performance improvements.
Using bitmap based code that was removed from this. For non-list types I think it can be a big performance (potentially another 20% on our benchmarks) win on all platforms and a win at least for shallowly nested lists I expect it to be better for native Intel.

There is also an unrelated bug on the write side #8219 which I asked for @wesm to review (it is based on some changes in this PR).

pitrou · 2020-09-21T15:54:18Z

Do we also need ad hoc nested tests as a separate JIRA / PR? Randomized testing is nice to find corner cases, but it's always easier to diagnose hand-written test cases :-)

pitrou · 2020-09-21T15:55:08Z

Also we only have one-level nested benchmarks for now, I suppose we should add a bit more (two-level nesting may be enough).

emkornfield · 2020-09-21T15:57:20Z

Do we also need ad hoc nested tests as a separate JIRA / PR? Randomized testing is nice to find corner cases, but it's always easier to diagnose hand-written test cases :-)

I think the internal tests you wrote have pretty good coverage. After #8219 is merged I was planning on make some of the one way (the ones I wrote for write and the ones your wrote for read) fully round-trip. If you think there are gaps, by all means we should add tests.

emkornfield · 2020-09-21T15:59:09Z

Also we only have one-level nested benchmarks for now, I suppose we should add a bit more (two-level nesting may be enough).

My main concern is getting some data that reflects real workloads. @jorisvandenbossche it looks it sounded like you had Geo data that has multiple levels of nesting, I wonder if there is a canonical dataset we could make use of for benchmarking.

pitrou

+1

emkornfield requested a review from pitrou September 13, 2020 04:59

emkornfield mentioned this pull request Sep 13, 2020

ARROW-9810: [C++] Generalized nested reconstruction helpers #8156

Closed

emkornfield commented Sep 13, 2020

View reviewed changes

emkornfield changed the title ~~ARROW-8494: [C++] Full support for mixed lista and structs~~ ARROW-8494: [C++][Parquet] Full support for reading mixed lista and structs Sep 14, 2020

emkornfield commented Sep 14, 2020

View reviewed changes

python/pyarrow/tests/test_parquet.py Outdated

Copy link

Contributor Author

emkornfield Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was meant for my other PR< I willl revert it.

pitrou reviewed Sep 14, 2020

View reviewed changes

pitrou changed the title ~~ARROW-8494: [C++][Parquet] Full support for reading mixed lista and structs~~ ARROW-8494: [C++][Parquet] Full support for reading mixed list and structs Sep 15, 2020

pitrou reviewed Sep 15, 2020

View reviewed changes

cpp/src/parquet/level_conversion.h Outdated

Copy link

Member

pitrou Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun :-)

pitrou reviewed Sep 15, 2020

View reviewed changes

cpp/src/parquet/level_conversion.h Outdated

Copy link

Member

pitrou Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great comments, thank you.

pitrou reviewed Sep 15, 2020

View reviewed changes

emkornfield commented Sep 16, 2020

View reviewed changes

emkornfield requested a review from pitrou September 17, 2020 16:24

emkornfield and others added 14 commits September 21, 2020 16:43

Update level_conversion.h

0517131

Add export to RunBasedExtract

Update level_comparison.h

d0a305f

add PARQUET_EXPORT to GreaterThanBitmap

add include and format

cf14106

Revert "remove v1 test"

f42d1cd

This reverts commit d19ebc02fec16bc363ba610833ee76d8e9b02668.

first round feedback

2061d6d

remove comment

73f6aee

more comments

d0667e4

fix expect_that

d3c0a8a

address more comments

f4d082d

formta

a3b52a4

refactor conversion test

4aaec32

fix guard for scalar function

2e08db4

make windows pedantry happier

4093cc9

Use DefLevelsToBitmapScalar instead of emulated PEXT

4743d80

pitrou force-pushed the rep_def_all branch from e97b0fe to 4743d80 Compare September 21, 2020 14:50

emkornfield added 2 commits September 21, 2020 15:31

remove residual writer v1 test

ce32825

add warning comment

fac3868

pitrou approved these changes Sep 21, 2020

View reviewed changes

pitrou closed this in 44f3de2 Sep 21, 2020

asfimport mentioned this pull request Sep 22, 2020

[C++] Implement basic array-by-array reassembly logic #24666

Closed

Conversation

emkornfield commented Sep 13, 2020

Uh oh!

github-actions bot commented Sep 13, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pitrou commented Sep 14, 2020

Uh oh!

emkornfield commented Sep 15, 2020

Uh oh!

pitrou Sep 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pitrou Sep 15, 2020 •

edited

Loading