GH-43695: [C++][Parquet] flatbuffers metadata integration#48431
GH-43695: [C++][Parquet] flatbuffers metadata integration#48431Jiayi-Wang-db wants to merge 17 commits intoapache:mainfrom
Conversation
… instead of relative ones. Implement Geo types. Add Float16 and Variant. Pack statistics better
| auto To(const format::ColumnMetaData& cm) { | ||
| if (!cm.encoding_stats.empty()) { | ||
| for (auto&& e : cm.encoding_stats) { | ||
| if (e.page_type != format::PageType::DATA_PAGE && | ||
| e.page_type != format::PageType::DATA_PAGE_V2) | ||
| continue; | ||
| if (e.encoding != format::Encoding::PLAIN_DICTIONARY && | ||
| e.encoding != format::Encoding::RLE_DICTIONARY) { | ||
| return false; | ||
| } | ||
| } | ||
| return true; | ||
| } | ||
| bool has_plain_dictionary_encoding = false; | ||
| bool has_non_dictionary_encoding = false; | ||
| for (auto encoding : cm.encodings) { | ||
| switch (encoding) { | ||
| case format::Encoding::PLAIN_DICTIONARY: | ||
| // PLAIN_DICTIONARY encoding was present, which means at | ||
| // least one page was dictionary encoded and v1.0 encodings are used. | ||
| has_plain_dictionary_encoding = true; | ||
| break; | ||
| case format::Encoding::RLE: | ||
| case format::Encoding::BIT_PACKED: | ||
| // Other than for boolean values, RLE and BIT_PACKED are only used for | ||
| // repetition or definition levels. Additionally booleans are not dictionary | ||
| // encoded hence it is safe to disregard the case where some boolean data pages | ||
| // are dictionary encoded and some boolean pages are RLE/BIT_PACKED encoded. | ||
| break; | ||
| default: | ||
| has_non_dictionary_encoding = true; | ||
| break; | ||
| } | ||
| } | ||
| if (has_plain_dictionary_encoding) { | ||
| // Return true, if there are no encodings other than dictionary or | ||
| // repetition/definition levels. | ||
| return !has_non_dictionary_encoding; | ||
| } | ||
|
|
||
| // If PLAIN_DICTIONARY wasn't present, then either the column is not | ||
| // dictionary-encoded, or the 2.0 encoding, RLE_DICTIONARY, was used. | ||
| // For 2.0, this cannot determine whether a page fell back to non-dictionary encoding | ||
| // without page encoding stats. | ||
| return false; | ||
| } |
There was a problem hiding this comment.
This is not the same logic as parquet::IsColumnChunkFullyDictionaryEncoded, but it is the same as parquet-mr DistionaryFilte::HasNonDictionaryPages.
Need advice on what's the difference and which approach to follow.
There was a problem hiding this comment.
Could you summarize the difference?
|
Great to see things moving here! |
### Rationale for this change Add link to flatbuf footer ticket with proposal. ### What changes are included in this PR? ### Do these changes have PoC implementations? apache/arrow#48431
### Rationale for this change Add link to flatbuf footer ticket with proposal. ### What changes are included in this PR? ### Do these changes have PoC implementations? apache/arrow#48431
cpp/src/parquet/metadata3.h
Outdated
| // Returns the size of the flatbuffer if found (and writes to out_flatbuffer), | ||
| // returns 0 if no flatbuffer extension is present, or returns the required | ||
| // buffer size if the input buffer is too small. | ||
| ::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer); |
There was a problem hiding this comment.
Since FileMetaData::Make takes uint32_t as metadata_len it might make sense to return it here?
| ::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer); | |
| ::arrow::Result<uint32_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer); |
| #include "arrow/result.h" | ||
| #include "flatbuffers/flatbuffers.h" | ||
| #include "generated/parquet3_generated.h" | ||
| #include "generated/parquet_types.h" |
There was a problem hiding this comment.
Is metadata3.h meant to be public? If so this will make generated thrift header public as well. Perhaps we could introduce MakeFromFlatbuffer in metadata.h/cc instead so we can use it in file_reader.cc:457.
static std::shared_ptr<FileMetaData> MakeFromFlatbuffer(
const uint8_t* flatbuffer_data,
size_t flatbuffer_size,
uint32_t metadata_len,
const ReaderProperties& properties = default_reader_properties());There was a problem hiding this comment.
Some effort is made to not make thrift structs public, I think we should take the same approach with Flatbuffer.
|
FYI @emkornfield @prtkgaur if you want to take a look |
| format3::GetFileMetaData(flatbuffer_data.data()); | ||
| auto thrift_metadata = | ||
| std::make_unique<format::FileMetaData>(FromFlatbuffer(fb_metadata)); | ||
| file_metadata_ = FileMetaData::Make( |
There was a problem hiding this comment.
FileMetadata is already a wrapper around thrift, is there a reason we don't have a different implementation that is made purely from the FileMetadata?
| @@ -0,0 +1,224 @@ | |||
| namespace parquet.format3; | |||
There was a problem hiding this comment.
I left comments on the PR for the FBS file in parquet-format, we should resync after those are adressed.
| void set_footer_read_size(size_t size) { footer_read_size_ = size; } | ||
| size_t footer_read_size() const { return footer_read_size_; } | ||
|
|
||
| // If enabled, try to read the metadata3 footer from the file. |
There was a problem hiding this comment.
| // If enabled, try to read the metadata3 footer from the file. | |
| // If enabled, try to read the flatbuffer metadata footer from the file as an extension (i.e. a PAR1 file). |
| // If it fails, fall back to Thrift footer decoding. | ||
| bool read_metadata3() const { return read_metadata3_; } | ||
| void set_read_metadata3(bool read_metadata3) { read_metadata3_ = read_metadata3; } | ||
|
|
There was a problem hiding this comment.
I guess we need to finalize PAR2 or PAR3 footer to be able to write this out without extension, I think that can be follow-up work but it would be nice to do this as part of the FBS work to ensure we can eventually move away from thrift.
|
|
||
| // If enabled, try to read the metadata3 footer from the file. | ||
| // If it fails, fall back to Thrift footer decoding. | ||
| bool read_metadata3() const { return read_metadata3_; } |
There was a problem hiding this comment.
| bool read_metadata3() const { return read_metadata3_; } | |
| bool read_flatbuffer_metadata_if_present() const { return read_metadata3_; } |
| bool page_checksum_verification_ = false; | ||
| // Used with a RecordReader. | ||
| bool read_dense_for_nullable_ = false; | ||
| bool read_metadata3_ = false; |
There was a problem hiding this comment.
I think this should default to true? otherwise I worry about readers getting the benefit?
| LZ4_RAW = 7, | ||
| }; | ||
|
|
||
| auto GetNumChildren( |
There was a problem hiding this comment.
style nit: Is auto needed here, generally we wouldn't use it unless it was needed for templating, etc.
|
|
||
| auto GetName(const std::vector<format::SchemaElement>& s, size_t i) { return s[i].name; } | ||
|
|
||
| class ColumnMap { |
| BuildParents(s); | ||
| } | ||
|
|
||
| size_t ToSchema(size_t cc_idx) const { return colchunk2schema_[cc_idx]; } |
| std::vector<uint32_t> parents_; | ||
| }; | ||
|
|
||
| struct MinMax { |
| uint8_t* const p = reinterpret_cast<uint8_t*>(out.data()) + n + 1; | ||
|
|
||
| // Compute and store checksums and lengths | ||
| uint32_t crc32 = ::arrow::internal::crc32(0, reinterpret_cast<const uint8_t*>(out.data()), n + 1); |
There was a problem hiding this comment.
Is this format documented (I might have missed it in the parquet-format pull request).
| } while (true); | ||
| } | ||
|
|
||
| inline uint32_t CountLeadingZeros32(uint32_t v) { |
| return out; | ||
| } | ||
|
|
||
| inline uint8_t* WriteULEB64(uint64_t v, uint8_t* out) { |
There was a problem hiding this comment.
we should have something like this for delta binary packed, which uses uleb as well, could you look there?
| // The extension itself is as follows: | ||
| // | ||
| // +-------------------+------------+--------------------------------------+----------------+---------+--------------------------------+------+ | ||
| // | compress(flatbuf) | compressor | crc(compress(flatbuf) .. compressor) | compressed_len | raw_len | crc(compressed_len .. raw_len) | UUID | |
There was a problem hiding this comment.
This should be documented in the parquet-format PR.
Rationale for this change
Integrate flatbuffers metadata into thrift footer.
The detailed design and experiment doc:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?usp=sharing)
What changes are included in this PR?
Definition of the FlatBuffer footer and the generated FlatBuffer file
To/FromFlatBuffer functions to convert between FlatBuffer and Thrift footer
Append/Extract FlatBuffer to/from the extension field of the Thrift footer
Use append/extract operations based on reader/writer flags
Are these changes tested?
Yes, with newly added tests.
Are there any user-facing changes?
Yes, users can write and read the FlatBuffer footer to speed up footer parsing.