Skip to content

GH-43695: [C++][Parquet] flatbuffers metadata integration#48431

Open
Jiayi-Wang-db wants to merge 17 commits intoapache:mainfrom
Jiayi-Wang-db:flatbuf3
Open

GH-43695: [C++][Parquet] flatbuffers metadata integration#48431
Jiayi-Wang-db wants to merge 17 commits intoapache:mainfrom
Jiayi-Wang-db:flatbuf3

Conversation

@Jiayi-Wang-db
Copy link

@Jiayi-Wang-db Jiayi-Wang-db commented Dec 10, 2025

Rationale for this change

Integrate flatbuffers metadata into thrift footer.
The detailed design and experiment doc:
https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?usp=sharing)

What changes are included in this PR?

  • Definition of the FlatBuffer footer and the generated FlatBuffer file

  • To/FromFlatBuffer functions to convert between FlatBuffer and Thrift footer

  • Append/Extract FlatBuffer to/from the extension field of the Thrift footer

  • Use append/extract operations based on reader/writer flags

Are these changes tested?

Yes, with newly added tests.

Are there any user-facing changes?

Yes, users can write and read the FlatBuffer footer to speed up footer parsing.

Comment on lines +874 to +919
auto To(const format::ColumnMetaData& cm) {
if (!cm.encoding_stats.empty()) {
for (auto&& e : cm.encoding_stats) {
if (e.page_type != format::PageType::DATA_PAGE &&
e.page_type != format::PageType::DATA_PAGE_V2)
continue;
if (e.encoding != format::Encoding::PLAIN_DICTIONARY &&
e.encoding != format::Encoding::RLE_DICTIONARY) {
return false;
}
}
return true;
}
bool has_plain_dictionary_encoding = false;
bool has_non_dictionary_encoding = false;
for (auto encoding : cm.encodings) {
switch (encoding) {
case format::Encoding::PLAIN_DICTIONARY:
// PLAIN_DICTIONARY encoding was present, which means at
// least one page was dictionary encoded and v1.0 encodings are used.
has_plain_dictionary_encoding = true;
break;
case format::Encoding::RLE:
case format::Encoding::BIT_PACKED:
// Other than for boolean values, RLE and BIT_PACKED are only used for
// repetition or definition levels. Additionally booleans are not dictionary
// encoded hence it is safe to disregard the case where some boolean data pages
// are dictionary encoded and some boolean pages are RLE/BIT_PACKED encoded.
break;
default:
has_non_dictionary_encoding = true;
break;
}
}
if (has_plain_dictionary_encoding) {
// Return true, if there are no encodings other than dictionary or
// repetition/definition levels.
return !has_non_dictionary_encoding;
}

// If PLAIN_DICTIONARY wasn't present, then either the column is not
// dictionary-encoded, or the 2.0 encoding, RLE_DICTIONARY, was used.
// For 2.0, this cannot determine whether a page fell back to non-dictionary encoding
// without page encoding stats.
return false;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the same logic as parquet::IsColumnChunkFullyDictionaryEncoded, but it is the same as parquet-mr DistionaryFilte::HasNonDictionaryPages.
Need advice on what's the difference and which approach to follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you summarize the difference?

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 11, 2025
@rok
Copy link
Member

rok commented Dec 11, 2025

Great to see things moving here!

alkis added a commit to alkis/parquet-format that referenced this pull request Dec 12, 2025
### Rationale for this change

Add link to flatbuf footer ticket with proposal.

### What changes are included in this PR?


### Do these changes have PoC implementations?

apache/arrow#48431
Fokko pushed a commit to apache/parquet-format that referenced this pull request Dec 12, 2025
### Rationale for this change

Add link to flatbuf footer ticket with proposal.

### What changes are included in this PR?


### Do these changes have PoC implementations?

apache/arrow#48431
// Returns the size of the flatbuffer if found (and writes to out_flatbuffer),
// returns 0 if no flatbuffer extension is present, or returns the required
// buffer size if the input buffer is too small.
::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FileMetaData::Make takes uint32_t as metadata_len it might make sense to return it here?

Suggested change
::arrow::Result<size_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer);
::arrow::Result<uint32_t> ExtractFlatbuffer(std::shared_ptr<Buffer> buf, std::string* out_flatbuffer);

#include "arrow/result.h"
#include "flatbuffers/flatbuffers.h"
#include "generated/parquet3_generated.h"
#include "generated/parquet_types.h"
Copy link
Member

@rok rok Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is metadata3.h meant to be public? If so this will make generated thrift header public as well. Perhaps we could introduce MakeFromFlatbuffer in metadata.h/cc instead so we can use it in file_reader.cc:457.

  static std::shared_ptr<FileMetaData> MakeFromFlatbuffer(
      const uint8_t* flatbuffer_data,
      size_t flatbuffer_size,
      uint32_t metadata_len,
      const ReaderProperties& properties = default_reader_properties());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some effort is made to not make thrift structs public, I think we should take the same approach with Flatbuffer.

@alkis
Copy link
Contributor

alkis commented Jan 21, 2026

FYI @emkornfield @prtkgaur if you want to take a look

format3::GetFileMetaData(flatbuffer_data.data());
auto thrift_metadata =
std::make_unique<format::FileMetaData>(FromFlatbuffer(fb_metadata));
file_metadata_ = FileMetaData::Make(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileMetadata is already a wrapper around thrift, is there a reason we don't have a different implementation that is made purely from the FileMetadata?

@@ -0,0 +1,224 @@
namespace parquet.format3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left comments on the PR for the FBS file in parquet-format, we should resync after those are adressed.

void set_footer_read_size(size_t size) { footer_read_size_ = size; }
size_t footer_read_size() const { return footer_read_size_; }

// If enabled, try to read the metadata3 footer from the file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If enabled, try to read the metadata3 footer from the file.
// If enabled, try to read the flatbuffer metadata footer from the file as an extension (i.e. a PAR1 file).

// If it fails, fall back to Thrift footer decoding.
bool read_metadata3() const { return read_metadata3_; }
void set_read_metadata3(bool read_metadata3) { read_metadata3_ = read_metadata3; }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to finalize PAR2 or PAR3 footer to be able to write this out without extension, I think that can be follow-up work but it would be nice to do this as part of the FBS work to ensure we can eventually move away from thrift.


// If enabled, try to read the metadata3 footer from the file.
// If it fails, fall back to Thrift footer decoding.
bool read_metadata3() const { return read_metadata3_; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool read_metadata3() const { return read_metadata3_; }
bool read_flatbuffer_metadata_if_present() const { return read_metadata3_; }

bool page_checksum_verification_ = false;
// Used with a RecordReader.
bool read_dense_for_nullable_ = false;
bool read_metadata3_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should default to true? otherwise I worry about readers getting the benefit?

LZ4_RAW = 7,
};

auto GetNumChildren(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: Is auto needed here, generally we wouldn't use it unless it was needed for templating, etc.


auto GetName(const std::vector<format::SchemaElement>& s, size_t i) { return s[i].name; }

class ColumnMap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add docs.

BuildParents(s);
}

size_t ToSchema(size_t cc_idx) const { return colchunk2schema_[cc_idx]; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs.

std::vector<uint32_t> parents_;
};

struct MinMax {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs.

uint8_t* const p = reinterpret_cast<uint8_t*>(out.data()) + n + 1;

// Compute and store checksums and lengths
uint32_t crc32 = ::arrow::internal::crc32(0, reinterpret_cast<const uint8_t*>(out.data()), n + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this format documented (I might have missed it in the parquet-format pull request).

} while (true);
}

inline uint32_t CountLeadingZeros32(uint32_t v) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return out;
}

inline uint8_t* WriteULEB64(uint64_t v, uint8_t* out) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have something like this for delta binary packed, which uses uleb as well, could you look there?

// The extension itself is as follows:
//
// +-------------------+------------+--------------------------------------+----------------+---------+--------------------------------+------+
// | compress(flatbuf) | compressor | crc(compress(flatbuf) .. compressor) | compressed_len | raw_len | crc(compressed_len .. raw_len) | UUID |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be documented in the parquet-format PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants