GH-531: Add parquet flatbuf schema by alkis · Pull Request #544 · apache/parquet-format

alkis · 2025-12-12T07:58:13Z

Rationale for this change

Improve wide table support.

What changes are included in this PR?

Add parquet flatbuf schema.

Do these changes have PoC implementations?

apache/arrow#48431

emkornfield · 2026-01-09T18:19:39Z

src/main/flatbuf/parquet3.fbs

@@ -0,0 +1,224 @@
+namespace parquet.format3;


lets just name this parquet.format for now?

emkornfield · 2026-01-09T18:20:45Z

src/main/flatbuf/parquet3.fbs

+// 1. Statistics are stored in integral types if their size is fixed, otherwise prefix + suffix
+// 2. ColumnMetaData.encoding_stats are removed, they are replaced with
+//    ColumnMetaData.is_fully_dict_encoded.
+// 3. RowGroups are limited to 2GB in size, so we can use int for sizes.


I think this and the item below are out of date (we are using long now) and can keep things absolute?

emkornfield · 2026-01-09T18:21:08Z

src/main/flatbuf/parquet3.fbs

@@ -0,0 +1,224 @@
+namespace parquet.format3;
+
+// Optimization notes


Can we expand this comment to be explicit about the relationship between this FBS and parquet.thrift.

emkornfield · 2026-01-09T18:22:00Z

src/main/flatbuf/parquet3.fbs

+// Note: Match the thrift enum values so that we can cast between them.
+enum Encoding : byte {
+  PLAIN = 0,
+  // GROUP_VAR_INT = 1,


Call out commented out entries as deprecated to make it clear why they are commented out?

emkornfield · 2026-01-09T18:22:09Z

src/main/flatbuf/parquet3.fbs

+  GZIP = 2,
+  LZO = 3,
+  BROTLI = 4,
+  // LZ4 = 5,


same comment on deprecation.

emkornfield · 2026-01-09T18:22:49Z

src/main/flatbuf/parquet3.fbs

+  scale: int;
+}
+enum TimeUnit : byte {
+  MS = 0,


Can we please make these match parquet.thrift for names (Millisecond, Microsecond, Nanosecond)?

emkornfield · 2026-01-09T18:23:30Z

src/main/flatbuf/parquet3.fbs

+// Logical types.
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+table Empty {}


I think we want detailed docs (same level as parquet.thrift if we intend this to be the new footer)?

emkornfield · 2026-01-09T18:24:00Z

src/main/flatbuf/parquet3.fbs

+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+table Empty {}
+table DecimalOpts {


Suggested change

table DecimalOpts {

table DecimalOptions {

Should be spell out type names to make it easer on reader?

emkornfield · 2026-01-09T18:24:48Z

src/main/flatbuf/parquet3.fbs

+  // - BYTE_ARRAY:
+  //   prefix: the longest common prefix of min/max
+  //   lo8+hi8 zero padded 16 bytes (big-endian) of the suffix
+  //   len: the length for the suffix of the value after removing the prefix. If > 16 then the


Suggested change

// len: the length for the suffix of the value after removing the prefix. If > 16 then the

// min_len/max_len: the length for the suffix of the value after removing the prefix. If > 16 then the

emkornfield · 2026-01-09T18:26:00Z

src/main/flatbuf/parquet3.fbs

+  //   prefix: the longest common prefix of min/max
+  //   lo8+hi8 zero padded 16 bytes (big-endian) of the suffix
+  //   len: the length for the suffix of the value after removing the prefix. If > 16 then the
+  //        value is inexact


Suggested change

// value is inexact

// value is inexact (it is exact otherwise).

emkornfield · 2026-01-09T18:28:28Z

src/main/flatbuf/parquet3.fbs

+  // - BOOLEAN: none
+  // - INT32/FLOAT: lo4 (little-endian)
+  // - INT64/DOUBLE: lo8 (little-endian)
+  // - INT96: lo4+lo8 (little-endian)


for composite values, I think this is complicated enough that providing concrete examples would be belpful for implementors?

emkornfield · 2026-01-09T18:28:53Z

src/main/flatbuf/parquet3.fbs

+  DATA_PAGE_V2 = 3,
+}
+
+table KV {


nit

Suggested change

table KV {

table KeyValue {

Lets keep name consistent if possible?

emkornfield · 2026-01-09T18:29:51Z

src/main/flatbuf/parquet3.fbs

+  codec: CompressionCodec;
+  num_values: long = null;  // only present if not equal to rg.num_rows
+  total_uncompressed_size: long;
+  total_compressed_size: long;


It would be nice to keep total unencoded size here which I think is generally useful? But I suppose it can be added after?

emkornfield · 2026-01-09T18:30:26Z

src/main/flatbuf/parquet3.fbs

+  dictionary_page_offset: long = null;
+  statistics: Statistics;
+  is_fully_dict_encoded: bool;
+  bloom_filter_offset: long = null;


Should we this be made a struct/value to make the bloom filter info more self contained?

emkornfield · 2026-01-09T18:30:54Z

src/main/flatbuf/parquet3.fbs

+  row_groups: [RowGroup];
+  kv: [KV];
+  created_by: string;
+  // column_orders: [ColumnOrder];  // moved to SchemaElement


remove this row for now?

emkornfield

I think we also need to add an apache header here, and CI to make sure this compiles?

Add parquet flatbuf schema

f951a6d

This was referenced Jan 7, 2026

Parquet metadata as flatbuffers apache/arrow-rs#9041

Open

parquet: use flatbuffers to store metadata (WIP) apache/arrow-rs#9042

Draft

emkornfield reviewed Jan 9, 2026

View reviewed changes

src/main/flatbuf/parquet3.fbs

@@ -0,0 +1,224 @@

namespace parquet.format3;

Copy link

Contributor

emkornfield Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets just name this parquet.format for now?

emkornfield reviewed Jan 9, 2026

View reviewed changes

src/main/flatbuf/parquet3.fbs

GZIP = 2,

LZO = 3,

BROTLI = 4,

// LZ4 = 5,

Copy link

Contributor

emkornfield Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment on deprecation.

emkornfield reviewed Jan 9, 2026

View reviewed changes

emkornfield requested changes Jan 9, 2026

View reviewed changes

		@@ -0,0 +1,224 @@
		namespace parquet.format3;

		// Optimization notes

	// len: the length for the suffix of the value after removing the prefix. If > 16 then the
	// min_len/max_len: the length for the suffix of the value after removing the prefix. If > 16 then the

	// value is inexact
	// value is inexact (it is exact otherwise).

Conversation

alkis commented Dec 12, 2025

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants