Conversation
| @@ -0,0 +1,224 @@ | |||
| namespace parquet.format3; | |||
There was a problem hiding this comment.
lets just name this parquet.format for now?
| // 1. Statistics are stored in integral types if their size is fixed, otherwise prefix + suffix | ||
| // 2. ColumnMetaData.encoding_stats are removed, they are replaced with | ||
| // ColumnMetaData.is_fully_dict_encoded. | ||
| // 3. RowGroups are limited to 2GB in size, so we can use int for sizes. |
There was a problem hiding this comment.
I think this and the item below are out of date (we are using long now) and can keep things absolute?
| @@ -0,0 +1,224 @@ | |||
| namespace parquet.format3; | |||
|
|
|||
| // Optimization notes | |||
There was a problem hiding this comment.
Can we expand this comment to be explicit about the relationship between this FBS and parquet.thrift.
| // Note: Match the thrift enum values so that we can cast between them. | ||
| enum Encoding : byte { | ||
| PLAIN = 0, | ||
| // GROUP_VAR_INT = 1, |
There was a problem hiding this comment.
Call out commented out entries as deprecated to make it clear why they are commented out?
| GZIP = 2, | ||
| LZO = 3, | ||
| BROTLI = 4, | ||
| // LZ4 = 5, |
There was a problem hiding this comment.
same comment on deprecation.
| scale: int; | ||
| } | ||
| enum TimeUnit : byte { | ||
| MS = 0, |
There was a problem hiding this comment.
Can we please make these match parquet.thrift for names (Millisecond, Microsecond, Nanosecond)?
| // Logical types. | ||
| /////////////////////////////////////////////////////////////////////////////////////////////////// | ||
|
|
||
| table Empty {} |
There was a problem hiding this comment.
I think we want detailed docs (same level as parquet.thrift if we intend this to be the new footer)?
| /////////////////////////////////////////////////////////////////////////////////////////////////// | ||
|
|
||
| table Empty {} | ||
| table DecimalOpts { |
There was a problem hiding this comment.
| table DecimalOpts { | |
| table DecimalOptions { |
Should be spell out type names to make it easer on reader?
| // - BYTE_ARRAY: | ||
| // prefix: the longest common prefix of min/max | ||
| // lo8+hi8 zero padded 16 bytes (big-endian) of the suffix | ||
| // len: the length for the suffix of the value after removing the prefix. If > 16 then the |
There was a problem hiding this comment.
| // len: the length for the suffix of the value after removing the prefix. If > 16 then the | |
| // min_len/max_len: the length for the suffix of the value after removing the prefix. If > 16 then the |
| // prefix: the longest common prefix of min/max | ||
| // lo8+hi8 zero padded 16 bytes (big-endian) of the suffix | ||
| // len: the length for the suffix of the value after removing the prefix. If > 16 then the | ||
| // value is inexact |
There was a problem hiding this comment.
| // value is inexact | |
| // value is inexact (it is exact otherwise). |
| // - BOOLEAN: none | ||
| // - INT32/FLOAT: lo4 (little-endian) | ||
| // - INT64/DOUBLE: lo8 (little-endian) | ||
| // - INT96: lo4+lo8 (little-endian) |
There was a problem hiding this comment.
for composite values, I think this is complicated enough that providing concrete examples would be belpful for implementors?
| DATA_PAGE_V2 = 3, | ||
| } | ||
|
|
||
| table KV { |
There was a problem hiding this comment.
nit
| table KV { | |
| table KeyValue { |
Lets keep name consistent if possible?
| codec: CompressionCodec; | ||
| num_values: long = null; // only present if not equal to rg.num_rows | ||
| total_uncompressed_size: long; | ||
| total_compressed_size: long; |
There was a problem hiding this comment.
It would be nice to keep total unencoded size here which I think is generally useful? But I suppose it can be added after?
| dictionary_page_offset: long = null; | ||
| statistics: Statistics; | ||
| is_fully_dict_encoded: bool; | ||
| bloom_filter_offset: long = null; |
There was a problem hiding this comment.
Should we this be made a struct/value to make the bloom filter info more self contained?
| row_groups: [RowGroup]; | ||
| kv: [KV]; | ||
| created_by: string; | ||
| // column_orders: [ColumnOrder]; // moved to SchemaElement |
There was a problem hiding this comment.
remove this row for now?
emkornfield
left a comment
There was a problem hiding this comment.
I think we also need to add an apache header here, and CI to make sure this compiles?
Rationale for this change
Improve wide table support.
What changes are included in this PR?
Add parquet flatbuf schema.
Do these changes have PoC implementations?
apache/arrow#48431