feat(parquet): add schema projection to parquet#159
Conversation
8968c66 to
80a629b
Compare
80a629b to
316f42a
Compare
lidavidm
left a comment
There was a problem hiding this comment.
Hmm, not sure what that lint failure is talking about
Me neither :/ |
|
I'm guessing one of the GTest or GMock macros expands to something weird. |
b24dd93 to
b287a6f
Compare
ceb4b1c to
6af2b8f
Compare
1311ee6 to
a94e6a0
Compare
472d2ee to
ad10678
Compare
a4531f0 to
ca51b68
Compare
c2f5b56 to
2ea602a
Compare
2ea602a to
b42bda5
Compare
zhjwpku
left a comment
There was a problem hiding this comment.
LGTM, thanks for working on this.
|
@Fokko @zeroshade Could you help review this? Thanks! |
| } | ||
| break; | ||
| case TypeId::kTime: | ||
| if (arrow_type->id() == ::arrow::Type::TIME64) { |
There was a problem hiding this comment.
Should we also check for ::arrow::TimeUnit::MICRO here?
There was a problem hiding this comment.
Good catch! I have added an exhaustive test case to make sure I don't miss any primitive type.
| } | ||
| break; | ||
| case TypeId::kDecimal: | ||
| if (arrow_type->id() == ::arrow::Type::DECIMAL128) { |
| if (arrow_type->id() == ::arrow::Type::FIXED_SIZE_BINARY) { | ||
| const auto& fixed_binary = | ||
| internal::checked_cast<const ::arrow::FixedSizeBinaryType&>(*arrow_type); | ||
| if (fixed_binary.byte_width() == 16) { | ||
| return {}; | ||
| } | ||
| } |
There was a problem hiding this comment.
We should probably also allow https://github.com/apache/arrow/blob/main/cpp/src/arrow/extension/uuid.h#L35
You can validate via arrow_type->id() == ::arrow::Type::EXTENSION and the extension_name() == "arrow.uuid"
| case TypeId::kString: | ||
| if (arrow_type->id() == ::arrow::Type::STRING) { | ||
| return {}; | ||
| } | ||
| break; | ||
| case TypeId::kBinary: | ||
| if (arrow_type->id() == ::arrow::Type::BINARY) { | ||
| return {}; | ||
| } | ||
| break; |
There was a problem hiding this comment.
What about LargeString, LargeBinary, StringView and BinaryView?
There was a problem hiding this comment.
I don't think parquet-cpp has supported these types.
| // TODO(gangwu): support v3 unknown type | ||
| Status ValidateParquetSchemaEvolution( | ||
| const Type& expected_type, const ::parquet::arrow::SchemaField& parquet_field) { | ||
| const auto& arrow_type = parquet_field.field->type(); |
There was a problem hiding this comment.
I forget offhand if this will return a DictionaryType for dictionary encoded columns, if so then you need to check for the DictionaryType and then switch on the ValueType of it.
There was a problem hiding this comment.
No it won't. Reading dictionary is supported via an option to create a RecordReader: https://github.com/apache/arrow/blob/2dd3ccda6437f79aa34641bd3197dd7392ae4aec/cpp/src/parquet/column_reader.h#L266
| } | ||
| break; | ||
| case TypeId::kList: | ||
| if (arrow_type->id() == ::arrow::Type::LIST) { |
There was a problem hiding this comment.
What about LargeList and ListView?
There was a problem hiding this comment.
ListView is not supported by parquet-cpp yet. I think we should just support simple list and binary type variants in the early versions of iceberg-cpp. Once parquet-cpp has full support, we can leverage them later.
89410b3 to
b686ccb
Compare
b686ccb to
bc9d8be
Compare
Xuanwo
left a comment
There was a problem hiding this comment.
Thank you for working on this, let's move!

No description provided.