Skip to content

feat: Add ability to read and process complex/nested data types from a parquet file in Parquet Data Source: enums, repeated enums, message, repeated message, repeated primitives #100

@Meghajit

Description

@Meghajit

Dagger has been processing real-time Kafka streams for years now, And now with parquet file processing, we aim to add the capability of performing dagger operations over the historical data, making Dagger a complete solution for data processing from historical to real-time.

As part of this feature, we want to extend #99 and add the capability to read repeated primitive types: repeated primitives, repeated enums and repeated simple groups and some complex types: enums and nested simple groups.

All the current features of Dagger like transformers, UDFs, continue to work on the data. From the perspective of downstream components, they need not know what kind of source produced this data.

ACCEPTANCE CRITERIA

GIVEN WHEN THEN
Dagger job is created Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more enum fields Dagger should process the data from the local parquet file instead and then exit gracefully.The enum fields should be able to get added into a Flink row.Suitable default value should be used when data is not present in parquet file but present in schema.
Dagger job is created Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more repeated enum fields ( array of enums) Dagger should process the data from the local parquet file instead and then exit gracefully.Each list of enums should be able to get added into a Flink row as a list.Suitable default value should be used when data is not present in parquet file but present in schema.
Dagger job is created Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more nested simple groups(i:e, simple group within another simple group) Dagger should process the data from the local parquet file instead and then exit gracefully.The nested simple groups should be able to get parsed into nested flink rows.Suitable default value should be used when data is not present in parquet file but present in schema.
Dagger job is created Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type repeated simplegroups simple groups(i:e, array of simple groups) Dagger should process the data from the local parquet file instead and then exit gracefully.The repeated simplegroups should be able to get parsed into array of flink rows.Suitable default value should be used when data is not present in parquet file but present in schema.
Dagger job is created Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type repeated primitives (i:e, array of parquet primitive types like int64, boolean, etc) Dagger should process the data from the local parquet file instead and then exit gracefully.Repeated primitive types should be able to get parsed into array of flink rows.Suitable default value should be used when data is not present in parquet file but present in schema.

Out of scope

  • Struct

  • Repeated Struct

  • Maps

  • Timestamp of type SimpleGroup

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions