Skip to content

feat: Handle complex/nested data types from a parquet file in Parquet Data Source: Struct, Repeated Struct, Maps and Timestamp of type SimpleGroup #137

@Meghajit

Description

@Meghajit

Dagger has been processing real-time Kafka streams for years now, And now with parquet file processing, we aim to add the capability of performing dagger operations over the historical data, making Dagger a complete solution for data processing from historical to real-time.

As part of this feature, we want to extend #99 and add the capability to read **maps** as well as **timestamp in the format of nested simple groups**(seconds + nanos) from the parquet file as well.

All the current features of Dagger like transformers, UDFs, continue to work on the data. From the perspective of downstream components, they need not know what kind of source produced this data.

ACCEPTANCE CRITERIA:

GIVEN WHEN THEN
Dagger job is created Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type struct Dagger should process the data from the local parquet file instead and then exit gracefully.All the struct fields should have their value set to null in the Flink row. ( NO PROCESSING CAN BE DONE FOR STRUCTS AS OF NOW)
Dagger job is created Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type repeated struct Dagger should process the data from the local parquet file instead and then exit gracefully.All the repeated struct fields should have their value set to null in the Flink row. ( NO PROCESSING CAN BE DONE FOR REPEATED STRUCTS AS OF NOW)
Dagger job is created Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type MAP Dagger should process the data from the local parquet file instead and then exit gracefully.Map types should be able to get parsed into array of key-value flink rows.Suitable default value should be used when data is not present in parquet file but present in schema.
Dagger job is created Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more timestamp fields in the form of simple group of seconds + nanos.(This is different from issue #99 as there timestamp parsing was supported only when it was in int64/long format) Dagger should process the data from the local parquet file instead and then exit gracefully.The timestamp should be able to get parsed into a flink row of seconds and nanos.Suitable default value should be used when data is not present in parquet file but present in schema.

Out of scope:

  • Enums, repeated enums

  • Message, repeated message

  • Repeated primitives

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions