feat: Handle complex/nested data types from a parquet file in Parquet Data Source: Struct, Repeated Struct, Maps and Timestamp of type SimpleGroup

Dagger has been processing real-time Kafka streams for years now, And now with parquet file processing, we aim to add the capability of performing dagger operations over the historical data, making Dagger a complete solution for data processing from historical to real-time.As part of this feature, we want to extend <a href="https://github.com/odpf/dagger/issues/99">#99</a> and add the capability to read **maps** as well as **timestamp in the format of nested simple groups**(seconds + nanos) from the parquet file as well.All the current features of Dagger like transformers, UDFs, continue to work on the data. From the perspective of downstream components, they need not know what kind of source produced this data.ACCEPTANCE CRITERIA:

GIVEN | WHEN | THEN
-- | -- | --
Dagger job is created | Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type struct | Dagger should process the data from the local parquet file instead and then exit gracefully.All the struct fields should have their value set to null in the Flink row. ( NO PROCESSING CAN BE DONE FOR STRUCTS AS OF NOW)
Dagger job is created | Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type repeated struct | Dagger should process the data from the local parquet file instead and then exit gracefully.All the repeated struct fields should have their value set to null in the Flink row. ( NO PROCESSING CAN BE DONE FOR REPEATED STRUCTS AS OF NOW)
Dagger job is created | Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type MAP | Dagger should process the data from the local parquet file instead and then exit gracefully.Map types should be able to get parsed into array of key-value flink rows.Suitable default value should be used when data is not present in parquet file but present in schema.
Dagger job is created | Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more timestamp fields in the form of simple group of seconds + nanos.(This is different from issue #99 as there timestamp parsing was supported only when it was in int64/long format) | Dagger should process the data from the local parquet file instead and then exit gracefully.The timestamp should be able to get parsed into a flink row of seconds and nanos.Suitable default value should be used when data is not present in parquet file but present in schema.

**Out of scope:**

- Enums, repeated enums

- Message, repeated message
 
- Repeated primitives

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Handle complex/nested data types from a parquet file in Parquet Data Source: Struct, Repeated Struct, Maps and Timestamp of type SimpleGroup #137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GIVEN	WHEN	THEN
Dagger job is created	Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type struct	Dagger should process the data from the local parquet file instead and then exit gracefully.All the struct fields should have their value set to null in the Flink row. ( NO PROCESSING CAN BE DONE FOR STRUCTS AS OF NOW)
Dagger job is created	Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type repeated struct	Dagger should process the data from the local parquet file instead and then exit gracefully.All the repeated struct fields should have their value set to null in the Flink row. ( NO PROCESSING CAN BE DONE FOR REPEATED STRUCTS AS OF NOW)
Dagger job is created	Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type MAP	Dagger should process the data from the local parquet file instead and then exit gracefully.Map types should be able to get parsed into array of key-value flink rows.Suitable default value should be used when data is not present in parquet file but present in schema.
Dagger job is created	Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more timestamp fields in the form of simple group of seconds + nanos.(This is different from issue #99 as there timestamp parsing was supported only when it was in int64/long format)	Dagger should process the data from the local parquet file instead and then exit gracefully.The timestamp should be able to get parsed into a flink row of seconds and nanos.Suitable default value should be used when data is not present in parquet file but present in schema.

feat: Handle complex/nested data types from a parquet file in Parquet Data Source: Struct, Repeated Struct, Maps and Timestamp of type SimpleGroup #137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions