-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Dagger has been processing real-time Kafka streams for years now, And now with parquet file processing, we aim to add the capability of performing dagger operations over the historical data, making Dagger a complete solution for data processing from historical to real-time.
As part of this feature, we want to extend #99 and add the capability to read **maps** as well as **timestamp in the format of nested simple groups**(seconds + nanos) from the parquet file as well.
All the current features of Dagger like transformers, UDFs, continue to work on the data. From the perspective of downstream components, they need not know what kind of source produced this data.
ACCEPTANCE CRITERIA:
| GIVEN | WHEN | THEN |
|---|---|---|
| Dagger job is created | Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type struct | Dagger should process the data from the local parquet file instead and then exit gracefully.All the struct fields should have their value set to null in the Flink row. ( NO PROCESSING CAN BE DONE FOR STRUCTS AS OF NOW) |
| Dagger job is created | Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type repeated struct | Dagger should process the data from the local parquet file instead and then exit gracefully.All the repeated struct fields should have their value set to null in the Flink row. ( NO PROCESSING CAN BE DONE FOR REPEATED STRUCTS AS OF NOW) |
| Dagger job is created | Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more fields of type MAP | Dagger should process the data from the local parquet file instead and then exit gracefully.Map types should be able to get parsed into array of key-value flink rows.Suitable default value should be used when data is not present in parquet file but present in schema. |
| Dagger job is created | Data source is selected as parquet One or more parquet file is provided as inputParquet file has a parent simple group which contains one or more timestamp fields in the form of simple group of seconds + nanos.(This is different from issue #99 as there timestamp parsing was supported only when it was in int64/long format) | Dagger should process the data from the local parquet file instead and then exit gracefully.The timestamp should be able to get parsed into a flink row of seconds and nanos.Suitable default value should be used when data is not present in parquet file but present in schema. |
Out of scope:
-
Enums, repeated enums
-
Message, repeated message
-
Repeated primitives