Skip to content

feat: Create a Parquet Data Source with ability to read and process primitive data types from a parquet file #99

@Meghajit

Description

@Meghajit

Dagger has been processing real-time Kafka streams for years now, And now with parquet file processing, we aim to add the capability of performing dagger operations over the historical data, making Dagger a complete solution for data processing from historical to real-time.

As part of this feature, we want to add a DataSource in Dagger which can read data from a parquet file and send records downstream as Flink Row. We only want to target reading of simple types like INT, FLOATS or STRINGS, etc from the Parquet File in this issue. Reading of nested fields is not under scope for this issue but will be covered by #100

All the current features of Dagger like transformers, UDFs, continue to work on the data generated by Parquet Data Source. In fact, from the perspective of downstream components, they need not know what kind of source produced this data.

Tasks to be done:

  1. Create a Parquet Data Source and expose configurations. Data Sources should be switchable in Dagger.

  2. Create a Parquet Reader which reads parquet files using row groups and columns.

  3. Create parser for parsing Parquet primitive types → Flink Row types.

  4. Process the parquet files in chronological order

Not in Scope

  1. Checkpointing and state persistence: when the parquet source dagger is restarted, it should start processing from last checkpoint
  2. Corrupt file behaviour

Acceptance Criteria

Acceptance Criteria:

GIVEN WHEN THEN
Dagger job is created Data source is selected as parquet A single local parquet file is provided as input Dagger should process the data from the local parquet file instead and then exit gracefully.Only primitive types will be parsed into their equivalent Java types. Other complex types can be empty.Any int64 timestamp fields in the parquet data should be parsed to seconds+nanos.
Dagger job is created Data source is selected as parquet Multiple date partitioned local folder path containing multiple parquet files is provided as input Dagger should process all the files in chronological order of dates and then exit gracefully.Only primitive types will be parsed into their equivalent Java types. Other complex types can be empty.Any int64 timestamp fields in the parquet data should be parsed to seconds+nanos.
Dagger job is created Data source is selected as parquet Multiple hour partitioned local folder path containing multiple parquet files is provided as input Dagger should process all the files in chronological order of hour and then exit gracefully.Only primitive types will be parsed into their equivalent Java types. Other complex types can be empty.Any int64 timestamp fields in the parquet data should be parsed to seconds+nanos.
Dagger job is created Data source is selected as parquet Multiple date/hour partitioned folder paths containing 0 files. Dagger should stop gracefully

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions