feat: Create a Parquet Data Source with ability to read and process primitive data types from a parquet file

Dagger has been processing real-time Kafka streams for years now, And now with parquet file processing, we aim to add the capability of performing dagger operations over the historical data, making Dagger a complete solution for data processing from historical to real-time.

As part of this feature, we want to add a DataSource in Dagger which can read data from a parquet file and send records downstream as Flink **Row**. We only want to target reading of simple types like INT, FLOATS or STRINGS, etc from the Parquet File in this issue. Reading of nested fields is not under scope for this issue but will be covered by #100 

All the current features of Dagger like transformers, UDFs, continue to work on the data generated by Parquet Data Source. In fact, from the perspective of downstream components, they need not know what kind of source produced this data.

Tasks to be done:<ol class="ak-ol"><li>Create a Parquet Data Source and expose configurations. Data Sources should be switchable in Dagger.</li><li>Create a Parquet Reader which reads parquet files using row groups and columns.</li><li>Create parser for parsing Parquet primitive types → Flink Row types. </li></li><li>Process the parquet files in chronological order </li></ol>

**Not in Scope**
1. Checkpointing and state persistence: when the parquet source dagger is restarted, it should start processing from last checkpoint
2. Corrupt file behaviour

### Acceptance Criteria

Acceptance Criteria:

GIVEN | WHEN | THEN
-- | -- | --
Dagger job is created | Data source is selected as parquet A single local parquet file is provided as input | Dagger should process the data from the local parquet file instead and then exit gracefully.Only primitive types will be parsed into their equivalent Java types. Other complex types can be empty.Any int64 timestamp fields in the parquet data should be parsed to seconds+nanos.
Dagger job is created | Data source is selected as parquet Multiple date partitioned local folder path containing multiple parquet files is provided as input | Dagger should process all the files in chronological order of dates and then exit gracefully.Only primitive types will be parsed into their equivalent Java types. Other complex types can be empty.Any int64 timestamp fields in the parquet data should be parsed to seconds+nanos.
Dagger job is created | Data source is selected as parquet Multiple hour partitioned local folder path containing multiple parquet files is provided as input | Dagger should process all the files in chronological order of hour and then exit gracefully.Only primitive types will be parsed into their equivalent Java types. Other complex types can be empty.Any int64 timestamp fields in the parquet data should be parsed to seconds+nanos.
Dagger job is created | Data source is selected as parquet Multiple date/hour partitioned folder paths containing 0 files. | Dagger should stop gracefully

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Create a Parquet Data Source with ability to read and process primitive data types from a parquet file #99

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GIVEN	WHEN	THEN
Dagger job is created	Data source is selected as parquet A single local parquet file is provided as input	Dagger should process the data from the local parquet file instead and then exit gracefully.Only primitive types will be parsed into their equivalent Java types. Other complex types can be empty.Any int64 timestamp fields in the parquet data should be parsed to seconds+nanos.
Dagger job is created	Data source is selected as parquet Multiple date partitioned local folder path containing multiple parquet files is provided as input	Dagger should process all the files in chronological order of dates and then exit gracefully.Only primitive types will be parsed into their equivalent Java types. Other complex types can be empty.Any int64 timestamp fields in the parquet data should be parsed to seconds+nanos.
Dagger job is created	Data source is selected as parquet Multiple hour partitioned local folder path containing multiple parquet files is provided as input	Dagger should process all the files in chronological order of hour and then exit gracefully.Only primitive types will be parsed into their equivalent Java types. Other complex types can be empty.Any int64 timestamp fields in the parquet data should be parsed to seconds+nanos.
Dagger job is created	Data source is selected as parquet Multiple date/hour partitioned folder paths containing 0 files.	Dagger should stop gracefully

feat: Create a Parquet Data Source with ability to read and process primitive data types from a parquet file #99

Description

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions