Skip to content

Conversation

@Meghajit
Copy link
Member

PR for #150

- added split assigners to assign splits based on
timestamp in url and based on index in filepaths array

[raystack#99]
- add methods to get FileSplitAssigner and FileRecordFormat based
on configs
- pass StencilClientOrchestrator to SourceFactory as well when
creating the source

[raystack#99]
- this is required for parsing the parquet SimpleGroup data
structure into Java objects.

[raystack#99]
- implement parsers for int32, int64 and boolean
parquet data types

[raystack#99]
- remove abstract method serializer from the interface
as it is not required

[raystack#99]
- return DaggerDeserializationException instead of
ClassCastException when logical type is incorrect

[raystack#99]
- return DaggerDeserializationException instead of
ClassCastException when logical type is incorrect

[raystack#99]
- change the class to a usual class instead of
a factory class

[raystack#99]
- ParquetDataTypeParser.getValueOrDefault() now returns
the default value only if the deserialized value is null.

[raystack#99]
Meghajit added 14 commits May 6, 2022 11:08
- add validation methods to check if SimpleGroup map
schema follows Apache Parquet LogicalTypes spec or legacy one
- official spec
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
- add some tests

[raystack#137]
- add tests
- refactor implementation of the original method into smaller
modular methods

[raystack#137]
- remove proto keyword
- update usages
- this fixes for review comment
raystack#138 (comment)
and raystack#138 (comment)

[raystack#138]
…00-parquet-complex-and-repeated-datatype-deserialization
- replace transformFromKafka with transformFromProto
- fixes for review comment
raystack#140 (comment)

[raystack#140]
…serialization' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
…00-parquet-complex-and-repeated-datatype-deserialization
…serialization' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
Meghajit added 7 commits May 17, 2022 13:22
- replace `KafkaTransform` keyword

[raystack#100]
…' into feat/issue#100-parquet-complex-and-repeated-datatype-deserialization
…serialization' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
…' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
…ization' into feat/issue#150-handle-invalid-parquet-data-source-configs
…' into feat/issue#150-handle-invalid-parquet-data-source-configs

# Conflicts:
#	dagger-core/src/main/java/io/odpf/dagger/core/source/config/StreamConfig.java
#	dagger-core/src/main/java/io/odpf/dagger/core/source/config/adapter/FileDateRangeAdaptor.java
#	dagger-core/src/main/java/io/odpf/dagger/core/source/parquet/splitassigner/ChronologyOrderedSplitAssigner.java
#	dagger-core/src/test/java/io/odpf/dagger/core/source/config/StreamConfigTest.java
#	dagger-core/src/test/java/io/odpf/dagger/core/source/config/adapter/FileDateRangeAdaptorTest.java
#	dagger-core/src/test/java/io/odpf/dagger/core/source/parquet/ParquetFileSourceTest.java
#	dagger-core/src/test/java/io/odpf/dagger/core/source/parquet/splitassigner/ChronologyOrderedSplitAssignerTest.java
@prakharmathur82 prakharmathur82 merged commit 1cf812a into raystack:dagger-parquet-file-processing May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle missing configs, extra whitespaces or incorrect config values for Parquet Source

2 participants