-
Notifications
You must be signed in to change notification settings - Fork 42
feat: Handle complex/nested data types from a parquet file in Parquet Data Source: Struct, Repeated Struct, Maps and Timestamp of type SimpleGroup #148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Handle complex/nested data types from a parquet file in Parquet Data Source: Struct, Repeated Struct, Maps and Timestamp of type SimpleGroup #148
Conversation
- added split assigners to assign splits based on timestamp in url and based on index in filepaths array [raystack#99]
- add methods to get FileSplitAssigner and FileRecordFormat based on configs - pass StencilClientOrchestrator to SourceFactory as well when creating the source [raystack#99]
- this is required for parsing the parquet SimpleGroup data structure into Java objects. [raystack#99]
- implement parsers for int32, int64 and boolean parquet data types [raystack#99]
- remove abstract method serializer from the interface as it is not required [raystack#99]
- return DaggerDeserializationException instead of ClassCastException when logical type is incorrect [raystack#99]
- return DaggerDeserializationException instead of ClassCastException when logical type is incorrect [raystack#99]
- change the class to a usual class instead of a factory class [raystack#99]
- ParquetDataTypeParser.getValueOrDefault() now returns the default value only if the deserialized value is null. [raystack#99]
…andler - add tests [raystack#100]
- fixes review comment raystack#138 (comment) [raystack#139]
- returning a long[] causes failure during deserialization due to class cast exception. This is because the type information for LongHandler array is an object array of LONG and not a primitive array [raystack#100]
…alization - adding a type helps remove boilerplate code for type casting [raystack#100]
…00-parquet-complex-and-repeated-datatype-deserialization
…serialization' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
- add validation methods to check if SimpleGroup map schema follows Apache Parquet LogicalTypes spec or legacy one - official spec https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1 - add some tests [raystack#137]
- add tests - refactor implementation of the original method into smaller modular methods [raystack#137]
…lizer - remove unnecessary test [raystack#137]
- remove proto keyword - update usages - this fixes for review comment raystack#138 (comment) and raystack#138 (comment) [raystack#138]
…00-parquet-complex-and-repeated-datatype-deserialization
- replace transformFromKafka with transformFromProto - fixes for review comment raystack#140 (comment) [raystack#140]
…serialization' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
…files-and-packages
…00-parquet-complex-and-repeated-datatype-deserialization
…serialization' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
- replace `KafkaTransform` keyword [raystack#100]
...c/main/java/io/odpf/dagger/common/serde/parquet/deserialization/SimpleGroupDeserializer.java
Show resolved
Hide resolved
| * @param fieldName The name of the map field | ||
| * @return true, if the map structure follows the spec and false otherwise. | ||
| */ | ||
| public static boolean checkIsLegacySimpleGroupMap(SimpleGroup simpleGroup, String fieldName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks more like a method in mapHandler rather than a static method here. Any particular rationale for not keeping it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are mainly 2 reasons:
- SRP: Even though it uses map, this function looks and behaves like a validator. It returns a boolean, yes or no answer. I am giving this class a responsibility of validating simple groups against a field, that's it. Whether it's a map field, enum field, string field or what kind of validation doesn't quite matter. Maybe in the future if we start supporting simple group structs, we can also put validators here on whether it is a struct that is valid.
- Modularity and Ease of Understanding: If you check this function carefully, it does a quite many
ifbased checks and hence this needs to be tested for different corner cases. The unit tests are verbose. Here it is a public function and can be tested as it is, but when placed in MapHandler, this would ideally become a private function and hence the entire unit would need to be tested through the public interface oftransformFromParquet. This method already has a lot of big tests right now and moving this validator function to MapHandler would result in a lot of hard to understand tests. The MapHandler class is already quite big and I don't want to make it bigger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, let's keep this for now
| * @param fieldName The name of the map field | ||
| * @return true, if the map structure follows the spec and false otherwise. | ||
| */ | ||
| public static boolean checkIsStandardSimpleGroupMap(SimpleGroup simpleGroup, String fieldName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks more like a method in mapHandler rather than a static method here. Any particular rationale for not keeping it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, we can keep this for now
| @Override | ||
| public Object parseRepeatedSimpleGroupField(SimpleGroup simpleGroup) { | ||
| String fieldName = fieldDescriptor.getName(); | ||
| ArrayList<Long> deserializedValues = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep variable naming consistent with other classes? longArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed via commit a28474c
| public Object transformFromParquet(SimpleGroup simpleGroup) { | ||
| return null; | ||
| String defaultEnumValue = fieldDescriptor.getEnumType().findValueByNumber(0).getName(); | ||
| List<String> deserializedEnumArray = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enumArray? For consistency with other similar classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed via commit a28474c
…' into feat/issue#100-parquet-complex-and-repeated-datatype-deserialization
…serialization' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
Solves for #137