Skip to content

Conversation

@Meghajit
Copy link
Member

@Meghajit Meghajit commented May 5, 2022

Solves for #137

- added split assigners to assign splits based on
timestamp in url and based on index in filepaths array

[raystack#99]
- add methods to get FileSplitAssigner and FileRecordFormat based
on configs
- pass StencilClientOrchestrator to SourceFactory as well when
creating the source

[raystack#99]
- this is required for parsing the parquet SimpleGroup data
structure into Java objects.

[raystack#99]
- implement parsers for int32, int64 and boolean
parquet data types

[raystack#99]
- remove abstract method serializer from the interface
as it is not required

[raystack#99]
- return DaggerDeserializationException instead of
ClassCastException when logical type is incorrect

[raystack#99]
- return DaggerDeserializationException instead of
ClassCastException when logical type is incorrect

[raystack#99]
- change the class to a usual class instead of
a factory class

[raystack#99]
- ParquetDataTypeParser.getValueOrDefault() now returns
the default value only if the deserialized value is null.

[raystack#99]
- returning a long[] causes failure during deserialization due to
class cast exception. This is because the type information for LongHandler
array is an object array of LONG and not a primitive array

[raystack#100]
…alization

- adding a type helps remove boilerplate code for type casting

[raystack#100]
…00-parquet-complex-and-repeated-datatype-deserialization
…serialization' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
@Meghajit Meghajit changed the title [WIP] feat: Handle complex/nested data types from a parquet file in Parquet Data Source: Struct, Repeated Struct, Maps and Timestamp of type SimpleGroup feat: Handle complex/nested data types from a parquet file in Parquet Data Source: Struct, Repeated Struct, Maps and Timestamp of type SimpleGroup May 10, 2022
@Meghajit Meghajit marked this pull request as ready for review May 10, 2022 05:33
Meghajit added 8 commits May 11, 2022 17:48
- remove proto keyword
- update usages
- this fixes for review comment
raystack#138 (comment)
and raystack#138 (comment)

[raystack#138]
…00-parquet-complex-and-repeated-datatype-deserialization
- replace transformFromKafka with transformFromProto
- fixes for review comment
raystack#140 (comment)

[raystack#140]
…serialization' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
…00-parquet-complex-and-repeated-datatype-deserialization
…serialization' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
- replace `KafkaTransform` keyword

[raystack#100]
* @param fieldName The name of the map field
* @return true, if the map structure follows the spec and false otherwise.
*/
public static boolean checkIsLegacySimpleGroupMap(SimpleGroup simpleGroup, String fieldName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks more like a method in mapHandler rather than a static method here. Any particular rationale for not keeping it there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are mainly 2 reasons:

  1. SRP: Even though it uses map, this function looks and behaves like a validator. It returns a boolean, yes or no answer. I am giving this class a responsibility of validating simple groups against a field, that's it. Whether it's a map field, enum field, string field or what kind of validation doesn't quite matter. Maybe in the future if we start supporting simple group structs, we can also put validators here on whether it is a struct that is valid.
  2. Modularity and Ease of Understanding: If you check this function carefully, it does a quite many if based checks and hence this needs to be tested for different corner cases. The unit tests are verbose. Here it is a public function and can be tested as it is, but when placed in MapHandler, this would ideally become a private function and hence the entire unit would need to be tested through the public interface of transformFromParquet. This method already has a lot of big tests right now and moving this validator function to MapHandler would result in a lot of hard to understand tests. The MapHandler class is already quite big and I don't want to make it bigger.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, let's keep this for now

* @param fieldName The name of the map field
* @return true, if the map structure follows the spec and false otherwise.
*/
public static boolean checkIsStandardSimpleGroupMap(SimpleGroup simpleGroup, String fieldName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks more like a method in mapHandler rather than a static method here. Any particular rationale for not keeping it there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we can keep this for now

@Override
public Object parseRepeatedSimpleGroupField(SimpleGroup simpleGroup) {
String fieldName = fieldDescriptor.getName();
ArrayList<Long> deserializedValues = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep variable naming consistent with other classes? longArray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed via commit a28474c

public Object transformFromParquet(SimpleGroup simpleGroup) {
return null;
String defaultEnumValue = fieldDescriptor.getEnumType().findValueByNumber(0).getName();
List<String> deserializedEnumArray = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enumArray? For consistency with other similar classes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed via commit a28474c

Meghajit added 3 commits May 20, 2022 17:49
…' into feat/issue#100-parquet-complex-and-repeated-datatype-deserialization
…serialization' into feat/issue#137-parquet-map-and-group-timestamp-deserialization
@gauravsinghania gauravsinghania merged commit 28912c9 into raystack:dagger-parquet-file-processing May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Handle complex/nested data types from a parquet file in Parquet Data Source: Struct, Repeated Struct, Maps and Timestamp of type SimpleGroup

2 participants