[SPARK-25746][SQL] Refactoring ExpressionEncoder to get rid of flat flag by viirya · Pull Request #22749 · apache/spark

viirya · 2018-10-16T15:30:10Z

What changes were proposed in this pull request?

This is inspired during implementing #21732. For now ScalaReflection needs to consider how ExpressionEncoder uses generated serializers and deserializers. And ExpressionEncoder has a weird flat flag. After discussion with @cloud-fan, it seems to be better to refactor ExpressionEncoder. It should make SPARK-24762 easier to do.

To summarize the proposed changes:

serializerFor and deserializerFor return expressions for serializing/deserializing an input expression for a given type. They are private and should not be called directly.
serializerForType and deserializerForType returns an expression for serializing/deserializing for an object of type T to/from Spark SQL representation. It assumes the input object/Spark SQL representation is located at ordinal 0 of a row.

So in other words, serializerForType and deserializerForType return expressions for atomically serializing/deserializing JVM object to/from Spark SQL value.

A serializer returned by serializerForType will serialize an object at row(0) to a corresponding Spark SQL representation, e.g. primitive type, array, map, struct.

A deserializer returned by deserializerForType will deserialize an input field at row(0) to an object with given type.

The construction of ExpressionEncoder takes a pair of serializer and deserializer for type T. It uses them to create serializer and deserializer for T <-> row serialization. Now ExpressionEncoder dones't need to remember if serializer is flat or not. When we need to construct new ExpressionEncoder based on existing ones, we only need to change input location in the atomic serializer and deserializer.

How was this patch tested?

Existing tests.

SparkQA · 2018-10-16T16:13:44Z

Test build #97459 has finished for PR 22749 at commit d755e84.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new RuntimeException(s\"class $clsName has unexpected serializer: $objSerializer\")

SparkQA · 2018-10-16T16:20:08Z

Test build #97460 has finished for PR 22749 at commit 84f3ce0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
throw new RuntimeException(s\"class $clsName has unexpected serializer: $objSerializer\")

SparkQA · 2018-10-17T03:56:58Z

Test build #97479 has finished for PR 22749 at commit 6a6fa45.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-17T07:05:02Z

Test build #97480 has finished for PR 22749 at commit 25a6162.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-10-17T07:12:00Z

retest this please.

SparkQA · 2018-10-17T10:09:31Z

Test build #97485 has finished for PR 22749 at commit 25a6162.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-18T13:00:28Z

-    // We convert the not-serializable TypeTag into StructType and ClassTag.
    val mirror = ScalaReflection.mirror
-    val tpe = typeTag[T].in(mirror).tpe
+    val tpe = ScalaReflection.localTypeOf[T]


why change it from typeTag[T].in(mirror).tpe?

localTypeOf is actually doing the same thing. I think it is better to use ScalaReflection for such thing.

localTypeOf has a dealias at the end.

I think it should be fine, but let me revert this change first.

cloud-fan · 2018-10-18T13:01:12Z

   * name/positional binding is preserved.
   */
  def tuple(encoders: Seq[ExpressionEncoder[_]]): ExpressionEncoder[_] = {
+    if (encoders.length > 22) {


can we do it in a separated PR with a test?

cloud-fan · 2018-10-18T13:06:04Z


-      val newSerializer = enc.serializer.map(_.transformUp {
+      val newSerializer = enc.objSerializer.transformUp {
        case b: BoundReference if b == originalInputObject => newInputObject


Since there is only one distinct BoundReference, we can just write case b: BoundReference => newInputObject

yes, right.

cloud-fan · 2018-10-18T13:07:39Z

@@ -103,75 +88,61 @@ object ExpressionEncoder {
   * name/positional binding is preserved.
   */
  def tuple(encoders: Seq[ExpressionEncoder[_]]): ExpressionEncoder[_] = {


cool, this method is simplified a lot with the new abstraction.

cloud-fan · 2018-10-18T13:14:50Z

+          AssertNotNull(r, Seq("top level Product or row object"))
+      }
+      nullSafeSerializer match {
+        case If(_, _, s: CreateNamedStruct) => s


let's also make sure the if condition is IsNull, which better explains why we strip it(it can't be null)

cloud-fan · 2018-10-18T13:16:40Z


-  if (flat) require(serializer.size == 1)
+  /**
+   * A set of expressions, one for each top-level field that can be used to


set -> sequence

cloud-fan · 2018-10-18T13:21:37Z

+  // The schema after converting `T` to a Spark SQL row. This schema is dependent on the given
+  // serialier.
+  val schema: StructType = StructType(serializer.map { s =>
+    StructField(s.name, s.dataType, s.nullable)


can we call dataType before serializer is analyzed?

nvm, serializer don't need analysis

cloud-fan · 2018-10-18T13:26:47Z

I like this idea! waiting for tests pass

SparkQA · 2018-10-18T16:13:28Z

Test build #97539 has finished for PR 22749 at commit 85a9122.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-24T06:25:23Z

hmm, it still has conflict...

viirya · 2018-10-24T06:26:51Z

Let me rebase again.

SparkQA · 2018-10-24T07:05:02Z

Test build #97964 has finished for PR 22749 at commit ed4f4c9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-10-24T07:14:33Z

retest this please.

cloud-fan · 2018-10-24T07:37:01Z

+
+    // The input object to `ExpressionEncoder` is located at first column of an row.
+    val inputObject = BoundReference(0, dataTypeFor(tpe),
+      nullable = !cls.isPrimitive)


we just check isPrimitive of the given cls, can we check tpe directly?

Yes, we can check tpe.typeSymbol.asClass.isPrimitive instead.

good, then we don't need cls as a parameter.

cloud-fan · 2018-10-24T07:42:09Z


-  /** Helper for extracting internal fields from a case class. */
+  /**
+   * Returns an expression for serializing the value of an input expression into Spark SQL


do we really need to duplicate the doc in this private method？

I did simplify a lot of it.

cloud-fan · 2018-10-24T07:49:20Z

-    val serializer = serializerFor(AssertNotNull(inputObject, Seq("top level row object")), schema)
-    val deserializer = deserializerFor(schema)
+    val serializer = serializerFor(inputObject, schema)
+    val deserializer = deserializerFor(GetColumnByOrdinal(0, serializer.dataType), schema)


in ScalaReflection, we create GetColumnByOrdinal in deserializeFor, shall we follow it here?

Ok. Sounds better.

Ah, we need to access serializer.dataType here. So if we want to create GetColumnByOrdinal in deserializeFor, we need to pass this data type too. What do you think?

ah i see, then let's leave it.

cloud-fan · 2018-10-24T08:08:29Z

    // side, in cases like outer-join.
    val left = {
-      val combined = if (this.exprEnc.flat) {
+      val combined = if (!this.exprEnc.objSerializer.dataType.isInstanceOf[StructType]) {


shall we create a method in ExpressionEncoder for this check?

SparkQA · 2018-10-24T10:53:32Z

Test build #97967 has finished for PR 22749 at commit ed4f4c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-24T10:57:23Z

+   * A sequence of expressions, one for each top-level field that can be used to
+   * extract the values from a raw object into an [[InternalRow]]:
+   * 1. If `serializer` encodes a raw object to a struct, we directly use the `serializer`.
+   * 2. For other cases, we create a struct to wrap the `serializer`.


Let's make these 2 comments more precise

1. If `serializer` encodes a raw object to a struct, strip the outer if-IsNull and get the CreateNamedStruct 2. For other cases, wrap the single serializer with CreateNamedStruct

cloud-fan · 2018-10-24T11:05:47Z

-    assert(numberOfCheckedArguments(deserializerFor[(java.lang.Double, Int)]) == 1)
-    assert(numberOfCheckedArguments(deserializerFor[(java.lang.Integer, java.lang.Integer)]) == 0)
+    assert(numberOfCheckedArguments(
+      deserializerForType(ScalaReflection.localTypeOf[(Double, Double)])) == 2)


shall we create a deserializerFor method in this test suite to save some code diff?

Sounds good.

cloud-fan · 2018-10-24T11:07:36Z

LGTM except 2 minor comments

SparkQA · 2018-10-24T12:19:15Z

Test build #97969 has finished for PR 22749 at commit 8cb710b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-24T13:19:10Z

  test("SPARK-22442: Generate correct field names for special characters") {
-    val serializer = serializerFor[SpecialCharAsFieldData](BoundReference(
-      0, ObjectType(classOf[SpecialCharAsFieldData]), nullable = false))
+    val serializer = serializerForType(ScalaReflection.localTypeOf[SpecialCharAsFieldData])


can we replace all the serializerForType with serializerFor in this suite?

Do you mean to create a method serializerFor in this suite? Or replace serializerForType with ScalaReflection.serializerFor?

like deserializerFor in this suite, let's also create a serializerFor

SparkQA · 2018-10-24T14:52:21Z

Test build #97971 has finished for PR 22749 at commit 078a071.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-25T03:18:01Z

Test build #97991 has finished for PR 22749 at commit c00d5e4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-25T11:18:10Z

+      }
+      nullSafeSerializer match {
+        case If(_: IsNull, _, s: CreateNamedStruct) => s
+        case s: CreateNamedStruct => s


when will we hit this?

oh, good catch! I think this is redundant pattern.

this is minor, we can update it in another PR. We don't need to wait for another jenkins QA round.

Ok. Sounds good to me.

cloud-fan · 2018-10-25T11:28:47Z

thanks, merging to master!

viirya · 2018-10-25T11:29:36Z

Thanks @cloud-fan

## What changes were proposed in this pull request? This is inspired during implementing apache#21732. For now `ScalaReflection` needs to consider how `ExpressionEncoder` uses generated serializers and deserializers. And `ExpressionEncoder` has a weird `flat` flag. After discussion with cloud-fan, it seems to be better to refactor `ExpressionEncoder`. It should make SPARK-24762 easier to do. To summarize the proposed changes: 1. `serializerFor` and `deserializerFor` return expressions for serializing/deserializing an input expression for a given type. They are private and should not be called directly. 2. `serializerForType` and `deserializerForType` returns an expression for serializing/deserializing for an object of type T to/from Spark SQL representation. It assumes the input object/Spark SQL representation is located at ordinal 0 of a row. So in other words, `serializerForType` and `deserializerForType` return expressions for atomically serializing/deserializing JVM object to/from Spark SQL value. A serializer returned by `serializerForType` will serialize an object at `row(0)` to a corresponding Spark SQL representation, e.g. primitive type, array, map, struct. A deserializer returned by `deserializerForType` will deserialize an input field at `row(0)` to an object with given type. 3. The construction of `ExpressionEncoder` takes a pair of serializer and deserializer for type `T`. It uses them to create serializer and deserializer for T <-> row serialization. Now `ExpressionEncoder` dones't need to remember if serializer is flat or not. When we need to construct new `ExpressionEncoder` based on existing ones, we only need to change input location in the atomic serializer and deserializer. ## How was this patch tested? Existing tests. Closes apache#22749 from viirya/SPARK-24762-refactor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

## What changes were proposed in this pull request? a followup of apache#22749. When we construct the new serializer in `ExpressionEncoder.tuple`, we don't need to add `if(isnull ...)` check for each field. They are either simple expressions that can propagate null correctly(e.g. `GetStructField(GetColumnByOrdinal(0, schema), index)`), or complex expression that already have the isnull check. ## How was this patch tested? existing tests Closes apache#22898 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

viirya added 10 commits July 9, 2018 03:42

Aggregator should be able to use Option of Product encoder.

e1b5dee

Enable top-level Option of Product encoders.

80506f4

Remove topLevel parameter.

ed3d5cb

Merge remote-tracking branch 'upstream/master' into SPARK-24762

9fc3f61

Remove useless change.

5f95bd0

Add more tests.

a4f0405

Add test.

c1f798f

Merge remote-tracking branch 'upstream/master' into SPARK-24762

80e11d2

Improve code comments.

0f029b0

Refactoring ExpressionEncoder.

84f3ce0

viirya force-pushed the SPARK-24762-refactor branch from d755e84 to 84f3ce0 Compare October 16, 2018 15:34

Fix Malformed class name.

6a6fa45

Fix error message.

25a6162

cloud-fan reviewed Oct 18, 2018

View reviewed changes

viirya added 2 commits October 18, 2018 15:58

Fix test.

295ecde

Merge remote-tracking branch 'upstream/master' into SPARK-24762-refactor

85a9122

Merge remote-tracking branch 'upstream/master' into SPARK-24762-refactor

ed4f4c9

cloud-fan reviewed Oct 24, 2018

View reviewed changes

Address comments.

8cb710b

cloud-fan reviewed Oct 24, 2018

View reviewed changes

Make comment more precise.

682fa4b

cloud-fan reviewed Oct 24, 2018

View reviewed changes

Simplify test change.

078a071

cloud-fan reviewed Oct 24, 2018

View reviewed changes

Address comment.

c00d5e4

cloud-fan reviewed Oct 25, 2018

View reviewed changes

asfgit closed this in cb5ea20 Oct 25, 2018

This was referenced Oct 25, 2018

[SPARK-25817][SQL] Dataset encoder should support combination of map and product type #22812

Closed

[SPARK-25746][SQL][followup] do not add unnecessary If expression #22898

Closed

viirya deleted the SPARK-24762-refactor branch December 27, 2023 18:22

Conversation

viirya commented Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 16, 2018

Uh oh!

SparkQA commented Oct 16, 2018

Uh oh!

SparkQA commented Oct 17, 2018

Uh oh!

SparkQA commented Oct 17, 2018

Uh oh!

viirya commented Oct 17, 2018

Uh oh!

SparkQA commented Oct 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 18, 2018

Uh oh!

SparkQA commented Oct 18, 2018

Uh oh!

cloud-fan commented Oct 24, 2018

Uh oh!

viirya commented Oct 24, 2018

Uh oh!

SparkQA commented Oct 24, 2018

Uh oh!

viirya commented Oct 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 24, 2018

Uh oh!

viirya commented Oct 16, 2018 •

edited

Loading