[SPARK-19471][SQL]AggregationIterator does not initialize the generated result projection before using it by DonnyZone · Pull Request #18920 · apache/spark

DonnyZone · 2017-08-11T13:07:24Z

What changes were proposed in this pull request?

Recently, we have also encountered such NPE issues in our production environment as described in:
https://issues.apache.org/jira/browse/SPARK-19471

This issue can be reproduced by the following examples:
` val df = spark.createDataFrame(Seq(("1", 1), ("1", 2), ("2", 3), ("2", 4))).toDF("x", "y")

//HashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false
df.groupBy("x").agg(rand(),sum("y")).show()

//ObjectHashAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false
df.groupBy("x").agg(rand(),collect_list("y")).show()

//SortAggregate, SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key=false &&SQLConf.USE_OBJECT_HASH_AGG.key=false
df.groupBy("x").agg(rand(),collect_list("y")).show()

This PR is based on PR-16820(#16820) with test cases for all aggregation paths. We want to push it forward.

When AggregationIterator generates result projection, it does not call the initialize method of the Projection class. This will cause a runtime NullPointerException when the projection involves nondeterministic expressions.

How was this patch tested?

unit test
verified in production environment

DonnyZone · 2017-08-11T13:10:45Z

Jenkins, test this please

DonnyZone · 2017-08-11T13:14:50Z

@hvanhovell, @yangw1234, @gatorsmile

gatorsmile · 2017-08-13T06:30:43Z

ok to test

SparkQA · 2017-08-13T07:04:49Z

Test build #80582 has finished for PR 18920 at commit b932d2f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

DonnyZone · 2017-08-13T12:11:34Z

Jenkins, retest this please.

SparkQA · 2017-08-13T15:21:46Z

Test build #80591 has finished for PR 18920 at commit bb29b8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-13T17:15:54Z

+        (SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, wholeStage.toString),
+        (SQLConf.USE_OBJECT_HASH_AGG.key, useObjectHashAgg.toString)) {
+        val df = Seq(("1", 1), ("1", 2), ("2", 3), ("2", 4)).toDF("x", "y")
+        // HashAggregate


We need to check/compare the plans to ensure they are HashAggregate, ObjectHashAggregate and SortAggregate.

DonnyZone · 2017-08-14T00:35:37Z

updated

DonnyZone · 2017-08-14T00:36:37Z

Jenkins, retest this please.

SparkQA · 2017-08-14T03:14:25Z

Test build #80599 has finished for PR 18920 at commit 5239ebb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-14T06:00:41Z

+          assert(hashAggPlan.find(p =>
+            p.isInstanceOf[WholeStageCodegenExec] &&
+              p.asInstanceOf[WholeStageCodegenExec].child
+                .isInstanceOf[HashAggregateExec]).isDefined)


assert(hashAggPlan.find { case WholeStageCodegenExec(_: HashAggregateExec) => true case _ => false }.isDefined)

gatorsmile · 2017-08-14T06:01:03Z

+    Seq(
+      monotonically_increasing_id(), spark_partition_id(),
+      rand(Random.nextLong()), randn(Random.nextLong())
+    ).foreach(assertNoExceptions(_))


-> ).foreach(assertNoExceptions)

gatorsmile · 2017-08-14T06:14:32Z

            allImperativeAggregateFunctions(i).eval(currentBuffer))
          i += 1
        }
+        resultProjection.initialize(partIndex)


Move it to line 221

gatorsmile · 2017-08-14T06:14:45Z

          typedImperativeAggregates(i).serializeAggregateBufferInPlace(currentBuffer)
          i += 1
        }
+        resultProjection.initialize(partIndex)


Move it to line 240

gatorsmile · 2017-08-14T06:14:55Z

      // Grouping-only: we only output values based on grouping expressions.
      val resultProjection = UnsafeProjection.create(resultExpressions, groupingAttributes)
      (currentGroupingKey: UnsafeRow, currentBuffer: InternalRow) => {
+        resultProjection.initialize(partIndex)


Move it to line 261

gatorsmile · 2017-08-14T06:18:53Z

+        hashAggDF.collect()
+
+        // ObjectHashAggregate and SortAggregate test cases
+        val objHashOrSort_AggDF = df.groupBy("x").agg(c, collect_list("y"))


objHashOrSort_AggDF -> objHashAggOrSortAggDf

gatorsmile · 2017-08-14T06:19:52Z

+
+        // ObjectHashAggregate and SortAggregate test cases
+        val objHashOrSort_AggDF = df.groupBy("x").agg(c, collect_list("y"))
+        val objHashOrSort_Plan = objHashOrSort_AggDF.queryExecution.executedPlan


objHashOrSort_Plan -> objHashAggOrSortAggPlan

DonnyZone · 2017-08-14T07:01:27Z

Updated, thanks for reviewing.

SparkQA · 2017-08-14T08:42:38Z

Test build #80615 has finished for PR 18920 at commit f55b161.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

DonnyZone · 2017-08-14T10:06:27Z

retest please

SparkQA · 2017-08-14T13:56:45Z

Test build #80625 has finished for PR 18920 at commit d58ffaa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-14T16:36:58Z

LGTM

gatorsmile · 2017-08-14T16:39:09Z

Thanks! Merged to master.

gatorsmile · 2017-08-14T16:52:37Z

    ).foreach(assertValuesDoNotChangeAfterCoalesceOrUnion(_))
  }

+  private def assertNoExceptions(c: Column): Unit = {


Could you submit a follow-up PR to move this test case to DataFrameAggregateSuite? Thanks!

DonnyZone · 2017-08-14T23:22:54Z

Sure, I will do it later.

…ted result projection before using it ## What changes were proposed in this pull request? This is a follow-up PR that moves the test case in PR-18920 (#18920) to DataFrameAggregateSuit. ## How was this patch tested? unit test Author: donnyzone <wellfengzhu@gmail.com> Closes #18946 from DonnyZone/branch-19471-followingPR.

spark-19471

b932d2f

add comment for param

bb29b8f

gatorsmile reviewed Aug 13, 2017

View reviewed changes

check the plan for unit test

5239ebb

gatorsmile reviewed Aug 14, 2017

View reviewed changes

code refractor

f55b161

test comment

d58ffaa

asfgit closed this in fbc2692 Aug 14, 2017

gatorsmile reviewed Aug 14, 2017

View reviewed changes

DonnyZone mentioned this pull request Aug 15, 2017

[SPARK-19471][SQL]AggregationIterator does not initialize the generated result projection before using it #18946

Closed

Conversation

DonnyZone commented Aug 11, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

DonnyZone commented Aug 11, 2017

Uh oh!

DonnyZone commented Aug 11, 2017

Uh oh!

gatorsmile commented Aug 13, 2017

Uh oh!

SparkQA commented Aug 13, 2017

Uh oh!

DonnyZone commented Aug 13, 2017

Uh oh!

SparkQA commented Aug 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DonnyZone commented Aug 14, 2017

Uh oh!

DonnyZone commented Aug 14, 2017

Uh oh!

SparkQA commented Aug 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DonnyZone commented Aug 14, 2017

Uh oh!

SparkQA commented Aug 14, 2017

Uh oh!

DonnyZone commented Aug 14, 2017

Uh oh!

SparkQA commented Aug 14, 2017

Uh oh!

gatorsmile commented Aug 14, 2017

Uh oh!

gatorsmile commented Aug 14, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DonnyZone commented Aug 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants