[SPARK-35794][SQL] Allow custom plugin for AQE cost evaluator by c21 · Pull Request #32944 · apache/spark

c21 · 2021-06-17T08:31:32Z

What changes were proposed in this pull request?

Current AQE has cost evaluator to decide whether to use new plan after replanning. The current used evaluator is SimpleCostEvaluator to make decision based on number of shuffle in the query plan. This is not perfect cost evaluator, and different production environments might want to use different custom evaluators. E.g., sometimes we might want to still do skew join even though it might introduce extra shuffle (trade off resource for better latency), sometimes we might want to take sort into consideration for cost as well. Take our own setting as an example, we are using a custom remote shuffle service (Cosco), and the cost model is more complicated. So We want to make the cost evaluator to be pluggable, and developers can implement their own CostEvaluator subclass and plug in dynamically based on configuration.

The approach is to introduce a new config to allow define sub-class name of CostEvaluator - spark.sql.adaptive.customCostEvaluatorClass. And add CostEvaluator.instantiate to instantiate the cost evaluator class in AdaptiveSparkPlanExec.costEvaluator.

Why are the changes needed?

Make AQE cost evaluation more flexible.

Does this PR introduce any user-facing change?

No but an internal config is introduced - spark.sql.adaptive.customCostEvaluatorClass to allow custom implementation of CostEvaluator.

How was this patch tested?

Added unit test in AdaptiveQueryExecSuite.scala.

c21 · 2021-06-17T08:33:19Z

cc @cloud-fan could you help take a look when you have time? Thanks.

cloud-fan · 2021-06-17T09:29:25Z

does it work well with #32816 ?

c21 · 2021-06-17T09:36:14Z

does it work well with #32816 ?

@cloud-fan - I think so. If we decide merge this first, then in #32816, we don't need the extra config spark.sql.adaptive.forceEnableSkewJoin. Developers/users can set spark.sql.adaptive.costEvaluatorClass to SkewJoinAwareCostEvaluator and it should work, cc @ulysses-you FYI, thanks.

SparkQA · 2021-06-17T10:00:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44440/

SparkQA · 2021-06-17T10:35:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44440/

ulysses-you · 2021-06-17T12:48:47Z

@c21 thank you for ping me.

Not sure it's worth to make cost evaluator as plugin. You mentioned sort (I think it's local sort, isn't it ?), and can you provide a real use case about it ?

SparkQA · 2021-06-17T13:31:39Z

Test build #139911 has finished for PR 32944 at commit 6670938.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-06-17T13:40:19Z

Developers/users can set spark.sql.adaptive.costEvaluatorClass to SkewJoinAwareCostEvaluator and it should work

I don't think it's that simple. If force-skew-join-handling is enabled, Spark must use SkewJoinAwareCostEvaluator, not a user-specified one.

c21 · 2021-06-17T23:02:04Z

You mentioned sort (I think it's local sort, isn't it ?), and can you provide a real use case about it ?

@ulysses-you - e.g.

SortAggregate
- SortMergeJoin
  - Sort(Shuffle(Scan))
  - Sort(Shuffle(Scan))

AQE might change it to

SortAggregate
- Sort
  - ShuffledHashJoin
    - Shuffle(Scan)
    - Shuffle(Scan)

With our Cosco remote shuffle service, we already implemented the sorted shuffle (Sort(Shuffle) where Sort and Shuffle down in shuffle service side together at same time), and it would be more efficient than doing Sort separately in Spark. So a Sort(Shuffle) is more efficient than a pair of Shuffle and Sort in our case. This influences our AQE decision and we have to have a custom cost evaluator. As we can see, we need a separate cost evaluator for forcing skew join and we might have more in the future. Another aspect is Cosco hasn't been open source yet, so we want a clean interface for custom cost evaluator, instead of always maintaining a fork change on our side.

c21 · 2021-06-17T23:04:43Z

I don't think it's that simple. If force-skew-join-handling is enabled, Spark must use SkewJoinAwareCostEvaluator, not a user-specified one.

@cloud-fan - from my checking of #32816, it looks like the only logic controlled by the new config spark.sql.adaptive.forceEnableSkewJoin, is to choose a different cost evaluator - SkewJoinAwareCostEvaluator. My idea is to not introduce the new config, but we can just set spark.sql.adaptive.costEvaluatorClass to SkewJoinAwareCostEvaluator to enable force skew join.

ulysses-you · 2021-06-18T01:48:58Z

@c21 thanks for the explaination, the example SortAggregate(SMJ) to SortAggregate(SHJ) seems useful. But about the usage, I agree with @cloud-fan , the boolean config of forceEnableSkewJoin is necessary and more easy for user. A class name is a little hack for user if they want to optimize skew join anyway.

c21 · 2021-06-18T01:53:10Z

the boolean config of forceEnableSkewJoin is necessary and more easy for user. A class name is a little hack for user if they want to optimize skew join anyway.

@ulysses-you - sure, I agree with boolean config is more intuitive and easier to use. If we do need the boolean config, we can add special logic in AdaptiveSparkPlanExec.costEvaluator, to use SkewJoinAwareCostEvaluator when spark.sql.adaptive.forceEnableSkewJoin is true, instead of whatever user sets for spark.sql.adaptive.costEvaluatorClass. It just a matter of priority between different configs.

cloud-fan · 2021-07-01T18:35:35Z

We can make it an optional conf: spark.sql.adaptive.customCostEvaluatorClass. If not set, we use the builtin impl.

@cloud-fan - sure, updated.

cloud-fan · 2021-07-01T18:43:08Z

We can use the standard API in Spark: Utils.loadExtensions

@cloud-fan - good call, updated.

cloud-fan · 2021-07-01T18:44:16Z

This can still be an object, if we follow https://github.com/apache/spark/pull/32944/files#r662513062

@cloud-fan - yeah, updated.

SparkQA · 2021-07-02T00:35:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45060/

SparkQA · 2021-07-02T01:08:12Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45060/

cloud-fan · 2021-07-02T03:15:43Z

nit: we don't have to create a method here if it's only called once

@cloud-fan - sure, updated.

cloud-fan · 2021-07-02T03:18:55Z

does this custom cost evaluator change the query plan? It seems to be the same with the builtin cost evaluator.

@cloud-fan - this evaluator does not change plan, and to be the same with the builtin evaluator for this query. Do we want to come up a different one here? I think this just validates the custom evaluator works.

SGTM, let's leave it then

SparkQA · 2021-07-02T04:01:30Z

Test build #140547 has finished for PR 32944 at commit 404fe35.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
.doc(\"The custom cost evaluator class to be used for adaptive execution. If not being set,\" +

cloud-fan · 2021-07-02T08:09:47Z

@c21 can you fix the code conflicts?

c21 · 2021-07-02T08:15:51Z

@cloud-fan - thanks, just rebased to latest master.

SparkQA · 2021-07-02T08:45:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45079/

HyukjinKwon · 2021-07-02T09:10:34Z

@c21, can you at least mark CostEvaluator with @Unstable API tag? Also please add a note that it is subject to be moved or changed in the near future.

SparkQA · 2021-07-02T09:18:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45079/

SparkQA · 2021-07-02T09:28:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45082/

SparkQA · 2021-07-02T10:02:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45082/

SparkQA · 2021-07-02T12:21:07Z

Test build #140567 has finished for PR 32944 at commit e202aa8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-07-02T13:04:53Z

Test build #140570 has finished for PR 32944 at commit c5ed8e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21 · 2021-07-04T06:13:27Z

@HyukjinKwon - updated per discussion, and this is ready for review again, thanks.

SparkQA · 2021-07-04T07:37:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45131/

SparkQA · 2021-07-04T08:09:32Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45131/

SparkQA · 2021-07-04T11:17:36Z

Test build #140618 has finished for PR 32944 at commit ac5c121.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-07-05T00:31:06Z

+    buildConf("spark.sql.adaptive.customCostEvaluatorClass")
+      .doc("The custom cost evaluator class to be used for adaptive execution. If not being set," +
+        " Spark will use its own SimpleCostEvaluator by default.")
+      .version("3.2.0")


the only think is that the version has to be 3.3.0 since we cut the branch now. Since this PR won't likely affect anything in the main code, I am okay with merging to 3.2.0 either tho. I will leave it to @cloud-fan and you.

3.2 is the first version that enables AQE by default, and this seems to be a useful extension. Let's include it in 3.2.

cloud-fan · 2021-07-05T09:06:37Z

thanks, merging to master/3.2!

### What changes were proposed in this pull request? Current AQE has cost evaluator to decide whether to use new plan after replanning. The current used evaluator is `SimpleCostEvaluator` to make decision based on number of shuffle in the query plan. This is not perfect cost evaluator, and different production environments might want to use different custom evaluators. E.g., sometimes we might want to still do skew join even though it might introduce extra shuffle (trade off resource for better latency), sometimes we might want to take sort into consideration for cost as well. Take our own setting as an example, we are using a custom remote shuffle service (Cosco), and the cost model is more complicated. So We want to make the cost evaluator to be pluggable, and developers can implement their own `CostEvaluator` subclass and plug in dynamically based on configuration. The approach is to introduce a new config to allow define sub-class name of `CostEvaluator` - `spark.sql.adaptive.customCostEvaluatorClass`. And add `CostEvaluator.instantiate` to instantiate the cost evaluator class in `AdaptiveSparkPlanExec.costEvaluator`. ### Why are the changes needed? Make AQE cost evaluation more flexible. ### Does this PR introduce _any_ user-facing change? No but an internal config is introduced - `spark.sql.adaptive.customCostEvaluatorClass` to allow custom implementation of `CostEvaluator`. ### How was this patch tested? Added unit test in `AdaptiveQueryExecSuite.scala`. Closes #32944 from c21/aqe-cost. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 044dddf) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

c21 · 2021-07-06T04:55:25Z

Thank you @cloud-fan and @HyukjinKwon for review!

github-actions Bot added the SQL label Jun 17, 2021

cloud-fan reviewed Jul 1, 2021

View reviewed changes

c21 force-pushed the aqe-cost branch from 6670938 to 404fe35 Compare July 1, 2021 23:40

cloud-fan reviewed Jul 2, 2021

View reviewed changes

c21 added 4 commits July 2, 2021 01:15

Allow custom plugin for AQE cost evaluator

494b8bc

Fix newline at end of file

f1ee7ec

Address all comments and rebase to latest master

9fd1bbe

Address comment to remove adaptiveCustomCostEvaluatorClass

c5ed8e7

c21 force-pushed the aqe-cost branch from e202aa8 to c5ed8e7 Compare July 2, 2021 08:15

cloud-fan approved these changes Jul 2, 2021

View reviewed changes

HyukjinKwon reviewed Jul 2, 2021

View reviewed changes

Comment thread sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala

HyukjinKwon reviewed Jul 2, 2021

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated

Address all comments

ac5c121

HyukjinKwon approved these changes Jul 5, 2021

View reviewed changes

HyukjinKwon reviewed Jul 5, 2021

View reviewed changes

HyukjinKwon approved these changes Jul 5, 2021

View reviewed changes

cloud-fan closed this in 044dddf Jul 5, 2021

c21 deleted the aqe-cost branch July 6, 2021 04:55

roryqi mentioned this pull request Mar 18, 2025

[#1750] feat(remote merge): Support Spark. apache/uniffle#2405

Open

Conversation

c21 commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

c21 commented Jun 17, 2021

Uh oh!

cloud-fan commented Jun 17, 2021

Uh oh!

c21 commented Jun 17, 2021

Uh oh!

SparkQA commented Jun 17, 2021

Uh oh!

SparkQA commented Jun 17, 2021

Uh oh!

ulysses-you commented Jun 17, 2021

Uh oh!

SparkQA commented Jun 17, 2021

Uh oh!

cloud-fan commented Jun 17, 2021

Uh oh!

c21 commented Jun 17, 2021

Uh oh!

c21 commented Jun 17, 2021

Uh oh!

ulysses-you commented Jun 18, 2021

Uh oh!

c21 commented Jun 18, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 2, 2021

Uh oh!

SparkQA commented Jul 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 2, 2021

Uh oh!

cloud-fan commented Jul 2, 2021

Uh oh!

c21 commented Jul 2, 2021

Uh oh!

SparkQA commented Jul 2, 2021

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Jul 2, 2021

Uh oh!

SparkQA commented Jul 2, 2021

Uh oh!

SparkQA commented Jul 2, 2021

Uh oh!

SparkQA commented Jul 2, 2021

Uh oh!

c21 commented Jun 17, 2021 •

edited

Loading

cloud-fan Jul 1, 2021 •

edited

Loading

cloud-fan Jul 2, 2021 •

edited

Loading

HyukjinKwon Jul 5, 2021 •

edited

Loading