[SPARK-37328][SQL] Fix bug that OptimizeSkewedJoin may not work after it was moved from queryStageOptimizerRules to queryStagePreparationRules.#34602
Conversation
… it was moved from queryStageOptimizerRules to queryStagePreparationRules.
|
Can one of the admins verify this patch? |
|
ping @cloud-fan and @ulysses-you, could you take a look at this problem? |
…ld because ensureRequirements with requiredDistribution may bring extra shuffle if we just return child of exchange.
|
|
I haven't changed that. The problem is since We should apply |
|
can we remove the check |
And it will not only work in cases where just 2 tables join, many complex combination need to be considered, such as multiple table joins in same stage. |
Hi @cloud-fan, The |
Why it can not work in such case ? if multiple table joins in same stage, the plan should be : So we can still optimize the SHJ1 by transformUp this plan if we allow introduce extra shuffle. It seems to me that the check |
@ulysses-you per my understanding, it's mostly about to reduce complexity. Also cc @zhengruifeng https://github.com/apache/spark/pull/33893/files since your PR is about to generalize the skew join rule |
Yes, it will work in cases where multiple table joins in same stage. But I don't think it's the best way to optimize MultipleSkewedJoin since extra shuffle will be introduced. In worst cases, N SHJ will introduce (N-1) shuffles. |
|
@advancedxy Sorry for the late reply and thanks for ping me. I did a quick test with #33893 Unfortunately, #33893 failed to handle the case, since the whole plan including test code: related log: update: I try to change |
|
This should be a regression, but a simple change As to #33893, I also update it to support this case. It supports multi joins with union/agg/win nodes in single stage, and had been used on our production system for 3 months, you may have a try. |
|
I think we can simply remove the check. Thanks for providing the test! I created a PR to remove the check and your test passed: #34974 |
What changes were proposed in this pull request?
Fix the issue that OptimizeSkewedJoin may not work.
Since OptimizeSkewedJoin was moved from
queryStageOptimizerRulestoqueryStagePreparationRules,the position OptimizeSkewedJoin was applied has been moved fromnewQueryStage()toreOptimize(). The plan OptimizeSkewedJoin applied on changed from plan of new stage which is about to submit to whole spark plan.In the cases where skewedJoin is not last stage, OptimizeSkewedJoin may not work because the number of collected shuffleStages is more than 2.
Why are the changes needed?
Bug fix.
Does this PR introduce any user-facing change?
No
How was this patch tested?
New test.