[SPARK-36638][SQL] Generalize OptimizeSkewedJoin#33893
Conversation
|
added test("General Skew Join: 3-table join") |
|
added test("General Skew Join: 3-table join UNION 2-table join") |
|
added test("General Skew Join: 5-table join") |
|
friendly ping @cloud-fan |
|
friendly ping @JkSelf @cloud-fan @yaooqinn @ulysses-you . Could you please take a look in your spare time? Thanks! |
545bf9b to
cd3c449
Compare
cd3c449 to
5e64c4e
Compare
6b434b0 to
c1d3e48
Compare
|
When developing this method, I used some tests like #34108 to check correctness. It should be helpful for reviewing. |
dc3b6ea to
d7c6678
Compare
d7c6678 to
3cdd96e
Compare
3d4d7fa to
c8bfb0c
Compare
36a5b6a to
1f88310
Compare
1f88310 to
65bb496
Compare
65bb496 to
169c2ca
Compare
There was a problem hiding this comment.
this skew test case newly added in #34908 can be optimized by this PR without extra shuffle:
| Master | this PR |
|---|---|
![]() |
![]() |
169c2ca to
2b3a2fd
Compare
2b3a2fd to
35d2212
Compare
|
retest this please |
35d2212 to
6305685
Compare
|
cc @maryannxue |
6305685 to
0d250b0
Compare
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |





What changes were proposed in this pull request?
This PR aims to generalize
OptimizeSkewedJointo support all patterns that can be handled by current split-duplicate strategy:1, find the splittable shuffle query stages by the semantics of internal nodes;
2, for each splittable shuffle query stage, check whether skew partitions exists, if true, split them into specs;
3, handle Combinatorial Explosion: for each skew partition, check whether the combination number is too large, if so, re-split the stages to keep a reasonable number of combinations. For example, for partition 0, stage A/B/C are split into 100/100/100 specs, respectively. Then there are 1M combinations, which is too large, and will cause performance regression.
4, attach new specs to shuffle query stages;
Why are the changes needed?
to Generalize OptimizeSkewedJoin
Does this PR introduce any user-facing change?
one additional config added
How was this patch tested?
existing testsuites, added testsuites, some cases on our productive system