Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~4000x faster)#18875
Conversation
…hes on vector of RecordBatches. Add benchmarks, update tests.
|
|
|
I have a bug somewhere the extended tests demonstrate. I'll try to track it down next week. |
# Conflicts: # Cargo.lock
|
I think I sorted out the corner case failures by refactoring a bit. I basically removed direct member access to |
BatchCoaleser in Sort Merge Join, new benchmarksBatchCoaleser in Sort Merge Join, new benchmarks (TPC-H Q21 SMJ 1000x faster)
|
🤖 |
|
I started the following on this branch I think that will effectively test the merge join performance of main with this branch |
|
This is what I get on my amd ryzen 9 machine: |
BatchCoalescer, new benchmarks (TPC-H Q21 SMJ ~1000x faster)BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~1000x faster)
|
FWIW the benchmarks are still running because Q21 took over an hour to run 🤯 |
|
some of the |
I might remove some. They were mostly to help me understand control flow as I was learning the SMJ state machine: I'd try to codify my understanding with |
|
🤖: Benchmark completed Details
|
My goodness. |
|
Small batches are evil, sorry for delay, I wanted to check the PR with TPCDS but because of recent regression #19075 cannot merge it right now |
|
could you please align with main, I just merged a PR that fixed bug in SMJ and updated fuzz tests |
BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~1000x faster)BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~4000x faster)
# Conflicts: # benchmarks/bench.sh # benchmarks/src/bin/dfbench.rs
|
🚀 |
|
TPCDS benches, sorry for being late |
Which issue does this PR close?
Rationale for this change
DataFusion Comet often uses Sort Merge Joins because DataFusion does not have a larger-than-memory Hash Join operator. Performance on TPC-H Q21 is quite bad when run through native, and instead Comet falls back to Spark by default. If you force Comet to use DataFusion's SMJ operator, performance is:
Profiling showed most of the time spent in
concat_batchesof single-digit rows:What changes are included in this PR?
Use a
BatchCoalescerboth internally and to buffer final output. There was also some redundant concatenation of batches for filtered joins. One made the biggest difference, but I switched to two to be consistent. Here are Comet results with the changes based on 50.3 (which is where Comet is):TPC-H SF1 benchmark results are below (
PREFER_HASH_JOIN=false ./bench.sh run tpch). I tried to run SF10 TPC-H but it seemed like it was going to take hours on my machine. It ran successfully on this PR.Are these changes tested?
Existing Sort Merge Join unit tests, added a new benchmark.
Are there any user-facing changes?
There should not be.