Remove waits from blocking threads reading spill files. by ashdnazg · Pull Request #15654 · apache/datafusion

ashdnazg · 2025-04-09T09:46:17Z

Which issue does this PR close?

Closes Reduce number of tokio blocking threads in SortExec spill #15323.

Rationale for this change

The previous design of reading spill files was a push design, spawning
long lived blocking tasks which repeatedly read records, send them and
wait until they are received. This design had an issue where progress
wasn't guaranteed (i.e., there was a deadlock) if there were more spill
files than the blocking thread pool in tokio which were all waited for
together.

To solve this, the design is changed to a pull design, where blocking
tasks are spawned for every read, removing waiting on the IO threads and
guaranteeing progress.

While there might be an added overhead for repeatedly calling
spawn_blocking, it's probably insignificant compared to the IO cost of
reading from the disk.

Are these changes tested?

Added a test which causes a deadlock in main but passes with this fix.

Are there any user-facing changes?

No.

rluvaton · 2025-04-09T09:56:37Z

Can you please add a test, this solves a deadlock.

Can you please add the following test to make sure that the read spill does not block:
create a tokio runtime with 8 blocking threads and create 9 read spills and wait for all of them to be available (same as in the SortPreservingMergeStream) current implementation should have a dead lock - a timeout reached, new implementation should not deadlock, the test finish)

ashdnazg · 2025-04-09T10:18:43Z

@rluvaton I added the test, although with 1 blocking thread and not 8. shouldn't be an issue IMO.
The test deadlocks on main but passes here.

rluvaton · 2025-04-09T10:31:53Z

@alamb and @andygrove can you please review? it looks fine by me

alamb · 2025-04-09T11:14:38Z

FYI @2010YOUY01

alamb · 2025-04-09T11:14:53Z

Thanks @ashdnazg and @rluvaton -- I started the CI checks

datafusion/physical-plan/src/spill/spill_manager.rs

alamb · 2025-04-09T11:32:21Z

Does anyone know if we have benchmarks for sorting / spilling I could run to verify the impact of this PR on their behavior?

I took a brief look but didn't find any

rluvaton · 2025-04-09T11:33:34Z

Does anyone know if we have benchmarks for sorting / spilling I could run to verify the impact of this PR on their behavior?

I took a brief look but didn't find any

I think you can tweak the TPC benchmark to have less memory so it will spil

andygrove · 2025-04-09T14:41:39Z

Does anyone know if we have benchmarks for sorting / spilling I could run to verify the impact of this PR on their behavior?

I can test with Comet today.

andygrove · 2025-04-09T14:42:50Z

@Kontinuation fyi

Kontinuation · 2025-04-09T15:24:13Z

If I understand this PR correctly, the SpillReaderStream in this PR will read the next batch only when the stream is polled, so the latency of polling a batch is the time spent reading a batch plus some scheduling overhead. The original approach buffers at most 2 batches in each stream. If the batch is already buffered, the latency of polling a batch is the time spent consuming a batch from the mpsc channel, thus hiding the latency of reading files.

ashdnazg · 2025-04-09T15:30:15Z

@Kontinuation indeed!
I originally considered removing the Waiting state, to get rid of the scheduling overhead, or even moving buffering into the stream, but it felt somewhat contrary to the lazy spirit of Futures in Rust.

Will be interesting to see if that's an actual bottleneck.

alamb · 2025-04-09T11:19:43Z

datafusion/physical-plan/src/spill/mod.rs

+    ) -> std::task::Poll<Option<Result<RecordBatch>>> {
+        match &mut self.state {
+            SpillReaderStreamState::Uninitialized(_) => {
+                // Temporarily replace with `Done` to be able to pass the file to the task.


Another pattern for this that could avoid the unreachable might be to change the origianl match to something like

// temporily mark as done: let state = std::mem::replace(&mut self.state, SpillReaderStreamState::Done); // Now you can match with an owned state match state { ... }

the problem with this is that it is easier to leave that done state, for example the futures::ready! macro does return in it in case of pending so we would not be able to use it and it is prune to errors

That is fair -- I don't have a strong preference about which pattern to use, I was just mentioning an alternate pattern as a possibility

datafusion/physical-plan/src/spill/spill_manager.rs

alamb · 2025-04-09T22:25:55Z

Does anyone know if we have benchmarks for sorting / spilling I could run to verify the impact of this PR on their behavior?
I took a brief look but didn't find any

I think you can tweak the TPC benchmark to have less memory so it will spil

I also filed a ticket to track adding a spilling benchmark:

Benchmark / program to test Spilling Sorts #15664

andygrove · 2025-04-09T22:43:22Z

I created a PR in Comet to use DF from this PR - apache/datafusion-comet#1629

I did not have time to run benchmarks today but hope to tomorrow

2010YOUY01 · 2025-04-10T05:04:24Z

I tried a simple benchmark:

Under datafusion/datafusion-cli, compile and run with 100M memory limit
cargo run --profile release-nonlto -- --mem-pool-type fair -m 100M
Execute a query triggered out-of-core sorting
explain analyze select * from generate_series(1, 1000000000) as t1(v1) order by v1;

Result on my MacBook:
main: 35s
PR: 38s

Not quite sure why, I'm trying to understand how those IO interfaces work.

ashdnazg · 2025-04-10T09:44:12Z

@2010YOUY01 I checked your benchmark locally on my linux, Ryzen 7945HX, 3 times on each version and got
main: ~58s
PR: ~57s
which is not much better than noise.
I also checked a version with the buffering done inside the stream using a tokio channel, which should reduce the spawn_blocking overhead, and that one got ~56s.
Again, not a significant difference but the code is significantly more complicated and fragile.
I pushed that version to https://github.com/ashdnazg/datafusion/tree/pull-batch-2. It would be interesting to see its performance on the MacBook.

I do worry that the benchmark might not measure the IO bottleneck accurately due to the OS caching the spill files.

andygrove · 2025-04-10T15:26:32Z

I tested this PR with Comet. Here are the most relevant configs for Comet related to this testing:

$SPARK_HOME/bin/spark-submit \
    --master $SPARK_MASTER \
    --jars $COMET_JAR \
    --driver-class-path $COMET_JAR \
    --conf spark.driver.memory=8G \
    --conf spark.executor.instances=1 \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=8 \
    --conf spark.executor.memory=8g \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=1g \
    --conf spark.executorEnv.COMET_WORKER_THREADS=32 \
    --conf spark.executorEnv.COMET_MAX_BLOCKING_THREADS=80 \

With Comet main branch, TPC-H q4 never completes due to deadlock. With the changes in this PR, the query completes with good performance.

alamb

Looks good to me and the benchmarks sound good

Thank you @ashdnazg

As noted, we will have to wait for #15653 to merge (I'll do that later today unless anyone else would like a chance to review) and then we'll get this one ready

ashdnazg · 2025-04-10T16:04:48Z

@andygrove any chance you could check Comet's performance with this alternative implementation: https://github.com/ashdnazg/datafusion/tree/pull-batch-2 ?
It attempts to remove the spawn overhead and to make buffering more efficient.

andygrove · 2025-04-10T16:16:52Z

@andygrove any chance you could check Comet's performance with this alternative implementation: https://github.com/ashdnazg/datafusion/tree/pull-batch-2 ? It attempts to remove the spawn overhead and to make buffering more efficient.

Yes, I'll do that now.

andygrove · 2025-04-10T17:36:21Z

@andygrove any chance you could check Comet's performance with this alternative implementation: https://github.com/ashdnazg/datafusion/tree/pull-batch-2 ? It attempts to remove the spawn overhead and to make buffering more efficient.

I don't think Comet testing is going to help with this. Here are timings for q4 with this PR and the alternate for 5 runs of q4. In both cases there are tasks failing and restarting due to lack of memory.

This PR

        14.834558725357056,
        11.173914194107056,
        11.313692808151245,
        10.791407823562622,
        11.371635913848877

Alternate

        13.932721853256226,
        12.08954644203186,
        11.981270551681519,
        12.231445550918579,
        10.979195594787598

ashdnazg · 2025-04-10T18:28:06Z

Thank you @andygrove!
Seems clear that we should stay with the simpler approach for now.

ashdnazg · 2025-04-10T20:35:01Z

Rebased

alamb · 2025-04-10T21:12:55Z

I'll plan to merge this tomorrow unless anyone else would like more time to review

2010YOUY01

Really appreciate the nice fix!

2010YOUY01 · 2025-04-11T07:47:13Z

datafusion/physical-plan/src/spill/mod.rs

+
+/// Stream that reads spill files from disk where each batch is read in a spawned blocking task
+/// It will read one batch at a time and will not do any buffering, to buffer data use [`crate::common::spawn_buffered`]
+struct SpillReaderStream {


Suggested change

struct SpillReaderStream {

/// A simpler solution would be spawning a long-running blocking task for each

/// file read (instead of each batch). This approach does not work because when

/// the number of concurrent reads exceeds the Tokio thread pool limit,

/// deadlocks can occur and block progress.

struct SpillReaderStream {

I recommend to add a 'why' comment here.

Strongly agree

Fixes apache#15323. The previous design of reading spill files was a `push` design, spawning long lived blocking tasks which repeatedly read records, send them and wait until they are received. This design had an issue where progress wasn't guaranteed (i.e., there was a deadlock) if there were more spill files than the blocking thread pool in tokio which were all waited for together. To solve this, the design is changed to a `pull` design, where blocking tasks are spawned for every read, removing waiting on the IO threads and guaranteeing progress. While there might be an added overhead for repeatedly calling `spawn_blocking`, it's probably insignificant compared to the IO cost of reading from the disk.

andygrove

Thanks @ashdnazg!

alamb · 2025-04-12T12:10:59Z

🚀

jayzhan211 · 2025-04-14T01:09:17Z

Extended test takes longer time and couldn't finish in 6hr after this change

https://github.com/apache/datafusion/actions/runs/14419458859/job/40440288212

ashdnazg · 2025-04-14T04:41:27Z

@jayzhan211 💩 ☹️
On it.

2010YOUY01 · 2025-04-14T04:44:22Z

Extended test takes longer time and couldn't finish in 6hr after this change

https://github.com/apache/datafusion/actions/runs/14419458859/job/40440288212

I found some memory limit validation tests get stuck from the log: this test

datafusion/datafusion/core/tests/memory_limit/memory_limit_validation/sort_mem_validation.rs

Line 48 in 61e8a5d

fn sort_with_mem_limit_1_runner() {

's outer runner got stuck, but inner test completed.

And I am not able to reproduce this issue on my MacBook, it can progress and finish all the tests 🤦🏼

ashdnazg · 2025-04-14T05:50:34Z

I do reproduce it here on ubuntu - when I run the test through the runner it takes much more time (or hangs entirely) than without.

Just to see what happens, I tried to run the test in release mode, it finished very quickly in both cases.

ashdnazg · 2025-04-14T07:15:11Z

Seems to be contention with refresh_all in the memory monitoring task.

PR here: #15702

Fixes apache#15323. The previous design of reading spill files was a `push` design, spawning long lived blocking tasks which repeatedly read records, send them and wait until they are received. This design had an issue where progress wasn't guaranteed (i.e., there was a deadlock) if there were more spill files than the blocking thread pool in tokio which were all waited for together. To solve this, the design is changed to a `pull` design, where blocking tasks are spawned for every read, removing waiting on the IO threads and guaranteeing progress. While there might be an added overhead for repeatedly calling `spawn_blocking`, it's probably insignificant compared to the IO cost of reading from the disk.

ashdnazg mentioned this pull request Apr 9, 2025

Implement Future for SpawnedTask. #15653

Merged

This was referenced Apr 9, 2025

feat: add multi level merge for sorting #15608

Closed

Reduce number of tokio blocking threads in SortExec spill #15323

Closed

ashdnazg force-pushed the pull-batch branch from fe9fe4d to d24e373 Compare April 9, 2025 10:16

github-actions bot added the common Related to common crate label Apr 9, 2025

rluvaton approved these changes Apr 9, 2025

View reviewed changes

alamb reviewed Apr 9, 2025

View reviewed changes

datafusion/physical-plan/src/spill/spill_manager.rs Show resolved Hide resolved

ashdnazg force-pushed the pull-batch branch from d24e373 to 2042535 Compare April 9, 2025 13:02

alamb reviewed Apr 9, 2025

View reviewed changes

andygrove mentioned this pull request Apr 9, 2025

ignore: Remove waits from blocking threads reading spill files apache/datafusion-comet#1629

Closed

alamb approved these changes Apr 10, 2025

View reviewed changes

ashdnazg force-pushed the pull-batch branch from 2042535 to 37b6386 Compare April 10, 2025 19:47

github-actions bot removed the common Related to common crate label Apr 10, 2025

2010YOUY01 approved these changes Apr 11, 2025

View reviewed changes

alamb mentioned this pull request Apr 11, 2025

Release DataFusion 47.0.0 (April 2025) #15072

Closed

39 tasks

ashdnazg force-pushed the pull-batch branch from 37b6386 to b41d607 Compare April 11, 2025 11:23

andygrove approved these changes Apr 11, 2025

View reviewed changes

alamb merged commit b6a5174 into apache:main Apr 12, 2025
27 checks passed

-struct SpillReaderStream {
+/// A simpler solution would be spawning a long-running blocking task for each
+/// file read (instead of each batch). This approach does not work because when
+/// the number of concurrent reads exceeds the Tokio thread pool limit,
+/// deadlocks can occur and block progress.
+struct SpillReaderStream {

Conversation

ashdnazg commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

rluvaton commented Apr 9, 2025

Uh oh!

ashdnazg commented Apr 9, 2025

Uh oh!

rluvaton commented Apr 9, 2025

Uh oh!

alamb commented Apr 9, 2025

Uh oh!

alamb commented Apr 9, 2025

Uh oh!

Uh oh!

alamb commented Apr 9, 2025

Uh oh!

rluvaton commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Apr 9, 2025

Uh oh!

andygrove commented Apr 9, 2025

Uh oh!

Kontinuation commented Apr 9, 2025

Uh oh!

ashdnazg commented Apr 9, 2025

Uh oh!

alamb Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

rluvaton Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Apr 9, 2025

Uh oh!

andygrove commented Apr 9, 2025

Uh oh!

2010YOUY01 commented Apr 10, 2025

Uh oh!

ashdnazg commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Apr 10, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

ashdnazg commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andygrove commented Apr 10, 2025

Uh oh!

andygrove commented Apr 10, 2025

This PR

Alternate

Uh oh!

ashdnazg commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashdnazg commented Apr 10, 2025

Uh oh!

alamb commented Apr 10, 2025

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

rluvaton Apr 11, 2025

ashdnazg commented Apr 9, 2025 •

edited

Loading

rluvaton commented Apr 9, 2025 •

edited

Loading

rluvaton Apr 9, 2025 •

edited

Loading

ashdnazg commented Apr 10, 2025 •

edited

Loading

ashdnazg commented Apr 10, 2025 •

edited

Loading

ashdnazg commented Apr 10, 2025 •

edited

Loading