-
Notifications
You must be signed in to change notification settings - Fork 232
Description
The current implementation of static_thread_pool might use a subset of the available threads even when there are tasks scheduled. The work-stealing mechanism only runs on threads that are running and would otherwise transition to idle; if a thread is already idle, nothing will wake it up even if other threads have non-empty queues (EDIT: sibling threads can also be notified in-between tasks, but that depends on waiting for the current task to complete).
The result is that the scheduler is not work-conserving; this can cause both correctness issues and pathological performance.
For an example of incorrectness, the following code should terminate but it instead deadlocks:
static_thread_pool pool(2);
auto sched = pool.get_scheduler();
auto parent = [&] {
// parent is occupying one thread, but the pool should have more
// available for this to complete.
sync_wait(on(sched, just()));
};
sync_wait(on(sched, just() | then(parent)));
std::printf("Done\n");This happens because the child task is scheduled from a scheduler thread so it goes directly into its local queue without waking up any of the other threads. Note that the same code does not deadlock with libunifex, which uses a more primitive scheduler (single locked queue) which is however work-conserving.
For an example of pathological performance, the following code spawns N long-running tasks on a pool of N threads, so they should all run in parallel
const size_t num_threads = 8;
static_thread_pool pool(num_threads);
auto sched = pool.get_scheduler();
async_scope scope;
auto parent = [&] {
std::printf("Parent\n");
auto child = [] {
std::printf("Child\n");
// Simulate a long-running (possibly CPU-intensive) task with a
// sleep.
std::this_thread::sleep_for(std::chrono::seconds(1));
};
for (size_t i = 0; i < num_threads; ++i) {
scope.spawn(on(sched, just() | then(child)));
}
};
sync_wait(on(sched, just() | then(parent)));
sync_wait(scope.on_empty());
std::printf("Done\n");Instead, they run sequentially, so the program takes 8 seconds instead of the expected 1 second (libunifex correctly takes 1 second).
Generally, work-stealing schedulers have some mechanism to wake up siblings when a thread queue is non-empty. However, it is tricky to do this efficiently, because this requires frequent cross-thread synchronization; if done naively, the local thread and a sibling will compete for each newly scheduled task. So it may not be easy to fix without affecting performance.
Full repro: https://gist.github.com/ot/64712433d33dab75559731f84dba5304
On an unrelated note, the remote queue management seems inefficient: if I am understanding correctly, each remote thread will create a queue for each thread in the scheduler. So if we have two schedulers with N threads scheduling tasks into each other, each will have N^2 remote queues, which seems expensive to poll when collecting remote tasks.