Skip to content

static_thread_pool is not work-conserving #1305

@ot

Description

@ot

The current implementation of static_thread_pool might use a subset of the available threads even when there are tasks scheduled. The work-stealing mechanism only runs on threads that are running and would otherwise transition to idle; if a thread is already idle, nothing will wake it up even if other threads have non-empty queues (EDIT: sibling threads can also be notified in-between tasks, but that depends on waiting for the current task to complete).

The result is that the scheduler is not work-conserving; this can cause both correctness issues and pathological performance.

For an example of incorrectness, the following code should terminate but it instead deadlocks:

  static_thread_pool pool(2);
  auto sched = pool.get_scheduler();

  auto parent = [&] {
    // parent is occupying one thread, but the pool should have more
    // available for this to complete.
    sync_wait(on(sched, just()));
  };
  sync_wait(on(sched, just() | then(parent)));
  std::printf("Done\n");

This happens because the child task is scheduled from a scheduler thread so it goes directly into its local queue without waking up any of the other threads. Note that the same code does not deadlock with libunifex, which uses a more primitive scheduler (single locked queue) which is however work-conserving.

For an example of pathological performance, the following code spawns N long-running tasks on a pool of N threads, so they should all run in parallel

  const size_t num_threads = 8;
  static_thread_pool pool(num_threads);
  auto sched = pool.get_scheduler();

  async_scope scope;

  auto parent = [&] {
    std::printf("Parent\n");
    auto child = [] {
      std::printf("Child\n");
      // Simulate a long-running (possibly CPU-intensive) task with a
      // sleep.
      std::this_thread::sleep_for(std::chrono::seconds(1));
    };

    for (size_t i = 0; i < num_threads; ++i) {
      scope.spawn(on(sched, just() | then(child)));
    }
  };

  sync_wait(on(sched, just() | then(parent)));
  sync_wait(scope.on_empty());
  std::printf("Done\n");

Instead, they run sequentially, so the program takes 8 seconds instead of the expected 1 second (libunifex correctly takes 1 second).

Generally, work-stealing schedulers have some mechanism to wake up siblings when a thread queue is non-empty. However, it is tricky to do this efficiently, because this requires frequent cross-thread synchronization; if done naively, the local thread and a sibling will compete for each newly scheduled task. So it may not be easy to fix without affecting performance.

Full repro: https://gist.github.com/ot/64712433d33dab75559731f84dba5304

On an unrelated note, the remote queue management seems inefficient: if I am understanding correctly, each remote thread will create a queue for each thread in the scheduler. So if we have two schedulers with N threads scheduling tasks into each other, each will have N^2 remote queues, which seems expensive to poll when collecting remote tasks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions