`static_thread_pool` is not work-conserving

The current implementation of `static_thread_pool` might use a subset of the available threads even when there are tasks scheduled. The work-stealing mechanism only runs on threads that are running and would otherwise transition to idle; if a thread is already idle, nothing will wake it up even if other threads have non-empty queues (EDIT: sibling threads can also be notified in-between tasks, but that depends on waiting for the current task to complete).

The result is that the scheduler is not work-conserving; this can cause both correctness issues and pathological performance.

For an example of incorrectness, the following code should terminate but it instead deadlocks:

```c++
  static_thread_pool pool(2);
  auto sched = pool.get_scheduler();

  auto parent = [&] {
    // parent is occupying one thread, but the pool should have more
    // available for this to complete.
    sync_wait(on(sched, just()));
  };
  sync_wait(on(sched, just() | then(parent)));
  std::printf("Done\n");
```

This happens because the child task is scheduled from a scheduler thread so it goes directly into its local queue without waking up any of the other threads. Note that the same code does not deadlock with libunifex, which uses a more primitive scheduler (single locked queue) which is however work-conserving.

For an example of pathological performance, the following code spawns N long-running tasks on a pool of N threads, so they should all run in parallel

```c++
  const size_t num_threads = 8;
  static_thread_pool pool(num_threads);
  auto sched = pool.get_scheduler();

  async_scope scope;

  auto parent = [&] {
    std::printf("Parent\n");
    auto child = [] {
      std::printf("Child\n");
      // Simulate a long-running (possibly CPU-intensive) task with a
      // sleep.
      std::this_thread::sleep_for(std::chrono::seconds(1));
    };

    for (size_t i = 0; i < num_threads; ++i) {
      scope.spawn(on(sched, just() | then(child)));
    }
  };

  sync_wait(on(sched, just() | then(parent)));
  sync_wait(scope.on_empty());
  std::printf("Done\n");
```

Instead, they run sequentially, so the program takes 8 seconds instead of the expected 1 second (libunifex correctly takes 1 second).

Generally, work-stealing schedulers have some mechanism to wake up siblings when a thread queue is non-empty. However, it is tricky to do this efficiently, because this requires frequent cross-thread synchronization; if done naively, the local thread and a sibling will compete for each newly scheduled task. So it may not be easy to fix without affecting performance.

Full repro: https://gist.github.com/ot/64712433d33dab75559731f84dba5304

On an unrelated note, the remote queue management seems inefficient: if I am understanding correctly, each remote thread will create a queue for each thread in the scheduler. So if we have two schedulers with N threads scheduling tasks into each other, each will have N^2 remote queues, which seems expensive to poll when collecting remote tasks.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`static_thread_pool` is not work-conserving #1305

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

static_thread_pool is not work-conserving #1305

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`static_thread_pool` is not work-conserving #1305