Skip to content

Bounded worker pool for multi-upstream polling#15

Merged
ch4r10t33r merged 4 commits into
mainfrom
feat/bounded-upstream-poll-workers
May 13, 2026
Merged

Bounded worker pool for multi-upstream polling#15
ch4r10t33r merged 4 commits into
mainfrom
feat/bounded-upstream-poll-workers

Conversation

@ch4r10t33r
Copy link
Copy Markdown
Collaborator

Summary

Multi-upstream mode previously spawned one detached OS thread per upstream on every poll tick and busy-waited on a global deadline. That does not scale past a handful of nodes (thread churn, many parallel ~16MB SSZ downloads, fragile cleanup).

Changes

  • Replace the per-upstream thread + detach model with a bounded worker pool (default 16 workers, max 256) that drains a shared atomic work queue.
  • Workers use Thread.join() for a deterministic poll round; each worker keeps its own std.http.Client (unchanged safety model).
  • Add poll_max_concurrency: --poll-concurrency <n> and LEANPOINT_POLL_CONCURRENCY (0 falls back to default).

Tuning

With concurrency < N, worst-case wall time per tick grows roughly with ceil(N / workers) × per-request latency (still bounded by existing socket timeouts in lean_api). Raise concurrency on fast networks or large devnets if needed.

Testing

  • zig build / zig build test

Replace one detached OS thread per upstream with a fixed-size worker
queue and Thread.join(), capping parallel SSZ downloads and thread churn
when many nodes are configured.

Add poll_max_concurrency (default 16): --poll-concurrency and
LEANPOINT_POLL_CONCURRENCY; values above 256 are clamped in the poller.
If spawning the Nth worker failed, we joined already-running workers but
returned null before Step 2, discarding completed HTTP poll results and
leaving all upstreams at initial error_count/last_error.

Break out of the spawn loop instead, join all started workers once, then
run Step 2 when at least one worker started. Only return null when no
worker could be spawned (results buffer uninitialized).
The previous worker-pool change used Thread.join() to wait for workers,
but std.http.Client.open does a synchronous connect() and lean_api's
SO_RCVTIMEO/SO_SNDTIMEO only bound read/write — not connect. A single
blackholed peer hangs a worker on connect forever, so join() blocks
forever and the poll loop never advances (live dashboard stuck on
healthy=false, error_count=0, last_error=null for every node).

Restore the original detach-on-deadline shape and layer bounded
concurrency on top:

- Refcounted PollCtx (spawner + worker hold a ref each); slow workers
  safely outlive the dispatcher.
- SlotState (mutex + condvar + in_flight counter) caps parallel polls
  and signals the dispatcher when a slot frees up or all workers drain.
- Per-tick deadline = ceil(N/cap) * timeout_ms + headroom; workers not
  done by then are abandoned (still detached, ctx survives).
- Worker uses release store on done; dispatcher acquires before reading
  any output field. SSZ ownership transfer for the consensus winner
  unchanged.
After deploying to a 64-node cluster the dispatcher was hitting its
deadline at upstream 36/64. Two causes:

1. Worker slot occupancy was up to ~2× timeout_ms (fetchSlots up to
   timeout_ms, then aggregator + head_slot at timeout_ms/2 each), but
   the deadline assumed timeout_ms per worker.
2. Default cap of 16 forced 4 batches for the 64-node prod cluster,
   compounding the under-budgeting.

Changes:
- Default poll_max_concurrency 16 → 64. With one batch in prod, polls
  finish within stale_after_ms; SSZ memory peak is no worse than the
  pre-bounding code.
- Aux call timeout: timeout_ms/2 → timeout_ms/4 (min 1s). Healthy
  workers free their slot in ~1.5× timeout_ms.
- Deadline formula: per_worker_max_ms = 2 × timeout_ms; round_ms =
  batches × per_worker_max_ms + timeout_ms (drain). Dispatch will not
  starve when a batch is full of slow peers.
@ch4r10t33r ch4r10t33r merged commit 2217b32 into main May 13, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant