Bounded worker pool for multi-upstream polling by ch4r10t33r · Pull Request #15 · blockblaz/leanpoint

ch4r10t33r · 2026-05-13T18:45:55Z

Summary

Multi-upstream mode previously spawned one detached OS thread per upstream on every poll tick and busy-waited on a global deadline. That does not scale past a handful of nodes (thread churn, many parallel ~16MB SSZ downloads, fragile cleanup).

Changes

Replace the per-upstream thread + detach model with a bounded worker pool (default 16 workers, max 256) that drains a shared atomic work queue.
Workers use Thread.join() for a deterministic poll round; each worker keeps its own std.http.Client (unchanged safety model).
Add poll_max_concurrency: --poll-concurrency <n> and LEANPOINT_POLL_CONCURRENCY (0 falls back to default).

Tuning

With concurrency < N, worst-case wall time per tick grows roughly with ceil(N / workers) × per-request latency (still bounded by existing socket timeouts in lean_api). Raise concurrency on fast networks or large devnets if needed.

Testing

zig build / zig build test

Replace one detached OS thread per upstream with a fixed-size worker queue and Thread.join(), capping parallel SSZ downloads and thread churn when many nodes are configured. Add poll_max_concurrency (default 16): --poll-concurrency and LEANPOINT_POLL_CONCURRENCY; values above 256 are clamped in the poller.

If spawning the Nth worker failed, we joined already-running workers but returned null before Step 2, discarding completed HTTP poll results and leaving all upstreams at initial error_count/last_error. Break out of the spawn loop instead, join all started workers once, then run Step 2 when at least one worker started. Only return null when no worker could be spawned (results buffer uninitialized).

The previous worker-pool change used Thread.join() to wait for workers, but std.http.Client.open does a synchronous connect() and lean_api's SO_RCVTIMEO/SO_SNDTIMEO only bound read/write — not connect. A single blackholed peer hangs a worker on connect forever, so join() blocks forever and the poll loop never advances (live dashboard stuck on healthy=false, error_count=0, last_error=null for every node). Restore the original detach-on-deadline shape and layer bounded concurrency on top: - Refcounted PollCtx (spawner + worker hold a ref each); slow workers safely outlive the dispatcher. - SlotState (mutex + condvar + in_flight counter) caps parallel polls and signals the dispatcher when a slot frees up or all workers drain. - Per-tick deadline = ceil(N/cap) * timeout_ms + headroom; workers not done by then are abandoned (still detached, ctx survives). - Worker uses release store on done; dispatcher acquires before reading any output field. SSZ ownership transfer for the consensus winner unchanged.

After deploying to a 64-node cluster the dispatcher was hitting its deadline at upstream 36/64. Two causes: 1. Worker slot occupancy was up to ~2× timeout_ms (fetchSlots up to timeout_ms, then aggregator + head_slot at timeout_ms/2 each), but the deadline assumed timeout_ms per worker. 2. Default cap of 16 forced 4 batches for the 64-node prod cluster, compounding the under-budgeting. Changes: - Default poll_max_concurrency 16 → 64. With one batch in prod, polls finish within stale_after_ms; SSZ memory peak is no worse than the pre-bounding code. - Aux call timeout: timeout_ms/2 → timeout_ms/4 (min 1s). Healthy workers free their slot in ~1.5× timeout_ms. - Deadline formula: per_worker_max_ms = 2 × timeout_ms; round_ms = batches × per_worker_max_ms + timeout_ms (drain). Dispatch will not starve when a batch is full of slow peers.

ch4r10t33r added 4 commits May 13, 2026 19:45

ch4r10t33r merged commit 2217b32 into main May 13, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bounded worker pool for multi-upstream polling#15

Bounded worker pool for multi-upstream polling#15
ch4r10t33r merged 4 commits into
mainfrom
feat/bounded-upstream-poll-workers

ch4r10t33r commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ch4r10t33r commented May 13, 2026

Summary

Changes

Tuning

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant