Bounded worker pool for multi-upstream polling#15
Merged
Conversation
Replace one detached OS thread per upstream with a fixed-size worker queue and Thread.join(), capping parallel SSZ downloads and thread churn when many nodes are configured. Add poll_max_concurrency (default 16): --poll-concurrency and LEANPOINT_POLL_CONCURRENCY; values above 256 are clamped in the poller.
If spawning the Nth worker failed, we joined already-running workers but returned null before Step 2, discarding completed HTTP poll results and leaving all upstreams at initial error_count/last_error. Break out of the spawn loop instead, join all started workers once, then run Step 2 when at least one worker started. Only return null when no worker could be spawned (results buffer uninitialized).
The previous worker-pool change used Thread.join() to wait for workers, but std.http.Client.open does a synchronous connect() and lean_api's SO_RCVTIMEO/SO_SNDTIMEO only bound read/write — not connect. A single blackholed peer hangs a worker on connect forever, so join() blocks forever and the poll loop never advances (live dashboard stuck on healthy=false, error_count=0, last_error=null for every node). Restore the original detach-on-deadline shape and layer bounded concurrency on top: - Refcounted PollCtx (spawner + worker hold a ref each); slow workers safely outlive the dispatcher. - SlotState (mutex + condvar + in_flight counter) caps parallel polls and signals the dispatcher when a slot frees up or all workers drain. - Per-tick deadline = ceil(N/cap) * timeout_ms + headroom; workers not done by then are abandoned (still detached, ctx survives). - Worker uses release store on done; dispatcher acquires before reading any output field. SSZ ownership transfer for the consensus winner unchanged.
After deploying to a 64-node cluster the dispatcher was hitting its deadline at upstream 36/64. Two causes: 1. Worker slot occupancy was up to ~2× timeout_ms (fetchSlots up to timeout_ms, then aggregator + head_slot at timeout_ms/2 each), but the deadline assumed timeout_ms per worker. 2. Default cap of 16 forced 4 batches for the 64-node prod cluster, compounding the under-budgeting. Changes: - Default poll_max_concurrency 16 → 64. With one batch in prod, polls finish within stale_after_ms; SSZ memory peak is no worse than the pre-bounding code. - Aux call timeout: timeout_ms/2 → timeout_ms/4 (min 1s). Healthy workers free their slot in ~1.5× timeout_ms. - Deadline formula: per_worker_max_ms = 2 × timeout_ms; round_ms = batches × per_worker_max_ms + timeout_ms (drain). Dispatch will not starve when a batch is full of slow peers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Multi-upstream mode previously spawned one detached OS thread per upstream on every poll tick and busy-waited on a global deadline. That does not scale past a handful of nodes (thread churn, many parallel ~16MB SSZ downloads, fragile cleanup).
Changes
Thread.join()for a deterministic poll round; each worker keeps its ownstd.http.Client(unchanged safety model).poll_max_concurrency:--poll-concurrency <n>andLEANPOINT_POLL_CONCURRENCY(0 falls back to default).Tuning
With concurrency < N, worst-case wall time per tick grows roughly with
ceil(N / workers)× per-request latency (still bounded by existing socket timeouts inlean_api). Raise concurrency on fast networks or large devnets if needed.Testing
zig build/zig build test