fix: pin Farm client to HTTP/1.1 and cap concurrent uploads#10456
fix: pin Farm client to HTTP/1.1 and cap concurrent uploads#10456basvandijk wants to merge 2 commits into
Conversation
Farm's web server (warp-tls) advertises HTTP/2
SETTINGS_MAX_CONCURRENT_STREAMS=64 and tears down the whole connection
(RST_STREAM + GOAWAY "exceeds max concurrent") if a burst of multiplexed
streams exceeds that limit before the client has processed the server's
SETTINGS frame.
The blocking reqwest client used by the test driver multiplexes all parallel
requests over a single HTTP/2 connection. When a test starts many VMs at once
(e.g. the 120-node xnet_slo_120_subnets_staging_test), the request burst
outruns the SETTINGS frame on high-RTT paths and Farm kills the connection,
failing all in-flight requests at once (SendError { kind: Disconnected } /
REFUSED_STREAM), which surfaces as:
Starting VM failed with: Retried too many times: sending a request to Farm
and the test fails in setup with zero VMs started. Low-RTT environments (CI
colocated with Farm) usually win the race, which is why this mostly bites
when running tests from dev machines.
Fix:
- Force HTTP/1.1 on the Farm client (one TCP connection per request, no
stream-concurrency limit to violate).
- Cap concurrent file uploads at 32 with a small blocking semaphore held
across the retry loop; under HTTP/1.1 this bounds parallel TCP connections
and keeps the code safe if the HTTP/1.1 pin is ever removed.
Verified by running xnet_slo_120_subnets_staging_test from a dev container:
setup now starts all 120 VMs reliably (previously 0 started).
There was a problem hiding this comment.
Pull request overview
This PR mitigates Farm connection teardown failures during large parallel VM setups by preventing HTTP/2 stream multiplexing against Farm’s strict concurrent-stream handling and by bounding upload concurrency in the test driver.
Changes:
- Force the Farm
reqwest::blockingclient to use HTTP/1.1 to avoid HTTP/2 multiplexed bursts exceeding Farm’sSETTINGS_MAX_CONCURRENT_STREAMS. - Add a small blocking counting semaphore shared across
Farmclones to cap concurrentupload_filecalls at 32, holding the permit across the full retry loop.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
nmattia
left a comment
There was a problem hiding this comment.
Nice! couple questions to help me understand better
| /// Cap on concurrent `upload_file` calls per `Farm` instance (clones share the | ||
| /// HTTP client and this semaphore). Farm's web server advertises HTTP/2 | ||
| /// SETTINGS_MAX_CONCURRENT_STREAMS=64 and tears down the whole connection if a | ||
| /// burst of multiplexed streams exceeds that limit, so stay well below it, | ||
| /// leaving headroom for other requests sharing the client (e.g. the keepalive | ||
| /// task). Under HTTP/1.1 this bounds the number of parallel TCP connections | ||
| /// instead. |
There was a problem hiding this comment.
I'm a bit confused by this comment. I see below that we now enforce HTTP1.1, so what do we mean by "..., so stay well below [the HTTP2 max streams]"?
|
|
||
| /// Blocks until a permit is free. The permit is held until the returned | ||
| /// guard is dropped (including on panic). | ||
| fn acquire(&self) -> UploadPermit<'_> { |
There was a problem hiding this comment.
is this ever called within tokio? this will be blocking the whole thread, correct?
Problem
Farm's web server (warp-tls) advertises HTTP/2
SETTINGS_MAX_CONCURRENT_STREAMS=64and tears down the whole connection (RST_STREAM+GOAWAY \"exceeds max concurrent\") when a burst of multiplexed streams exceeds that limit — strict behavior vs. RFC 9113 §5.1.2, which prescribes rejecting only the offending stream.The blocking reqwest client used by the test driver multiplexes all parallel requests over a single HTTP/2 connection. When a test starts many VMs at once (e.g. the 120-node
xnet_slo_120_subnets_staging_test), the request burst outruns the server's SETTINGS frame on high-RTT paths and Farm kills the connection, failing all in-flight requests at once:→ setup fails with zero VMs started. Low-RTT environments usually win the race, which is why this mostly bites when running system tests from devenvs.
Diagnosis detail: all failures land in the same millisecond (one shared connection dying), and
ssshows only a single connection to Farm during a 120-VM burst.Fix
upload_filecalls at 32 with a small blocking semaphore (RAII permit held across the whole retry loop). Under HTTP/1.1 this bounds parallel TCP connections; it also keeps the code safe should the HTTP/1.1 pin ever be removed (32 < 64 leaves headroom for other requests sharing the client).Verification
Ran
//rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test(120 single-node subnets) from a dev container: