Skip to content

fix: pin Farm client to HTTP/1.1 and cap concurrent uploads#10456

Open
basvandijk wants to merge 2 commits into
masterfrom
basvandijk/farm-http1-uploads
Open

fix: pin Farm client to HTTP/1.1 and cap concurrent uploads#10456
basvandijk wants to merge 2 commits into
masterfrom
basvandijk/farm-http1-uploads

Conversation

@basvandijk

@basvandijk basvandijk commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Problem

Farm's web server (warp-tls) advertises HTTP/2 SETTINGS_MAX_CONCURRENT_STREAMS=64 and tears down the whole connection (RST_STREAM + GOAWAY \"exceeds max concurrent\") when a burst of multiplexed streams exceeds that limit — strict behavior vs. RFC 9113 §5.1.2, which prescribes rejecting only the offending stream.

The blocking reqwest client used by the test driver multiplexes all parallel requests over a single HTTP/2 connection. When a test starts many VMs at once (e.g. the 120-node xnet_slo_120_subnets_staging_test), the request burst outruns the server's SETTINGS frame on high-RTT paths and Farm kills the connection, failing all in-flight requests at once:

ERRO ... sending a request to Farm failed: reqwest::Error { kind: Request, ...,
source: hyper_util::client::legacy::Error(SendRequest, hyper::Error(Http2,
Error { kind: GoAway(b\"exceeds max concurrent\", REFUSED_STREAM, Remote) })) }
WARN ... Starting VM failed with: Retried too many times: sending a request to Farm

→ setup fails with zero VMs started. Low-RTT environments usually win the race, which is why this mostly bites when running system tests from devenvs.

Diagnosis detail: all failures land in the same millisecond (one shared connection dying), and ss shows only a single connection to Farm during a 120-VM burst.

Fix

  • Force HTTP/1.1 on the Farm client: one TCP connection per concurrent request, no stream-concurrency limit to violate.
  • Cap concurrent upload_file calls at 32 with a small blocking semaphore (RAII permit held across the whole retry loop). Under HTTP/1.1 this bounds parallel TCP connections; it also keeps the code safe should the HTTP/1.1 pin ever be removed (32 < 64 leaves headroom for other requests sharing the client).

Verification

Ran //rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test (120 single-node subnets) from a dev container:

Farm's web server (warp-tls) advertises HTTP/2
SETTINGS_MAX_CONCURRENT_STREAMS=64 and tears down the whole connection
(RST_STREAM + GOAWAY "exceeds max concurrent") if a burst of multiplexed
streams exceeds that limit before the client has processed the server's
SETTINGS frame.

The blocking reqwest client used by the test driver multiplexes all parallel
requests over a single HTTP/2 connection. When a test starts many VMs at once
(e.g. the 120-node xnet_slo_120_subnets_staging_test), the request burst
outruns the SETTINGS frame on high-RTT paths and Farm kills the connection,
failing all in-flight requests at once (SendError { kind: Disconnected } /
REFUSED_STREAM), which surfaces as:

    Starting VM failed with: Retried too many times: sending a request to Farm

and the test fails in setup with zero VMs started. Low-RTT environments (CI
colocated with Farm) usually win the race, which is why this mostly bites
when running tests from dev machines.

Fix:
- Force HTTP/1.1 on the Farm client (one TCP connection per request, no
  stream-concurrency limit to violate).
- Cap concurrent file uploads at 32 with a small blocking semaphore held
  across the retry loop; under HTTP/1.1 this bounds parallel TCP connections
  and keeps the code safe if the HTTP/1.1 pin is ever removed.

Verified by running xnet_slo_120_subnets_staging_test from a dev container:
setup now starts all 120 VMs reliably (previously 0 started).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR mitigates Farm connection teardown failures during large parallel VM setups by preventing HTTP/2 stream multiplexing against Farm’s strict concurrent-stream handling and by bounding upload concurrency in the test driver.

Changes:

  • Force the Farm reqwest::blocking client to use HTTP/1.1 to avoid HTTP/2 multiplexed bursts exceeding Farm’s SETTINGS_MAX_CONCURRENT_STREAMS.
  • Add a small blocking counting semaphore shared across Farm clones to cap concurrent upload_file calls at 32, holding the permit across the full retry loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@basvandijk basvandijk added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label Jun 13, 2026
@basvandijk basvandijk marked this pull request as ready for review June 14, 2026 12:42
@basvandijk basvandijk requested a review from a team as a code owner June 14, 2026 12:42
@github-actions github-actions Bot added the @idx label Jun 14, 2026

@nmattia nmattia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! couple questions to help me understand better

Comment on lines +47 to +53
/// Cap on concurrent `upload_file` calls per `Farm` instance (clones share the
/// HTTP client and this semaphore). Farm's web server advertises HTTP/2
/// SETTINGS_MAX_CONCURRENT_STREAMS=64 and tears down the whole connection if a
/// burst of multiplexed streams exceeds that limit, so stay well below it,
/// leaving headroom for other requests sharing the client (e.g. the keepalive
/// task). Under HTTP/1.1 this bounds the number of parallel TCP connections
/// instead.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this comment. I see below that we now enforce HTTP1.1, so what do we mean by "..., so stay well below [the HTTP2 max streams]"?


/// Blocks until a permit is free. The permit is held until the returned
/// guard is dropped (including on panic).
fn acquire(&self) -> UploadPermit<'_> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this ever called within tokio? this will be blocking the whole thread, correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants