fix: pin Farm client to HTTP/1.1 and cap concurrent uploads by basvandijk · Pull Request #10456 · dfinity/ic

basvandijk · 2026-06-12T20:18:50Z

Problem

Farm's web server (warp-tls) advertises HTTP/2 SETTINGS_MAX_CONCURRENT_STREAMS=64 and tears down the whole connection (RST_STREAM + GOAWAY \"exceeds max concurrent\") when a burst of multiplexed streams exceeds that limit — strict behavior vs. RFC 9113 §5.1.2, which prescribes rejecting only the offending stream.

The blocking reqwest client used by the test driver multiplexes all parallel requests over a single HTTP/2 connection. When a test starts many VMs at once (e.g. the 120-node xnet_slo_120_subnets_staging_test), the request burst outruns the server's SETTINGS frame on high-RTT paths and Farm kills the connection, failing all in-flight requests at once:

ERRO ... sending a request to Farm failed: reqwest::Error { kind: Request, ...,
source: hyper_util::client::legacy::Error(SendRequest, hyper::Error(Http2,
Error { kind: GoAway(b\"exceeds max concurrent\", REFUSED_STREAM, Remote) })) }
WARN ... Starting VM failed with: Retried too many times: sending a request to Farm

→ setup fails with zero VMs started. Low-RTT environments usually win the race, which is why this mostly bites when running system tests from devenvs.

Diagnosis detail: all failures land in the same millisecond (one shared connection dying), and ss shows only a single connection to Farm during a 120-VM burst.

Fix

Force HTTP/1.1 on the Farm client: one TCP connection per concurrent request, no stream-concurrency limit to violate.
Cap concurrent upload_file calls at 32 with a small blocking semaphore (RAII permit held across the whole retry loop). Under HTTP/1.1 this bounds parallel TCP connections; it also keeps the code safe should the HTTP/1.1 pin ever be removed (32 < 64 leaves headroom for other requests sharing the client).

Verification

Ran //rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test (120 single-node subnets) from a dev container:

Without this change: setup fails — all 120 VM-start requests die with the GoAway error above, 0 VMs started.
With this change: setup starts all 120 VMs reliably and the test runs to completion (passed in combination with fix: evict dead pooled XNet connections via HTTP/2 keep-alive pings #10455).

Farm's web server (warp-tls) advertises HTTP/2 SETTINGS_MAX_CONCURRENT_STREAMS=64 and tears down the whole connection (RST_STREAM + GOAWAY "exceeds max concurrent") if a burst of multiplexed streams exceeds that limit before the client has processed the server's SETTINGS frame. The blocking reqwest client used by the test driver multiplexes all parallel requests over a single HTTP/2 connection. When a test starts many VMs at once (e.g. the 120-node xnet_slo_120_subnets_staging_test), the request burst outruns the SETTINGS frame on high-RTT paths and Farm kills the connection, failing all in-flight requests at once (SendError { kind: Disconnected } / REFUSED_STREAM), which surfaces as: Starting VM failed with: Retried too many times: sending a request to Farm and the test fails in setup with zero VMs started. Low-RTT environments (CI colocated with Farm) usually win the race, which is why this mostly bites when running tests from dev machines. Fix: - Force HTTP/1.1 on the Farm client (one TCP connection per request, no stream-concurrency limit to violate). - Cap concurrent file uploads at 32 with a small blocking semaphore held across the retry loop; under HTTP/1.1 this bounds parallel TCP connections and keeps the code safe if the HTTP/1.1 pin is ever removed. Verified by running xnet_slo_120_subnets_staging_test from a dev container: setup now starts all 120 VMs reliably (previously 0 started).

Copilot

Pull request overview

This PR mitigates Farm connection teardown failures during large parallel VM setups by preventing HTTP/2 stream multiplexing against Farm’s strict concurrent-stream handling and by bounding upload concurrency in the test driver.

Changes:

Force the Farm reqwest::blocking client to use HTTP/1.1 to avoid HTTP/2 multiplexed bursts exceeding Farm’s SETTINGS_MAX_CONCURRENT_STREAMS.
Add a small blocking counting semaphore shared across Farm clones to cap concurrent upload_file calls at 32, holding the permit across the full retry loop.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nmattia

Nice! couple questions to help me understand better

nmattia · 2026-06-15T09:04:17Z

+/// Cap on concurrent `upload_file` calls per `Farm` instance (clones share the
+/// HTTP client and this semaphore). Farm's web server advertises HTTP/2
+/// SETTINGS_MAX_CONCURRENT_STREAMS=64 and tears down the whole connection if a
+/// burst of multiplexed streams exceeds that limit, so stay well below it,
+/// leaving headroom for other requests sharing the client (e.g. the keepalive
+/// task). Under HTTP/1.1 this bounds the number of parallel TCP connections
+/// instead.


I'm a bit confused by this comment. I see below that we now enforce HTTP1.1, so what do we mean by "..., so stay well below [the HTTP2 max streams]"?

nmattia · 2026-06-15T09:08:20Z

+
+    /// Blocks until a permit is free. The permit is held until the returned
+    /// guard is dropped (including on panic).
+    fn acquire(&self) -> UploadPermit<'_> {


is this ever called within tokio? this will be blocking the whole thread, correct?

github-actions Bot added the fix label Jun 12, 2026

basvandijk requested a review from Copilot June 12, 2026 20:19

Copilot started reviewing on behalf of basvandijk June 12, 2026 20:19 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

basvandijk added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label Jun 13, 2026

Merge branch 'master' into basvandijk/farm-http1-uploads

1a704a5

basvandijk marked this pull request as ready for review June 14, 2026 12:42

basvandijk requested a review from a team as a code owner June 14, 2026 12:42

github-actions Bot added the @idx label Jun 14, 2026

nmattia reviewed Jun 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pin Farm client to HTTP/1.1 and cap concurrent uploads#10456

fix: pin Farm client to HTTP/1.1 and cap concurrent uploads#10456
basvandijk wants to merge 2 commits into
masterfrom
basvandijk/farm-http1-uploads

basvandijk commented Jun 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

nmattia left a comment

Uh oh!

nmattia Jun 15, 2026

Uh oh!

nmattia Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

basvandijk commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

nmattia left a comment

Choose a reason for hiding this comment

Uh oh!

nmattia Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

nmattia Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

basvandijk commented Jun 12, 2026 •

edited

Loading