fix(sprout-agent): retry on pre-response transport errors#700
Merged
Conversation
The LLM POST retry loop in `crates/sprout-agent/src/llm.rs` only retried
`is_timeout()` and `is_connect()` reqwest errors. That left out a real
class of transient failures: TLS handshake aborts, sockets dropped or
reset mid-send, h2 GOAWAY/RST_STREAM, and hyper protocol errors. reqwest
classifies all of these under `is_request()`, with the exact message
`error sending request for url (...)` — the same string that has been
escaping the agent and surfacing in ACP as
Agent reported error: llm: transport: error sending request for url ...
Broaden the predicate to also match `is_request()`. These are safe to
replay because the body is serialized once before the loop and the LLM
POST has no observable side effects until a response is received.
Also add `tracing::warn!` on each retry so this is visible in production
logs rather than silent.
Tests: a regression covers the case where a server accepts the
connection and drops it without writing any response bytes — a
request-class failure with no `is_connect()` or `is_timeout()` signal.
Verified the test fails against the old predicate with the exact
"error sending request for url" message.
Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
This was referenced May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Broaden the LLM POST retry predicate in
crates/sprout-agent/src/llm.rsfromis_timeout() || is_connect()to also include reqwest'sis_request(). Add atracing::warn!on each retry so this isn't silent in production.Why
The agent has been bubbling these to ACP instead of retrying them:
That exact
"error sending request for url (...)"string is reqwest'sKind::RequestDisplay. The current predicate misses it. In practice,is_request()covers the real-world transient failures we want to retry: TLS handshake aborts, sockets dropped or reset mid-send, h2 GOAWAY/RST_STREAM, hyper protocol errors. These are pre-response failures, and the LLM POST is idempotent — the body is serialized once before the loop and there's no observable side effect until a response is received — so replaying is safe.Status-code retries (5xx, 429), auth bail-out (401/403), backoff, and
MAX_RETRIESare unchanged.Test
New
post_retries_on_dropped_connection_before_responseinllm.rs: a TCP listener that accepts and drops the first connection without writing any response bytes, then serves a normal JSON body on the second connection. Asserts the post succeeds and the server saw ≥2 connection attempts.I verified the test reproduces the bug — with the old
is_timeout() || is_connect()predicate it fails with literally:That's the same message Tyler reported from Anthropic.
Local:
cargo test -p sprout-agent17 integration + 44 lib tests pass;cargo clippy -p sprout-agent --all-targets -- -D warningsclean;cargo fmt -p sprout-agent --checkclean.Credit
Mari (GPT-5.5) led research and proposed the predicate. Sami (Claude Opus) verified all the load-bearing claims against the tree and sharpened the safety argument (body-serialize is outside the loop, so broadening to
is_request()doesn't risk retrying a malformed body). I confirmed both, wrote the diff, and added the regression test.