Skip to content

fix(sprout-agent): retry on pre-response transport errors#700

Merged
tlongwell-block merged 1 commit into
mainfrom
dawn/llm-retry-broaden
May 22, 2026
Merged

fix(sprout-agent): retry on pre-response transport errors#700
tlongwell-block merged 1 commit into
mainfrom
dawn/llm-retry-broaden

Conversation

@tlongwell-block

Copy link
Copy Markdown
Collaborator

What

Broaden the LLM POST retry predicate in crates/sprout-agent/src/llm.rs from is_timeout() || is_connect() to also include reqwest's is_request(). Add a tracing::warn! on each retry so this isn't silent in production.

Why

The agent has been bubbling these to ACP instead of retrying them:

Agent reported error: {"code":-32000,"message":"llm: transport: error sending request for url (https://api.anthropic.com/v1/messages)"}

That exact "error sending request for url (...)" string is reqwest's Kind::Request Display. The current predicate misses it. In practice, is_request() covers the real-world transient failures we want to retry: TLS handshake aborts, sockets dropped or reset mid-send, h2 GOAWAY/RST_STREAM, hyper protocol errors. These are pre-response failures, and the LLM POST is idempotent — the body is serialized once before the loop and there's no observable side effect until a response is received — so replaying is safe.

Status-code retries (5xx, 429), auth bail-out (401/403), backoff, and MAX_RETRIES are unchanged.

Test

New post_retries_on_dropped_connection_before_response in llm.rs: a TCP listener that accepts and drops the first connection without writing any response bytes, then serves a normal JSON body on the second connection. Asserts the post succeeds and the server saw ≥2 connection attempts.

I verified the test reproduces the bug — with the old is_timeout() || is_connect() predicate it fails with literally:

Llm("transport: error sending request for url (http://127.0.0.1:xxxxx/v1/x)")

That's the same message Tyler reported from Anthropic.

Local: cargo test -p sprout-agent 17 integration + 44 lib tests pass; cargo clippy -p sprout-agent --all-targets -- -D warnings clean; cargo fmt -p sprout-agent --check clean.

Credit

Mari (GPT-5.5) led research and proposed the predicate. Sami (Claude Opus) verified all the load-bearing claims against the tree and sharpened the safety argument (body-serialize is outside the loop, so broadening to is_request() doesn't risk retrying a malformed body). I confirmed both, wrote the diff, and added the regression test.

The LLM POST retry loop in `crates/sprout-agent/src/llm.rs` only retried
`is_timeout()` and `is_connect()` reqwest errors. That left out a real
class of transient failures: TLS handshake aborts, sockets dropped or
reset mid-send, h2 GOAWAY/RST_STREAM, and hyper protocol errors. reqwest
classifies all of these under `is_request()`, with the exact message
`error sending request for url (...)` — the same string that has been
escaping the agent and surfacing in ACP as

    Agent reported error: llm: transport: error sending request for url ...

Broaden the predicate to also match `is_request()`. These are safe to
replay because the body is serialized once before the loop and the LLM
POST has no observable side effects until a response is received.

Also add `tracing::warn!` on each retry so this is visible in production
logs rather than silent.

Tests: a regression covers the case where a server accepts the
connection and drops it without writing any response bytes — a
request-class failure with no `is_connect()` or `is_timeout()` signal.
Verified the test fails against the old predicate with the exact
"error sending request for url" message.

Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
@tlongwell-block tlongwell-block requested a review from a team as a code owner May 21, 2026 17:31
@tlongwell-block tlongwell-block merged commit 7b33c47 into main May 22, 2026
15 checks passed
@tlongwell-block tlongwell-block deleted the dawn/llm-retry-broaden branch May 22, 2026 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant