Skip to content

feat(agent): SSE streaming for LLM completions with per-provider accumulators#944

Closed
wpfleger96 wants to merge 6 commits into
wpfleger/timeout-resilience-basefrom
wpfleger/timeout-resilience-streaming
Closed

feat(agent): SSE streaming for LLM completions with per-provider accumulators#944
wpfleger96 wants to merge 6 commits into
wpfleger/timeout-resilience-basefrom
wpfleger/timeout-resilience-streaming

Conversation

@wpfleger96

@wpfleger96 wpfleger96 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replace buffered stream: false LLM requests with Server-Sent Events streaming. Each text delta emits an agent_message_chunk session update as it arrives, providing a natural keepalive that resets the ACP idle clock without relying solely on the 30s ticker.

Stack: #935 (Phase 1: base) → this PR (Phase 2: streaming)

Design Decisions

  • Two reqwest::Client instances: http (with global timeout for summarize()) and http_stream (no global timeout — first-byte and inter-chunk timeouts enforced via tokio::time::timeout)
  • Two-phase timeout: llm_timeout (120s) until first content delta, then stream_chunk_timeout (30s) between subsequent chunks
  • Keepalive via content deltas: each text delta emits agent_message_chunk session update, resetting ACP idle clock; 30s ticker preserved as fallback for reasoning models
  • MAX_LLM_RESPONSE_BYTES caps accumulated semantic content (text + tool arg strings), not raw SSE wire bytes

Per-Provider Accumulators

  • Anthropic: index-keyed HashMap<usize, (String, String)> for parallel tool-call blocks via content_block index
  • OpenAI Chat: Vec-indexed accumulation by tool_calls[].index
  • OpenAI Responses API: routes on JSON type field inside data: payload — response.output_text.delta for text, response.function_call_arguments.delta for tool args, response.output_item.done for item completion

SSE Parser

Custom inline parser (~70 lines) handling:

  • : comment lines (discarded)
  • Multi-line data: fields (joined with \n)
  • id: and retry: fields (ignored)
  • [DONE] sentinel detection

Changes

crates/sprout-agent/src/llm.rs

  • Add http_stream client field (no global timeout)
  • post_stream() function with SSE line parser
  • Per-provider accumulator structs (Anthropic, OpenAI Chat, Responses)
  • Two-phase timeout: llm_timeout for pre-content, stream_chunk_timeout for post-content
  • MAX_LLM_RESPONSE_BYTES enforcement on semantic content

crates/sprout-agent/src/config.rs

  • Add stream_chunk_timeout field
  • Parse from SPROUT_AGENT_LLM_STREAM_CHUNK_TIMEOUT_SECS env var (default 30s)

crates/sprout-agent/src/agent.rs

  • Pass WireSender + session_id into complete() for delta emission
  • Keep ticker (G) as fallback for reasoning models

Tests

  • SSE parser edge cases (comments, multi-line data, id/retry fields)
  • Per-provider accumulation (Anthropic parallel tool calls, OpenAI indexed)
  • Two-phase timeout behavior
  • MAX_LLM_RESPONSE_BYTES enforcement during streaming

Acceptance Criteria

  • cargo test passes (existing + new streaming tests)
  • cargo clippy clean
  • Anthropic streaming: text chunks emit as agent_message_chunk, tool calls accumulate via index-keyed HashMap
  • OpenAI Chat streaming: same with Vec-indexed accumulation
  • OpenAI Responses streaming: same with correct event taxonomy
  • summarize() still works via http client (non-streaming path preserved)
  • MAX_LLM_RESPONSE_BYTES enforced during streaming (semantic content)
  • Keepalive ticker still fires if no delta arrives within timeout window
  • Two-phase timeout: llm_timeout pre-content, stream_chunk_timeout post-content

@wpfleger96

Copy link
Copy Markdown
Collaborator Author

Pushed Thufir's two review fixes (HEAD now edbe0e40):

Fix 1 — connect_timeout restored to 10s. crates/sprout-agent/src/llm.rs:52.connect_timeout(std::time::Duration::from_secs(10)), independent of cfg.llm_timeout. A dead endpoint fails fast at the handshake instead of waiting 120s (and being retried). The .timeout() removal from the original PR stays.

Fix 2 — mid-body read errors retry like stalls. The chunk().await Err arm now backs off and resends when attempts remain, hard-failing only on the final attempt — mirroring the stall arm. reqwest's chunk() wraps a reset/truncated body as a decode error (Kind::Decode, not Kind::Body), so is_retryable_transport_error gained is_decode(). JSON-parse failures take the separate serde_json path on the fully-buffered body and are unaffected.

Test: post_retries_on_read_error_mid_body — real TCP, server sends chunked headers + a partial chunk then resets the socket on attempt 1, serves a complete body on retry. Asserts success and >=2 connect attempts.

cargo test -p sprout-agent (82 lib tests, +1) and cargo clippy --all-targets both clean.

Note: branch was force-updated from a dedicated worktree (duncan/944-review-fixes) to sidestep an edit race with a concurrent SSE-task instance in the shared streaming worktree; nothing from that instance was committed.

@wpfleger96 wpfleger96 force-pushed the wpfleger/timeout-resilience-streaming branch from edbe0e4 to f268646 Compare June 10, 2026 15:13
@wpfleger96 wpfleger96 changed the title feat(agent): streaming-aware timeout handling for LLM responses feat(agent): SSE streaming for LLM completions with per-provider accumulators Jun 10, 2026
npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 3 commits June 10, 2026 13:45
…mulators

Replace buffered LLM requests with Server-Sent Events streaming. Each
text delta emits an agent_message_chunk session update as it arrives,
providing a natural keepalive that resets the ACP idle clock without
relying solely on the 30s ticker.

Design decisions:
- Two reqwest::Client instances: `http` (with global timeout for
  summarize) and `http_stream` (no global timeout, enforces first-byte
  and inter-chunk timeouts via tokio::time::timeout)
- Two-phase timeout: llm_timeout (120s) until first content delta,
  then stream_chunk_timeout (30s) between subsequent events
- Anthropic: index-keyed HashMap<usize, (String, String)> for parallel
  tool-call accumulation via content_block index
- OpenAI Chat: Vec-indexed accumulation by tool_calls[].index
- Responses API: routes on JSON `type` field inside data payload, uses
  response.output_text.delta and response.function_call_arguments.delta
- MAX_LLM_RESPONSE_BYTES caps accumulated semantic content (text + tool
  args), not raw SSE wire bytes
- SSE parser handles : comments, multi-line data:, id:/retry: fields
- Keepalive ticker (G) preserved as fallback for reasoning models that
  pause before producing content

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
output_item.added may arrive after function_call_arguments.delta when
the Responses API reorders events. The delta handler creates the HashMap
entry with an empty name, and the previous or_insert_with no-oped on
existing entries. Use and_modify to backfill the name regardless of
insertion order.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
…am retry, ordered tool calls

The SSE reader split events only on "\n\n", so providers or proxies
emitting CRLF or bare-CR line terminators broke event framing. It also
let SseReader.buf grow without bound when a stream never produced an
event boundary. Streaming requests had no retry on the initial send, and
the Responses-API tool-call accumulator used a HashMap, yielding
non-deterministic tool-call order.

- Normalize CR and CRLF to LF on chunk append (SSE spec equivalence)
- Cap SseReader.buf at MAX_LLM_RESPONSE_BYTES, erroring before any
  unbounded growth
- Add send_stream_with_retry() applying the buffered path's transport/
  5xx/429 backoff to the initial request only, stopping before any chunk
  is emitted to avoid duplicate output
- Replace the HashMap tool-call accumulator with a first-seen-ordered Vec
  to match the buffered parser's ordering

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 force-pushed the wpfleger/timeout-resilience-streaming branch from f1599f2 to 06378f4 Compare June 10, 2026 17:48
npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 3 commits June 10, 2026 14:01
The buffered-body inter-chunk timeout was named llm_stream_chunk_timeout,
confusingly similar to stream_chunk_timeout (the SSE inter-chunk timeout).
Operators tuning "stream chunk timeout" would set the wrong env var.
Rename to llm_body_chunk_timeout / SPROUT_AGENT_LLM_BODY_CHUNK_TIMEOUT_SECS
to clearly distinguish the buffered path from the streaming path.

Extract the openai_to_sse_events test helper (duplicated in fake_llm.rs,
golden_transcripts.rs, and regressions.rs) into tests/common/mod.rs.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
… retry

Cover three high-priority gaps identified in the PR #944 test plan:

Gap 3 (dual-timeout switchover): Verify first_byte_timeout governs
pre-content and stream_chunk_timeout governs post-content. Three tests
exercise the transition using sub-second timeouts against a delayed-write
TCP server.

Gap 1 (split-boundary framing): Verify SseReader handles event boundaries
split across TCP chunks, data: prefix split mid-keyword, and trailing data
without a final \n\n boundary before EOF.

Gap 6 (send_stream_with_retry): Verify 5xx triggers retry and succeeds on
second attempt, all-503 exhausts retries with clear error, and 401 surfaces
LlmAuth immediately without retrying.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
…alidation tests

StreamEmitter::test_channel() helper: returns a live mpsc receiver so
tests can assert which agent_message_chunk messages were emitted, unlike
noop() which drops the receiver.

Chunk-emission (2 tests): verify one chunk per text delta with correct
content, and that empty/absent text emits nothing.

Auto→Responses upgrade (2 tests): verify that Auto mode retries on
/responses when /chat/completions returns a 400 with an
is_responses_required body, and that the sticky auto_upgraded flag
causes subsequent calls to skip /chat/completions entirely. Both tests
assert which paths the fake server saw — the discriminating assertion
per Thufir's review.

Config validation (2 tests): stream_chunk_timeout=0 and
llm_body_chunk_timeout=0 both rejected with the documented error message.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant