feat(agent): SSE streaming for LLM completions with per-provider accumulators by wpfleger96 · Pull Request #944 · block/buzz

wpfleger96 · 2026-06-10T14:22:03Z

Summary

Replace buffered stream: false LLM requests with Server-Sent Events streaming. Each text delta emits an agent_message_chunk session update as it arrives, providing a natural keepalive that resets the ACP idle clock without relying solely on the 30s ticker.

Stack: #935 (Phase 1: base) → this PR (Phase 2: streaming)

Design Decisions

Two reqwest::Client instances: http (with global timeout for summarize()) and http_stream (no global timeout — first-byte and inter-chunk timeouts enforced via tokio::time::timeout)
Two-phase timeout: llm_timeout (120s) until first content delta, then stream_chunk_timeout (30s) between subsequent chunks
Keepalive via content deltas: each text delta emits agent_message_chunk session update, resetting ACP idle clock; 30s ticker preserved as fallback for reasoning models
MAX_LLM_RESPONSE_BYTES caps accumulated semantic content (text + tool arg strings), not raw SSE wire bytes

Per-Provider Accumulators

Anthropic: index-keyed HashMap<usize, (String, String)> for parallel tool-call blocks via content_block index
OpenAI Chat: Vec-indexed accumulation by tool_calls[].index
OpenAI Responses API: routes on JSON type field inside data: payload — response.output_text.delta for text, response.function_call_arguments.delta for tool args, response.output_item.done for item completion

SSE Parser

Custom inline parser (~70 lines) handling:

: comment lines (discarded)
Multi-line data: fields (joined with \n)
id: and retry: fields (ignored)
[DONE] sentinel detection

Changes

`crates/sprout-agent/src/llm.rs`

Add http_stream client field (no global timeout)
post_stream() function with SSE line parser
Per-provider accumulator structs (Anthropic, OpenAI Chat, Responses)
Two-phase timeout: llm_timeout for pre-content, stream_chunk_timeout for post-content
MAX_LLM_RESPONSE_BYTES enforcement on semantic content

`crates/sprout-agent/src/config.rs`

Add stream_chunk_timeout field
Parse from SPROUT_AGENT_LLM_STREAM_CHUNK_TIMEOUT_SECS env var (default 30s)

`crates/sprout-agent/src/agent.rs`

Pass WireSender + session_id into complete() for delta emission
Keep ticker (G) as fallback for reasoning models

Tests

SSE parser edge cases (comments, multi-line data, id/retry fields)
Per-provider accumulation (Anthropic parallel tool calls, OpenAI indexed)
Two-phase timeout behavior
MAX_LLM_RESPONSE_BYTES enforcement during streaming

Acceptance Criteria

cargo test passes (existing + new streaming tests)
cargo clippy clean
Anthropic streaming: text chunks emit as agent_message_chunk, tool calls accumulate via index-keyed HashMap
OpenAI Chat streaming: same with Vec-indexed accumulation
OpenAI Responses streaming: same with correct event taxonomy
summarize() still works via http client (non-streaming path preserved)
MAX_LLM_RESPONSE_BYTES enforced during streaming (semantic content)
Keepalive ticker still fires if no delta arrives within timeout window
Two-phase timeout: llm_timeout pre-content, stream_chunk_timeout post-content

wpfleger96 · 2026-06-10T14:46:44Z

Pushed Thufir's two review fixes (HEAD now edbe0e40):

Fix 1 — connect_timeout restored to 10s. crates/sprout-agent/src/llm.rs:52 — .connect_timeout(std::time::Duration::from_secs(10)), independent of cfg.llm_timeout. A dead endpoint fails fast at the handshake instead of waiting 120s (and being retried). The .timeout() removal from the original PR stays.

Fix 2 — mid-body read errors retry like stalls. The chunk().await Err arm now backs off and resends when attempts remain, hard-failing only on the final attempt — mirroring the stall arm. reqwest's chunk() wraps a reset/truncated body as a decode error (Kind::Decode, not Kind::Body), so is_retryable_transport_error gained is_decode(). JSON-parse failures take the separate serde_json path on the fully-buffered body and are unaffected.

Test: post_retries_on_read_error_mid_body — real TCP, server sends chunked headers + a partial chunk then resets the socket on attempt 1, serves a complete body on retry. Asserts success and >=2 connect attempts.

cargo test -p sprout-agent (82 lib tests, +1) and cargo clippy --all-targets both clean.

Note: branch was force-updated from a dedicated worktree (duncan/944-review-fixes) to sidestep an edit race with a concurrent SSE-task instance in the shared streaming worktree; nothing from that instance was committed.

…mulators Replace buffered LLM requests with Server-Sent Events streaming. Each text delta emits an agent_message_chunk session update as it arrives, providing a natural keepalive that resets the ACP idle clock without relying solely on the 30s ticker. Design decisions: - Two reqwest::Client instances: `http` (with global timeout for summarize) and `http_stream` (no global timeout, enforces first-byte and inter-chunk timeouts via tokio::time::timeout) - Two-phase timeout: llm_timeout (120s) until first content delta, then stream_chunk_timeout (30s) between subsequent events - Anthropic: index-keyed HashMap<usize, (String, String)> for parallel tool-call accumulation via content_block index - OpenAI Chat: Vec-indexed accumulation by tool_calls[].index - Responses API: routes on JSON `type` field inside data payload, uses response.output_text.delta and response.function_call_arguments.delta - MAX_LLM_RESPONSE_BYTES caps accumulated semantic content (text + tool args), not raw SSE wire bytes - SSE parser handles : comments, multi-line data:, id:/retry: fields - Keepalive ticker (G) preserved as fallback for reasoning models that pause before producing content Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

output_item.added may arrive after function_call_arguments.delta when the Responses API reorders events. The delta handler creates the HashMap entry with an empty name, and the previous or_insert_with no-oped on existing entries. Use and_modify to backfill the name regardless of insertion order. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

…am retry, ordered tool calls The SSE reader split events only on "\n\n", so providers or proxies emitting CRLF or bare-CR line terminators broke event framing. It also let SseReader.buf grow without bound when a stream never produced an event boundary. Streaming requests had no retry on the initial send, and the Responses-API tool-call accumulator used a HashMap, yielding non-deterministic tool-call order. - Normalize CR and CRLF to LF on chunk append (SSE spec equivalence) - Cap SseReader.buf at MAX_LLM_RESPONSE_BYTES, erroring before any unbounded growth - Add send_stream_with_retry() applying the buffered path's transport/ 5xx/429 backoff to the initial request only, stopping before any chunk is emitted to avoid duplicate output - Replace the HashMap tool-call accumulator with a first-seen-ordered Vec to match the buffered parser's ordering Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

The buffered-body inter-chunk timeout was named llm_stream_chunk_timeout, confusingly similar to stream_chunk_timeout (the SSE inter-chunk timeout). Operators tuning "stream chunk timeout" would set the wrong env var. Rename to llm_body_chunk_timeout / SPROUT_AGENT_LLM_BODY_CHUNK_TIMEOUT_SECS to clearly distinguish the buffered path from the streaming path. Extract the openai_to_sse_events test helper (duplicated in fake_llm.rs, golden_transcripts.rs, and regressions.rs) into tests/common/mod.rs. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

… retry Cover three high-priority gaps identified in the PR #944 test plan: Gap 3 (dual-timeout switchover): Verify first_byte_timeout governs pre-content and stream_chunk_timeout governs post-content. Three tests exercise the transition using sub-second timeouts against a delayed-write TCP server. Gap 1 (split-boundary framing): Verify SseReader handles event boundaries split across TCP chunks, data: prefix split mid-keyword, and trailing data without a final \n\n boundary before EOF. Gap 6 (send_stream_with_retry): Verify 5xx triggers retry and succeeds on second attempt, all-503 exhausts retries with clear error, and 401 surfaces LlmAuth immediately without retrying. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

…alidation tests StreamEmitter::test_channel() helper: returns a live mpsc receiver so tests can assert which agent_message_chunk messages were emitted, unlike noop() which drops the receiver. Chunk-emission (2 tests): verify one chunk per text delta with correct content, and that empty/absent text emits nothing. Auto→Responses upgrade (2 tests): verify that Auto mode retries on /responses when /chat/completions returns a 400 with an is_responses_required body, and that the sticky auto_upgraded flag causes subsequent calls to skip /chat/completions entirely. Both tests assert which paths the fake server saw — the discriminating assertion per Thufir's review. Config validation (2 tests): stream_chunk_timeout=0 and llm_body_chunk_timeout=0 both rejected with the documented error message. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

wpfleger96 force-pushed the wpfleger/timeout-resilience-streaming branch from edbe0e4 to f268646 Compare June 10, 2026 15:13

wpfleger96 changed the title ~~feat(agent): streaming-aware timeout handling for LLM responses~~ feat(agent): SSE streaming for LLM completions with per-provider accumulators Jun 10, 2026

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 3 commits June 10, 2026 13:45

wpfleger96 force-pushed the wpfleger/timeout-resilience-streaming branch from f1599f2 to 06378f4 Compare June 10, 2026 17:48

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 3 commits June 10, 2026 14:01

wpfleger96 closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): SSE streaming for LLM completions with per-provider accumulators#944

feat(agent): SSE streaming for LLM completions with per-provider accumulators#944
wpfleger96 wants to merge 6 commits into
wpfleger/timeout-resilience-basefrom
wpfleger/timeout-resilience-streaming

wpfleger96 commented Jun 10, 2026 •

edited

Loading

Uh oh!

wpfleger96 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wpfleger96 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design Decisions

Per-Provider Accumulators

SSE Parser

Changes

crates/sprout-agent/src/llm.rs

crates/sprout-agent/src/config.rs

crates/sprout-agent/src/agent.rs

Tests

Acceptance Criteria

Uh oh!

wpfleger96 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wpfleger96 commented Jun 10, 2026 •

edited

Loading

`crates/sprout-agent/src/llm.rs`

`crates/sprout-agent/src/config.rs`

`crates/sprout-agent/src/agent.rs`