Skip to content

feat(acp): agent timeout resilience — idle margin, tool-call reset, death notices, keepalive#935

Merged
wpfleger96 merged 6 commits into
mainfrom
wpfleger/timeout-resilience-base
Jun 10, 2026
Merged

feat(acp): agent timeout resilience — idle margin, tool-call reset, death notices, keepalive#935
wpfleger96 merged 6 commits into
mainfrom
wpfleger/timeout-resilience-base

Conversation

@wpfleger96

@wpfleger96 wpfleger96 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Addresses spurious agent session kills caused by the 20s margin between the 600s max shell timeout and the 620s idle timeout. Legitimate long-running tool calls (git push with pre-push hooks, large compilation) would exhaust the margin and trigger idle timeout.

Changes

C: Raise idle timeout default (620 → 900s)

DEFAULT_IDLE_TIMEOUT_SECS in crates/sprout-acp/src/config.rs raised from 620 to 900, giving 300s of breathing room above the 600s max shell timeout.

D: Tool-call-aware idle reset

handle_session_update() in crates/sprout-acp/src/acp.rs now returns a bool indicating whether a tool call started. The idle-timeout read loop explicitly resets the idle clock and logs when a tool call begins, making the intent visible in traces.

E: Death notices

crates/sprout-acp/src/relay.rs gains build_death_notice() which constructs a KIND_JOB_ERROR (kind:43006) diagnostic event. handle_prompt_result() in crates/sprout-acp/src/lib.rs emits a turn_error observer event and stores a KIND_JOB_ERROR relay event when a session ends due to idle timeout, unexpected agent exit, or transport error. These appear in the per-agent activity panel (profile → View activity log) — not in the channel message stream. Death notices are threaded into the original conversation via NIP-10 e tags when thread context is available.

G: LLM keepalive ticker

crates/sprout-agent/src/agent.rs adds a 30s interval ticker to the tokio::select! around the LLM complete() call. While waiting on the provider, it emits a lightweight keepalive session update that the ACP harness sees as valid JSON activity, resetting the idle clock.

…eath notices, keepalive

The 20s margin between the 600s max shell timeout and the 620s idle
timeout caused spurious session kills during legitimate long-running
tool calls. This changeset addresses the problem from four angles:

- Raise DEFAULT_IDLE_TIMEOUT_SECS from 620 to 900 (300s margin)
- Reset idle clock explicitly on tool_call session updates with
  observability logging
- Post a visible channel message (death notice) when a session ends
  due to idle timeout or unexpected agent exit
- Emit a keepalive session update every 30s while waiting on the LLM
  provider response, preventing idle timeout during slow completions

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 requested a review from a team as a code owner June 10, 2026 00:22
…ages, keepalive arm

Address review findings from Thufir on PR #935:

- Post death notice on transport-error respawns (Io, WriteTimeout,
  Timeout, Protocol) — same user-facing silence as idle timeout
- Thread death notices into the original conversation via e-tag reply
  so users know which task died in busy channels
- Add explicit "keepalive" match arm in handle_session_update for
  documentation clarity
- Update idle-reset comment to clarify belt-and-suspenders intent

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
npub16v54tttfqacx9ycvc3k0ut0npj564ahcuajzy6qjvh57ntmsf4uq4806j2 and others added 4 commits June 10, 2026 13:38
Death notices referenced the thread root with NIP-10 marker "reply"
instead of "root", causing incorrect threading in clients. Tag-parse
errors incorrectly mapped to RelayError::AuthFailed.

- Change e-tag marker from "reply" to "root" in build_death_notice
- Add RelayError::EventBuild variant for tag-parse failures
- Add publish_death_notice() method on HarnessRelay to consolidate
  the build+publish+warn pattern used by both timeout and transport
  error paths
- Replace inline match blocks in handle_prompt_result with method call

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Lock the behavioral guarantees introduced by the timeout resilience
work: keepalive idle reset, tool_call idle reset, death-notice NIP-10
threading marker ("root" not "reply"), and the DEFAULT_IDLE_TIMEOUT_SECS
constant + idle < max_turn guard.

All tests use the existing spawn_script + sub-second idle idiom in
acp.rs, pure unit assertions in relay.rs and config.rs. No production
logic changed.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The tool description said only "timeout_ms capped at 600000" without
stating the default. Models guessed low values (e.g. 30s for git push),
causing unnecessary timeouts. Now states the 120s default and recommends
300s+ for long-running commands.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Reinforces the tool description with per-parameter schema docs so
models see the default (120s) and guidance (300s+ for long ops) in
both the tool-level and parameter-level descriptions.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 marked this pull request as ready for review June 10, 2026 18:58
@wpfleger96 wpfleger96 merged commit 60c8c50 into main Jun 10, 2026
14 checks passed
@wpfleger96 wpfleger96 deleted the wpfleger/timeout-resilience-base branch June 10, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant