feat(acp): agent timeout resilience — idle margin, tool-call reset, death notices, keepalive#935
Merged
Merged
Conversation
…eath notices, keepalive The 20s margin between the 600s max shell timeout and the 620s idle timeout caused spurious session kills during legitimate long-running tool calls. This changeset addresses the problem from four angles: - Raise DEFAULT_IDLE_TIMEOUT_SECS from 620 to 900 (300s margin) - Reset idle clock explicitly on tool_call session updates with observability logging - Post a visible channel message (death notice) when a session ends due to idle timeout or unexpected agent exit - Emit a keepalive session update every 30s while waiting on the LLM provider response, preventing idle timeout during slow completions Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
…ages, keepalive arm Address review findings from Thufir on PR #935: - Post death notice on transport-error respawns (Io, WriteTimeout, Timeout, Protocol) — same user-facing silence as idle timeout - Thread death notices into the original conversation via e-tag reply so users know which task died in busy channels - Add explicit "keepalive" match arm in handle_session_update for documentation clarity - Update idle-reset comment to clarify belt-and-suspenders intent Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
9 tasks
Death notices referenced the thread root with NIP-10 marker "reply" instead of "root", causing incorrect threading in clients. Tag-parse errors incorrectly mapped to RelayError::AuthFailed. - Change e-tag marker from "reply" to "root" in build_death_notice - Add RelayError::EventBuild variant for tag-parse failures - Add publish_death_notice() method on HarnessRelay to consolidate the build+publish+warn pattern used by both timeout and transport error paths - Replace inline match blocks in handle_prompt_result with method call Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Lock the behavioral guarantees introduced by the timeout resilience
work: keepalive idle reset, tool_call idle reset, death-notice NIP-10
threading marker ("root" not "reply"), and the DEFAULT_IDLE_TIMEOUT_SECS
constant + idle < max_turn guard.
All tests use the existing spawn_script + sub-second idle idiom in
acp.rs, pure unit assertions in relay.rs and config.rs. No production
logic changed.
Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The tool description said only "timeout_ms capped at 600000" without stating the default. Models guessed low values (e.g. 30s for git push), causing unnecessary timeouts. Now states the 120s default and recommends 300s+ for long-running commands. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Reinforces the tool description with per-parameter schema docs so models see the default (120s) and guidance (300s+ for long ops) in both the tool-level and parameter-level descriptions. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses spurious agent session kills caused by the 20s margin between the 600s max shell timeout and the 620s idle timeout. Legitimate long-running tool calls (git push with pre-push hooks, large compilation) would exhaust the margin and trigger idle timeout.
Changes
C: Raise idle timeout default (620 → 900s)
DEFAULT_IDLE_TIMEOUT_SECSincrates/sprout-acp/src/config.rsraised from 620 to 900, giving 300s of breathing room above the 600s max shell timeout.D: Tool-call-aware idle reset
handle_session_update()incrates/sprout-acp/src/acp.rsnow returns a bool indicating whether a tool call started. The idle-timeout read loop explicitly resets the idle clock and logs when a tool call begins, making the intent visible in traces.E: Death notices
crates/sprout-acp/src/relay.rsgainsbuild_death_notice()which constructs aKIND_JOB_ERROR(kind:43006) diagnostic event.handle_prompt_result()incrates/sprout-acp/src/lib.rsemits aturn_errorobserver event and stores aKIND_JOB_ERRORrelay event when a session ends due to idle timeout, unexpected agent exit, or transport error. These appear in the per-agent activity panel (profile → View activity log) — not in the channel message stream. Death notices are threaded into the original conversation via NIP-10etags when thread context is available.G: LLM keepalive ticker
crates/sprout-agent/src/agent.rsadds a 30s interval ticker to thetokio::select!around the LLMcomplete()call. While waiting on the provider, it emits a lightweightkeepalivesession update that the ACP harness sees as valid JSON activity, resetting the idle clock.