fix(desktop): reap orphaned agent processes across instances#954
Merged
Conversation
0afa85a to
bfbec21
Compare
Ctrl+C on `just dev`/`just staging` tears down the Tauri app before its in-process system sweep can finish, so agent workers spawned in their own process groups survive as orphans and accumulate across sessions. Add an EXIT trap that reaps this instance's agents via the PID-file receipts the desktop already writes (one PGID per agent under `<app-data>/agents/agent-pids/`). The app-data dir is keyed by the bundle identifier, so the cleanup is scoped exactly to the instance that ran the recipe — the main checkout never reaps a worktree's agents, or vice versa. PID files are matched instead of the `SPROUT_MANAGED_AGENT` env marker because macOS `pkill -f` matches only argv, not the environment, so an env-marker match would silently reap nothing. Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>
When a Sprout desktop process crashes or is killed (e.g. Ctrl+C on `tauri dev`), its agents survive because they run in separate process groups. Previously, only the *same* instance ID's next boot would reclaim them — which may never happen if the developer switches worktrees or to the release DMG. Add `reap_dead_instance_agents` which scans all user processes for `SPROUT_MANAGED_AGENT=*`, groups by instance ID, and for each foreign instance checks whether a Sprout desktop binary is still alive. If no desktop owns those agents, they are reaped. The desktop-alive check matches `sprout-desktop`/`Sprout` processes and confirms the instance identifier appears in their KERN_PROCARGS2 buffer (argv + environ), satisfying the constraint that orphaned agents cannot vouch for each other. Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>
Spawn a background tokio task during Tauri setup that runs both `sweep_system_agent_processes` and `reap_dead_instance_agents` every 60 seconds, skipping PIDs in the active managed_agent_processes map. This closes the gap where a `just staging` Ctrl+C leak only gets collected by the *next boot* of a same-instance-id process — which may never happen if the developer switches worktrees. The periodic sweep ensures orphans are reaped within 60s regardless of which instance detects them. Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co> Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Extract the duplicated BSDInfo struct, proc_listallpids/proc_pidinfo extern declarations, and PROC_PIDTBSDINFO constant to module level — previously copy-pasted 4 times with two different layouts. All four call sites now share a single definition. Change sweep_system_agent_processes_with_grace to return the current orphan HashSet so the caller in lib.rs can reuse it directly instead of scanning the entire process table a second time per tick. Document the permissive Linux PPID failure mode: when /proc/<pid>/stat is unreadable, the process is treated as orphaned (safe because the two-tick grace in the periodic path mitigates transient failures). Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
e8e8ab2 to
9141695
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Orphaned
sprout-agentprocesses accumulate across dev sessions because:tauri devCtrl+C kills the Rust app before its system sweep completes, so agent workers spawned in their own process groups survive as orphansOn a dev machine this manifests as ~100+ zombie
sprout-agentprocesses after a few sessions, consuming memory and CPU.Fix
Three-layer defense, one commit per layer:
1. Justfile EXIT trap (
scripts/cleanup-instance-agents.sh)Adds an EXIT trap to
just devandjust stagingthat reaps this instance's agents via the PID-file receipts the desktop already writes (<app-data>/agents/agent-pids/<pubkey>.pid, each containing the agent's PGID). Kills by process group so the entire agent subtree is reached.Uses PID files instead of
pkill -f SPROUT_MANAGED_AGENT=...because macOSpkill -fmatches only argv, not the environment — an env-marker match silently reaps nothing.Scoping is exact: the app-data dir is keyed by bundle identifier, so a worktree's trap never touches the main checkout's agents or vice versa.
2. Dead-instance reaping on boot (
runtime.rs)Extends
sweep_system_agent_processesto detect agents belonging to instance IDs whose desktop/Tauri binary is no longer running:SPROUT_MANAGED_AGENT=*env markerssigterm_then_sigkillpathThe desktop liveness check (
desktop_is_alive_for_instance) specifically matches the Tauri/desktop binary name (sprout-desktop,Sprout) and confirms the identifier in itsKERN_PROCARGS2buffer (macOS) or/proc/<pid>/cmdline(Linux). This prevents orphaned agents from vouching for each other.Boundary-anchored matching: The identifier match uses
buffer_contains_identifier()which requires the trailing byte to be a non-identifier character (not[A-Za-z0-9._-]). This preventsxyz.block.sprout.appfrom false-matching insidexyz.block.sprout.app.dev— the exact prefix-collision scenario that motivated the fix.proc_listallpidsrobustness: The PID enumeration loops until the buffer is confirmed large enough, preventing silent under-scanning during fork storms (the exact domain this code operates in).3. Periodic 60s sweep (
lib.rs)Spawns a background
tokio::spawntask that runs the full system sweep (including dead-instance reaping) every 60 seconds.Two-tick grace: Same-instance orphans are only reaped after being seen orphaned on two consecutive sweeps (120s worst case). This prevents killing a legitimately-starting agent that spawned between the skip-list snapshot and the process scan — a race that recurs every 60s and widens during burst spawns.
Off the async pool: The blocking syscall work (
sysctl/proc_listallpids/proc_pidinfo+ the 200msthread::sleepinsigterm_then_sigkill) runs insidespawn_blockingso it never stalls a tokio worker thread.Closes the window where a Ctrl+C leak from a different instance ID only gets collected by the next boot of that same instance — which may never happen if the developer switches workflows.
Post-fix expected state
~100 processes under normal DMG usage is expected (4 agents × 25 at
DEFAULT_AGENT_PARALLELISM=24). That's a separate discussion — this PR addresses the unbounded accumulation from dead instances.