Skip to content

fix(desktop): reap orphaned agent processes across instances#954

Merged
wpfleger96 merged 4 commits into
mainfrom
wpfleger/fix-orphaned-agent-processes
Jun 10, 2026
Merged

fix(desktop): reap orphaned agent processes across instances#954
wpfleger96 merged 4 commits into
mainfrom
wpfleger/fix-orphaned-agent-processes

Conversation

@wpfleger96

@wpfleger96 wpfleger96 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Problem

Orphaned sprout-agent processes accumulate across dev sessions because:

  1. tauri dev Ctrl+C kills the Rust app before its system sweep completes, so agent workers spawned in their own process groups survive as orphans
  2. Instance-id scoping from fix(desktop): scope agent sweep to the owning app instance #808 means a restarted instance only sweeps agents matching its own identifier — agents from a dead worktree or crashed instance are never collected
  3. No periodic sweep runs during the app's lifetime, so leaks from mid-session crashes persist until the next reboot of the same instance

On a dev machine this manifests as ~100+ zombie sprout-agent processes after a few sessions, consuming memory and CPU.

Fix

Three-layer defense, one commit per layer:

1. Justfile EXIT trap (scripts/cleanup-instance-agents.sh)

Adds an EXIT trap to just dev and just staging that reaps this instance's agents via the PID-file receipts the desktop already writes (<app-data>/agents/agent-pids/<pubkey>.pid, each containing the agent's PGID). Kills by process group so the entire agent subtree is reached.

Uses PID files instead of pkill -f SPROUT_MANAGED_AGENT=... because macOS pkill -f matches only argv, not the environment — an env-marker match silently reaps nothing.

Scoping is exact: the app-data dir is keyed by bundle identifier, so a worktree's trap never touches the main checkout's agents or vice versa.

2. Dead-instance reaping on boot (runtime.rs)

Extends sweep_system_agent_processes to detect agents belonging to instance IDs whose desktop/Tauri binary is no longer running:

  • Scans all user processes for SPROUT_MANAGED_AGENT=* env markers
  • Groups by instance ID extracted from the marker value
  • For each foreign instance: checks if a Sprout desktop binary (not any agent) with that identifier is still alive
  • If no desktop is alive → reaps all that instance's agents via the existing sigterm_then_sigkill path

The desktop liveness check (desktop_is_alive_for_instance) specifically matches the Tauri/desktop binary name (sprout-desktop, Sprout) and confirms the identifier in its KERN_PROCARGS2 buffer (macOS) or /proc/<pid>/cmdline (Linux). This prevents orphaned agents from vouching for each other.

Boundary-anchored matching: The identifier match uses buffer_contains_identifier() which requires the trailing byte to be a non-identifier character (not [A-Za-z0-9._-]). This prevents xyz.block.sprout.app from false-matching inside xyz.block.sprout.app.dev — the exact prefix-collision scenario that motivated the fix.

proc_listallpids robustness: The PID enumeration loops until the buffer is confirmed large enough, preventing silent under-scanning during fork storms (the exact domain this code operates in).

3. Periodic 60s sweep (lib.rs)

Spawns a background tokio::spawn task that runs the full system sweep (including dead-instance reaping) every 60 seconds.

Two-tick grace: Same-instance orphans are only reaped after being seen orphaned on two consecutive sweeps (120s worst case). This prevents killing a legitimately-starting agent that spawned between the skip-list snapshot and the process scan — a race that recurs every 60s and widens during burst spawns.

Off the async pool: The blocking syscall work (sysctl/proc_listallpids/proc_pidinfo + the 200ms thread::sleep in sigterm_then_sigkill) runs inside spawn_blocking so it never stalls a tokio worker thread.

Closes the window where a Ctrl+C leak from a different instance ID only gets collected by the next boot of that same instance — which may never happen if the developer switches workflows.

Post-fix expected state

~100 processes under normal DMG usage is expected (4 agents × 25 at DEFAULT_AGENT_PARALLELISM=24). That's a separate discussion — this PR addresses the unbounded accumulation from dead instances.

@wpfleger96 wpfleger96 requested a review from a team as a code owner June 10, 2026 19:28
@wpfleger96 wpfleger96 force-pushed the wpfleger/fix-orphaned-agent-processes branch 3 times, most recently from 0afa85a to bfbec21 Compare June 10, 2026 20:49
npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 4 commits June 10, 2026 17:30
Ctrl+C on `just dev`/`just staging` tears down the Tauri app before its
in-process system sweep can finish, so agent workers spawned in their own
process groups survive as orphans and accumulate across sessions.

Add an EXIT trap that reaps this instance's agents via the PID-file receipts
the desktop already writes (one PGID per agent under
`<app-data>/agents/agent-pids/`). The app-data dir is keyed by the bundle
identifier, so the cleanup is scoped exactly to the instance that ran the
recipe — the main checkout never reaps a worktree's agents, or vice versa.

PID files are matched instead of the `SPROUT_MANAGED_AGENT` env marker because
macOS `pkill -f` matches only argv, not the environment, so an env-marker match
would silently reap nothing.

Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>
When a Sprout desktop process crashes or is killed (e.g. Ctrl+C on
`tauri dev`), its agents survive because they run in separate process
groups. Previously, only the *same* instance ID's next boot would
reclaim them — which may never happen if the developer switches
worktrees or to the release DMG.

Add `reap_dead_instance_agents` which scans all user processes for
`SPROUT_MANAGED_AGENT=*`, groups by instance ID, and for each foreign
instance checks whether a Sprout desktop binary is still alive. If no
desktop owns those agents, they are reaped.

The desktop-alive check matches `sprout-desktop`/`Sprout` processes
and confirms the instance identifier appears in their KERN_PROCARGS2
buffer (argv + environ), satisfying the constraint that orphaned agents
cannot vouch for each other.

Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>
Spawn a background tokio task during Tauri setup that runs both
`sweep_system_agent_processes` and `reap_dead_instance_agents` every
60 seconds, skipping PIDs in the active managed_agent_processes map.

This closes the gap where a `just staging` Ctrl+C leak only gets
collected by the *next boot* of a same-instance-id process — which may
never happen if the developer switches worktrees. The periodic sweep
ensures orphans are reaped within 60s regardless of which instance
detects them.

Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Extract the duplicated BSDInfo struct, proc_listallpids/proc_pidinfo
extern declarations, and PROC_PIDTBSDINFO constant to module level —
previously copy-pasted 4 times with two different layouts. All four
call sites now share a single definition.

Change sweep_system_agent_processes_with_grace to return the current
orphan HashSet so the caller in lib.rs can reuse it directly instead
of scanning the entire process table a second time per tick.

Document the permissive Linux PPID failure mode: when /proc/<pid>/stat
is unreadable, the process is treated as orphaned (safe because the
two-tick grace in the periodic path mitigates transient failures).

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 force-pushed the wpfleger/fix-orphaned-agent-processes branch from e8e8ab2 to 9141695 Compare June 10, 2026 21:30
@wpfleger96 wpfleger96 merged commit 53e3f09 into main Jun 10, 2026
16 checks passed
@wpfleger96 wpfleger96 deleted the wpfleger/fix-orphaned-agent-processes branch June 10, 2026 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant