fix(desktop): reap orphaned agent processes across instances by wpfleger96 · Pull Request #954 · block/buzz

wpfleger96 · 2026-06-10T19:28:56Z

Problem

Orphaned sprout-agent processes accumulate across dev sessions because:

tauri dev Ctrl+C kills the Rust app before its system sweep completes, so agent workers spawned in their own process groups survive as orphans
Instance-id scoping from fix(desktop): scope agent sweep to the owning app instance #808 means a restarted instance only sweeps agents matching its own identifier — agents from a dead worktree or crashed instance are never collected
No periodic sweep runs during the app's lifetime, so leaks from mid-session crashes persist until the next reboot of the same instance

On a dev machine this manifests as ~100+ zombie sprout-agent processes after a few sessions, consuming memory and CPU.

Fix

Three-layer defense, one commit per layer:

1. Justfile EXIT trap (`scripts/cleanup-instance-agents.sh`)

Adds an EXIT trap to just dev and just staging that reaps this instance's agents via the PID-file receipts the desktop already writes (<app-data>/agents/agent-pids/<pubkey>.pid, each containing the agent's PGID). Kills by process group so the entire agent subtree is reached.

Uses PID files instead of pkill -f SPROUT_MANAGED_AGENT=... because macOS pkill -f matches only argv, not the environment — an env-marker match silently reaps nothing.

Scoping is exact: the app-data dir is keyed by bundle identifier, so a worktree's trap never touches the main checkout's agents or vice versa.

2. Dead-instance reaping on boot (`runtime.rs`)

Extends sweep_system_agent_processes to detect agents belonging to instance IDs whose desktop/Tauri binary is no longer running:

Scans all user processes for SPROUT_MANAGED_AGENT=* env markers
Groups by instance ID extracted from the marker value
For each foreign instance: checks if a Sprout desktop binary (not any agent) with that identifier is still alive
If no desktop is alive → reaps all that instance's agents via the existing sigterm_then_sigkill path

The desktop liveness check (desktop_is_alive_for_instance) specifically matches the Tauri/desktop binary name (sprout-desktop, Sprout) and confirms the identifier in its KERN_PROCARGS2 buffer (macOS) or /proc/<pid>/cmdline (Linux). This prevents orphaned agents from vouching for each other.

Boundary-anchored matching: The identifier match uses buffer_contains_identifier() which requires the trailing byte to be a non-identifier character (not [A-Za-z0-9._-]). This prevents xyz.block.sprout.app from false-matching inside xyz.block.sprout.app.dev — the exact prefix-collision scenario that motivated the fix.

proc_listallpids robustness: The PID enumeration loops until the buffer is confirmed large enough, preventing silent under-scanning during fork storms (the exact domain this code operates in).

3. Periodic 60s sweep (`lib.rs`)

Spawns a background tokio::spawn task that runs the full system sweep (including dead-instance reaping) every 60 seconds.

Two-tick grace: Same-instance orphans are only reaped after being seen orphaned on two consecutive sweeps (120s worst case). This prevents killing a legitimately-starting agent that spawned between the skip-list snapshot and the process scan — a race that recurs every 60s and widens during burst spawns.

Off the async pool: The blocking syscall work (sysctl/proc_listallpids/proc_pidinfo + the 200ms thread::sleep in sigterm_then_sigkill) runs inside spawn_blocking so it never stalls a tokio worker thread.

Closes the window where a Ctrl+C leak from a different instance ID only gets collected by the next boot of that same instance — which may never happen if the developer switches workflows.

Post-fix expected state

~100 processes under normal DMG usage is expected (4 agents × 25 at DEFAULT_AGENT_PARALLELISM=24). That's a separate discussion — this PR addresses the unbounded accumulation from dead instances.

Ctrl+C on `just dev`/`just staging` tears down the Tauri app before its in-process system sweep can finish, so agent workers spawned in their own process groups survive as orphans and accumulate across sessions. Add an EXIT trap that reaps this instance's agents via the PID-file receipts the desktop already writes (one PGID per agent under `<app-data>/agents/agent-pids/`). The app-data dir is keyed by the bundle identifier, so the cleanup is scoped exactly to the instance that ran the recipe — the main checkout never reaps a worktree's agents, or vice versa. PID files are matched instead of the `SPROUT_MANAGED_AGENT` env marker because macOS `pkill -f` matches only argv, not the environment, so an env-marker match would silently reap nothing. Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>

When a Sprout desktop process crashes or is killed (e.g. Ctrl+C on `tauri dev`), its agents survive because they run in separate process groups. Previously, only the *same* instance ID's next boot would reclaim them — which may never happen if the developer switches worktrees or to the release DMG. Add `reap_dead_instance_agents` which scans all user processes for `SPROUT_MANAGED_AGENT=*`, groups by instance ID, and for each foreign instance checks whether a Sprout desktop binary is still alive. If no desktop owns those agents, they are reaped. The desktop-alive check matches `sprout-desktop`/`Sprout` processes and confirms the instance identifier appears in their KERN_PROCARGS2 buffer (argv + environ), satisfying the constraint that orphaned agents cannot vouch for each other. Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>

Spawn a background tokio task during Tauri setup that runs both `sweep_system_agent_processes` and `reap_dead_instance_agents` every 60 seconds, skipping PIDs in the active managed_agent_processes map. This closes the gap where a `just staging` Ctrl+C leak only gets collected by the *next boot* of a same-instance-id process — which may never happen if the developer switches worktrees. The periodic sweep ensures orphans are reaped within 60s regardless of which instance detects them. Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co> Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

Extract the duplicated BSDInfo struct, proc_listallpids/proc_pidinfo extern declarations, and PROC_PIDTBSDINFO constant to module level — previously copy-pasted 4 times with two different layouts. All four call sites now share a single definition. Change sweep_system_agent_processes_with_grace to return the current orphan HashSet so the caller in lib.rs can reuse it directly instead of scanning the entire process table a second time per tick. Document the permissive Linux PPID failure mode: when /proc/<pid>/stat is unreadable, the process is treated as orphaned (safe because the two-tick grace in the periodic path mitigates transient failures). Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

wpfleger96 requested a review from a team as a code owner June 10, 2026 19:28

wpfleger96 force-pushed the wpfleger/fix-orphaned-agent-processes branch 3 times, most recently from 0afa85a to bfbec21 Compare June 10, 2026 20:49

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 4 commits June 10, 2026 17:30

wpfleger96 force-pushed the wpfleger/fix-orphaned-agent-processes branch from e8e8ab2 to 9141695 Compare June 10, 2026 21:30

wpfleger96 merged commit 53e3f09 into main Jun 10, 2026
16 checks passed

wpfleger96 deleted the wpfleger/fix-orphaned-agent-processes branch June 10, 2026 21:38

wesbillman mentioned this pull request Jun 11, 2026

chore(release): release version 0.3.17 #976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(desktop): reap orphaned agent processes across instances#954

fix(desktop): reap orphaned agent processes across instances#954
wpfleger96 merged 4 commits into
mainfrom
wpfleger/fix-orphaned-agent-processes

wpfleger96 commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wpfleger96 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

1. Justfile EXIT trap (scripts/cleanup-instance-agents.sh)

2. Dead-instance reaping on boot (runtime.rs)

3. Periodic 60s sweep (lib.rs)

Post-fix expected state

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wpfleger96 commented Jun 10, 2026 •

edited

Loading

1. Justfile EXIT trap (`scripts/cleanup-instance-agents.sh`)

2. Dead-instance reaping on boot (`runtime.rs`)

3. Periodic 60s sweep (`lib.rs`)