Skip to content

fix(desktop): keep active-turn badges through transient relay drops#1120

Merged
wpfleger96 merged 3 commits into
mainfrom
duncan/active-turn-badge-resilience
Jun 18, 2026
Merged

fix(desktop): keep active-turn badges through transient relay drops#1120
wpfleger96 merged 3 commits into
mainfrom
duncan/active-turn-badge-resilience

Conversation

@wpfleger96

Copy link
Copy Markdown
Collaborator

Problem

Active-turn timer badges on the Agents menu all vanish at once on a transient relay drop (flaky VPN). When the relay drops, liveness frames stop arriving for every agent simultaneously, and the 5s pruneExpired tick deletes every turn whose lastActivityAt is older than REMOVE_AFTER_MS (25s) in one sweep — wiping all badges together. A prior fix addressed frame recovery, not this prune layer.

Fix

All changes are in activeAgentTurnsStore.ts.

  • Pause the prune when every tracked turn is simultaneously stale — the "all at once" local signature of a relay drop. pruneExpired early-returns when the max lastActivityAt across all agents' turns is older than FRAME_GAP_PAUSE_MS (20s, below the 25s prune bound). Gating on the max per-turn lastActivityAt rather than a global frame clock is what prevents over-pausing: a single live sibling turn keeps the max fresh, so a genuine multi-agent crash still prunes the dead turn at 25s — no regression. No relay-connection coupling; wake-on-resume is automatic once frames flow again.
  • Resurrect a pruned badge on a recovered liveness/acp frame. A turn_liveness/acp_* frame for a turn no longer in the live map recreates it. This runs only for frames that pass the existing per-agent watermark, so replayed stale frames cannot revive a turn.
  • Per-turn terminal tombstone (terminalAtByAgent) records when a turn terminally ended (turn_completed/turn_error/agent_panic), and resurrection revives a turn only when the recovered frame is strictly newer than that terminal — a completed turn is never revived. The map is bounded per agent and cleared in resetActiveAgentTurnsStore.

Accepted residual

A lone turn kill -9'd under a healthy relay (it was the only active turn) keeps its badge until its next frame instead of clearing at 25s. This is intrinsic to local-only sensing — that case is locally indistinguishable from a drop — and the badge self-heals the instant any frame arrives. Signed off as the chosen tradeoff to keep badges visible through transient drops.

Notes for review

  • The badge render is exercised E2E-only; this PR covers the store logic with the existing node:test unit suite (mock.timers).
  • Two prior backstop tests asserted the old contract that a lone silent turn prunes at 25s; they now assert the new contract (dead turn prunes when a live sibling keeps the max fresh; lone silent turn lingers as the documented residual).

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 3 commits June 18, 2026 16:05
A flaky relay (VPN drop) stops liveness frames for every agent at once,
and the 5s pruneExpired tick then wiped all badges together — the
"all at once" disappearance Will reported. pruneExpired now pauses when
EVERY tracked turn is simultaneously stale (max lastActivityAt older than
FRAME_GAP_PAUSE_MS), the drop's local signature. A live sibling turn keeps
the max fresh, so a genuine multi-agent crash still prunes the dead turn
at 25s — no regression. The residual: a lone kill -9'd turn under a
healthy relay lingers until its next frame, intrinsic to local-only
sensing. A recovered liveness frame for an already-pruned turn resurrects
its badge, gated by a per-turn terminal tombstone so a completed turn is
never revived.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The two tests labeled bound-proving asserted the strict-newer terminal comparison, not the map-size cap in recordTerminal, leaving the eviction line with zero coverage. Drive 18 completions sharing one timestamp with rising seq so an equal-timestamp probe clears the per-agent watermark on the seq tiebreak yet reaches the tombstone check; the evicted oldest entry resurrects while a survivor still blocks, proving the cap fires and evicts oldest-by-insertion.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
…ss gap

The badge render is E2E-only, so the unit suite could not prove that
shouldPausePrune keeps per-channel Working timers visible through a
flaky-VPN relay drop. This spec installs page.clock before navigation,
seeds two agents working across channels, then fastForwards 30s with no
further frames — firing several real 5s prune ticks past both the 20s
pause and 25s remove thresholds — and asserts the badges persist.
fastForward (not setFixedTime) is required so the prune interval
genuinely runs; setFixedTime would leave badges present vacuously.
Registers the new spec in the smoke project testMatch allowlist.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96

Copy link
Copy Markdown
Collaborator Author

Active-turn badge resilience — E2E proof

The badge render is E2E-only, so these screenshots prove the fix on the actual rendered timer badge — not just the unit suite. The spec installs page.clock before navigation, seeds two agents working across channels, then fastForwards 30s with no further frames (firing several real 5s prune ticks past both the 20s pause and 25s remove thresholds) and asserts the badges persist. fastForward (not setFixedTime) is load-bearing: it genuinely fires the prune interval, so shouldPausePrune is actually exercised rather than vacuously skipped.

Healthy multi-agent state (before the drop)

01-badges-before-gap

Two agents working across channels, all per-channel Working in #… · <elapsed> timer badges present — Duncan in #general, Paul in #engineering and #general. This is the state Will sees with correct timers before a flaky-VPN relay drop.

Badges survive the all-at-once liveness gap (the resilience proof)

02-badges-survive-gap

Same view after advancing the clock 30s with no frames. All three badges are still present and their elapsed counters advanced to · 30s. Under the pre-fix code every badge would have vanished at the first prune tick past 25s — this is the exact regression #1120 fixes. The · 30s advance independently confirms the prune ticks fired and the render updated, so shouldPausePrune is what kept the badges alive, not the absence of a tick.

wpfleger96 pushed a commit that referenced this pull request Jun 18, 2026
@wpfleger96 wpfleger96 merged commit cf122fc into main Jun 18, 2026
21 of 24 checks passed
@wpfleger96 wpfleger96 deleted the duncan/active-turn-badge-resilience branch June 18, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant