fix(desktop): keep active-turn badges through transient relay drops#1120
Conversation
A flaky relay (VPN drop) stops liveness frames for every agent at once, and the 5s pruneExpired tick then wiped all badges together — the "all at once" disappearance Will reported. pruneExpired now pauses when EVERY tracked turn is simultaneously stale (max lastActivityAt older than FRAME_GAP_PAUSE_MS), the drop's local signature. A live sibling turn keeps the max fresh, so a genuine multi-agent crash still prunes the dead turn at 25s — no regression. The residual: a lone kill -9'd turn under a healthy relay lingers until its next frame, intrinsic to local-only sensing. A recovered liveness frame for an already-pruned turn resurrects its badge, gated by a per-turn terminal tombstone so a completed turn is never revived. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The two tests labeled bound-proving asserted the strict-newer terminal comparison, not the map-size cap in recordTerminal, leaving the eviction line with zero coverage. Drive 18 completions sharing one timestamp with rising seq so an equal-timestamp probe clears the per-agent watermark on the seq tiebreak yet reaches the tombstone check; the evicted oldest entry resurrects while a survivor still blocks, proving the cap fires and evicts oldest-by-insertion. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
…ss gap The badge render is E2E-only, so the unit suite could not prove that shouldPausePrune keeps per-channel Working timers visible through a flaky-VPN relay drop. This spec installs page.clock before navigation, seeds two agents working across channels, then fastForwards 30s with no further frames — firing several real 5s prune ticks past both the 20s pause and 25s remove thresholds — and asserts the badges persist. fastForward (not setFixedTime) is required so the prune interval genuinely runs; setFixedTime would leave badges present vacuously. Registers the new spec in the smoke project testMatch allowlist. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Active-turn badge resilience — E2E proofThe badge render is E2E-only, so these screenshots prove the fix on the actual rendered timer badge — not just the unit suite. The spec installs Healthy multi-agent state (before the drop)Two agents working across channels, all per-channel Badges survive the all-at-once liveness gap (the resilience proof)Same view after advancing the clock 30s with no frames. All three badges are still present and their elapsed counters advanced to |


Problem
Active-turn timer badges on the Agents menu all vanish at once on a transient relay drop (flaky VPN). When the relay drops, liveness frames stop arriving for every agent simultaneously, and the 5s
pruneExpiredtick deletes every turn whoselastActivityAtis older thanREMOVE_AFTER_MS(25s) in one sweep — wiping all badges together. A prior fix addressed frame recovery, not this prune layer.Fix
All changes are in
activeAgentTurnsStore.ts.pruneExpiredearly-returns when the maxlastActivityAtacross all agents' turns is older thanFRAME_GAP_PAUSE_MS(20s, below the 25s prune bound). Gating on the max per-turnlastActivityAtrather than a global frame clock is what prevents over-pausing: a single live sibling turn keeps the max fresh, so a genuine multi-agent crash still prunes the dead turn at 25s — no regression. No relay-connection coupling; wake-on-resume is automatic once frames flow again.acpframe. Aturn_liveness/acp_*frame for a turn no longer in the live map recreates it. This runs only for frames that pass the existing per-agent watermark, so replayed stale frames cannot revive a turn.terminalAtByAgent) records when a turn terminally ended (turn_completed/turn_error/agent_panic), and resurrection revives a turn only when the recovered frame is strictly newer than that terminal — a completed turn is never revived. The map is bounded per agent and cleared inresetActiveAgentTurnsStore.Accepted residual
A lone turn
kill -9'd under a healthy relay (it was the only active turn) keeps its badge until its next frame instead of clearing at 25s. This is intrinsic to local-only sensing — that case is locally indistinguishable from a drop — and the badge self-heals the instant any frame arrives. Signed off as the chosen tradeoff to keep badges visible through transient drops.Notes for review
node:testunit suite (mock.timers).