Skip to content

fix(presence): clear on disconnect, fix heartbeat/TTL, drop broken REST path#877

Merged
wpfleger96 merged 3 commits into
mainfrom
wpfleger/presence-reliability
Jun 5, 2026
Merged

fix(presence): clear on disconnect, fix heartbeat/TTL, drop broken REST path#877
wpfleger96 merged 3 commits into
mainfrom
wpfleger/presence-reliability

Conversation

@wpfleger96

@wpfleger96 wpfleger96 commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Three bugs caused users to appear offline or away while they were clearly active:

Bug 1 — No presence clear on disconnect. When a WebSocket connection drops (crash, network loss, app quit without sending "offline"), the relay never removed the user's Redis presence key. The user stayed "online" until the 90-second TTL expired, then vanished to "offline" with no "away" transition. Fix: handle_connection() cleanup now calls clear_presence() when the connection was authenticated and no other connections remain for that pubkey — preventing CLI one-shot publishes and multi-device sessions from clearing each other's presence.

Bug 2 — Heartbeat/TTL mismatch. The Redis TTL comment says "3x the 30s heartbeat interval", but the desktop heartbeat was 60 seconds — leaving only 30 seconds of slack. One missed heartbeat (brief WS reconnect, backgrounded tab, network hiccup) and the key expires before the next beat fires. Fix: halve PRESENCE_HEARTBEAT_INTERVAL_MS from 60_000 to 30_000 to match the TTL's design assumption.

Bug 3 — Silent broken REST fallback. useSetPresenceMutation caught WS failures and fell back to the Tauri set_presence command, which submitted kind:20001 over POST /events. The relay rejects that. The error was swallowed by .catch(() => {}), so WS failures silently lost presence updates. Fix: remove the fallback entirely — the 30s heartbeat loop provides natural retry.

Also removes the reconcile_mcp_commands_in_file startup migration that cleared stale "sprout-mcp-server" values from managed-agents.json. It has been running on every launch since before #850; all active users are clean and the binary no longer ships.

Bonus: handlePresenceEvent in the WS subscription now extracts the real user pubkey from the p tag on relay-synthesized presence events (previously read event.pubkey, which is the relay's key for synthesized events).

  • crates/sprout-relay/src/connection.rs — clear presence on disconnect, guarded by remaining-connections check via connection_ids_for_pubkey
  • desktop/src/features/presence/hooks.ts — halve heartbeat, remove REST fallback, fix p-tag handling
  • desktop/src-tauri/src/commands/profile.rs + events.rs + models.rs — remove dead set_presence command + build_presence builder + SetPresenceResponse type
  • desktop/src-tauri/src/migration.rs + lib.rs — remove reconcile_mcp_commands_in_file + startup call + 9 tests
  • desktop/src/shared/api/tauri.ts + types.ts — remove dead setPresence() function + SetPresenceResult type
  • desktop/src/testing/e2eBridge.ts — add kind:20001 mock WS handler, remove dead set_presence Tauri mock

… path

Three distinct bugs caused users to appear offline while still active:

1. No `clear_presence` on WS disconnect — crashed clients stayed "online"
   in Redis until the 90s TTL expired, then jumped straight to offline.
   Fix: clear presence in `handle_connection` cleanup when the connection
   was authenticated.

2. Heartbeat/TTL mismatch — the Redis TTL comment says "3x the 30s
   heartbeat" but the desktop heartbeat was 60s, leaving only 30s of
   slack. One missed heartbeat = user goes offline. Fix: halve the
   heartbeat from 60s to 30s to match the TTL's design assumption.

3. Silent broken REST fallback — `useSetPresenceMutation` caught WS
   failures and fell back to the Tauri `set_presence` command, which
   used `POST /events`. The relay rejects kind:20001 over HTTP, so
   the fallback always silently failed. Fix: remove the fallback; the
   30s heartbeat loop provides natural retry behavior.

Also removes `reconcile_mcp_commands_in_file` — the migration that
cleared stale `"sprout-mcp-server"` values from managed-agents.json.
It has been running on every launch since before #850 tightened it;
all active users are already clean and the binary no longer ships.

Bonus: fix WS subscription handler to extract the real user pubkey
from the `p` tag on relay-synthesized presence events (previously
used `event.pubkey` which is the relay's key for synthesized events).
@wpfleger96 wpfleger96 requested a review from a team as a code owner June 5, 2026 17:25
Disconnect cleanup now checks whether the user has other active
connections before clearing Redis presence. Without this guard, CLI
`set-presence online` (short-lived WS) would immediately delete its own
presence on disconnect, and multi-device users would lose presence when
closing one device.

Also fixes the E2E mock bridge: kind:20001 events from
`relayClient.sendPresence()` now get a dedicated handler that updates
the mock presence map and fans out to global subscribers, matching real
relay behavior. Removes the dead `set_presence` Tauri command mock
(`handleSetPresence`, `RawSetPresenceResponse`, dispatcher case).
@wpfleger96 wpfleger96 enabled auto-merge (squash) June 5, 2026 20:27
@wpfleger96 wpfleger96 merged commit 5d7c748 into main Jun 5, 2026
16 checks passed
@wpfleger96 wpfleger96 deleted the wpfleger/presence-reliability branch June 5, 2026 20:28
michaelneale added a commit that referenced this pull request Jun 6, 2026
* origin/main:
  chore(release): release version 0.3.12 (#886)
  Show hover cards for inline message emoji (#885)
  Fix monotonic read-state merges (#884)
  Refine sidebar behavior and borders (#869)
  fix(presence): clear on disconnect, fix heartbeat/TTL, drop broken REST path (#877)
  fix(cli): publish ephemeral events over WebSocket via sprout-ws-client (#876)
  docs(sprout-acp): add communication discipline rules to base prompt + deprecate --mention flag (#883)
  Polish thread summaries and reactions (#881)
  feat(cli): add emoji export and import subcommands (#882)
  Polish message row hover states (#880)
  Improve emoji naming and custom emoji UX (#878)
  docs: add ecosystem section to CONTRIBUTING.md, fix stale release info (#873)
  fix(relay): wire custom filter fields through HTTP bridge (#864)
  chore: deprecate sprout-mcp — fill CLI gaps, remove crate and all references (#850)
  Fix custom emoji status in profile popover (#874)
  fix(agent): gate handoff on provider token usage, not byte estimate (#821)
  docs: add VISION_MESH.md — the compute-commons vision (#867)
  fix(desktop): simplify profile popover header (#853)
  fix(desktop): remove thread comment hover outline (#861)
  feat(desktop): always show channel section search/add buttons (#856)

# Conflicts:
#	crates/sprout-cli/src/client.rs
#	desktop/src/app/AppShell.tsx
#	justfile
tellaho pushed a commit that referenced this pull request Jun 8, 2026
…ST path (#877)

Signed-off-by: Taylor Ho <taylorkmho@gmail.com>
wpfleger96 pushed a commit that referenced this pull request Jun 8, 2026
Re-enable the generic `reconcile_provider_mcp_commands` migration that
was accidentally removed from the startup sequence in PR #877. This
reconciles `mcp_command` values in managed-agents.json against the
discovery table on every launch — fixing stale "sprout-mcp-server"
references and any future drift without a dedicated one-off function.

Extended to cover both `app_data_dir()` and `canonical_dev_data_dir`
so worktree instances are also healed.

Additionally, harden the spawn site in runtime.rs: if `mcp_command`
references a binary that cannot be resolved, log a warning and continue
spawning without MCP rather than hard-failing. This prevents this entire
class of breakage permanently, regardless of whether reconciliation ran.

Fixes the user-reported issue where agents created before v0.3.12 fail
to spawn because "sprout-mcp-server" no longer exists.
wesbillman added a commit that referenced this pull request Jun 8, 2026
Re-enable the generic `reconcile_provider_mcp_commands` migration that
was accidentally removed from the startup sequence in PR #877. This
reconciles `mcp_command` values in managed-agents.json against the
discovery table on every launch — fixing stale "sprout-mcp-server"
references and any future drift without a dedicated one-off function.

Extended to cover both `app_data_dir()` and `canonical_dev_data_dir`
so worktree instances are also healed.

Additionally, harden the spawn site in runtime.rs: if `mcp_command`
references a binary that cannot be resolved, log a warning and continue
spawning without MCP rather than hard-failing. This prevents this entire
class of breakage permanently, regardless of whether reconciliation ran.

Fixes the user-reported issue where agents created before v0.3.12 fail
to spawn because "sprout-mcp-server" no longer exists.

Co-authored-by: Brain <21994759fc7a6fa6b965551d35cfd7897d262f2495467f2d78694ddcfa6a5c7e@sprout-oss.stage.blox.sqprod.co>
Co-authored-by: Duncan <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: Wes <wesbillman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants