Skip to content

Replace LiveKit with WebSocket Opus audio relay#326

Merged
tlongwell-block merged 6 commits into
mainfrom
opus
Apr 15, 2026
Merged

Replace LiveKit with WebSocket Opus audio relay#326
tlongwell-block merged 6 commits into
mainfrom
opus

Conversation

@tlongwell-block

@tlongwell-block tlongwell-block commented Apr 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replace the external LiveKit WebRTC SFU with a WebSocket-based Opus audio relay embedded in sprout-relay. All codec work happens in Rust — the WebView becomes a dumb mic capture + UI layer. No new services, no new ports, no new network dependencies.

34 files changed, +1464 / −1742 (net −278 lines)

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                          sprout-relay                               │
│  ┌──────────────────┐    ┌────────────────────────────────────┐     │
│  │  Nostr WS (/)    │    │  Audio WS (/huddle/{id}/audio)     │     │
│  │  kind:9 text     │    │  binary: opaque Opus frames        │     │
│  │  kind:481xx sig  │    │  text:   control JSON              │     │
│  └──────────────────┘    └──────────┬─────────────────────────┘     │
│                                     │                               │
│                          ┌──────────▼──────────┐                    │
│                          │   AudioRoomManager   │                   │
│                          │  DashMap<Uuid,Room>  │                   │
│                          │  fan-out: try_send   │                   │
│                          └──────────────────────┘                   │
└─────────────────────────────────────────────────────────────────────┘

What changed

Relay — new audio/ module

  • room.rsRoom, AudioPeer, AdmissionGuard (mutex-synchronized peer admission + ended flag + index recycling), dual-channel fan-out (audio_tx + ctrl_tx), MAX_PEERS_PER_ROOM (25) defense-in-depth cap, remove_peer_and_check_ended for atomic last-peer detection
  • handler.rs — WS handler with NIP-42 keypair auth, dual ctrl/data send channels with priority drain, heartbeat, lifecycle event emission (48101/48102), 8 KB text frame size limit, relay-side auto-end (archive + kind:48103 when last human disconnects)
  • mod.rs — re-exports

Relay — wiring & deletions

  • New route: GET /huddle/{channel_id}/audio (WebSocket upgrade)
  • AppState: added audio_rooms: Arc<AudioRoomManager>, removed huddle_service and livekit_url
  • Deleted: sprout-huddle crate, api/huddles.rs, api/webhooks.rs, LiveKit env var handling

Desktop Rust backend

  • relay_api.rs: connect_audio_relay() performs synchronous WS handshake (connect + NIP-42 auth + wait for joined) with 5s timeout before returning — auth failures propagate, no silent degradation. Background pipeline does Opus encode/decode + rodio playback. Emits huddle-audio-disconnected Tauri event on unexpected WS drop (suppressed on intentional teardown via CancellationToken guard).
  • state.rs: removed livekit_* fields, added audio_ws_cancel + audio_relay_pcm_tx
  • mod.rs: simplified commands — no token fetch, no 48101/48102 emission (relay owns these). Audio relay failure is fatal (not degraded mode). Both start_huddle and join_huddle rollback state on post_connect_setup failure.
  • events.rs: removed livekit_room param from build_huddle_started, deleted participant join/leave builders
  • pipeline.rs: audio relay connection failure propagates to caller

Desktop frontend

  • Deleted livekit.ts and livekit-client npm dependency
  • HuddleContext.tsx: getUserMedia directly (48 kHz enforced), no LiveKit connection object, mic stream cleanup on failure, stable disconnectMedia callback (reads track from ref, not state) to prevent React effect cascade from self-ending huddles during startup. Listens for huddle-audio-disconnected event to auto-leave on unexpected relay WS drop.
  • HuddleIndicator.tsx: extract participant pubkey from p-tag (relay-signed 48101/48102 events)
  • HuddleBar.tsx: removed livekit_room, isReconnecting
  • worklet.js: buffer 4800→960 samples (100ms→20ms) for Opus frame alignment, 48 kHz assumption documented

Key design decisions

  • NIP-42 keypair auth — reuses existing challenge/response pattern, no tokens
  • Dual per-peer channels — audio frames and control messages (joined/left) use separate queues so control is never starved by audio backpressure
  • Peer index recyclingIndexPool freelist prevents the 255-join exhaustion bug
  • Client-side active speakers — peer_index→pubkey map seeded from initial joined message, maintained via control messages, emitted as Tauri event every 500ms
  • Synchronous handshakeconnect_audio_relay() completes WS connect + auth before returning; failures are fatal
  • Stable React callbacksdisconnectMedia uses a ref for the audio track instead of state, preventing the unmount-cleanup effect from re-firing mid-startup
  • Auto-disconnect on relay drop — pipeline emits Tauri event on unexpected WS close; frontend auto-leaves. CancellationToken guard suppresses the event during intentional teardown (teardown_huddle cancels the token before dropping the PCM sender to prevent a race).
  • Relay-side auto-end — when the last audio WS peer disconnects, the relay archives the ephemeral channel and emits kind:48103 (huddle ended). Bots never connect to the audio WS, so every peer is a human — an empty room means zero humans remain. Concurrency is handled by AdmissionGuard, a shared mutex that makes add_peer and remove_peer_and_check_ended mutually exclusive (10 codex review passes to get the synchronization right).

Testing

  • All 1100+ existing workspace unit tests pass
  • Crossfired across 25 review passes total:
    • Initial development: codex CLI × 6, opus model × 1 (14 bugs found and fixed)
    • Post-PR crossfire: codex o3, gpt-5.4, claude-4-opus (3 independent reviewers, 15 issues found)
    • Fix review: codex gpt-5.4 × 4 passes (7→4→6→9/10), each pass catching a real regression in the fixes themselves
    • Auto-end feature: codex gpt-5.4 × 10 passes (6→4→5→3→3→4→3→4→7→9/10), progressively closing concurrency races
  • Live-tested: huddle start, audio relay connect, peer join/leave, teardown, auto-end on hard exit

Bugs found and fixed during development

  1. Wire protocol mismatch (client double-prefixed peer_index)
  2. Single WS send queue starving heartbeat pings → dual ctrl/data channels
  3. Peer index wraps at 255 → IndexPool with recycling freelist
  4. Initial joined peers not seeded into active speaker map
  5. Task leak (surviving task not aborted after select)
  6. Per-packet rodio Player creation → persistent player
  7. text.contains("leave") hack → proper JSON parse
  8. HuddleIndicator using event.pubkey for relay-signed 48101/48102 → p-tag extraction
  9. Mic stream leak on failed start/join → try/catch with stream cleanup
  10. Room per-peer queue shared for audio+control → split audio_tx/ctrl_tx
  11. connect_audio_relay returned before handshake → synchronous connect+auth
  12. Axum 0.8 route param syntax (:channel_id{channel_id})
  13. post_connect_setup failure swallowed as non-fatal → audio relay failure is now fatal
  14. React effect cascade: disconnectMedia dep on localAudioTrack state caused unmount-cleanup to re-fire mid-startup → stable ref-based callback with [] deps

Post-crossfire hardening (3 reviewers × 4 fix-review passes)

  1. start_huddle/join_huddle committed state before post_connect_setup with no rollback → added rollback (archive ephemeral channel for creators, reset to Idle for joiners)
  2. No auto-disconnect on unexpected relay WS drop → pipeline emits huddle-audio-disconnected Tauri event; frontend listens and auto-leaves
  3. Control channel capacity too small (4 slots) for state-bearing joined/left messages → increased to 32 with warn! on drop
  4. Client WS handshake had no timeout for challenge/joined → 5s timeout added
  5. Opus encoder errors silently swallowed (unwrap_or(0)) → logged via eprintln
  6. Auth event serialization used unwrap_or_default() → hard error
  7. No text frame size limit on auth/control messages → 8 KB cap added
  8. Tag::parse used .expect() in production path → match/warn!/return
  9. MAX_PEERS_PER_ROOM defense-in-depth cap (25) added
  10. Heartbeat missed-pong counter off-by-one readability → fetch_add + 1 pattern
  11. getUserMedia didn't enforce 48 kHz → sampleRate: 48000 added
  12. Dead stream parameter in cleanupFailedStart → removed
  13. Bytes::copy_from_slice in recv_loop → Bytes::from (zero-copy)
  14. TODO(security) comment on ensure_membership strengthened (unverified parent_channel_id)
  15. Disconnect event fired on intentional teardown too → is_cancelled() guard
  16. leaveHuddleRef2 referenced before leaveHuddle defined (TDZ) → reuse existing ref
  17. teardown_huddle dropped PCM sender before cancelling token (race) → cancel first, then drop

Auto-end feature (10 codex review passes)

  1. Hard exit (Ctrl+C, crash) left huddle "active" for up to 1 hour → relay-side auto-end archives ephemeral channel + emits 48103 when last audio peer disconnects
  2. ensure_membership checked is_member before archived_at → reordered so archived channels are rejected first
  3. Post-get_or_create DB check was missing → added fail-closed archived_at re-check to close cross-boundary race
  4. remove_peer and mark_ended were separate lock acquisitions → combined into atomic remove_peer_and_check_ended under one AdmissionGuard mutex
  5. add_peer ended check was outside the lock → moved inside, held across ended-check + alloc + insert
  6. Concurrent disconnects could both win auto-end → !g.ended guard ensures only first empty+!ended transition wins
  7. Archive failure left room in ended state → clear_ended() rollback, TTL reaper handles cleanup

What is NOT built (per plan §7)

Video, server-side recording/mixing, TURN/NAT traversal, jitter buffer, per-participant volume, WS reconnect/resume, FEC — all deferred per spec.

@tlongwell-block tlongwell-block force-pushed the opus branch 3 times, most recently from cb218b9 to f2e9e56 Compare April 15, 2026 15:43
Remove the external LiveKit WebRTC SFU dependency and replace it with a
simple WebSocket-based Opus audio relay embedded in sprout-relay. All
codec work happens in Rust — the WebView is a dumb mic capture + UI layer.

Relay (new audio/ module):
- AudioRoomManager with DashMap<Uuid, Room> for room lifecycle
- WS endpoint at /huddle/:channel_id/audio with NIP-42 keypair auth
- Binary fan-out: client sends raw Opus, relay prepends 1-byte peer_index
- Dual per-peer channels (audio_tx + ctrl_tx) so control messages
  (joined/left) are never starved by audio backpressure
- Dual WS send channels (data + ctrl) with priority drain, matching
  the existing connection.rs pattern
- Heartbeat via ping/pong (30s interval, 3 missed -> disconnect)
- Peer index recycling via IndexPool freelist (no 255-wrap bug)
- Relay emits kind:48101/48102 lifecycle events (single source of truth)

Desktop Rust backend:
- connect_audio_relay() performs synchronous WS handshake (connect +
  NIP-42 auth + wait for joined) before returning — auth failures
  propagate to the caller, no silent degradation
- Opus encode (48kHz mono, 32kbps, DTX) from PCM via push_audio_pcm
- Per-peer Opus decode with persistent rodio Player for playback
- Client-side active speaker tracking: peer_index->pubkey map seeded
  from initial joined message, maintained via control messages,
  emitted as huddle-active-speakers Tauri event every 500ms
- Surviving task explicitly aborted after select to prevent leaks

Desktop frontend:
- Delete livekit.ts and livekit-client npm dependency
- HuddleContext: getUserMedia directly, no LiveKit connection object
- HuddleIndicator: extract participant from p-tag (relay-signed events)
- HuddleBar: remove livekit_room, isReconnecting
- worklet.js: 4800->960 sample buffer (100ms->20ms) for Opus frame alignment
- Mic stream cleanup on failed start/join via try/catch

Deleted:
- sprout-huddle crate (token.rs, webhook.rs, session.rs, error.rs, lib.rs)
- api/huddles.rs (LiveKit token endpoint)
- api/webhooks.rs (LiveKit webhook handler)
- livekit.ts (frontend LiveKit connection)
- livekit-client npm dependency

34 files changed, ~1450 insertions, ~1735 deletions
* origin/main:
  [codex] Add mobile chat timeline rendering and hardening (#327)
  fix: pre-create index on partitioned table to unblock dev setup (#325)
  fix(composer): prevent phantom newline on Enter by moving submit into Tiptap extension (#324)

# Conflicts:
#	deny.toml
Fixes all issues found across 3 independent reviews (codex o3, gpt-5.4,
claude-4-opus) of the LiveKit→WebSocket Opus relay replacement.

Critical:
- C1: start_huddle/join_huddle now rollback state on post_connect_setup
  failure (archive ephemeral channel for creators, reset to Idle for joiners)
- C2: Audio relay pipeline emits 'huddle-audio-disconnected' Tauri event
  on unexpected WS drop; frontend listens and auto-leaves. Event is
  suppressed on intentional teardown via CancellationToken guard.
  teardown_huddle cancels the token BEFORE dropping pcm_tx to prevent
  a race where the send task exits via sender-drop before is_cancelled()
  is true.

Important:
- I3: Control channel capacity 4→32 slots with warn! on drop
- I4: Client WS handshake now has 5s timeout for challenge and joined
- I5: Strengthened TODO(security) comment on ensure_membership
- I6: Opus encoder errors logged instead of silently swallowed

Minor:
- M7: 8KB text frame size limit for auth/control messages
- M8: Removed dead 'stream' parameter from cleanupFailedStart
- M9: Bytes::copy_from_slice→Bytes::from (zero-copy)
- M10: Tag::parse uses match/warn!/return (no .expect() in prod)
- M11: MAX_PEERS_PER_ROOM (25) defense-in-depth cap
- M14: Heartbeat missed-pong counter readability (fetch_add+1)
- M15: 48kHz documented in worklet.js, enforced in getUserMedia

Reviewed across 4 codex CLI passes (7→4→6→9/10).
When the last audio WS peer disconnects, the relay now automatically
archives the ephemeral channel and emits kind:48103 (huddle ended).
Since bots never connect to the audio WS, every peer is a human —
an empty room means zero humans remain.

Concurrency design (10 codex review passes, 6→4→5→3→3→4→3→4→7→9/10):

Room-level AdmissionGuard mutex synchronizes peer admission with the
end-of-huddle transition. add_peer holds the lock across ended-check +
index-alloc + peer-insert. remove_peer_and_check_ended holds the same
lock across index-release + is_empty + ended=true. These are mutually
exclusive — no concurrent add_peer can succeed after the room is marked
ended, and no concurrent remover can double-trigger auto-end (!g.ended
guard ensures only the first empty+!ended transition wins).

Three layers of defense against stale joins:
1. ensure_membership checks archived_at before is_member
2. Post-get_or_create DB check (fail-closed) catches cross-boundary race
3. AdmissionGuard mutex catches same-room race

Archive failure rolls back the ended flag via clear_ended — the huddle
stays alive and the 1-hour TTL reaper handles cleanup later.
@tlongwell-block tlongwell-block merged commit 811c8f3 into main Apr 15, 2026
10 checks passed
@tlongwell-block tlongwell-block deleted the opus branch April 15, 2026 19:02
tlongwell-block added a commit that referenced this pull request May 24, 2026
The docs had drifted from the code in both directions — overstating
removed features (LiveKit) and understating shipped ones (git hosting,
mobile, huddles). This corrects the docs to match reality and removes
hardcoded counts that go stale.

- Huddles: rewrite around the in-relay WebSocket Opus path (LiveKit was
  removed in #326). Drop the phantom `sprout-huddle` crate from every
  crate map (README, AGENTS, CONTRIBUTING, ARCHITECTURE) and remove the
  dead LIVEKIT_* vars from .env.example.
- Status tables: git hosting (smart HTTP + NIP-34) and the Flutter mobile
  client are real and shipping/in-progress — promote them from "designed/
  planned." Huddles → built (recording/tracks still planned).
- Remove hardcoded counts ("44 tools", "44-command CLI", "17 crates",
  "~72K LOC", per-crate LOC markers, the LOC Summary appendix). They
  drift; point at the source of truth instead.
- Fix repository URL sprout-rs/sprout → block/sprout (incl. relay NIP-11
  software field).
- AGENTS crate map: complete it (was missing 8 crates), group it, add
  web/ and mobile/.
- README: reframe sprout-mcp as "being phased out in favor of the CLI"
  (matches the deprecation direction) rather than "legacy/optional."
- CONTRIBUTING: `cargo test -p sprout-test-client` → add `--ignored`
  (the E2E tests are #[ignore]d; without it the command runs nothing).
- kind.rs: drop the dead RESEARCH/ doc pointer; this module is the source.
- ARCHITECTURE: correct stale "Known Limitations" (huddle is wired; the
  send_dm/set_channel_topic actions fail at runtime, not silently).

Docs-only plus comment/string fixes; no logic changes. Verified
`cargo check` passes on the four crates with edited source.

Signed-off-by: tlongwell-block <109685178+tlongwell-block@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant