Skip to content

Add microphone test panel to audio settings#668

Merged
bdart merged 31 commits into
mainfrom
feat/audio-input-test-panel
May 12, 2026
Merged

Add microphone test panel to audio settings#668
bdart merged 31 commits into
mainfrom
feat/audio-input-test-panel

Conversation

@bdart
Copy link
Copy Markdown
Contributor

@bdart bdart commented May 11, 2026

This pull request enhances the audio input selection and testing experience in the user preferences dialog, introduces a microphone test panel, and improves the dictation UI for clearer feedback during audio capture. It also updates related tests and underlying audio handling logic to support these features.

User Preferences Dialog & Audio Input Selection:

  • Replaces the audio input device <select> with a custom DropdownMenu for better usability and accessibility, including improved labeling and selection handling. [1] [2]
  • Adds a new MicTestPanel component, allowing users to test their selected microphone directly within the preferences dialog, with real-time waveform and device feedback. [1] [2]
  • Updates tests to reflect the new dropdown menu and microphone selection logic. [1] [2]

Dictation and Audio UI Improvements:

  • Updates the dictation button to visually indicate when audio is being captured but not yet transmitted ("buffering"), including new labels and dimmed waveform bars for clearer user feedback. [1] [2] [3] [4]
  • Adds an isCapturingAudio prop to the ChatInput component and adjusts dictation state handling for improved accuracy in UI state. [1] [2]

Testing and Audio Infrastructure:

  • Refactors audio input and dictation test mocks to use AudioWorkletNode instead of deprecated ScriptProcessorNode, aligning with modern browser APIs and improving test reliability. [1] [2] [3] [4]
  • Mocks the audio worklet module in tests to prevent loading actual worker code.

General Codebase Improvements:

  • Adds missing imports and minor code cleanups for clarity and maintainability. [1] [2] [3] [4]

These changes collectively provide a more intuitive and robust experience for users configuring and testing their microphone, while also bringing the codebase up to date with current web audio best practices.- Share selected audio input via Zustand store so the dialog and
long-lived recorders see writes immediately (migrates from legacy
localStorage key).

  • New useAudioInputLevelPreview hook (analyser + RAF, no transcription)
    reusing the recorder's bar extractor.
  • New MicTestPanel reuses the Waveform primitive and surfaces the
    active track label + deviceId-mismatch warning.
  • Swap the bare for the project-standard DropdownMenu and drop the redundant "Audio input" sub-label to match the Appearance tab pattern.

bdart added 30 commits May 11, 2026 18:00
- Share selected audio input via Zustand store so the dialog and
  long-lived recorders see writes immediately (migrates from legacy
  localStorage key).
- New useAudioInputLevelPreview hook (analyser + RAF, no transcription)
  reusing the recorder's bar extractor.
- New MicTestPanel reuses the Waveform primitive and surfaces the
  active track label + deviceId-mismatch warning.
- Swap the bare <select> for the project-standard DropdownMenu and
  drop the redundant "Audio input" sub-label to match the Appearance
  tab pattern.
Build the AudioContext / analyser / ScriptProcessor pipeline right
after getUserMedia and pre-buffer samples in preSessionSamplesRef
until the live session is ready, then drain in order. Pass the
context's actual sample rate into startLiveDictationSession so
resampling uses the right ratio from the first sample. Clear the
pre-buffer in stopRecordingVisualizer and the unmount path.

Track the in-flight WebSocket in pendingSocketRef so stopDictation
can cancel a startup mid-handshake; toggleDictation now treats
isDictationStarting as cancellable. ChatInput renders WaveformButton
during isDictationStarting too, dimmed via a new isBuffering prop on
WaveformButton to signal "capturing but not yet transmitting".
Locks down the early-speech race fix: emit samples after the audio
pipeline is built but before session_state arrives, then verify they
are transmitted as part of the first PCM chunk once the session
resolves. Catches any future regression that re-orders the pipeline
construction below the socket handshake or skips the pre-buffer drain.
ScriptProcessorNode runs its callback on the main thread, which lets
heavy renders, GC pauses, or long tasks starve the audio path. Replace
it with an AudioWorkletNode whose processor lives in a small ESM
worklet file and runs on the audio rendering thread. The processor
batches render-quantum samples into 4096-sample frames and posts each
frame to the main thread via MessagePort, matching the legacy chunk
cadence so downstream resampling and PCM transport stay unchanged.

The worklet is bundled via `?worker&url` so Vite emits it as a real
JS asset rather than the broken `data:video/mp2t` URL produced by
`?url` on a .ts file. Drops the now-unneeded zero-gain processorSink
that the old ScriptProcessor required to stay live. Updates the
recorder test mocks to MockAudioWorkletNode + MockMessagePort.
isDictationStarting flips at the click, but the worklet only enters
the audio graph ~250-300 ms later, after await getUserMedia and
await audioWorklet.addModule resolve. Showing the waveform on
isDictationStarting let users start speaking into a mic the pipeline
hadn't tapped yet.

Add isCapturingAudio to the dictation hook, flipped true right after
source.connect(processor) and false in stopRecordingVisualizer.
ChatInput renders the WaveformButton only when isCapturingAudio or
isDictating, and keeps the spinner during the pre-capture window —
so the bars now mean "go ahead, captured even if not yet transmitted"
rather than "we clicked the button".
isCapturingAudio still flipped too early: source.connect(processor)
returns instantly, but the OS / mic stack may take hundreds of
milliseconds (Bluetooth, Safari, external mics) before samples flow.
The user saw bars, started speaking, and the analyser stayed flat
until samples finally arrived — losing the first words.

Move the isCapturingAudio flip into the worklet's port.onmessage
handler, gated by a per-session firstFrameSeen flag. That message
only fires after the worklet has buffered 4096 real samples from the
source, so bars-visible now means audio is genuinely flowing.

Also resume() the AudioContext after creation: if the click's user
activation has lapsed across the getUserMedia + addModule awaits,
the new context starts suspended and no audio reaches any node.
resume() is a no-op when the context is already running.

Adds the matching resume() to MockAudioContext in the test file.
Last fix flipped "capturing" on the worklet's first port.onmessage,
but that message was firing on a frame of silence. Per W3C Web Audio
§1.32 and Mozilla bug 1629478, MediaStreamAudioSourceNode emits
zero-filled render quanta during OS / driver warm-up — up to ~70
quanta on Firefox, hundreds of milliseconds on Bluetooth / Safari /
external mics. So bars appeared, user spoke, and real samples landed
later.

Match the OpenAI wavtools / LiveKit / hark idiom: gate worklet posts
on a foundAudio flag that flips on the first non-zero sample. The
worklet drops zero-only frames entirely and starts posting at the
first energy-bearing buffer, sliced from the first non-zero sample.
The main thread's existing firstFrameSeen flip is now triggered by
real audio, so the bars only appear when speech can actually be
captured.

Also await audioContext.resume() instead of fire-and-forget; Mozilla
measured 7x fewer empty quanta with an explicit awaited resume.
Re-checks isMountedRef after the new await to keep the unmount
teardown symmetric.

track.muted / onunmute were considered but rejected — Firefox keeps
muted=false for mic tracks (bug 1739163), WebKit drives muted from
AudioUnit suspension not first-sample arrival, and no production
audio library (LiveKit, wavtools, hark, vad-web, web-audio-analyser)
uses them as the readiness signal.
The previous worklet trim (drop frames until first non-zero) was the
wavtools push-to-talk pattern, applied incorrectly as a session-wide
silence stripper. OpenAI's stack tolerates that only because the
Realtime API's prefix_padding_ms (default 300 ms) adds the padding
back server-side; our server has no equivalent. Without leading
silence, the VAD has no calibration window and clips the first word.

Worklet now always batches and posts every render quantum — including
the zero-filled samples MediaStreamAudioSourceNode emits during OS
warm-up. Non-zero detection moved to the main-thread port.onmessage
handler, where it still drives isCapturingAudio so the bars only
appear when real audio is flowing — the UX cue the user wanted is
preserved, but the data on the wire now matches what production STT
clients (Deepgram, AssemblyAI, OpenAI Realtime reference clients)
send: full stream from getUserMedia onward.

Evidence: OpenAI prefix_padding_ms default 300 ms, Silero VAD
speech_pad_ms, sherpa-onnx #3035 pre-speech padding fix, DeepSpeech
#2443 (5 ms silence prepend fixes first word), Wispr Flow "Fix
Missing First Words" guide. Every production VAD documents the same
mitigation.
Cold first dictation got hundreds of milliseconds of OS warm-up
zeros for free, doubling as the VAD calibration window. Warm second
dictation found the OS audio device still hot — real samples arrived
almost instantly — and the wire format started with speech, letting
the server VAD clip the first word.

Seed preSessionSamplesRef with PRE_SPEECH_SILENCE_PRIMER_MS (300 ms)
of zero samples at the AudioContext rate immediately after the audio
pipeline is wired, so every session ships the same calibration
window to the server regardless of OS device state. Matches the
prefix_padding_ms = 300 default in the OpenAI Realtime VAD guide;
Silero VAD speech_pad_ms and sherpa-onnx pre-speech padding land in
the same range. Cost ~10 KB of PCM per session; user-visible spinner
and bars behavior unchanged.
Consolidate the duplicate "Audio input is not supported in this
browser." onto the existing "Audio recording is not supported..."
key already translated in every locale.

Translate the 13 genuinely new strings introduced by the mic test
panel, dictation early-capture UX, and preview-hook error paths:

- preferences.dialog.audio.test.{description,heading,listening,
  mismatch,start,stop}
- "Audio analysis is not supported in this browser."
- "Cancel starting dictation"
- "Capturing audio — waiting for transcription to start"
- "Could not start the microphone test."
- "Microphone permission denied. Allow access to test the microphone."
- "The microphone is already in use by another application or recording."
- "The selected microphone is not available. Refresh the device list
  and try again."

`test.heading` and `test.start` are kept as separate keys (despite
identical English "Test microphone") so translators can pick a
nominal form for the section heading and an imperative form for the
button label where the language warrants it; today they're identical
across de/es/fr/pl.
On a fully-warm OS audio device the worklet's first non-zero frame
can arrive within tens of milliseconds of source.connect(processor),
fast enough that the spinner feels like a blink and the user starts
speaking before they've finished the priming breath.

Belt-and-braces alongside the existing PRE_SPEECH_SILENCE_PRIMER_MS:
record the time we wire the audio graph, and when the first non-zero
frame arrives, defer the isCapturingAudio flip until at least
MIN_AUDIO_CAPTURE_DELAY_MS (150 ms) has elapsed. A new
capturingFlipTimerRef tracks the deferred flip and is cleared in
stopRecordingVisualizer so a mid-flight cancel doesn't leave a stale
timer firing after teardown. The flipCapturingAudio closure also
re-checks audioProcessorRef.current === processor before firing so a
late timeout can't promote a session that's already been torn down.
We never consumed the encoded MediaRecorder output — no
ondataavailable listener was ever set. Its only purpose was to give
us an onstop event and a state === "recording" flag. Both are
trivially replaceable with our own callable and the existing
liveSessionRef.

Extract the former mediaRecorder.onstop body into a completeDictation
useCallback that runs the final-flush → finish → await-completed →
cleanup chain. stopDictation now calls it directly when
liveSessionRef.current is set (precise "we are recording now" signal,
set the same synchronous tick as setIsDictating(true) — more
reliable than the React state across the same tick).

Eliminates a documented head/tail glitch class of bugs (Mozilla
1539186, w3c/mediacapture-record #178) and matches the production
STT consensus: AssemblyAI, OpenAI Realtime, LiveKit transport, Vapi,
Deepgram browser samples all use AudioWorklet alone, never alongside
MediaRecorder on the same getUserMedia stream.

Drops MockMediaRecorder and its assertions from the recorder test.
Every dictation used to create a brand-new AudioContext and close it
on stop. Each session paid the audio-thread spin-up + worklet
addModule fetch/parse cost, even when the OS audio device was warm —
which is the most likely cause of the warm-second-session first-word
loss: getUserMedia resolved in ~10 ms but the AudioContext itself was
cold, and the worklet's first frames took tens of milliseconds to
arrive while the user was already speaking.

Add ensureAudioContextReady: reuses the existing AudioContext when
present, only constructs a fresh one when the previous was closed,
calls audioWorklet.addModule once per context lifetime (tracked via
workletModuleLoadedRef), and resumes the context (no-op when
running). stopRecordingVisualizer now suspends rather than closes,
keeping the audio thread warm between sessions. close + null happen
only on hook unmount.

source / analyser / AudioWorkletNode are still created fresh per
session so leftover per-session state cannot leak. The expensive
parts — audio thread, worklet registration — persist; the per-session
graph wiring is what gets rebuilt.

Adds a suspend mock to MockAudioContext to keep the recorder tests
passing.
When AudioContext.sampleRate differs from the input MediaStreamTrack's
sample rate, createMediaStreamSource inserts a Web Audio internal
resampler with documented startup latency — one more contributor to
the warm-second-session first-sample gap.

Read trackSettings.sampleRate after getUserMedia and pass it to
ensureAudioContextReady, which forwards it to the AudioContext
constructor on first creation. Falls back to the default rate if the
browser rejects the requested value. Persisted contexts keep their
original rate even if a subsequent track reports a different one —
that tradeoff is right: the warm audio thread is a bigger win than
avoiding the resampler when the device changes, which is rare.

The worklet's PCM resampler still handles the 16 kHz canonical
conversion regardless.
Drops the bespoke lazyLocalStorage PersistStorage in favour of
createJSONStorage, but keeps a tiny wrapper object whose methods
re-resolve the localStorage global on every call. createJSONStorage
caches its factory's return value once at create time, so handing it
localStorage directly would freeze a reference that vi.stubGlobal
can't replace in tests. createJSONStorage still owns the JSON
wrapping, which is the part of the original wrapper that was
re-inventing zustand's wheel.
Verified empirically that Vite 6 inlines TS files referenced via
new URL(..., import.meta.url) as data:video/mp2t;base64,... URLs in
production builds (the .ts MIME being MPEG Transport Stream video,
not TypeScript), which the browser refuses to load via
audioWorklet.addModule. The ?worker&url query is the working pattern
for our TS worklet; expand the comment to explain why the
platform-canonical Vite recipe doesn't apply here, so a future
reviewer doesn't try the same swap again.
The audio-input dropdown trigger in UserPreferencesDialog rendered an
inline <svg> chevron — out of step with every other chevron in the
app, which goes through iconoir-react via the icons module wrapper
(NavArrowRight → ChevronRightIcon). Re-export NavArrowDown as
ChevronDownIcon and use it. Same rotation transform on dropdown open,
sized to match the rest of the trigger.
Replaces the manual `isMountedRef = useRef(true)` + mount/unmount
effect pattern in both audio hooks with react-use's useMountedState,
which returns a stable getter that's true while the component is
mounted. Drops the eslint-disable-next-line @typescript-eslint/no-
unnecessary-condition comment that was defeating TS narrowing of the
ref across awaits — the function call isn't subject to that narrowing
in the first place. The mount-effect that flipped the ref disappears
entirely. Adds isMounted to the deps arrays of the few useCallbacks /
useEffects that read it.
Previously the dropdown only listed devices known at mount time;
plugging in a USB / Bluetooth mic mid-session left it stale until
the user manually clicked "Refresh devices". Subscribe to
mediaDevices.devicechange and re-enumerate on the event. The manual
refresh button still has a job: devicechange does NOT fire when
labels become available after a permission grant (because the
device list itself didn't change), only on add/remove.

react-use ships useMediaDevices which does the same subscription
internally, but it doesn't expose a refresh callable — needed for
the permission-grant-then-reload case — so we keep the existing
imperative refresh and add only the missing reactivity.
The analyser RAF tick fires at ~60 Hz, so the bar array was being
written to React state 60 times per second. Halve consumer
re-renders with useThrottledCallback at 33 ms, no perceptible UX
change — the analyser still samples at full RAF rate, only the
React updates are throttled. Cancel any pending throttled update in
stopRecordingVisualizer before resetting bars to the idle pattern,
so a late tick can't briefly flash old levels.

The reset / idle / fallback codepaths still call the raw
setDictationBars so they apply immediately.
MediaRecorder was removed from the dictation flow in ca41cba; the
function's name has been lying about what it does ever since. It
stops the MediaStream tracks AND calls stopRecordingVisualizer (which
disconnects the worklet/analyser/source and suspends the
AudioContext) — i.e. it tears down the entire audio capture graph.
The new name says so.

Scope: dictation hook only. useAudioTranscriptionRecorder still uses
MediaRecorder, so its same-named function keeps its old name there.
Pure code, no React, no DOM event listeners — directly unit-testable
in isolation. Moves out of useAudioDictationRecorder:

- CANONICAL_AUDIO_SAMPLE_RATE_HZ / WAV_HEADER_BYTES / BYTES_PER_SAMPLE
- AUDIO_BARS_COUNT
- AudioDictationDiagnostics type
- writeAscii (internal)
- createCanonicalWavBytesFromPcm
- resampleMonoFloat32ToPcm16
- mediaTrackSettingsToDiagnostics
- getAudioLevelBarsFromTimeDomainData

The recorder, preview hook, and dictation test re-import from the
new module. ~120 lines removed from the recorder file with zero
behavior change.
Moves out of useAudioDictationRecorder:

- AUDIO_DICTATION_SOCKET_OPEN_TIMEOUT_MS / FRAME_TIMEOUT_MS
- AudioDictationSocketFrame discriminated-union type
- createAudioDictationWebSocketUrl
- sendAudioDictationControlFrame
- waitForSocketOpen
- waitForAudioDictationFrame

These are protocol-only concerns (URL construction, JSON-frame
encoding, listener-based promise helpers) with no React, no audio
pipeline, and no hook state. Lives in a sibling
audio-dictation-protocol module. Another ~110 lines out of the
recorder. Zero behavior change.
Replaces the three independent booleans isDictating /
isDictationStarting / isDictationCompleting with a useReducer over a
DictationSessionStatus tagged union:

  idle | starting | dictating | completing

The reducer encodes the four valid transitions explicitly (start,
session_ready, user_stop, complete) plus an abort that resets to idle
for error paths. Impossible combinations like "starting && completing"
are now statically excluded by the type — previously the code
defended against them with runtime checks and a long-comment "trust
liveSessionRef.current, not React state" workaround.

The hook's public return is unchanged: the three booleans are derived
from sessionStatus. Callers (ChatInput, etc.) continue to work
without modification.

`isCapturingAudio` stays as its own useState — it's an orthogonal
audio-graph signal (flips on the worklet's first non-zero frame),
not part of the session lifecycle.

`startInFlightRef` stays too — it's the synchronous re-entry guard
that catches the gap between dispatch and the next render's
state-read; the reducer doesn't replace it.
Threads an AbortSignal through waitForSocketOpen and
waitForAudioDictationFrame so they reject promptly on abort and
remove their listeners cleanly. The recorder mints an
AbortController in startDictation, stores it in sessionAbortRef, and
passes the signal into startLiveDictationSession. stopDictation's
cancel-during-starting branch now calls controller.abort() instead
of reaching into the pending socket directly; the abort listener
inside the startup helper closes the WebSocket as belt-and-braces
so the protocol close-handlers also fire.

The unmount cleanup aborts the in-flight controller (which closes
the pending socket as a side effect). The startDictation catch
block now distinguishes AbortError from real failures so a user
cancel doesn't surface as a "could not start" error message.

AbortRejection helper centralizes the DOMException construction
with an eslint-disable comment for the programmatic
name/message strings (which match what AbortSignal/fetch throw).
To pin down the "loading appears then dictation doesn't start" report.
Logs:
- toggleDictation entry, with current isDictating/isDictationStarting
- startDictation entry, with all guards + isMounted state
- startDictation guard early-returns (with reason)
- startDictation entered start phase, with sessionAbort.signal.aborted
- After getUserMedia: isMounted, signalAborted
- After ensureAudioContextReady: contextState, isMounted, signalAborted
- startDictation success → session_ready
- startDictation CAUGHT error: name, message, signalAborted, isMounted, stack
- stopDictation entry, with hasLiveSession / hasSessionAbort / stack
- stopDictation branch taken: user_stop / in-flight abort / fall-through
- Unmount-effect SETUP and CLEANUP

Reproduce the bug, copy the console output, and the path will be
obvious. Revert this commit once root-caused.
Whenever a fresh enumeration produces a non-empty device list and
the persisted selectedAudioInputDeviceId isn't in it, drop the
selection. This catches the common case where the user has a
Bluetooth / USB mic stored as their preferred input, disconnects it,
then tries to dictate — getUserMedia would otherwise throw
OverconstrainedError (which has empty `.message` on most browsers,
so the user sees the spinner appear and disappear with no visible
error).

The cleanup hooks the existing enumeration paths: initial load,
devicechange listener (commit 6c8f0a9), and the manual refresh
button. Guarded on audioInputDevices.length > 0 so a pre-permission
enumeration (some browsers return an empty list) doesn't wipe a
still-valid selection.

Diagnosed via diagnostic logging that surfaced the
OverconstrainedError as the actual error being silently caught.
When `getUserMedia` rejects because the persisted deviceId is no
longer available (Bluetooth disconnect mid-flight, USB unplug,
browser deviceId rotation between enumeration and the actual call),
catch OverconstrainedError, retry once with the system-default mic,
and clear the stale stored selection so the dropdown reflects
reality and subsequent sessions don't keep tripping the same error.

Belt-and-braces alongside the auto-clear-on-enumerate cleanup in
useAudioInputDevicePreference: the enumeration cleanup catches the
common case before the user clicks; this catches the
enumeration-and-then-immediately-disconnected race window.

Preview hook (MicTestPanel) intentionally doesn't fall back — the
whole point of that hook is to test the selected device, so it
surfaces the existing "selected mic not available, refresh" message
instead of silently testing a different mic.
Stale-deviceId / OverconstrainedError root cause confirmed and fixed
in 2d7be12 + 83aa240. The temporary [dictation]-prefixed
console.log statements added in f440c79 (plus one carried into
83aa240) are no longer needed; drop them. No behavior change.
Two kinds of drift, no new strings, no translations needed:

1. `#:` reference comments now point at audio-dictation-protocol.ts
   for the socket-related messages that moved there in 50841b1.
2. "Could not complete audio dictation." is marked obsolete (#~) —
   it was the MediaRecorder.onerror message, removed in ca41cba.
   The completion path uses "Failed to complete audio dictation."
   now (also exists in the catalog, with the same translations).
Whitespace and line-break drift that accumulated across the recent
refactors. No behavior change.
@bdart bdart merged commit 3d5f196 into main May 12, 2026
50 of 51 checks passed
@bdart bdart deleted the feat/audio-input-test-panel branch May 12, 2026 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant