Add microphone test panel to audio settings#668
Merged
Conversation
- Share selected audio input via Zustand store so the dialog and long-lived recorders see writes immediately (migrates from legacy localStorage key). - New useAudioInputLevelPreview hook (analyser + RAF, no transcription) reusing the recorder's bar extractor. - New MicTestPanel reuses the Waveform primitive and surfaces the active track label + deviceId-mismatch warning. - Swap the bare <select> for the project-standard DropdownMenu and drop the redundant "Audio input" sub-label to match the Appearance tab pattern.
Build the AudioContext / analyser / ScriptProcessor pipeline right after getUserMedia and pre-buffer samples in preSessionSamplesRef until the live session is ready, then drain in order. Pass the context's actual sample rate into startLiveDictationSession so resampling uses the right ratio from the first sample. Clear the pre-buffer in stopRecordingVisualizer and the unmount path. Track the in-flight WebSocket in pendingSocketRef so stopDictation can cancel a startup mid-handshake; toggleDictation now treats isDictationStarting as cancellable. ChatInput renders WaveformButton during isDictationStarting too, dimmed via a new isBuffering prop on WaveformButton to signal "capturing but not yet transmitting".
Locks down the early-speech race fix: emit samples after the audio pipeline is built but before session_state arrives, then verify they are transmitted as part of the first PCM chunk once the session resolves. Catches any future regression that re-orders the pipeline construction below the socket handshake or skips the pre-buffer drain.
ScriptProcessorNode runs its callback on the main thread, which lets heavy renders, GC pauses, or long tasks starve the audio path. Replace it with an AudioWorkletNode whose processor lives in a small ESM worklet file and runs on the audio rendering thread. The processor batches render-quantum samples into 4096-sample frames and posts each frame to the main thread via MessagePort, matching the legacy chunk cadence so downstream resampling and PCM transport stay unchanged. The worklet is bundled via `?worker&url` so Vite emits it as a real JS asset rather than the broken `data:video/mp2t` URL produced by `?url` on a .ts file. Drops the now-unneeded zero-gain processorSink that the old ScriptProcessor required to stay live. Updates the recorder test mocks to MockAudioWorkletNode + MockMessagePort.
isDictationStarting flips at the click, but the worklet only enters the audio graph ~250-300 ms later, after await getUserMedia and await audioWorklet.addModule resolve. Showing the waveform on isDictationStarting let users start speaking into a mic the pipeline hadn't tapped yet. Add isCapturingAudio to the dictation hook, flipped true right after source.connect(processor) and false in stopRecordingVisualizer. ChatInput renders the WaveformButton only when isCapturingAudio or isDictating, and keeps the spinner during the pre-capture window — so the bars now mean "go ahead, captured even if not yet transmitted" rather than "we clicked the button".
isCapturingAudio still flipped too early: source.connect(processor) returns instantly, but the OS / mic stack may take hundreds of milliseconds (Bluetooth, Safari, external mics) before samples flow. The user saw bars, started speaking, and the analyser stayed flat until samples finally arrived — losing the first words. Move the isCapturingAudio flip into the worklet's port.onmessage handler, gated by a per-session firstFrameSeen flag. That message only fires after the worklet has buffered 4096 real samples from the source, so bars-visible now means audio is genuinely flowing. Also resume() the AudioContext after creation: if the click's user activation has lapsed across the getUserMedia + addModule awaits, the new context starts suspended and no audio reaches any node. resume() is a no-op when the context is already running. Adds the matching resume() to MockAudioContext in the test file.
Last fix flipped "capturing" on the worklet's first port.onmessage, but that message was firing on a frame of silence. Per W3C Web Audio §1.32 and Mozilla bug 1629478, MediaStreamAudioSourceNode emits zero-filled render quanta during OS / driver warm-up — up to ~70 quanta on Firefox, hundreds of milliseconds on Bluetooth / Safari / external mics. So bars appeared, user spoke, and real samples landed later. Match the OpenAI wavtools / LiveKit / hark idiom: gate worklet posts on a foundAudio flag that flips on the first non-zero sample. The worklet drops zero-only frames entirely and starts posting at the first energy-bearing buffer, sliced from the first non-zero sample. The main thread's existing firstFrameSeen flip is now triggered by real audio, so the bars only appear when speech can actually be captured. Also await audioContext.resume() instead of fire-and-forget; Mozilla measured 7x fewer empty quanta with an explicit awaited resume. Re-checks isMountedRef after the new await to keep the unmount teardown symmetric. track.muted / onunmute were considered but rejected — Firefox keeps muted=false for mic tracks (bug 1739163), WebKit drives muted from AudioUnit suspension not first-sample arrival, and no production audio library (LiveKit, wavtools, hark, vad-web, web-audio-analyser) uses them as the readiness signal.
The previous worklet trim (drop frames until first non-zero) was the wavtools push-to-talk pattern, applied incorrectly as a session-wide silence stripper. OpenAI's stack tolerates that only because the Realtime API's prefix_padding_ms (default 300 ms) adds the padding back server-side; our server has no equivalent. Without leading silence, the VAD has no calibration window and clips the first word. Worklet now always batches and posts every render quantum — including the zero-filled samples MediaStreamAudioSourceNode emits during OS warm-up. Non-zero detection moved to the main-thread port.onmessage handler, where it still drives isCapturingAudio so the bars only appear when real audio is flowing — the UX cue the user wanted is preserved, but the data on the wire now matches what production STT clients (Deepgram, AssemblyAI, OpenAI Realtime reference clients) send: full stream from getUserMedia onward. Evidence: OpenAI prefix_padding_ms default 300 ms, Silero VAD speech_pad_ms, sherpa-onnx #3035 pre-speech padding fix, DeepSpeech #2443 (5 ms silence prepend fixes first word), Wispr Flow "Fix Missing First Words" guide. Every production VAD documents the same mitigation.
Cold first dictation got hundreds of milliseconds of OS warm-up zeros for free, doubling as the VAD calibration window. Warm second dictation found the OS audio device still hot — real samples arrived almost instantly — and the wire format started with speech, letting the server VAD clip the first word. Seed preSessionSamplesRef with PRE_SPEECH_SILENCE_PRIMER_MS (300 ms) of zero samples at the AudioContext rate immediately after the audio pipeline is wired, so every session ships the same calibration window to the server regardless of OS device state. Matches the prefix_padding_ms = 300 default in the OpenAI Realtime VAD guide; Silero VAD speech_pad_ms and sherpa-onnx pre-speech padding land in the same range. Cost ~10 KB of PCM per session; user-visible spinner and bars behavior unchanged.
Consolidate the duplicate "Audio input is not supported in this
browser." onto the existing "Audio recording is not supported..."
key already translated in every locale.
Translate the 13 genuinely new strings introduced by the mic test
panel, dictation early-capture UX, and preview-hook error paths:
- preferences.dialog.audio.test.{description,heading,listening,
mismatch,start,stop}
- "Audio analysis is not supported in this browser."
- "Cancel starting dictation"
- "Capturing audio — waiting for transcription to start"
- "Could not start the microphone test."
- "Microphone permission denied. Allow access to test the microphone."
- "The microphone is already in use by another application or recording."
- "The selected microphone is not available. Refresh the device list
and try again."
`test.heading` and `test.start` are kept as separate keys (despite
identical English "Test microphone") so translators can pick a
nominal form for the section heading and an imperative form for the
button label where the language warrants it; today they're identical
across de/es/fr/pl.
On a fully-warm OS audio device the worklet's first non-zero frame can arrive within tens of milliseconds of source.connect(processor), fast enough that the spinner feels like a blink and the user starts speaking before they've finished the priming breath. Belt-and-braces alongside the existing PRE_SPEECH_SILENCE_PRIMER_MS: record the time we wire the audio graph, and when the first non-zero frame arrives, defer the isCapturingAudio flip until at least MIN_AUDIO_CAPTURE_DELAY_MS (150 ms) has elapsed. A new capturingFlipTimerRef tracks the deferred flip and is cleared in stopRecordingVisualizer so a mid-flight cancel doesn't leave a stale timer firing after teardown. The flipCapturingAudio closure also re-checks audioProcessorRef.current === processor before firing so a late timeout can't promote a session that's already been torn down.
We never consumed the encoded MediaRecorder output — no ondataavailable listener was ever set. Its only purpose was to give us an onstop event and a state === "recording" flag. Both are trivially replaceable with our own callable and the existing liveSessionRef. Extract the former mediaRecorder.onstop body into a completeDictation useCallback that runs the final-flush → finish → await-completed → cleanup chain. stopDictation now calls it directly when liveSessionRef.current is set (precise "we are recording now" signal, set the same synchronous tick as setIsDictating(true) — more reliable than the React state across the same tick). Eliminates a documented head/tail glitch class of bugs (Mozilla 1539186, w3c/mediacapture-record #178) and matches the production STT consensus: AssemblyAI, OpenAI Realtime, LiveKit transport, Vapi, Deepgram browser samples all use AudioWorklet alone, never alongside MediaRecorder on the same getUserMedia stream. Drops MockMediaRecorder and its assertions from the recorder test.
Every dictation used to create a brand-new AudioContext and close it on stop. Each session paid the audio-thread spin-up + worklet addModule fetch/parse cost, even when the OS audio device was warm — which is the most likely cause of the warm-second-session first-word loss: getUserMedia resolved in ~10 ms but the AudioContext itself was cold, and the worklet's first frames took tens of milliseconds to arrive while the user was already speaking. Add ensureAudioContextReady: reuses the existing AudioContext when present, only constructs a fresh one when the previous was closed, calls audioWorklet.addModule once per context lifetime (tracked via workletModuleLoadedRef), and resumes the context (no-op when running). stopRecordingVisualizer now suspends rather than closes, keeping the audio thread warm between sessions. close + null happen only on hook unmount. source / analyser / AudioWorkletNode are still created fresh per session so leftover per-session state cannot leak. The expensive parts — audio thread, worklet registration — persist; the per-session graph wiring is what gets rebuilt. Adds a suspend mock to MockAudioContext to keep the recorder tests passing.
When AudioContext.sampleRate differs from the input MediaStreamTrack's sample rate, createMediaStreamSource inserts a Web Audio internal resampler with documented startup latency — one more contributor to the warm-second-session first-sample gap. Read trackSettings.sampleRate after getUserMedia and pass it to ensureAudioContextReady, which forwards it to the AudioContext constructor on first creation. Falls back to the default rate if the browser rejects the requested value. Persisted contexts keep their original rate even if a subsequent track reports a different one — that tradeoff is right: the warm audio thread is a bigger win than avoiding the resampler when the device changes, which is rare. The worklet's PCM resampler still handles the 16 kHz canonical conversion regardless.
Drops the bespoke lazyLocalStorage PersistStorage in favour of createJSONStorage, but keeps a tiny wrapper object whose methods re-resolve the localStorage global on every call. createJSONStorage caches its factory's return value once at create time, so handing it localStorage directly would freeze a reference that vi.stubGlobal can't replace in tests. createJSONStorage still owns the JSON wrapping, which is the part of the original wrapper that was re-inventing zustand's wheel.
Verified empirically that Vite 6 inlines TS files referenced via new URL(..., import.meta.url) as data:video/mp2t;base64,... URLs in production builds (the .ts MIME being MPEG Transport Stream video, not TypeScript), which the browser refuses to load via audioWorklet.addModule. The ?worker&url query is the working pattern for our TS worklet; expand the comment to explain why the platform-canonical Vite recipe doesn't apply here, so a future reviewer doesn't try the same swap again.
The audio-input dropdown trigger in UserPreferencesDialog rendered an inline <svg> chevron — out of step with every other chevron in the app, which goes through iconoir-react via the icons module wrapper (NavArrowRight → ChevronRightIcon). Re-export NavArrowDown as ChevronDownIcon and use it. Same rotation transform on dropdown open, sized to match the rest of the trigger.
Replaces the manual `isMountedRef = useRef(true)` + mount/unmount effect pattern in both audio hooks with react-use's useMountedState, which returns a stable getter that's true while the component is mounted. Drops the eslint-disable-next-line @typescript-eslint/no- unnecessary-condition comment that was defeating TS narrowing of the ref across awaits — the function call isn't subject to that narrowing in the first place. The mount-effect that flipped the ref disappears entirely. Adds isMounted to the deps arrays of the few useCallbacks / useEffects that read it.
Previously the dropdown only listed devices known at mount time; plugging in a USB / Bluetooth mic mid-session left it stale until the user manually clicked "Refresh devices". Subscribe to mediaDevices.devicechange and re-enumerate on the event. The manual refresh button still has a job: devicechange does NOT fire when labels become available after a permission grant (because the device list itself didn't change), only on add/remove. react-use ships useMediaDevices which does the same subscription internally, but it doesn't expose a refresh callable — needed for the permission-grant-then-reload case — so we keep the existing imperative refresh and add only the missing reactivity.
The analyser RAF tick fires at ~60 Hz, so the bar array was being written to React state 60 times per second. Halve consumer re-renders with useThrottledCallback at 33 ms, no perceptible UX change — the analyser still samples at full RAF rate, only the React updates are throttled. Cancel any pending throttled update in stopRecordingVisualizer before resetting bars to the idle pattern, so a late tick can't briefly flash old levels. The reset / idle / fallback codepaths still call the raw setDictationBars so they apply immediately.
MediaRecorder was removed from the dictation flow in ca41cba; the function's name has been lying about what it does ever since. It stops the MediaStream tracks AND calls stopRecordingVisualizer (which disconnects the worklet/analyser/source and suspends the AudioContext) — i.e. it tears down the entire audio capture graph. The new name says so. Scope: dictation hook only. useAudioTranscriptionRecorder still uses MediaRecorder, so its same-named function keeps its old name there.
Pure code, no React, no DOM event listeners — directly unit-testable in isolation. Moves out of useAudioDictationRecorder: - CANONICAL_AUDIO_SAMPLE_RATE_HZ / WAV_HEADER_BYTES / BYTES_PER_SAMPLE - AUDIO_BARS_COUNT - AudioDictationDiagnostics type - writeAscii (internal) - createCanonicalWavBytesFromPcm - resampleMonoFloat32ToPcm16 - mediaTrackSettingsToDiagnostics - getAudioLevelBarsFromTimeDomainData The recorder, preview hook, and dictation test re-import from the new module. ~120 lines removed from the recorder file with zero behavior change.
Moves out of useAudioDictationRecorder: - AUDIO_DICTATION_SOCKET_OPEN_TIMEOUT_MS / FRAME_TIMEOUT_MS - AudioDictationSocketFrame discriminated-union type - createAudioDictationWebSocketUrl - sendAudioDictationControlFrame - waitForSocketOpen - waitForAudioDictationFrame These are protocol-only concerns (URL construction, JSON-frame encoding, listener-based promise helpers) with no React, no audio pipeline, and no hook state. Lives in a sibling audio-dictation-protocol module. Another ~110 lines out of the recorder. Zero behavior change.
Replaces the three independent booleans isDictating / isDictationStarting / isDictationCompleting with a useReducer over a DictationSessionStatus tagged union: idle | starting | dictating | completing The reducer encodes the four valid transitions explicitly (start, session_ready, user_stop, complete) plus an abort that resets to idle for error paths. Impossible combinations like "starting && completing" are now statically excluded by the type — previously the code defended against them with runtime checks and a long-comment "trust liveSessionRef.current, not React state" workaround. The hook's public return is unchanged: the three booleans are derived from sessionStatus. Callers (ChatInput, etc.) continue to work without modification. `isCapturingAudio` stays as its own useState — it's an orthogonal audio-graph signal (flips on the worklet's first non-zero frame), not part of the session lifecycle. `startInFlightRef` stays too — it's the synchronous re-entry guard that catches the gap between dispatch and the next render's state-read; the reducer doesn't replace it.
Threads an AbortSignal through waitForSocketOpen and waitForAudioDictationFrame so they reject promptly on abort and remove their listeners cleanly. The recorder mints an AbortController in startDictation, stores it in sessionAbortRef, and passes the signal into startLiveDictationSession. stopDictation's cancel-during-starting branch now calls controller.abort() instead of reaching into the pending socket directly; the abort listener inside the startup helper closes the WebSocket as belt-and-braces so the protocol close-handlers also fire. The unmount cleanup aborts the in-flight controller (which closes the pending socket as a side effect). The startDictation catch block now distinguishes AbortError from real failures so a user cancel doesn't surface as a "could not start" error message. AbortRejection helper centralizes the DOMException construction with an eslint-disable comment for the programmatic name/message strings (which match what AbortSignal/fetch throw).
To pin down the "loading appears then dictation doesn't start" report. Logs: - toggleDictation entry, with current isDictating/isDictationStarting - startDictation entry, with all guards + isMounted state - startDictation guard early-returns (with reason) - startDictation entered start phase, with sessionAbort.signal.aborted - After getUserMedia: isMounted, signalAborted - After ensureAudioContextReady: contextState, isMounted, signalAborted - startDictation success → session_ready - startDictation CAUGHT error: name, message, signalAborted, isMounted, stack - stopDictation entry, with hasLiveSession / hasSessionAbort / stack - stopDictation branch taken: user_stop / in-flight abort / fall-through - Unmount-effect SETUP and CLEANUP Reproduce the bug, copy the console output, and the path will be obvious. Revert this commit once root-caused.
Whenever a fresh enumeration produces a non-empty device list and the persisted selectedAudioInputDeviceId isn't in it, drop the selection. This catches the common case where the user has a Bluetooth / USB mic stored as their preferred input, disconnects it, then tries to dictate — getUserMedia would otherwise throw OverconstrainedError (which has empty `.message` on most browsers, so the user sees the spinner appear and disappear with no visible error). The cleanup hooks the existing enumeration paths: initial load, devicechange listener (commit 6c8f0a9), and the manual refresh button. Guarded on audioInputDevices.length > 0 so a pre-permission enumeration (some browsers return an empty list) doesn't wipe a still-valid selection. Diagnosed via diagnostic logging that surfaced the OverconstrainedError as the actual error being silently caught.
When `getUserMedia` rejects because the persisted deviceId is no longer available (Bluetooth disconnect mid-flight, USB unplug, browser deviceId rotation between enumeration and the actual call), catch OverconstrainedError, retry once with the system-default mic, and clear the stale stored selection so the dropdown reflects reality and subsequent sessions don't keep tripping the same error. Belt-and-braces alongside the auto-clear-on-enumerate cleanup in useAudioInputDevicePreference: the enumeration cleanup catches the common case before the user clicks; this catches the enumeration-and-then-immediately-disconnected race window. Preview hook (MicTestPanel) intentionally doesn't fall back — the whole point of that hook is to test the selected device, so it surfaces the existing "selected mic not available, refresh" message instead of silently testing a different mic.
Two kinds of drift, no new strings, no translations needed: 1. `#:` reference comments now point at audio-dictation-protocol.ts for the socket-related messages that moved there in 50841b1. 2. "Could not complete audio dictation." is marked obsolete (#~) — it was the MediaRecorder.onerror message, removed in ca41cba. The completion path uses "Failed to complete audio dictation." now (also exists in the catalog, with the same translations).
Whitespace and line-break drift that accumulated across the recent refactors. No behavior change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request enhances the audio input selection and testing experience in the user preferences dialog, introduces a microphone test panel, and improves the dictation UI for clearer feedback during audio capture. It also updates related tests and underlying audio handling logic to support these features.
User Preferences Dialog & Audio Input Selection:
<select>with a customDropdownMenufor better usability and accessibility, including improved labeling and selection handling. [1] [2]MicTestPanelcomponent, allowing users to test their selected microphone directly within the preferences dialog, with real-time waveform and device feedback. [1] [2]Dictation and Audio UI Improvements:
isCapturingAudioprop to theChatInputcomponent and adjusts dictation state handling for improved accuracy in UI state. [1] [2]Testing and Audio Infrastructure:
AudioWorkletNodeinstead of deprecatedScriptProcessorNode, aligning with modern browser APIs and improving test reliability. [1] [2] [3] [4]General Codebase Improvements:
These changes collectively provide a more intuitive and robust experience for users configuring and testing their microphone, while also bringing the codebase up to date with current web audio best practices.- Share selected audio input via Zustand store so the dialog and
long-lived recorders see writes immediately (migrates from legacy
localStorage key).
reusing the recorder's bar extractor.
active track label + deviceId-mismatch warning.