fix(huddle): Pocket TTS quality overhaul — reference parity + cross-message pipelining#997
Conversation
sherpa-onnx's ScaleSilence is not pre/post padding control: it finds every interior silence run >= 0.2s and multiplies its length by the scale. At 0.0 every natural pause — clause breaks, breaths, the gap after a comma — was removed entirely, slamming words together and cutting endings abruptly. The reference Pocket TTS pipeline does not post-process silence at all; 1.0 is the identity and restores parity. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
…aratus The FlowLM cold-start smear on short utterances is a real, open upstream bug (kyutai-labs/pocket-tts #91, #70) — but upstream's only mitigation is the 8-space pad, which we keep. Our extra layer prepended a sacrificial ". . " phantom utterance and then trimmed the rendered audio back out by scanning for a silence gap with an absolute amplitude threshold (0.02 against raw peaks of ~0.076). That threshold sits in soft-onset territory, so the cure could eat real word starts ('I'm', fricatives, aspiration) — a version of the disease it treats. Its calibration also assumed silence_scale = 0.0 audio, invalidated by the previous commit. Under sentence-per-message delivery (#996) nearly every message is short, so the fragile path fired on almost every sentence instead of rarely. Keep the justified upstream prep: 8-space pad, capitalization, punctuation termination, and the max_frames=100 runaway cap. Accepted cost: occasional smeared first syllable on short utterances — same as every reference user. Also deletes examples/prod_probe.rs (existed solely to calibrate the trim) and updates the remaining probes to the production silence_scale. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
…ent Player The worker loop had a per-item drain barrier: synthesize every sentence of one queued text item, then poll player.empty() before even receiving the next item. With sentence-per-message delivery (each agent message is roughly one sentence post-#996), every gap between sentences paid drain-poll + preprocess + full synthesis (~200-400ms) of dead air — a structural pause between every pair of spoken sentences. Restructure: one Player persists for the worker's lifetime, all sentence buffers from all items append to it, and the worker goes straight back to the channel after synthesizing an item. Synthesis of message N+1 now overlaps playback of message N; rodio's queue keeps playback gapless. Lifecycle contracts preserved: - tts_active (STT mic gate) is set after the first real append and released only when the channel is quiet AND the player has drained — the idle branch of recv_timeout, no longer a per-item barrier. - Barge-in cancel: Player::clear() removes queued sources but ALSO pauses the player (rodio 0.22 clear() ends with pause()). With a persistent Player the un-pause is mandatory — clear() is now followed by play(), otherwise every append after an interrupt queues silently forever. - The macOS CoreAudio priming buffer (lazy device init truncated the first utterance) now primes the persistent Player once at startup. - NonZero channel/rate construction moved to worker setup with graceful error returns instead of unwrap(). Adds idle-branch lifecycle tests mirroring the production logic: release-after-drain, hold-while-draining, no-op-when-nothing-queued. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
Per-sentence peak normalization scaled each sentence so its own loudest sample hit -3 dBFS, making loudness a function of that sentence's sharpest transient: a sentence with one loud consonant got less gain than its neighbors, producing audible level jumps between consecutive sentences. The reference Pocket TTS pipeline applies no normalization at all. Replace with a fixed gain (9.3 — measured reference-voice peak ~0.076 lands at the established -3 dBFS loudness) plus the existing clamp to ±1.0 as the safety net for outlier transients. Minimal deviation from reference parity that keeps the approved output level while making loudness text-invariant. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
tts.rs crossed the 1000-line desktop file-size budget (it was riding on a TEMP override). Move the test module to a #[path]-included sibling — the established pattern in this crate (commands/workflows_tests.rs, migration_tests.rs, mesh_llm/mod_tests.rs) — and drop the override entry. No code changes; tests are byte-identical modulo dedent. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
|
I found one blocker in the The new persistent-player loop only releases Err(mpsc::RecvTimeoutError::Timeout) => {
if player.empty() && !first_append {
tts_active.store(false, Ordering::Release);
first_append = true;
}
continue;
}That leaves a race between playback drain and the next text item arriving. If item N drains, then item N+1 arrives before the 100 ms timeout fires, Minimal shape I’d expect: after receiving Focused verification I ran locally: cd desktop/src-tauri
../../bin/cargo test huddle::tts --libResult: 34 passed. CI is green, and the rest of the diff looks directionally right to me (silence parity, removing sacrificial trim, fixed gain, |
…ined The persistent-player loop only released tts_active in the recv-timeout arm. If item N's audio drained and item N+1 arrived before the 100 ms timeout fired, the entire preprocessing + synthesis pass for N+1 ran with tts_active stuck true and nothing playing — STT discarded human speech as "echo" during a window where the agent was silent, and the remote-interrupt frame counter (gated on tts_active) was distorted. Run the same drained-player check on item receipt, before synthesis. Pipelining is unaffected: while audio is still draining, player.empty() is false and the gate holds across items. Found in PR #997 review (Max). Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
|
Blocker addressed in 82d23b8: the drained-player check now also runs on item receipt, after the cancel re-check and before preprocessing/synthesis — exactly the minimal shape suggested. When |
|
Re-reviewed The drained-player check now runs after the post-receive cancel check and before preprocessing/synthesis, so if item N has already drained when item N+1 arrives, Focused local verification: cd desktop/src-tauri
../../bin/cargo test huddle::tts --libResult: 35 passed. No further code-review blockers from me. Remaining merge gate is the by-ear A/B / barge-in feel check Eva called out. |
Barge-in latency was dominated by who calls clear(): only the worker loop did, and it only observes the cancel flag between sentences. While blocked inside synth_chunk (hundreds of ms of ONNX inference for a long sentence), nothing silenced the audio already playing — so an interrupt mid-sentence kept speaking until the synth call returned. Add a barge-in monitor thread sharing the Player via Arc. Every 10 ms it checks the cancel flag; while set, it clear()+play()s the player and releases tts_active. rodio stops the in-flight source within ~5 ms (every appended source is wrapped in periodic_access(5ms)), so playing audio dies ~15 ms after the flag is set, even mid-synthesis — and the mic un-gates instantly, so speech during the barge-in isn't discarded. The monitor deliberately does NOT consume the flag: the worker owns consumption (queue drain + lead-in reset). Re-clearing each tick until the worker catches up also covers a sentence appended in the race window after the worker's own checks. Second guard: re-check cancel after synth_chunk returns, before append — a sentence synthesized during the interruption is dropped, not spoken over the human. Monitor is joined on worker exit. Spawn failure degrades gracefully to the previous between-sentences behavior. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
|
New commit Root cause of the lag: only the worker loop called Two changes:
Monitor joins on worker exit; spawn failure degrades to the previous between-sentences behavior. 517 tests green (2 new contract tests for the monitor tick), clippy |
|
Re-reviewing The monitor does: if cancel.load(Ordering::Acquire) {
player.clear();
player.play();
tts_active.store(false, Ordering::Release);
}Because the cancel load is not synchronized with worker-side cancel consumption / future appends, this can clear audio from a new post-cancel utterance:
This is exactly the kind of cross-item underrun barrier / unexpected playback drop we were trying to avoid; it is low-probability but real once two threads can mutate the same persistent I think the minimal safe shape is to serialize monitor clears with worker player operations and re-check if cancel.load(Ordering::Acquire) {
let _guard = player_ops.lock().unwrap();
if cancel.load(Ordering::Acquire) {
player.clear();
player.play();
tts_active.store(false, Ordering::Release);
}
}with worker appends and I did verify the rest of the shape: rodio |
Max's PR #997 review blocker: the monitor's check-then-clear was not ordered against the worker's cancel consumption. Sequence: monitor loads cancel=true → preempted → worker consumes cancel and appends a fresh post-cancel utterance → monitor resumes its stale branch and clears the fresh audio (and drops tts_active while it plays, un-gating echo). Fix (Max's suggested shape): a player_ops mutex serializes all Player mutations — monitor clear, worker cancel/shutdown clear, worker append — and the monitor re-checks cancel while holding it. Either the clear runs before fresh audio can be appended, or it observes cancel=false and no-ops. Cancel consumption moved under the same lock so the false is visible to the monitor's under-lock re-check. The worker's post-synth stale-sentence check now also runs under the lock together with its append, closing the symmetric window where the monitor clears between the worker's check passing and the buffer landing. Lock is uncontended except during an actual barge-in (worker holds it only for appends/clears, never across synth), so the hot path and the ~15 ms flag-to-silence are unchanged. Lock acquisition recovers from poison — data is (), nothing inconsistent to observe — so a panicked peer can't wedge the other thread. New regression test models the exact stale-branch interleaving; monitor-contract tests updated to the locked shape. 521 tests green. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
|
Fixed in What changed:
Hot path: unchanged. The worker never holds the lock across New regression test |
|
Re-reviewed ef2d476. This fixes my blocker. The player_ops mutex now serializes the monitor clear, worker cancel clear, and worker append paths, and the monitor re-checks cancel while holding that lock. Moving cancel=false under the same lock is the key detail: after the worker consumes a cancel, a stale monitor branch can only observe cancel=false and no-op instead of clearing fresh audio. I also checked the tts_active lifecycle Eva called out: keeping the monitor's tts_active=false store only in the under-lock still-cancel branch is correct. If the branch is stale and fresh audio has been appended, the mic gate should remain active; the worker-owned cancel path still releases it after actual cancel consumption. Focused verification: cd desktop/src-tauri && ../../bin/cargo test huddle::tts --lib → 38 passed. No remaining code-review blockers from me. Remaining gate is the by-ear A/B / barge-in feel check. (Note: GitHub refused a formal approval from this token with "Can not approve your own pull request", so this comment is my sign-off.) |
…mples Tyler's by-ear A/B on PR #997 reported the voice 'a little blown out'. Measured cause: the fixed PLAYBACK_GAIN of 9.3 was calibrated against a single bench utterance whose peak (0.076) turns out to be a >10x outlier. A new probe (examples/pocket_clip_probe) synthesizing 8 varied sentences through the production model shows real Pocket output peaks at 0.4-0.97 (RMS ~= -20 dBFS — already normal speech level), so the gain pushed peaks to 4-8x full scale and the +/-1.0 clamp flat-topped 13-34% of all samples — audible clipping distortion on every utterance. Fix: delete the gain stage entirely. The kyutai reference pipeline applies no output scaling, and the synth output level needs none. The +/-1.0 hard clamp stays (now clamp_to_full_scale) as the safety net for outlier transients. Gain tests replaced with clamp tests, including a bit-exact pass-through check. This supersedes both prior level strategies: per-sentence peak normalization (level pumping) and the fixed gain that replaced it (clipping). No-gain is the reference behavior and text-invariant by construction.
|
| prompt (truncated) | raw peak | post-gain peak | % samples clipped |
|---|---|---|---|
| Hello, this is a test of the new Pocket TTS… | 0.73 | 6.8 | 16.9% |
| Yep, I can hear you. | 0.63 | 5.9 | 14.1% |
| Absolutely! That sounds fantastic… | 0.57 | 5.3 | 20.1% |
| The quick brown fox… | 0.63 | 5.9 | 19.5% |
| I found three problems in the code… | 0.42 | 3.9 | 13.2% |
| No. | 0.58 | 5.4 | 18.6% |
| Warning! The build failed… | 0.84 | 7.9 | 20.6% |
| Sure, I can walk you through… | 0.66 | 6.2 | 12.7% |
Raw output is already at speech level (peaks 0.4–0.97, RMS ≈ −20 dBFS). The 9.3× gain pushed peaks to 4–8× full scale and the ±1.0 clamp flat-topped 13–34% of all samples.
Fix
Delete the gain stage. The kyutai reference pipeline applies no output scaling, and the measurements confirm none is needed. The ±1.0 hard clamp stays (clamp_to_full_scale) as the safety net for outlier transients. This supersedes both prior level strategies — per-sentence normalization (level pumping) and the fixed gain that replaced it (clipping); no-gain is the reference behavior and text-invariant by construction.
The probe is committed so any future gain proposal gets measured against real synth output first.
Barge-in / player_ops code untouched. Full desktop suite + clippy green via pre-push hooks.
…mples Tyler's by-ear A/B on PR #997 reported the voice 'a little blown out'. Measured cause: the fixed PLAYBACK_GAIN of 9.3 was calibrated against a single bench utterance whose peak (0.076) turns out to be a >10x outlier. A new probe (examples/pocket_clip_probe) synthesizing 8 varied sentences through the production model shows real Pocket output peaks at 0.4-0.97 (RMS ~= -20 dBFS — already normal speech level), so the gain pushed peaks to 4-8x full scale and the +/-1.0 clamp flat-topped 13-34% of all samples — audible clipping distortion on every utterance. Fix: delete the gain stage entirely. The kyutai reference pipeline applies no output scaling, and the synth output level needs none. The +/-1.0 hard clamp stays (now clamp_to_full_scale) as the safety net for outlier transients. Gain tests replaced with clamp tests, including a bit-exact pass-through check. This supersedes both prior level strategies: per-sentence peak normalization (level pumping) and the fixed gain that replaced it (clipping). No-gain is the reference behavior and text-invariant by construction. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>
daecbab to
40ac86b
Compare
…session-new * origin/main: fix(huddle): Pocket TTS quality overhaul — reference parity + cross-message pipelining (#997) Add manual ACP session rotation command (#932) fix(desktop): heal stale persona_team_dir paths in release builds (#1003) ci(docker): publish public ghcr.io/block/buzz image (native multi-arch) (#986) fix(buzz-agent): cap tool-result text at 50 KiB with middle elision (#952) feat(huddle): sentence-at-a-time voice-mode guidelines for lower TTS latency (#996) Shard desktop Playwright CI jobs (#992) chore(release): release version 0.3.18 (#995) Video Player Improvements (#993) Improve first-run welcome setup (#970) fix(release): use legacy updater key secret (#991) Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com> # Conflicts: # crates/buzz-acp/src/lib.rs # crates/buzz-agent/src/config.rs
…tate * origin/main: Add relay disconnect UX: friendly errors, reconnect, cached identity (#1004) feat(agents): add active turn indicators to Agents Menu (#1005) ci: add fork guards to docker, release, and auto-tag workflows (#1007) docs(nip-rs): add optional thread read context scheme (#1006) fix(huddle): Pocket TTS quality overhaul — reference parity + cross-message pipelining (#997) Add manual ACP session rotation command (#932) fix(desktop): heal stale persona_team_dir paths in release builds (#1003) ci(docker): publish public ghcr.io/block/buzz image (native multi-arch) (#986) fix(buzz-agent): cap tool-result text at 50 KiB with middle elision (#952) feat(huddle): sentence-at-a-time voice-mode guidelines for lower TTS latency (#996) Shard desktop Playwright CI jobs (#992) chore(release): release version 0.3.18 (#995) Video Player Improvements (#993) Improve first-run welcome setup (#970) fix(release): use legacy updater key secret (#991) Replace built-in personas with Fizz (#987)
Summary
Quality overhaul of the huddle Pocket TTS playback path. Organizing principle: reference parity by default — every deviation from kyutai's pipeline must be justified by a measured local problem, not by compensating for damage we introduced ourselves. Two of our four hacks existed to patch damage done by the other two; this PR deletes the damage and the patches together.
Follows #996 (sentence-at-a-time agent replies), which made the existing choppiness structurally worse — see change 3.
Changes
1.
silence_scale: 0.0 → 1.0— stop deleting pausessherpa-onnx's
ScaleSilenceis not pre/post padding control: it finds every interior silence run ≥ 0.2 s and multiplies its length by the scale. At0.0, every natural pause — clause breaks, breaths, the gap after a comma — was removed entirely, slamming words together and cutting endings abruptly. The reference pipeline doesn't post-process silence at all;1.0is the identity.2. Delete the sacrificial-prefix + cold-start trim apparatus
The first-phoneme smear is a real upstream bug (kyutai-labs/pocket-tts #91, #70), but our community-copied workaround — synthesize
". . "then amplitude-threshold-trim it off — used an absolute threshold (0.02) against raw un-normalized audio peaking at ~0.076, so soft word onsets sat in the "silence" band and got eaten. The cure was causing a version of the disease. Kyutai's own mitigation is just the 8-space pad — kept, along with capitalize, punctuation-terminate, and themax_framesrunaway cap. The trim's tuning constants were also calibrated againstsilence_scale=0.0audio, so they died with change 1 regardless.3. Cross-message pipelining — the structural fix
The worker had a per-item drain barrier: synth all sentences of one queued item, poll
player.empty(), only thenrecvthe next. With sentence-per-message delivery post-#996, every gap between sentences paid drain-poll + preprocess + full synthesis (~200–400 ms) of dead air. Now onePlayerpersists for the worker's lifetime; the worker returns to the channel immediately after synthesizing an item, so synthesis of message N+1 overlaps playback of message N and rodio keeps it gapless.Contracts preserved across the restructure:
tts_active(STT mic gate): set after the first real append, released only when the channel is quiet and the player has drained.Player::clear()removes queued sources and pauses the player — with a persistent Player the cancel path mustclear()thenplay(), or every append after an interrupt queues silently forever. (This is the sharp edge of the whole PR.)4. Fixed gain instead of per-sentence peak normalization
Per-sentence normalization made loudness a function of each sentence's loudest transient → audible level pumping between consecutive sentences. Replaced with a fixed gain (9.3; measured reference-voice peak ~0.076 lands at the established −3 dBFS) plus the existing ±1.0 clamp as safety. Reference applies no normalization; this is the minimal deviation that keeps the approved loudness.
Kept (justified, measured, independent of the smear saga): fade-out (click prevention), 20 ms lead-in cushion (CoreAudio warm-up),
INTER_SENTENCE_SILENCE(doubles as the lead-in budget — trim by ear in review if pauses now double up).Housekeeping
tts.rscrossed the 1000-line desktop budget; tests moved to a#[path]-includedtts_tests.rssibling (the crate's established pattern) and the TEMP size override removed.Why one PR, not four
1↔2 share calibration (trim constants assume scale-0 audio), 2↔3 interact (trim's output contract fed the lead-in cushioning), and the by-ear verdict is only meaningful with all four landed.
Verification
--all-targets -D warningsclean; fmt clean; file-size check clean.apply_playback_gainproperties (level lands at −3 dBFS, relative loudness preserved, clamp).Out of scope (backlog)
Barge-in TTS gate, latency instrumentation, sherpa
MergeShortSentences, voice-state conditioning across generations (the real cold-start fix; upstream sherpa-onnx contribution).