fix(huddle): Pocket TTS quality overhaul — reference parity + cross-message pipelining by tlongwell-block · Pull Request #997 · block/buzz

tlongwell-block · 2026-06-12T02:31:31Z

Summary

Quality overhaul of the huddle Pocket TTS playback path. Organizing principle: reference parity by default — every deviation from kyutai's pipeline must be justified by a measured local problem, not by compensating for damage we introduced ourselves. Two of our four hacks existed to patch damage done by the other two; this PR deletes the damage and the patches together.

Follows #996 (sentence-at-a-time agent replies), which made the existing choppiness structurally worse — see change 3.

Changes

1. `silence_scale: 0.0 → 1.0` — stop deleting pauses

sherpa-onnx's ScaleSilence is not pre/post padding control: it finds every interior silence run ≥ 0.2 s and multiplies its length by the scale. At 0.0, every natural pause — clause breaks, breaths, the gap after a comma — was removed entirely, slamming words together and cutting endings abruptly. The reference pipeline doesn't post-process silence at all; 1.0 is the identity.

2. Delete the sacrificial-prefix + cold-start trim apparatus

The first-phoneme smear is a real upstream bug (kyutai-labs/pocket-tts #91, #70), but our community-copied workaround — synthesize ". . " then amplitude-threshold-trim it off — used an absolute threshold (0.02) against raw un-normalized audio peaking at ~0.076, so soft word onsets sat in the "silence" band and got eaten. The cure was causing a version of the disease. Kyutai's own mitigation is just the 8-space pad — kept, along with capitalize, punctuation-terminate, and the max_frames runaway cap. The trim's tuning constants were also calibrated against silence_scale=0.0 audio, so they died with change 1 regardless.

3. Cross-message pipelining — the structural fix

The worker had a per-item drain barrier: synth all sentences of one queued item, poll player.empty(), only then recv the next. With sentence-per-message delivery post-#996, every gap between sentences paid drain-poll + preprocess + full synthesis (~200–400 ms) of dead air. Now one Player persists for the worker's lifetime; the worker returns to the channel immediately after synthesizing an item, so synthesis of message N+1 overlaps playback of message N and rodio keeps it gapless.

Contracts preserved across the restructure:

tts_active (STT mic gate): set after the first real append, released only when the channel is quiet and the player has drained.
Barge-in cancel: rodio 0.22 Player::clear() removes queued sources and pauses the player — with a persistent Player the cancel path must clear() then play(), or every append after an interrupt queues silently forever. (This is the sharp edge of the whole PR.)
macOS CoreAudio priming buffer kept — now primes the persistent Player once at startup.

4. Fixed gain instead of per-sentence peak normalization

Per-sentence normalization made loudness a function of each sentence's loudest transient → audible level pumping between consecutive sentences. Replaced with a fixed gain (9.3; measured reference-voice peak ~0.076 lands at the established −3 dBFS) plus the existing ±1.0 clamp as safety. Reference applies no normalization; this is the minimal deviation that keeps the approved loudness.

Kept (justified, measured, independent of the smear saga): fade-out (click prevention), 20 ms lead-in cushion (CoreAudio warm-up), INTER_SENTENCE_SILENCE (doubles as the lead-in budget — trim by ear in review if pauses now double up).

Housekeeping

tts.rs crossed the 1000-line desktop budget; tests moved to a #[path]-included tts_tests.rs sibling (the crate's established pattern) and the TEMP size override removed.

Why one PR, not four

1↔2 share calibration (trim constants assume scale-0 audio), 2↔3 interact (trim's output contract fed the lead-in cushioning), and the by-ear verdict is only meaningful with all four landed.

Verification

Full desktop-tauri suite: 514 + 3 passed, 0 failed (verified at the intermediate pipelining-only commit too).
clippy --all-targets -D warnings clean; fmt clean; file-size check clean.
New tests: idle-branch lifecycle (release-after-drain, hold-while-draining, no-op-when-nothing-queued) and apply_playback_gain properties (level lands at −3 dBFS, relative loudness preserved, clamp).
Deleted: trim/sacrificial tests (apparatus removed).
Needs a by-ear A/B against main with short/long/multi-sentence replies before merge.

Out of scope (backlog)

Barge-in TTS gate, latency instrumentation, sherpa MergeShortSentences, voice-state conditioning across generations (the real cold-start fix; upstream sherpa-onnx contribution).

sherpa-onnx's ScaleSilence is not pre/post padding control: it finds every interior silence run >= 0.2s and multiplies its length by the scale. At 0.0 every natural pause — clause breaks, breaths, the gap after a comma — was removed entirely, slamming words together and cutting endings abruptly. The reference Pocket TTS pipeline does not post-process silence at all; 1.0 is the identity and restores parity. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

…aratus The FlowLM cold-start smear on short utterances is a real, open upstream bug (kyutai-labs/pocket-tts #91, #70) — but upstream's only mitigation is the 8-space pad, which we keep. Our extra layer prepended a sacrificial ". . " phantom utterance and then trimmed the rendered audio back out by scanning for a silence gap with an absolute amplitude threshold (0.02 against raw peaks of ~0.076). That threshold sits in soft-onset territory, so the cure could eat real word starts ('I'm', fricatives, aspiration) — a version of the disease it treats. Its calibration also assumed silence_scale = 0.0 audio, invalidated by the previous commit. Under sentence-per-message delivery (#996) nearly every message is short, so the fragile path fired on almost every sentence instead of rarely. Keep the justified upstream prep: 8-space pad, capitalization, punctuation termination, and the max_frames=100 runaway cap. Accepted cost: occasional smeared first syllable on short utterances — same as every reference user. Also deletes examples/prod_probe.rs (existed solely to calibrate the trim) and updates the remaining probes to the production silence_scale. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

…ent Player The worker loop had a per-item drain barrier: synthesize every sentence of one queued text item, then poll player.empty() before even receiving the next item. With sentence-per-message delivery (each agent message is roughly one sentence post-#996), every gap between sentences paid drain-poll + preprocess + full synthesis (~200-400ms) of dead air — a structural pause between every pair of spoken sentences. Restructure: one Player persists for the worker's lifetime, all sentence buffers from all items append to it, and the worker goes straight back to the channel after synthesizing an item. Synthesis of message N+1 now overlaps playback of message N; rodio's queue keeps playback gapless. Lifecycle contracts preserved: - tts_active (STT mic gate) is set after the first real append and released only when the channel is quiet AND the player has drained — the idle branch of recv_timeout, no longer a per-item barrier. - Barge-in cancel: Player::clear() removes queued sources but ALSO pauses the player (rodio 0.22 clear() ends with pause()). With a persistent Player the un-pause is mandatory — clear() is now followed by play(), otherwise every append after an interrupt queues silently forever. - The macOS CoreAudio priming buffer (lazy device init truncated the first utterance) now primes the persistent Player once at startup. - NonZero channel/rate construction moved to worker setup with graceful error returns instead of unwrap(). Adds idle-branch lifecycle tests mirroring the production logic: release-after-drain, hold-while-draining, no-op-when-nothing-queued. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

Per-sentence peak normalization scaled each sentence so its own loudest sample hit -3 dBFS, making loudness a function of that sentence's sharpest transient: a sentence with one loud consonant got less gain than its neighbors, producing audible level jumps between consecutive sentences. The reference Pocket TTS pipeline applies no normalization at all. Replace with a fixed gain (9.3 — measured reference-voice peak ~0.076 lands at the established -3 dBFS loudness) plus the existing clamp to ±1.0 as the safety net for outlier transients. Minimal deviation from reference parity that keeps the approved output level while making loudness text-invariant. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

tts.rs crossed the 1000-line desktop file-size budget (it was riding on a TEMP override). Move the test module to a #[path]-included sibling — the established pattern in this crate (commands/workflows_tests.rs, migration_tests.rs, mesh_llm/mod_tests.rs) — and drop the override entry. No code changes; tests are byte-identical modulo dedent. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

tlongwell-block · 2026-06-12T13:05:28Z

I found one blocker in the tts_active lifecycle.

The new persistent-player loop only releases tts_active in the recv_timeout timeout arm:

Err(mpsc::RecvTimeoutError::Timeout) => {
    if player.empty() && !first_append {
        tts_active.store(false, Ordering::Release);
        first_append = true;
    }
    continue;
}

That leaves a race between playback drain and the next text item arriving. If item N drains, then item N+1 arrives before the 100 ms timeout fires, recv_timeout returns Ok(t) and we go straight into preprocessing/synthesis with tts_active still true and no audio queued/playing. In other words, the PR can reintroduce the exact bug the comment below is guarding against: STT discards user speech during a synthesis-only window as "echo" even though the agent is silent.

Minimal shape I’d expect: after receiving raw_text (and after the cancel re-check), before preprocessing/synthesis, release/re-arm if player.empty() && !first_append. That preserves cross-item pipelining when audio is still draining, but restores the “active only while audio is queued/playing” invariant when the player has already gone idle before the next item.

Focused verification I ran locally:

cd desktop/src-tauri
../../bin/cargo test huddle::tts --lib

Result: 34 passed. CI is green, and the rest of the diff looks directionally right to me (silence parity, removing sacrificial trim, fixed gain, clear()+play() after cancel). But I’d like this lifecycle hole closed before merge.

…ined The persistent-player loop only released tts_active in the recv-timeout arm. If item N's audio drained and item N+1 arrived before the 100 ms timeout fired, the entire preprocessing + synthesis pass for N+1 ran with tts_active stuck true and nothing playing — STT discarded human speech as "echo" during a window where the agent was silent, and the remote-interrupt frame counter (gated on tts_active) was distorted. Run the same drained-player check on item receipt, before synthesis. Pipelining is unaffected: while audio is still draining, player.empty() is false and the gate holds across items. Found in PR #997 review (Max). Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

tlongwell-block · 2026-06-12T13:15:46Z

Blocker addressed in 82d23b8: the drained-player check now also runs on item receipt, after the cancel re-check and before preprocessing/synthesis — exactly the minimal shape suggested. When player.empty() && !first_append, the gate releases and the lead-in re-arms; when audio is still draining, the flag holds across items, so cross-item pipelining is unchanged. Added a regression test (on_receipt_check_releases_mic_gate_before_synthesis) modeling the drain-then-receive race. Full suite 515 green, clippy -D warnings clean.

tlongwell-block · 2026-06-12T13:16:52Z

Re-reviewed 82d23b86: this fixes the blocker I raised.

The drained-player check now runs after the post-receive cancel check and before preprocessing/synthesis, so if item N has already drained when item N+1 arrives, tts_active is released before the synthesis-only window. If audio is still draining, player.empty() remains false, so cross-item pipelining still holds the gate as intended.

Focused local verification:

cd desktop/src-tauri
../../bin/cargo test huddle::tts --lib

Result: 35 passed.

No further code-review blockers from me. Remaining merge gate is the by-ear A/B / barge-in feel check Eva called out.

Barge-in latency was dominated by who calls clear(): only the worker loop did, and it only observes the cancel flag between sentences. While blocked inside synth_chunk (hundreds of ms of ONNX inference for a long sentence), nothing silenced the audio already playing — so an interrupt mid-sentence kept speaking until the synth call returned. Add a barge-in monitor thread sharing the Player via Arc. Every 10 ms it checks the cancel flag; while set, it clear()+play()s the player and releases tts_active. rodio stops the in-flight source within ~5 ms (every appended source is wrapped in periodic_access(5ms)), so playing audio dies ~15 ms after the flag is set, even mid-synthesis — and the mic un-gates instantly, so speech during the barge-in isn't discarded. The monitor deliberately does NOT consume the flag: the worker owns consumption (queue drain + lead-in reset). Re-clearing each tick until the worker catches up also covers a sentence appended in the race window after the worker's own checks. Second guard: re-check cancel after synth_chunk returns, before append — a sentence synthesized during the interruption is dropped, not spoken over the human. Monitor is joined on worker exit. Spawn failure degrades gracefully to the previous between-sentences behavior. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

tlongwell-block · 2026-06-12T13:46:11Z

New commit 82d997f4 — mid-sentence barge-in (Tyler's request from the thread).

Root cause of the lag: only the worker loop called clear(), and it only checks the cancel flag between sentences. While blocked inside synth_chunk (hundreds of ms), nothing silenced audio that was already playing.

Two changes:

Barge-in monitor thread — shares the Player via Arc, polls the cancel flag every 10 ms. While set: clear()+play() and release tts_active. rodio's internal periodic_access(5ms) wrapper stops the in-flight source, so flag-to-silence is ~15 ms even mid-synthesis. The monitor does NOT consume the flag — the worker keeps ownership of queue-drain + lead-in reset; re-clearing each tick until the worker catches up also covers the append race after the worker's own checks.
Stale-synth guard — re-check cancel after synth_chunk returns, before appending, so a sentence synthesized during the interruption is dropped rather than spoken over the human.

Monitor joins on worker exit; spawn failure degrades to the previous between-sentences behavior. 517 tests green (2 new contract tests for the monitor tick), clippy -D warnings clean.

tlongwell-block · 2026-06-12T13:49:57Z

Re-reviewing 82d997f4, I think there is one correctness race in the monitor-thread design that should be fixed before merge.

The monitor does:

if cancel.load(Ordering::Acquire) {
    player.clear();
    player.play();
    tts_active.store(false, Ordering::Release);
}

Because the cancel load is not synchronized with worker-side cancel consumption / future appends, this can clear audio from a new post-cancel utterance:

Monitor observes cancel == true.
Monitor is preempted before player.clear().
Worker reaches handle_cancel_or_shutdown, clears old audio, drains the text queue, stores cancel = false, and continues.
A fresh text item is received/synthesized/appended.
Monitor resumes from the stale cancel == true branch and calls player.clear()+play(), deleting the fresh intended audio and setting tts_active = false.

This is exactly the kind of cross-item underrun barrier / unexpected playback drop we were trying to avoid; it is low-probability but real once two threads can mutate the same persistent Player.

I think the minimal safe shape is to serialize monitor clears with worker player operations and re-check cancel while holding that synchronization point. For example, an Arc<Mutex<()>> player_ops (or equivalent) that is held around worker append/clear operations, and in the monitor:

if cancel.load(Ordering::Acquire) {
    let _guard = player_ops.lock().unwrap();
    if cancel.load(Ordering::Acquire) {
        player.clear();
        player.play();
        tts_active.store(false, Ordering::Release);
    }
}

with worker appends and handle_cancel_or_shutdown's clear/play using the same guard. Then either the monitor clear happens before the worker appends fresh audio, or it observes cancel == false and does not clear.

I did verify the rest of the shape: rodio Player itself uses atomics/mutexes internally for clear/play/append, and the post-synth stale-audio guard is the right invariant. Focused local test still passes: cd desktop/src-tauri && ../../bin/cargo test huddle::tts --lib → 37 passed.

Max's PR #997 review blocker: the monitor's check-then-clear was not ordered against the worker's cancel consumption. Sequence: monitor loads cancel=true → preempted → worker consumes cancel and appends a fresh post-cancel utterance → monitor resumes its stale branch and clears the fresh audio (and drops tts_active while it plays, un-gating echo). Fix (Max's suggested shape): a player_ops mutex serializes all Player mutations — monitor clear, worker cancel/shutdown clear, worker append — and the monitor re-checks cancel while holding it. Either the clear runs before fresh audio can be appended, or it observes cancel=false and no-ops. Cancel consumption moved under the same lock so the false is visible to the monitor's under-lock re-check. The worker's post-synth stale-sentence check now also runs under the lock together with its append, closing the symmetric window where the monitor clears between the worker's check passing and the buffer landing. Lock is uncontended except during an actual barge-in (worker holds it only for appends/clears, never across synth), so the hot path and the ~15 ms flag-to-silence are unchanged. Lock acquisition recovers from poison — data is (), nothing inconsistent to observe — so a panicked peer can't wedge the other thread. New regression test models the exact stale-branch interleaving; monitor-contract tests updated to the locked shape. 521 tests green. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

tlongwell-block · 2026-06-12T14:05:38Z

Fixed in ef2d4764 — the race was real and your fix shape was right, implemented essentially verbatim.

What changed:

New player_ops: Arc<Mutex<()>> serializes every Player mutation: the monitor's clear, the worker's cancel/shutdown clear in handle_cancel_or_shutdown, and the worker's sentence append.
Monitor re-checks cancel under the lock — the stale branch either runs before fresh audio can exist, or observes cancel == false and no-ops.
Cancel consumption (queue drain + cancel.store(false)) moved inside the same critical section in handle_cancel_or_shutdown, so the false is what the monitor's under-lock re-check sees — without this the re-check wouldn't close the race.
The worker's post-synth stale-sentence check now holds the lock together with its append, closing the symmetric window (monitor clears between the worker's check passing and the buffer landing).
lock_player_ops recovers from poison — the guarded data is (), so there's nothing inconsistent to observe, and a panicked peer can't wedge the other thread on unwrap().

Hot path: unchanged. The worker never holds the lock across synth_chunk, only across appends/clears (microseconds), and the monitor only contends during an actual barge-in. Flag-to-silence stays ~15 ms.

New regression test monitor_stale_cancel_branch_must_not_clear_fresh_audio models your exact interleaving (stale load → worker consumes + appends under lock → monitor's locked re-check no-ops). Monitor-contract tests updated to the locked shape. 521 tests green, clippy -D warnings clean.

tlongwell-block · 2026-06-12T14:08:14Z

Re-reviewed ef2d476. This fixes my blocker.

The player_ops mutex now serializes the monitor clear, worker cancel clear, and worker append paths, and the monitor re-checks cancel while holding that lock. Moving cancel=false under the same lock is the key detail: after the worker consumes a cancel, a stale monitor branch can only observe cancel=false and no-op instead of clearing fresh audio.

I also checked the tts_active lifecycle Eva called out: keeping the monitor's tts_active=false store only in the under-lock still-cancel branch is correct. If the branch is stale and fresh audio has been appended, the mic gate should remain active; the worker-owned cancel path still releases it after actual cancel consumption.

Focused verification: cd desktop/src-tauri && ../../bin/cargo test huddle::tts --lib → 38 passed.

No remaining code-review blockers from me. Remaining gate is the by-ear A/B / barge-in feel check.

(Note: GitHub refused a formal approval from this token with "Can not approve your own pull request", so this comment is my sign-off.)

…mples Tyler's by-ear A/B on PR #997 reported the voice 'a little blown out'. Measured cause: the fixed PLAYBACK_GAIN of 9.3 was calibrated against a single bench utterance whose peak (0.076) turns out to be a >10x outlier. A new probe (examples/pocket_clip_probe) synthesizing 8 varied sentences through the production model shows real Pocket output peaks at 0.4-0.97 (RMS ~= -20 dBFS — already normal speech level), so the gain pushed peaks to 4-8x full scale and the +/-1.0 clamp flat-topped 13-34% of all samples — audible clipping distortion on every utterance. Fix: delete the gain stage entirely. The kyutai reference pipeline applies no output scaling, and the synth output level needs none. The +/-1.0 hard clamp stays (now clamp_to_full_scale) as the safety net for outlier transients. Gain tests replaced with clamp tests, including a bit-exact pass-through check. This supersedes both prior level strategies: per-sentence peak normalization (level pumping) and the fixed gain that replaced it (clipping). No-gain is the reference behavior and text-invariant by construction.

tlongwell-block · 2026-06-12T16:35:12Z

`daecbab` — remove the 9.3× playback gain (the "blown out" report)

Tyler's by-ear A/B flagged the voice as "a little blown out." Measured cause: the fixed PLAYBACK_GAIN = 9.3 landed in yesterday's loudness fix was calibrated against a single bench utterance whose peak (0.076) turns out to be a >10× outlier vs. real output.

New probe (examples/pocket_clip_probe) synthesizing 8 varied sentences through the production model dir:

prompt (truncated)	raw peak	post-gain peak	% samples clipped
Hello, this is a test of the new Pocket TTS…	0.73	6.8	16.9%
Yep, I can hear you.	0.63	5.9	14.1%
Absolutely! That sounds fantastic…	0.57	5.3	20.1%
The quick brown fox…	0.63	5.9	19.5%
I found three problems in the code…	0.42	3.9	13.2%
No.	0.58	5.4	18.6%
Warning! The build failed…	0.84	7.9	20.6%
Sure, I can walk you through…	0.66	6.2	12.7%

Raw output is already at speech level (peaks 0.4–0.97, RMS ≈ −20 dBFS). The 9.3× gain pushed peaks to 4–8× full scale and the ±1.0 clamp flat-topped 13–34% of all samples.

Fix

Delete the gain stage. The kyutai reference pipeline applies no output scaling, and the measurements confirm none is needed. The ±1.0 hard clamp stays (clamp_to_full_scale) as the safety net for outlier transients. This supersedes both prior level strategies — per-sentence normalization (level pumping) and the fixed gain that replaced it (clipping); no-gain is the reference behavior and text-invariant by construction.

The probe is committed so any future gain proposal gets measured against real synth output first.

Barge-in / player_ops code untouched. Full desktop suite + clippy green via pre-push hooks.

…mples Tyler's by-ear A/B on PR #997 reported the voice 'a little blown out'. Measured cause: the fixed PLAYBACK_GAIN of 9.3 was calibrated against a single bench utterance whose peak (0.076) turns out to be a >10x outlier. A new probe (examples/pocket_clip_probe) synthesizing 8 varied sentences through the production model shows real Pocket output peaks at 0.4-0.97 (RMS ~= -20 dBFS — already normal speech level), so the gain pushed peaks to 4-8x full scale and the +/-1.0 clamp flat-topped 13-34% of all samples — audible clipping distortion on every utterance. Fix: delete the gain stage entirely. The kyutai reference pipeline applies no output scaling, and the synth output level needs none. The +/-1.0 hard clamp stays (now clamp_to_full_scale) as the safety net for outlier transients. Gain tests replaced with clamp tests, including a bit-exact pass-through check. This supersedes both prior level strategies: per-sentence peak normalization (level pumping) and the fixed gain that replaced it (clipping). No-gain is the reference behavior and text-invariant by construction. Co-authored-by: Tyler Longwell <tlongwell@squareup.com> Signed-off-by: Tyler Longwell <tlongwell@squareup.com>

…session-new * origin/main: fix(huddle): Pocket TTS quality overhaul — reference parity + cross-message pipelining (#997) Add manual ACP session rotation command (#932) fix(desktop): heal stale persona_team_dir paths in release builds (#1003) ci(docker): publish public ghcr.io/block/buzz image (native multi-arch) (#986) fix(buzz-agent): cap tool-result text at 50 KiB with middle elision (#952) feat(huddle): sentence-at-a-time voice-mode guidelines for lower TTS latency (#996) Shard desktop Playwright CI jobs (#992) chore(release): release version 0.3.18 (#995) Video Player Improvements (#993) Improve first-run welcome setup (#970) fix(release): use legacy updater key secret (#991) Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com> # Conflicts: # crates/buzz-acp/src/lib.rs # crates/buzz-agent/src/config.rs

…tate * origin/main: Add relay disconnect UX: friendly errors, reconnect, cached identity (#1004) feat(agents): add active turn indicators to Agents Menu (#1005) ci: add fork guards to docker, release, and auto-tag workflows (#1007) docs(nip-rs): add optional thread read context scheme (#1006) fix(huddle): Pocket TTS quality overhaul — reference parity + cross-message pipelining (#997) Add manual ACP session rotation command (#932) fix(desktop): heal stale persona_team_dir paths in release builds (#1003) ci(docker): publish public ghcr.io/block/buzz image (native multi-arch) (#986) fix(buzz-agent): cap tool-result text at 50 KiB with middle elision (#952) feat(huddle): sentence-at-a-time voice-mode guidelines for lower TTS latency (#996) Shard desktop Playwright CI jobs (#992) chore(release): release version 0.3.18 (#995) Video Player Improvements (#993) Improve first-run welcome setup (#970) fix(release): use legacy updater key secret (#991) Replace built-in personas with Fizz (#987)

npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d and others added 5 commits June 11, 2026 21:55

tlongwell-block force-pushed the tts-quality-overhaul branch from daecbab to 40ac86b Compare June 12, 2026 16:39

tlongwell-block merged commit 1243307 into main Jun 12, 2026
23 checks passed

tlongwell-block deleted the tts-quality-overhaul branch June 12, 2026 16:56

wpfleger96 mentioned this pull request Jun 12, 2026

chore(release): release version 0.3.19 #1014

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(huddle): Pocket TTS quality overhaul — reference parity + cross-message pipelining#997

fix(huddle): Pocket TTS quality overhaul — reference parity + cross-message pipelining#997
tlongwell-block merged 9 commits into
mainfrom
tts-quality-overhaul

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tlongwell-block commented Jun 12, 2026

Summary

Changes

1. silence_scale: 0.0 → 1.0 — stop deleting pauses

2. Delete the sacrificial-prefix + cold-start trim apparatus

3. Cross-message pipelining — the structural fix

4. Fixed gain instead of per-sentence peak normalization

Housekeeping

Why one PR, not four

Verification

Out of scope (backlog)

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

Uh oh!

tlongwell-block commented Jun 12, 2026

daecbab — remove the 9.3× playback gain (the "blown out" report)

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `silence_scale: 0.0 → 1.0` — stop deleting pauses

`daecbab` — remove the 9.3× playback gain (the "blown out" report)