Skip to content

Session handoff 2026-04-30 — CUDA wins shipped, Ascend saga narrowed#13

Open
ymote wants to merge 3 commits into
mainfrom
session-handoff-2026-04-30
Open

Session handoff 2026-04-30 — CUDA wins shipped, Ascend saga narrowed#13
ymote wants to merge 3 commits into
mainfrom
session-handoff-2026-04-30

Conversation

@ymote
Copy link
Copy Markdown
Contributor

@ymote ymote commented Apr 30, 2026

Summary

  • Three CUDA wallclock wins shipped today: TTS env-flip (+7%), CUDA QIE norm fusion (+5.85%), CUDA TTS parallel top-K + on-device sampling (-11.2% stochastic). All committed to OminiX-CUDA origin/main at 3fa53afd1.
  • Three silent bug fixes caught by perf agents (race in P1 cudaMemcpyAsync, race in rep_penalty for duplicate tokens, CFG mask not indexed per-batch on CUDA QIE).
  • Ascend QIE saga narrowed from "anywhere in 60-block DiT" to "block-0 attention substep at REAL inputs" via three closed hypothesis classes.
  • #99 conditioner.hpp pad fix landed on ac01 (commit 61c8e2f, build clean, runtime test pending).

Side artifacts: audit reports, perf explorations, F32 oracle findings, dispatch decision-tree docs, codex MCP config (.mcp.json), per-block bisect analysis script.

Bundles for fresh-machine resume (NOT in git, on Mac local at /Users/yuechen/home/):

  • ac03_saga_2026-04-30.bundle (147 MB) — ac03 main HEAD 3daae48
  • ac01_99_pad_fix.bundle (146 MB) — #99 fix
  • qie-saga-5.5.65-snapshot.bundle (146 MB) — prior backup

Test plan

  • SESSION_HANDOFF_2026-04-30.md reads cleanly and has the actual decisions / artifact paths a next agent needs
  • .mcp.json codex MCP config loads in Claude Code (other agents can use codex-as-MCP)
  • Bundle hashes match for the handoff artifacts

ymote and others added 3 commits April 30, 2026 16:45
The `tts_qwen3` handler previously accepted any `voice` string and let
the TTS backend silently fall back to a default, so callers whose
requested voice was never registered (e.g. fm_tts on octos mini2
calling with `voice=yangmi` while the server has no custom voices
loaded) would receive `200 OK` + audio in the wrong voice. The LLM
layer could not detect the mismatch and relayed the wrong voice back
to the user.

Changes:

- New `voice_registry` module: collects the set of valid voices from
  the built-in Qwen3-TTS preset speakers plus any custom voices
  declared in `~/.OminiX/models/voices.json` (overridable via
  `OMINIX_VOICES_JSON`). Supports aliases.
- `tts_qwen3` now normalizes the requested voice (preserving the
  existing empty/"default" -> "vivian" fallback) and checks it
  against the registry before scheduling synthesis. Unknown voices
  return `404 Not Found` with a JSON body containing
  `error: "voice_not_found"`, the requested voice, and the list of
  available voices.
- Logs the rejection at INFO with requesting IP (x-forwarded-for,
  x-real-ip, or peer socket).
- `/v1/voices` listing endpoint is untouched; the synthesis engine is
  untouched.

Tests (+13):

- 8 unit tests for `VoiceRegistry` covering presets, custom voices,
  aliases, case sensitivity, missing/malformed JSON, and listing
  order.
- 4 handler-level integration tests via salvo's `TestClient`:
  unknown voice -> 404 with contract-shaped body, registered preset
  -> 200, missing voice field -> 200 (default path preserved),
  custom alias -> 200.
- 1 serialization sanity test for `VoiceNotFoundError`.
…ga narrowed

Session-handoff doc (SESSION_HANDOFF_2026-04-30.md) captures:

- Three CUDA wallclock wins (today): #182 +7% TTS env-flip, #187 +5.85%
  CUDA QIE norm fusion, #199 -11.2% CUDA TTS stochastic via parallel top-K
  + on-device sampling chain. All committed to OminiX-CUDA origin/main at
  3fa53afd1.

- Three silent bug fixes caught by perf agents: cudaMemcpyAsync race in P1
  pattern, rep_penalty race in P2/predictor for duplicate tokens, CFG mask
  not indexed per-batch (silent wrong attention with negative prompts on
  CUDA QIE).

- Ascend QIE saga narrowed from "anywhere in 60-block DiT" to "block-0
  attention substep at REAL inputs" via three closed hypothesis classes
  (F16 matmul saturation, sigma/Euler chain, distributed-compounding cast
  noise). Existing dump infrastructure on ac03 supports the next bisect.

- #99 conditioner.hpp pad fix landed on ac01 (commit 61c8e2f, build clean,
  runtime test pending QIE weights).

Side artifacts also tracked: audit + exploration reports from today's
work (ascend_native_engine_audit, cuda_*_perf_exploration, qie_*_findings,
qie_clamp_disambiguation), Q4_0 dequant + matmul F32 oracle script, and
the per-block bisect analysis script. Plus the .mcp.json codex MCP config.

Bundles for resuming on a fresh machine (NOT in git, on Mac local):
  /Users/yuechen/home/ac03_saga_2026-04-30.bundle  (147 MB, ac03 main HEAD 3daae48)
  /Users/yuechen/home/ac01_99_pad_fix.bundle       (146 MB, #99 fix)
  /Users/yuechen/home/qie-saga-5.5.65-snapshot.bundle (146 MB, prior backup)

For next agent: read SESSION_HANDOFF_2026-04-30.md end-to-end before
dispatching anything. The bisect chain has narrowed dramatically; the
next concrete step is the block-0 substep bisect with codex-corrected
scope (start at 02_img_mod_out + chunks, not just 08_Q/K/V).
For another agent on a fresh machine without ~/.ssh/config aliases,
SESSION_HANDOFF_2026-04-30.md now includes:
- Hostname, port, user, key path for each of ac01/ac02/ac03 + zgx-5b44/zgx-3675
- Direct ssh command snippets
- ~/.ssh/config aliases to copy
- Smoke test loop to verify all boxes reachable
- Note that keys (.pem for Ascend, ~/.ssh/id_ed25519 for CUDA) are NOT
  in the repo for security; ask user (Yue Chen) if missing on the new
  machine
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant