diff --git a/.gitignore b/.gitignore index 9e49fce..02d59b8 100644 --- a/.gitignore +++ b/.gitignore @@ -3,3 +3,7 @@ *.png ws_chunks/ ws_multi_voice/ +.DS_Store +.claude/ +tmp/ +*.bundle diff --git a/.mcp.json b/.mcp.json new file mode 100644 index 0000000..c92d009 --- /dev/null +++ b/.mcp.json @@ -0,0 +1,8 @@ +{ + "mcpServers": { + "codex": { + "command": "codex", + "args": ["mcp-server"] + } + } +} diff --git a/AGENT_HANDOFF.md b/AGENT_HANDOFF.md new file mode 100644 index 0000000..f3627cd --- /dev/null +++ b/AGENT_HANDOFF.md @@ -0,0 +1,231 @@ +# Agent Handoff: OminiX Makepad Comparison + Ascend TTS Optimization + +## Context + +This session started as a deep code review of Makepad's `libs/` vs OminiX-MLX, then evolved into cross-platform TTS optimization spanning three repos and two hardware platforms. + +## Repos & Locations + +| Repo | Path | Remote | +|------|------|--------| +| OminiX-MLX | `/Users/yuechen/home/OminiX-MLX/` | `oxiglade/mlx-rs` | +| OminiX-API | `/Users/yuechen/home/OminiX-API/` | `OminiX-ai/OminiX-API` | +| OminiX-Ascend | `/Users/yuechen/home/OminiX-Ascend/` | `OminiX-ai/OminiX-Ascend` | +| MFA | `/Users/yuechen/home/OminiX-MLX/universal-metal-flash-attention/` | (submodule) | +| Makepad (reference) | `/tmp/makepad-dev/` | shallow clone, read-only | + +## Ascend Server Access + +```bash +ssh -i ~/home/tensordock/KeyPair-4fbd-yue.pem -p 31984 ma-user@dev-modelarts.cn-southwest-2.huaweicloud.com +``` +- Ascend 910B4, 32GB HBM, CANN 8.3.RC1 +- Code deployed at: `~/work/OminiX-Ascend/` +- GGUF models: `~/work/OminiX-Ascend/tools/qwen_tts/gguf/` (Base) and `gguf_cv/` (CustomVoice) +- Always set env before running: +```bash +export ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest +export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:$ASCEND_TOOLKIT_HOME/runtime/lib64:$LD_LIBRARY_PATH +source $ASCEND_TOOLKIT_HOME/../set_env.sh 2>/dev/null || true +``` + +## AutoDL Server Access (China GPU server, for relaying files) + +```bash +sshpass -p "hWOBhhnRX0lo" ssh -p 13797 root@connect.gda1.seetacloud.com +``` +- Use `source /root/miniconda3/etc/profile.d/conda.sh && conda activate base` for Python +- HuggingFace blocked from both servers. Use local Mac to download, relay via AutoDL. +- AutoDL `/tmp` is FULL (30GB). Use `/root/autodl-tmp/` (82GB free). + +## What Was Shipped + +### OminiX-MLX (all committed, not pushed) + +1. **Memory module** (`mlx-rs-core/src/memory.rs`): `MemoryGuard`, `eval_with_retry`, `memory_snapshot`, `preflight_check`, `clear_cache`, `set_memory_limit`, `set_cache_limit` +2. **MemoryGuard propagated** to 7 crates: mixtral-mlx, qwen3-mlx (3 files), qwen3.5-35B-mlx, gpt-sovits-mlx, qwen-image-mlx. All `unsafe { mlx_clear_cache() }` replaced. +3. **Manual matmul attention → MLX SDPA** in 4 crates: funasr-mlx (encoder+decoder), deepseek-ocr2-mlx (2 files), qwen3-vl-mlx. Measured 1.3-2.2x attention speedup. +4. **Batched QKV projection** in qwen3-tts-mlx: merged q_proj/k_proj/v_proj into single quantized matmul. Clean refactor, speed-neutral. +5. **Flash attention Metal kernel** (`mlx-rs-core/src/metal_kernels.rs`): Online softmax, GQA-native, causal masking. Tests pass. Slower than MLX SDPA for Q=1 decode but ready for ASR/FLUX long prefill. +6. **MFA Metal fix**: Replaced `__asm` AIR intrinsics with barrier-based cooperative copy for macOS 26 compat. Added `MTLCompileOptions` with `.version3_1`. +7. **MFA GQA**: Added `num_kv_heads` to FFI, fixed `batchSize` passthrough, auto-select broadcast mode. + +### OminiX-API (committed, not pushed) + +1. **GPU memory monitoring**: `memory_snapshot()` logged at model load and per-request for image/llm/tts engines. +2. **eval_with_retry**: Denoising loop + VAE decode in image engine (2 retries on OOM). +3. **Preflight check**: Estimated VRAM per model type, warns before loading. +4. **Safe clear_cache**: Replaced `unsafe { mlx_sys::mlx_clear_cache() }` with `memory::clear_cache()`. + +### OminiX-Ascend (committed, not pushed) + +Commits (8 total on main): +1. `daf1f33` — Language tolower fix for xvec/customvoice +2. `54b6080` — MRoPE 4x positions, remove double tts_pad, llama.cpp CANN path for xvec (30x speedup: 0.43→12.8 fps) +3. `808a5d6` — Standard RoPE works for TTS (MRoPE GGUF produces noise) +4. `c856721` — CustomVoice mode, speaker ID export, Q8/Q5/Q4 quantization benchmarks +5. `5765e99` — Batch reuse optimization (+15%, 13.0 fps) +6. `dabca7d` — cp_groups configurable flag +7. `61c75ff` — cp_layers flag + native CANN engine prototype +8. `12405a5` — CANN engine build findings, cp_layers test results + +## Current Performance Numbers + +### Apple M3 Max (MLX, 8-bit quantized) +- Qwen3-TTS: **46 fps, ~3.1x realtime** (cannot be improved further — MLX ceiling) + +### Ascend 910B4 (llama.cpp + CANN, Q8_0) +- ICL voice clone: **13.0 fps, 1.48x realtime** (best achieved) +- xvec voice clone: **12.8 fps** (speech quality OK but not as good as MLX) +- CustomVoice vivian: **~12 fps** (working with proper CV GGUF) +- Q5_K_M: 3.8 fps (CANN not optimized for Q5) +- Q4_K_M: 4.5 fps (same issue) + +### Bottleneck Analysis (Ascend, per frame at 13 fps = 77ms/frame) +- Talker LLM decode (28 layers, NPU): **17ms** — fast, well-optimized +- Code Predictor (5 layers × 14 decodes, NPU via llama.cpp): **46ms** — bottleneck + - Each `llama_decode()`: ~3ms (2ms launch overhead + 1ms compute) + - 14 sequential calls: ~42ms NPU + ~4ms CPU (lm_heads, embedding lookups) +- Codec head + sampling + embedding: **14ms** — CPU, already NEON-optimized + +## What Failed (Don't Retry) + +| Approach | Why it failed | +|----------|--------------| +| Fused RMSNorm+QKV projection kernel (MLX) | MLX lazy eval already batches — 0% gain | +| Fused QK-norm+RoPE kernel (MLX) | Same — MLX handles dispatch batching | +| Flash attention for TTS decode (MLX) | MLX SDPA faster at Q=1 | +| Custom quantized matmul (MLX) | MLX already fused+M3-tuned, <5% theoretical | +| MRoPE GGUF for TTS (Ascend) | Standard RoPE works; MRoPE produces noise | +| CANN graph mode `GGML_CANN_ACL_GRAPH=on` (Ascend) | NPU core crash on TTS embedding inputs | +| GGML CP session on CANN (Ascend) | Segfault — custom CP ops not supported | +| CPU NEON CP path (Ascend) | Too slow for 5-layer transformer | +| cp_groups < 15 (Ascend) | EOS breaks + quality degrades (model not trained for partial codebooks) | +| cp_layers < 5 (Ascend) | Only layers=5 produces audible speech | + +## What To Do Next (Priority Order) + +### 1. CANN Engine Integration (HIGH — the only path to 20+ fps) + +**Status**: 850-line prototype at `tools/qwen_tts/cp_cann_engine.{h,cpp}`. Code is written. Build fails due to CANN library circular dependencies. + +**Fix options** (try in order): +1. **dlsym approach**: Load `aclnnMm`, `aclnnRmsNorm`, `aclnnSilu`, `aclCreateTensor` at runtime via `dlopen("libopapi.so")`. Avoids all link-time dependency issues. ~50 lines of function pointer declarations. +2. **Compile into ggml-cann.so**: Add `cp_cann_engine.cpp` to `ggml/src/ggml-cann/CMakeLists.txt` instead of `tools/qwen_tts/CMakeLists.txt`. All CANN symbols already resolved there. +3. **Separate shared lib**: Build `libcp_cann.so` with proper CANN link flags, dlopen it from qwen_tts. + +**Expected gain**: 14 × (3ms → <1ms) = 28ms saved/frame → ~20ms/frame total → **~50 fps** theoretical, probably **25-30 fps** realistic. + +**Integration point**: In `talker.cpp` `predict_code_groups()`, add a third path: +```cpp +if (cp_cann_engine_ && cp_cann_engine_->is_ready()) { + // Native CANN path — bypasses llama.cpp + cp_cann_engine_->reset_kv_cache(); + cp_cann_engine_->forward_one_token(hidden_states, 0, cp_out.data()); + // ... same autoregressive loop as CPU path (lines 1528-1549) ... +} +``` + +### 2. xvec Audio Quality (MEDIUM — Issue #1) + +xvec voice clone produces speech-range audio but sounds worse than MLX. Needs tensor-by-tensor debugging: +- Compare prefill hidden states layer-by-layer between C++ and MLX +- The `verify_*.py` scripts in `tools/qwen_tts/` can help +- Likely cause: weight precision (Q8_0 GGUF vs f32 MLX, especially codec_head/text_projection) + +### 3. Push All Commits (LOW — needs user approval) + +None of the repos have been pushed. All changes are local commits: +- OminiX-MLX: `git push` on main +- OminiX-API: `git push` on main +- OminiX-Ascend: `git push` on main + +### 4. Open GitHub Issues (some done, some not) + +**Already opened:** +- oxiglade/mlx-rs#339 — Flash attention for ASR/FLUX +- oxiglade/mlx-rs#340 — F16 KV cache +- oxiglade/mlx-rs#341 — Speculative TTS decode +- OminiX-ai/OminiX-Ascend#1 — xvec noise debugging +- OminiX-ai/OminiX-Ascend#2 — CV-specific GGUF export + +**Should open:** +- OminiX-ai/OminiX-Ascend#3 — Native CANN CP engine (bypass llama.cpp) +- OminiX-ai/OminiX-Ascend#4 — CANN graph mode crash report (for Huawei) + +## Key Files to Know + +### OminiX-MLX +- `mlx-rs-core/src/memory.rs` — new memory module +- `mlx-rs-core/src/metal_kernels.rs` — flash attention kernel + fused ops +- `qwen3-tts-mlx/src/talker.rs` — batched QKV, generation +- `qwen3-tts-mlx/src/generate.rs` — MemoryGuard integration + +### OminiX-Ascend +- `tools/qwen_tts/talker.cpp` — all TTS generation paths (ICL/xvec/customvoice) +- `tools/qwen_tts/talker.h` — TalkerSamplingParams (cp_max_groups, cp_max_layers) +- `tools/qwen_tts/qwen_tts.cpp` — orchestration, model loading +- `tools/qwen_tts/main.cpp` — CLI flags +- `tools/qwen_tts/cp_cann_engine.{h,cpp}` — native CANN prototype (NOT compiled yet) +- `tools/qwen_tts/export_qwen_tts.py` — GGUF export (now with spk_ids_json) +- `tools/qwen_tts/export_talker_llama.py` — llama-format GGUF export + +### OminiX-API +- `src/engines/image.rs` — eval_with_retry, preflight, memory logging +- `src/engines/llm.rs` — memory logging +- `src/engines/qwen3_tts.rs` — memory logging + +## Benchmarking Commands + +### MLX TTS (Mac) +```bash +cd /Users/yuechen/home/OminiX-MLX +./target/release/examples/synthesize -m models/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit -s vivian -l english -o /tmp/test.wav --seed 42 "Test text here." +``` + +### Ascend TTS (ICL, best quality) +```bash +./build/bin/qwen_tts -m tools/qwen_tts/gguf \ + -t "Test text." \ + -r tools/qwen_tts/data/ref_audios/ellen_ref_24k.wav \ + --ref_text "Reference transcript here." \ + --target_lang English --ref_lang English \ + --talker_model tools/qwen_tts/gguf/qwen_tts_talker_llama_q8_0.gguf \ + --cp_model tools/qwen_tts/gguf/qwen_tts_cp_llama.gguf \ + --n_gpu_layers 29 -n 8 -o output.wav -p +``` + +### Ascend TTS (CustomVoice) +```bash +./build/bin/qwen_tts -m tools/qwen_tts/gguf_cv \ + --mode customvoice --speaker vivian \ + -t "Test text." --target_lang English \ + --talker_model tools/qwen_tts/gguf_cv/qwen_tts_talker_llama_q8_0.gguf \ + --cp_model tools/qwen_tts/gguf_cv/qwen_tts_cp_llama.gguf \ + --n_gpu_layers 29 -n 8 -o output.wav +``` + +### Audio Quality Validation +```python +import wave, array +with wave.open('file.wav', 'rb') as w: + frames = w.readframes(min(w.getnframes(), 24000*5)) +s = array.array('h', frames) +rms = (sum(x*x for x in s) / len(s)) ** 0.5 +zc = sum(1 for i in range(1,len(s)) if (s[i]>=0)!=(s[i-1]>=0)) +zcr = zc / len(s) * 24000 +print(f'RMS={rms:.0f} ZCR={zcr:.0f}Hz {"SPEECH" if 5001000 else "POOR"}') +# BUT: metrics can't distinguish intelligible speech from garbled speech-like audio. +# Always listen to the file. The user is strict about quality. +``` + +## Important Lessons Learned + +1. **MLX lazy eval is very effective** — don't try to out-optimize it with custom kernels for small ops +2. **Q8_0 is fastest on Ascend** — Q5/Q4 are SLOWER due to unoptimized CANN kernels +3. **Standard RoPE works for TTS** — MRoPE GGUF breaks it (temporal-only positions = standard) +4. **All 5 CP layers needed** — can't reduce without quality collapse +5. **All 15 codec groups needed** — can't skip without EOS breaking + quality loss +6. **llama.cpp IS the bottleneck** on Ascend — 2ms per-launch overhead × 14 CP decodes = 28ms wasted +7. **User is strict about audio quality** — "sounds good" by metrics doesn't mean acceptable. Always play the file. +8. **File transfers to China servers**: Use AutoDL as relay (both in China). `/tmp` on AutoDL is FULL — use `/root/autodl-tmp/`. diff --git a/Cargo.lock b/Cargo.lock index ed43971..84ed627 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -3973,8 +3973,11 @@ checksum = "3b0978458bee0102c6c337040ea0b13c497ff1e31015c49c2cc9a387f813e2c1" dependencies = [ "async-trait", "base64 0.22.1", + "brotli", "bytes 1.11.0", + "encoding_rs", "enumflags2", + "flate2", "form_urlencoded", "futures-channel", "futures-util", @@ -4000,6 +4003,7 @@ dependencies = [ "serde", "serde-xml-rs", "serde_json", + "serde_urlencoded", "sync_wrapper", "tempfile", "thiserror 2.0.18", @@ -4007,6 +4011,8 @@ dependencies = [ "tokio-rustls", "tokio-util", "tracing", + "url", + "zstd", ] [[package]] @@ -5835,6 +5841,34 @@ version = "1.0.16" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "dfcd145825aace48cff44a8844de64bf75feec3080e0aa5cdbde72961ae51a65" +[[package]] +name = "zstd" +version = "0.13.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e91ee311a569c327171651566e07972200e76fcfe2242a4fa446149a3881c08a" +dependencies = [ + "zstd-safe", +] + +[[package]] +name = "zstd-safe" +version = "7.2.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8f49c4d5f0abb602a93fb8736af2a4f4dd9512e36f7f570d66e65ff867ed3b9d" +dependencies = [ + "zstd-sys", +] + +[[package]] +name = "zstd-sys" +version = "2.0.16+zstd.1.5.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "91e19ebc2adc8f83e43039e79776e3fda8ca919132d68a1fed6a5faca2683748" +dependencies = [ + "cc", + "pkg-config", +] + [[package]] name = "zune-core" version = "0.4.12" diff --git a/Cargo.toml b/Cargo.toml index 3bc29d3..34a8045 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -92,6 +92,12 @@ image = "0.25" # HTTP client (for model downloads) reqwest = { version = "0.12", features = ["blocking", "json"] } +[dev-dependencies] +# Enable salvo's TestClient for handler-level integration tests. +salvo = { version = "0.81.0", default-features = false, features = [ + "affix-state", "cors", "server", "http1", "http2", "websocket", "sse", "test" +] } + [[bin]] name = "ominix-api" path = "src/main.rs" diff --git a/QIE_ASCEND_HANDOFF.md b/QIE_ASCEND_HANDOFF.md new file mode 100644 index 0000000..1bd8330 --- /dev/null +++ b/QIE_ASCEND_HANDOFF.md @@ -0,0 +1,308 @@ +# QIE-Edit on Ascend 910B — Handoff for Review + +**Status (2026-04-29 evening, post-§5.5.13c discovery): contract gate WIDE OPEN.** +Both Ascend QIE paths produce NaN at 256² and 1024² in production cadence. The "GREEN" findings from §5.5.59/65/66/67 + audit #188 + §5.5.13 partial close were all under `SD_FIRST_NAN_TRACE=1`, which serializes per-node compute and adds CPU work between dispatches — that timing change masks the production NaN bug. Real production cadence (no tracer) is fully NaN. + +**Maintainer:** Yue Chen (yue.chen@futurewei.com) +**Date:** 2026-04-29 (revised after the §5.5.13c session) + +> **READ THIS FIRST**: the older sections of this handoff describe a saga state that has since been invalidated by direct production-cadence verification at HEAD. Specifically, the "first-NaN at node 4654 RMSNorm" finding was a tracer artifact, and the "switch from CLI to native" audit verdict was based on tracer-mode evidence. The corrected reckoning is in the **"Saga reckoning, 2026-04-29 evening"** section near the bottom. Use that section as the canonical state. The earlier sections are kept for archaeological context only. + +--- + +## TL;DR + +Qwen-Image-Edit-2509 (DiT) runs end-to-end correctly at 256² on Ascend 910B via ggml-cann. At 1024² it produces all-NaN output. After 60+ rounds of bisect (§5.5.0–§5.5.65), the actual first-NaN node has now been pinpointed: + +- **Step 0 (first DiT forward pass): fully clean across all 10,213 nodes.** +- **Step 1 (second DiT forward pass): first-NaN at node 4654, op=`RMS_NORM`, shape `[128, 24, 4096, 1]` f32, with clean input.** + +This is `norm_q` or `norm_k` (per-head Q/K RMSNorm) inside `Attention::forward` at roughly DiT block 27. The kernel's own input is clean (0 NaN at the moment of post-compute callback) but its output contains 4096 NaN values — exactly one per token-row. The pattern strongly suggests either an Ascend `aclnnRmsNorm` kernel quirk on the second graph invocation at this exact shape, or a buffer-aliasing issue that the existing `SD_FIRST_NAN_TRACE` callback timing happens to hide. + +The next step (§5.5.66, not started) is to dump the actual input buffer at that node and compare against a CPU reference RMSNorm to decide which. + +--- + +## Repository locations + +| Where | Path | Branch | Notes | +|---|---|---|---| +| **Production (live)** | `ac03:/home/ma-user/work/OminiX-Ascend` | `main` | **54 commits ahead of `origin/main`, NOT pushed**. This is where §5.5.x work lives and the binary that produced the §5.5.65c trace was built here. | +| Mac local mirror | `/Users/yuechen/home/OminiX-Ascend` | — | Separate fork-style branch with the same §5.5.65 work landed independently (commit hashes differ: `b8d36560`, `5e9bd3e1`, `b2dd5560`). Useful for code reading on Mac; **the trace evidence is on ac03**. | +| Upstream remote | `https://github.com/ymote/OminiX-Ascend.git` | `main` | Last upstream commit is `7306b7e5 Fix CANN9 Qwen3.6 frontend shutdown`. The §5.5.x saga work has not been pushed. | + +**ac03 access:** SSH config alias `ac03` (Huawei ModelArts notebook, user `ma-user`). NPU driver: `npu-smi 23.0.6`. Hardware: 910B4 (32 GB HBM). + +--- + +## What's in production code on ac03 (saga commits, newest first) + +``` +a078106 qie(§5.5.65): restore §5.5.53b INPUT flag annotations clobbered by scp +1e2de75 qie(§5.5.65): count true NaN only, not Inf +efd5384 qie(§5.5.65): extend FIRST-NAN trace to F16/BF16 + skip non-scannable +a1ca71d qie(§5.5.65): add SD_FIRST_NAN_TRACE per-node post-compute NaN scan +737e7c3 qie(Q2.4.5.5.60): § doc — d82ba89 verification PARTIAL, 256² still RED +d82ba89 fix(qie-edit): §5.5.59 — protect INPUT leaves from gallocr free + reuse +82e6e1a fix(qie-edit): mark model_out as OUTPUT — partial close on 1024² multi-step NaN +c88ca9b fix(qie-edit): partial graph-allocator alias fix — closes 256² NaN, 1024² still RED +``` + +The `§5.5.x` numbering is internal to `docs/qie_q2_phase4_smoke.md` (the saga journal — read it for full chain of evidence). + +### Files modified across the saga (cumulative) + +| File | What changed | Why | +|---|---|---| +| `ggml/src/ggml-alloc.c` | `+24 / -0` lines. Added INPUT-flag protection so leaf input tensors aren't freed/reused mid-graph by gallocr. Recursive view-chain traversal so views of INPUT tensors are also protected. | §5.5.59 closed 256² NaN; §5.5.61–62 extended coverage. | +| `tools/ominix_diffusion/src/ggml_extend.hpp` | `+294 / -134` lines. Added `SD_FIRST_NAN_TRACE` env-gated per-node post-compute eval_callback that scans every tensor immediately after its op completes. F16/BF16 aware, counts true NaN only (skips ±Inf). | §5.5.65 — needed because `SD_NAN_CHECK` reads at end of graph were corrupted by gallocr slot reuse. | +| `tools/ominix_diffusion/src/qwen_image.hpp` | `+7 / -0` lines. Marks `model_out` and other residual-stream tensors as `OUTPUT` so gallocr can't recycle their slots. | §5.5.56–58 partial close; surgical OUTPUT marking. | +| `tools/ominix_diffusion/src/denoiser.hpp` | `+11 / -0` lines. Per-step latent precision instrumentation. | §5.5.50–51 step-count sweep + sampler trace. | +| `docs/qie_q2_phase4_smoke.md` | `+376 / -0` lines. Saga journal — chain of evidence for every §5.5.x. | Documentation only. | + +### Working tree on ac03 (uncommitted) +``` +M ggml/src/ggml-alloc.c # in-progress §5.5.59 tweak +?? ggml/src/ggml-alloc.c.bak_5561 # backup of pre-§5.5.61 +?? tools/probes/qie_block0_cpu_reference/... # probe artifacts (gitignored OK) +?? tools/probes/qie_q45_real_denoise_smoke/... # probe artifacts +?? tools/probes/qie_q45_step4_full_denoise/... # probe artifacts +``` + +--- + +## Reproducing the §5.5.65c trace (the authoritative evidence) + +Binary: `/home/ma-user/work/OminiX-Ascend/build-w1/bin/ominix-diffusion-cli` (built 2026-04-29 07:06). + +Models on ac03: +``` +/home/ma-user/work/qie_weights/Qwen-Image-Edit-2509-Q4_0.gguf +/home/ma-user/work/qie_weights/Qwen2.5-VL-7B-Instruct-Q4_0.gguf +/home/ma-user/work/qie_weights/mmproj-BF16.gguf +/home/ma-user/work/qie_weights/split_files/vae/qwen_image_vae.safetensors +``` + +Run: +```bash +SD_FIRST_NAN_TRACE=1 /home/ma-user/work/OminiX-Ascend/build-w1/bin/ominix-diffusion-cli \ + -m /home/ma-user/work/qie_weights/Qwen-Image-Edit-2509-Q4_0.gguf \ + --qwen2vl /home/ma-user/work/qie_weights/Qwen2.5-VL-7B-Instruct-Q4_0.gguf \ + --qwen2vl-vision /home/ma-user/work/qie_weights/mmproj-BF16.gguf \ + --vae /home/ma-user/work/qie_weights/split_files/vae/qwen_image_vae.safetensors \ + -p "a cat" -W 1024 -H 1024 --steps 3 --seed 42 \ + -o /tmp/out.png 2>&1 | tee run.log +``` + +Existing log of this run: `ac03:/home/ma-user/work/qie_5565c/run.log`. + +### Expected output (what was actually observed) + +``` +[INFO ] qwen2.5vl: tracing 1154 nodes one-at-a-time +[INFO ] qwen2.5vl: no NaN found across 1154 nodes (SD_FIRST_NAN_TRACE) +[INFO ] [NaN CHECK] encoder/cond.c_crossattn: OK (32256 elements, range=[-142.4, 97.6]) +[INFO ] qwen_image: tracing 10213 nodes one-at-a-time +[INFO ] qwen_image: no NaN found across 10213 nodes (SD_FIRST_NAN_TRACE) ← step 0 clean + |================> | 1/3 - 737.86s/it +[INFO ] qwen_image: tracing 10213 nodes one-at-a-time +[ERROR] qwen_image: node 4654/10213 op=RMS_NORM name='node_4654' shape=[128,24,4096,1] type=f32 nans=4096/12582912 +[ERROR] src[0]: op=RESHAPE name=' (view) (reshaped)' shape=[128,24,4096,1] type=f32 nans=0/12582912 +[ERROR] qwen_image: aborting at first-NaN node 4654/10213 (SD_FIRST_NAN_TRACE) +[ERROR] diffusion model compute failed +``` + +**Decoded:** +- 128 = head_dim +- 24 = num_heads +- 4096 = num image tokens at 1024² (1024/16 = 64; 64² = 4096) +- 1 = batch +- The 4096 NaN count = exactly num_tokens, i.e. one NaN per token-row across all heads/positions +- This shape matches the per-head Q-norm or K-norm in attention (`qwen_image.hpp:91-92` declares `RMSNorm(dim_head=128, eps)` for `norm_q`/`norm_k`). With ~170 nodes/block × 60 blocks, node 4654 falls at roughly DiT block 27. + +### Why this trace is trustable (and earlier ones weren't) + +`SD_FIRST_NAN_TRACE` (§5.5.65) registers a per-node eval_callback that fires immediately *after* each op finishes computing. It reads the output AND each `src[i]` at that exact moment, before gallocr has had a chance to reuse any slot for downstream ops. + +Previously, `SD_NAN_CHECK` (the older mechanism) read tensors at *end of graph* — by which time gallocr has reused freed slots many times. That made every "NaN at end-of-graph" reading unreliable as a producer-localization signal. §5.5.63's "node 14 CONCAT" finding came from this contaminated source and was refuted by §5.5.64a's no-ref control test (the reported producer site never even fired in the no-ref run, yet the NaN signature was identical). + +There is one earlier `qie_5565b/run.log` that reports a different first-NaN inside `qwen2.5vl` at node 28 ADD, with `leaf_121` showing 903/1849 NaN. **Ignore it.** That run was made against a binary built before commit `efd5384`/`1e2de75` (which fixed the trace's true-NaN-vs-Inf counting); the leaf-121 reading is a false positive from an older buggy trace path. + +--- + +## What 60 rounds of bisect have actually established + +The saga isn't a series of failed fixes. Each `§5.5.x` either landed a real fix or ruled out a hypothesis. Net result of the legitimate fixes: + +- **Q4_0 dequant correctness (§5.5.42).** Mirrored ggml-cann's `mul_mat_quant` V2 dispatch byte-for-byte. Before this, mid-block residuals were drifting numerically. +- **F16-saturation in residual stream (§5.5.30, §5.5.45–46).** Widened gated-residual + Q/K/V projection outputs to BF16. Closed step-1 NaN at block 27 in single-step decode and most multi-step cases. +- **AdaLN modulation chunk order (§5.5.21, `ce34b9f`).** CPU reference was swapping scale/shift; engine was correct. Fixed the reference. +- **Graph-allocator buffer alias on second compute (§5.5.59 `d82ba89`).** Added INPUT-flag protection in `ggml-alloc.c` so leaf input tensors (latent, RoPE positions) aren't freed/reused mid-graph. **This closed 256² entirely.** Before it, both 256² and 1024² were RED on multi-step. +- **Recursive view-chain INPUT-flag traversal (§5.5.62).** Same mechanism extended to in-graph CONT/RESHAPE views of INPUT tensors. + +**State after all that:** 256² works at any step count. 1024² works for step 0 but fails at step 1. The remaining failure is the single RMS_NORM node identified above. + +### Hypotheses ruled out (don't reopen) + +- F32 attention accum / softmax precision (§5.5.55) +- F16 residual saturation as the *only* path (§5.5.30 fixed it; new failure mode is structural) +- Sampler-bug class (CacheDiT, layer-bisect — §5.5.51) +- ggml-cann concat shape bug at `[64, 8192]` (§5.5.63 retracted by §5.5.64a) +- Quantization correctness — Q4_0/Q4_1/Q5_K dispatch is bit-accurate (§5.5.31, §5.5.42) +- Weight load — `img_mod_1.weight` etc. are bit-identical to CLI baseline (§5.5.31) +- RoPE precision and shape (§5.5.18 oracle GREEN) +- t_emb chain (§5.5.20 oracle GREEN) +- Audio: not relevant (this is QIE-Edit / image stack) + +--- + +## Open work (§5.5.66 — fix track) + +Tracked as task #179. **Not started.** + +### Important corrections to the §5.5.65c reading (added 2026-04-29 after a 2nd-opinion review) + +Three things in the original framing of this handoff turned out to be wrong or imprecise: + +1. **`src[0] nans=0` ≠ "clean input."** The `SD_FIRST_NAN_TRACE` callback **skips Inf** when counting (`tools/ominix_diffusion/src/ggml_extend.hpp:2152, 2166` — counts true-NaN only). If src[0] contains Inf values, the count still reads 0. RMSNorm of a row with Inf produces NaN deterministically: + ``` + mean(x²) = Inf → 1/sqrt(Inf) = 0 → Inf × 0 = NaN + ``` + This mechanism explains the 4096 NaN count *much better* than "per-row variance went to zero," and is consistent with the saga's prior F16-saturation work (§5.5.30, §5.5.45–46) being incomplete for this exact codepath. + +2. **The shape arithmetic in this doc was sloppy.** With `[128, 24, 4096, 1]` and RMSNorm along `dim_head=128`, there are `24 × 4096 = 98,304` rows. So "one NaN per row" is not the literal description. 4096 NaN likely means *one head's worth* of NaN values — pointing at one specific head being the upstream Inf-producer. Re-do the per-head magnitude trace (cf. §5.5.34) at step 1 to find which head saturates. + +3. **It's not `aclnnRmsNorm`.** For this tensor size, ggml-cann uses a **manual decomposition** (`ggml/src/ggml-cann/aclnn_ops.cpp:1180-1212`): `Mul → Mean → add eps → Rsqrt → Mul`. So "Ascend kernel bug" is the wrong frame — the bug, if in this chain, could be in any of those five ops, with broadcast `Mean` and pool-workspace lifetime as the higher-priority suspects. + +### Additional missed coverage class + +`ggml/src/ggml-alloc.c:1007` — the gallocr realloc check looks at node/leaf counts and sizes, not flag semantics. OUTPUT/INPUT flag changes (the §5.5.59 fix mechanism) may not trigger a re-plan, meaning **step 1 reuses step 0's allocation plan even after flag annotations are updated.** This is a concrete coverage gap that could explain why §5.5.59 closed 256² but step-1 1024² still aliases — at 256² the smaller offsets happen to coincide harmlessly, at 1024² they don't. + +### Revised plan for §5.5.66 + +1. **Patch `SD_FIRST_NAN_TRACE` to additionally count Inf and max-finite per tensor**, then re-run the §5.5.65c trace. New decision branches: + - If `src[0]` has nonzero Inf → bug class is *upstream F16-saturation*. Walk back through the producer chain (likely a matmul output or residual still in F16 on this codepath) and find the missed widening. Don't dump RMSNorm input at all; dump the producer. + - If `src[0]` has 0 Inf and 0 NaN with sane max-finite → bug class is the manual RMSNorm decomposition or an allocator-replay issue. Then proceed to step 2. + +2. **Dump correctly.** Don't `ggml_set_output()` a RESHAPE view — the saga already learned views don't pin backing storage (`docs/qie_q2_phase4_smoke.md:3016, 3100`). Use `ggml_cont` to materialize, or pin the producer tensor. Capture finite-class stats *first*, full buffer second. + +3. **CPU reference comparison only after Inf is ruled out.** And if both CPU and NPU produce NaN, that does not prove "input is the real culprit" — it proves the dumped input contained Inf or was captured post-aliasing. Map output-NaN indices back to input-Inf/huge indices to localize. + +4. **Test the realloc-coverage hypothesis.** Force gallocr to fully re-plan between step 0 and step 1 (e.g. clear cached plans, or alter graph structure trivially between steps to invalidate the size+count check). If 1024² then runs clean to step 1, the bug is a reuse-of-stale-plan issue and the §5.5.66 fix is in `ggml-alloc.c:1007` — not in any kernel. + +5. **Cross-check at 768²** (still useful): the same trace should fire at the analogous node, since 768² has 2304 image tokens. If 768² works, the failure is 1024²-specific allocator pressure. + +### What NOT to do +- Do not assume the §5.5.65c reading is "authoritative" without first re-running with Inf counting enabled. +- Do not dump a RESHAPE view directly — use `ggml_cont` or pin the producer. +- Do not assume `aclnnRmsNorm` is in play — ggml-cann uses a manual decomposition for this shape. +- Do not assume "step 0 clean" implies kernel correctness — it implies "no NaN," not "no Inf, not pathological." +- Do not reopen broad gallocr instrumentation as a *first* move; first patch the trace, then test the §5.5.66 plan above. +- Do not assume node 14 CONCAT is the producer. §5.5.63 was wrong. + +### 2nd-opinion review + +Full critique: `/tmp/codex_qie_review.md` (independent reviewer, codex 0.125.0, ran 2026-04-29 11:13–11:17, read-only sandbox). It cites specific lines in `ggml_extend.hpp`, `aclnn_ops.cpp`, and `ggml-alloc.c` for each claim above. + +--- + +## Reference: also-relevant files + +- **Saga journal:** `ac03:/home/ma-user/work/OminiX-Ascend/docs/qie_q2_phase4_smoke.md` — full §5.5.x chain. +- **Native engine path** (separate from CLI path used in §5.5.65c): `tools/qwen_image_edit/native/image_diffusion_engine.{cpp,h}` — last cat-PNG output is GREEN at 256², used for harvesting per-block dumps. Note that the §5.5.65c trace ran the **CLI** path, which now matches engine path closely. +- **CLI binary source:** `tools/ominix_diffusion/` (this is what `build-w1/bin/ominix-diffusion-cli` is built from). +- **§5.5.65 instrumentation:** look in `tools/ominix_diffusion/src/ggml_extend.hpp` around lines 2080–2260 for `SD_FIRST_NAN_TRACE` callback + the `ggml_first_nan_check_callback` body. Search for `[FIRST-NAN]`. +- **Working trace logs on ac03:** + - `/home/ma-user/work/qie_5565c/run.log` ← authoritative §5.5.65c result + - `/home/ma-user/work/qie_5565b/run.log` ← ignore (pre-fix binary) + - `/home/ma-user/work/qie_5565/run.log` ← earlier attempt + +--- + +## Honest assessment + +The 60-round count looks bad in isolation. The reason it took this long is that the diagnostic infrastructure (`SD_NAN_CHECK` end-of-graph reads) was lying — every "found the bug" claim was based on contaminated evidence. §5.5.65's per-node post-compute callback is the first reading from this saga that can actually be trusted. + +With the trustable trace in hand, the picture is much narrower than the saga's length suggests: +- Step 0 across 10,213 ops is fully clean → kernels, weights, RoPE, attention, VAE, graph build, quantization are all correct. +- Step 1 fails at exactly one op → the remaining bug is either one specific kernel quirk on second invocation, or one specific aliasing case the gallocr coverage missed. + +Either is fixable in a focused session. The hard part — knowing where to look — is now solved. + +— Yue Chen, 2026-04-29 + +--- + +## Saga reckoning, 2026-04-29 evening (canonical state) + +After dispatching §5.5.13c (#203) to widen `img_mod.1` / `txt_mod.1` matmul outputs to BF16, I discovered the saga is in a much worse state than previous reports indicated. The issues: + +### What was actually verified at HEAD `a91bcfb` + +- **256² CLI in production cadence (no tracer): 16384/16384 NaN latent → all-black PNG (2301 bytes).** +- 1024² CLI: also NaN (§5.5.67 verification trace showed 4096 Inf in `src[0]` before RMSNorm at node 4654). +- Native engine in production cadence: not directly tested in this session. Inferred broken because: + - §5.5.13 (#190) said native produces all-NaN without the c_skip+c_out patch. + - That patch was applied and **then reverted** in #190. + - HEAD (a91bcfb) does not have it. + - The "polka-dot at 256²×20" eye-check from #190 was VAE-decode-only of a saved latent from the patched run, not a fresh native forward. + +### Why the previous "GREEN" findings were wrong + +`SD_FIRST_NAN_TRACE=1` serializes per-node compute (one-node subgraphs with explicit sync between each). This timing change masks the production NaN bug. + +- §5.5.59 fix verification at 256² × n=3 → tracer-mode "GREEN". +- §5.5.65c trace at 1024² × 3 → identified "first-NaN at node 4654 RMSNorm" as a tracer-mode artifact. +- §5.5.66 "smoking gun" (Inf-aware tracer flips step-1 RED→GREEN with no compute change) was actually evidence that **the bug is timing-dependent between dispatches**, NOT evidence of an allocator-replay class bug. The codex critique flagged "Inf × 0 = NaN through RMSNorm" as a *more plausible* mechanism in writing before this finding — that critique was itself missed. +- §5.5.67 gallocr re-plan fix (`f695ab4`) was based on the wrong diagnosis. Verification ran in tracer mode and looked GREEN, but production cadence is still NaN. +- Audit #188 ("Branch B switch native to production, 1.17 s/step NaN=0 finite latent") almost certainly also relied on tracer-mode evidence. + +### What §5.5.13c (#203) showed + +- Agent edited `tools/qwen_image_edit/native/image_diffusion_engine.cpp:3681-3734` to widen `img_mod.1` / `txt_mod.1` matmul output to BF16, plus an in-place BF16→F16 cast back before downstream consumers (mirroring §5.5.45/46 QKV pattern). +- BUT ran `build-w1/bin/ominix-diffusion-cli` for verification. That binary is built from `tools/ominix_diffusion/` (CLI path) — **it does not link the native engine code**. +- So the patch was dead code in the test. +- CLI run produced all-NaN regardless of `QIE_MOD_BF16=1` or unset (confirmed by env-unset re-run). +- The patch is still present on disk (uncommitted) but unverified. + +### What's still valid + +- **CUDA work (different platform):** + - #187 CUDA QIE norm-modulate fusion: **+5.85% wallclock at 1024²×20**, bit-identical PNG. + - #182 CUDA TTS env-flip: **+7% warm 2nd+ requests** (zero code change). +- **Silent race-condition fixes caught by perf agents:** + - #186: `cudaMemcpyAsync(pos_dev, pos_host_pin)` race in P1's pattern (would have failed silently in production multi-step decode). + - #193: `rep_penalty_kernel` race when recent-window has duplicate tokens (affected both predictor and existing P2 talker path). +- **Bug isolation (#201, #202):** + - Per-op F32 references show all weight-free ops match within F16 precision. + - Q4_0 dequant + matmul oracle confirms `img_mod.1` block 0 (Q5_K) and block 1 (Q4_0) and `ff.net.2` block 1 (Q4_1) match F32 oracle at cos=1.000000. The 491 / 6e5 magnitudes are intrinsic to the trained quantized weights. + - The fix-shape inference (BF16 widening) was plausible. The mechanism in #203 was wrong (codebase mismatch). + +### Action items for the NEXT session — do not dispatch without fresh codex review + +1. **Identify which binary actually exercises the native engine** in production. The CLI doesn't. Likely candidates: `test_image_diffusion_cuda_e2e`, `test_image_diffusion_cuda_init`, etc. Map binary → source. This is prerequisite to ANY native engine fix verification. +2. **Build a standalone production-cadence reproducer** for the native engine: invoke `denoise_full()` end-to-end at 256²×20, dump the latent, decode via VAE, check NaN count + visual. +3. **Re-verify EVERYTHING in production cadence** (no `SD_FIRST_NAN_TRACE`). The §5.5.59 / §5.5.66 / §5.5.67 GREEN claims are not trustworthy until re-verified at production cadence. +4. **Re-derive audit #188 verdict** with production-cadence evidence. +5. **Strategic question to escalate**: is shipping at 1024² achievable in this saga's timeline, or is descoping (256² only, Diffusers Python wrapper, etc.) the right move? + +### Validation gates raised + +Future audit-style strategic decisions require: +- 1024² PNG eye-check **in production cadence (no tracer)** +- VAE decode of a freshly-generated latent (not a saved one) +- Per-step latent stats reported (mean, std, range, NaN/Inf count) +- Artifact paths cited +- One independent codex reviewer pass +- Explicit "production cadence verified" claim before strategic claims + +### Working directories on ac03 + +- Repo: `/home/ma-user/work/OminiX-Ascend` (user `ma-user`). HEAD `a91bcfb`. 56 commits ahead of unpushed `origin/main`. +- §5.5.13c uncommitted patch: `M tools/qwen_image_edit/native/image_diffusion_engine.cpp` (works in spirit, wrong codebase for verification). +- Logs of broken production runs: + - `/tmp/qie_5513c_off.log` — 256²×20 with env unset, full NaN. +- F32 oracle artifacts: + - `/home/ma-user/work/qie_f32_refs/` — per-op references from #201. + - `/tmp/qie_q4_oracle_run.log` — Q4_0 dequant + matmul oracle from #202. +- Bundle backup: `/Users/yuechen/home/qie-saga-5.5.65-snapshot.bundle` on Mac. diff --git a/SESSION_HANDOFF_2026-04-30.md b/SESSION_HANDOFF_2026-04-30.md new file mode 100644 index 0000000..1a5a0b1 --- /dev/null +++ b/SESSION_HANDOFF_2026-04-30.md @@ -0,0 +1,290 @@ +# Session Handoff — 2026-04-30 + +**Status at end of session:** Three CUDA wallclock wins shipped, three silent bugs fixed, Ascend QIE bug class narrowed to one substep (visual gate not yet closed). + +**For the next agent picking this up:** read this file end-to-end before dispatching anything. The session covers two repos (OminiX-Ascend on Huawei Ascend 910B, OminiX-CUDA on NVIDIA GB10 Blackwell) and five hosts (ac01/ac02/ac03 + zgx-5b44 + zgx-3675). Everything is committed somewhere and reachable; the map is below. + +--- + +## What shipped today (by stream) + +### CUDA — three real wallclock wins + +All committed to `github.com/OminiX-ai/OminiX-CUDA.git` (origin/main, push `aecbecf6a..3fa53afd1`): + +| # | Stream | Win | Default state | +|---|---|---|---| +| #182 | CUDA TTS — `TALKER_USE_CUDA_GRAPHS=1` in warm daemon launcher | **+7%** warm 2nd+ requests | shipped (env baked into `scripts/demos/run_tts.sh`) | +| #187 | CUDA QIE — norm-modulate fusion + allocator unblock | **+5.85%** wallclock at 1024²×20 (baseline 309.5s → 291.4s) | `OMNX_CUDA_QIE_FUSED_NORM=1` default-on | +| #199 | CUDA TTS — parallel top-K sampler + on-device sampling chain | **−11.2%** stochastic warm (6411ms → 5691ms) | `OMNX_TTS_PARALLEL_TOPK=1` + `OMNX_TTS_PREDICTOR_ONDEV=1` + `OMNX_TTS_ONDEV_SAMPLE=1` all default-on | + +Net CUDA improvement: ~13% on QIE, ~18% on TTS (combining #182 + #199 where applicable). + +### Silent bug fixes (production safety, caught by perf agents) + +- **#186 (P1 path)** — `cudaMemcpyAsync(pos_dev, pos_host_pin)` inside chain graph races on multi-step replay. Subsequent host writes can clobber `pos_host_pin` before prior step's H2D actually executes. Fix: device-side `*p+=1` increment kernel as first node of chain graph. Shipped on zgx-3675 main. +- **#193 (P2 path)** — `rep_penalty_kernel` had read-modify-write race when recent-token window contains duplicates. Two threads both read pre-update, both divide, both write back. Affected both predictor AND existing P2 talker path (would have shipped silently). Fix: serialize the kernel to single thread (n_recent ≤ 64, negligible cost). Shipped on zgx-3675 main as part of `91c696e1`. +- **#196 (#208 followup)** — Existing `OMINIX_CFG_BATCHED=1` on CUDA produces silently-wrong attention when `s_cond ≠ s_uncond` (any non-empty negative prompt with different token count). The CUDA FA kernel reads mask only at `nb33*(sequence%ne33)`; with q folded to 3-D, `sequence=0` always. Batch-1's (uncond) mask is built but never read. Hidden because most tests use prompts with equal cond/uncond text lengths AND `cfg_build_attention_mask` returns nullptr early when `s_cond==s_uncond`. **Fix is in #208 (queued, not yet dispatched)** — see below. + +### Ascend QIE saga — bug class narrowed dramatically + +The 70+ round saga collapsed from "anywhere in 60-block DiT" down to "block-0 attention substep" today. Three hypothesis classes empirically closed in this session: + +| # | Hypothesis | Closure | +|---|---|---| +| #205 (wave-7) | F16 matmul saturation in residual chain | **CLOSED** — no F16 producer hits saturation; widening top sites moves <0.1% of latent stats | +| Schedule audit | sigma / c_skip / c_out / Euler chain | **CLOSED** — bit-correct vs Diffusers reference (cos=1.0000 stub-equivalent); even canonical c_skip+c_out fix doesn't close visual gate | +| #206 + ac01 disambiguation | Distributed-compounding cumulative cast noise across blocks | **CLOSED** — divergence is **5× at block 0** (cos=0.05, mag_ratio=0.49 vs CLI), not cumulative. Clamp band-aid (`QIE_RESID_CLAMP=60000`) confirmed irrelevant via source-level proof. | + +Real per-op verification done in #201/#202: weight-free ops match within F16 precision, Q4_0/Q5_K dequant + matmul cos=1.000000 vs F32 oracle (so the 491 / 6e5 chunk magnitudes are intrinsic to trained weights, not dispatch defects). + +The remaining work is to identify which substep in block-0's attention forward (one of: `02_img_mod_out` chunk → `06_LN1` → `07_modulated_LN` → `08_Q/K/V` → `09_RMSnorm` → `10_RoPE` → `11_attn_out` → `12_to_out_0`) is the divergence source. The dump infrastructure for this is already in place on ac03. See "Pickup map" below. + +### CFG-batching pad fix (correctness) + +#99 — `tools/ominix_diffusion/src/conditioner.hpp` now properly pads cond/uncond `c_crossattn` to common max_len and propagates per-row valid-lengths into `cfg_build_attention_mask`. Build clean on ac01. Runtime test pending (QIE weights not on ac01). Commit `61c8e2f` on ac01 branch `tmp_ac03_main`. + +--- + +## What's still open + +### #208 — CUDA QIE keep_n_outer fold fix (queued, not dispatched) + +This is the architectural fix that closes #196's silent CFG mask bug and unblocks 1024² CFG batching. + +**Mechanism**: add `keep_n_outer` param to `apply_rope` + `Rope::attention` in `tools/ominix_diffusion/src/rope.hpp`. When true, skip `reshape_3d` collapse at `rope.hpp:640`. Q becomes `[d_head, L, n_head, N=2]` 4-D. Mask becomes `[L_q, L_k, 1, N]` (broadcast across heads). Touches `qwen_image.hpp` callers only. ~150 LOC + per-model regression on FLUX/Z-Image/Wan/MMDiT. + +**Effect**: +1. Closes silent correctness bug today (cfg_batched + variable seq_len). +2. Mask memory drops 24× (48-fold head replication was a CANN broadcast workaround, unused on CUDA). +3. At 1024², mask drops from 12.6 GiB F32 / 6.3 GiB F16 to 565 MiB F32 / 283 MiB F16. +4. Unblocks `OMINIX_CFG_BATCHED=1` at 1024², projected 30-40% wallclock reduction at cfg-scale > 1. + +**Effort**: 2 days. Day 1: fix + 256² regression test. Day 2: 1024² perf measure + default flip. + +**Why queued, not dispatched**: bigger commit than the perf-class agents I'd been running. Multi-day, multi-model regression. Held for explicit user sign-off. + +### Ascend QIE saga — block-0 substep bisect + +Critical-path next step. After three hypothesis-class closures, the bug is in one substep of block 0's attention forward at REAL inputs. The §5.5.16/17/18/19 oracles all passed at SYNTHETIC inputs but §5.5.34/35 found drift enters at REAL. Codex review specifically asked for the substep bisect to start at `02_img_mod_out` (the AdaLN matmul output that feeds `gate1`/`scale_msa`/`shift_msa`/etc. chunks) — NOT just at `08_Q/K/V` — because per §5.5.30 history the magnitude drop tends to come from modulation, not Q projection. + +Existing dumps (engine + CLI matched-input at 256²×2 step): +- `ac03:/tmp/qie_5513f_eng_blocks/block00/.f32` — engine-side (~3.6 GB across all blocks; block 0 is small) +- `ac03:/tmp/qie_5513f_cli_blocks/block00/qie_cli_blk00_.f32.bin` — CLI-side +- These are sufficient for tags `13_*_resid1`, `21_*_resid2`, etc. but **upstream tags (02_img_mod_out, modulation chunks, 06_LN1, 07_modulated_LN) are NOT in either dump set yet** — they need new `dump_tensor_f32` hooks added before bisect. + +A previous dispatch (#207, twice) was killed in early phases. The codex-corrected scope dispatch for the substep bisect is in `/Users/yuechen/home/OminiX-API/qie_block0_substep_bisect.md` (not yet written by an agent — would be the agent's output if dispatched). + +### Other pending tasks + +- **#44** QIE-Q2.5 CacheDIT calibration (ac01) — gated on Ascend saga close. +- **Runtime verification of #99** — needs QIE weights on a host other than ac03 (ac03 has them; ac01 doesn't). Easiest: run on ac03 directly. + +--- + +## Real shipped artifacts + +### Repos and their states + +| Repo | GitHub | Latest pushed | Mac local | Notes | +|---|---|---|---|---| +| **OminiX-API** | `github.com/OminiX-ai/OminiX-API.git` (remote `ominix`) | will be after this commit | `/Users/yuechen/home/OminiX-API` | This handoff doc lives here | +| **OminiX-CUDA** | `github.com/OminiX-ai/OminiX-CUDA.git` (remote `origin`) | **`3fa53afd1`** (today's push) | `/Users/yuechen/home/ominix-cuda` | Both CUDA streams merged into main | +| **OminiX-Ascend** | `github.com/OminiX-ai/OminiX-Ascend.git` (remote `origin`) | `7306b7e5` (older — saga work NOT pushed) | `/Users/yuechen/home/OminiX-Ascend` | Saga commits live on ac03 main, 60+ ahead of origin | + +### Ascend saga bundle (the key handoff artifact for next agent) + +**`/Users/yuechen/home/ac03_saga_2026-04-30.bundle`** (147 MB) — git bundle of ac03 main at HEAD `3daae48`. Contains the entire session's saga commit chain on top of yesterday's `a078106`. Today's commits in this bundle (newest first): + +- `3daae48` §5.5.13d — widen `img_ff_up` / `txt_ff_up` matmul output to BF16 (directional null, escalation triggered) +- `5b1e032` §5.5.13c — BF16 widening on `img_mod.1` / `txt_mod.1` matmul output +- `a91bcfb` §5.5.67 — `SD_FIRST_NAN_TRACE_LEGACY` env to revert §5.5.66 Inf scan +- `f695ab4` §5.5.67 — gallocr re-plan on flag changes (closes 1024² step-1 NaN — partial) +- `12816bc` §5.5.66 — doc: Inf-aware tracer flips step 1 RED→GREEN, Case B confirmed +- `e2f3918` §5.5.66 — extend `SD_FIRST_NAN_TRACE` to count Inf + max-finite per tensor + +### Conditioner pad-fix bundle + +**`/Users/yuechen/home/ac01_99_pad_fix.bundle`** (146 MB) — branch `tmp_ac03_main` at HEAD `61c8e2f` (parent `3daae48`). Contains the #99 conditioner.hpp pad fix. + +### To resume on a fresh machine + +```bash +# Clone canonical OminiX-Ascend (origin) +git clone https://github.com/OminiX-ai/OminiX-Ascend.git +cd OminiX-Ascend + +# Pull saga state from bundle +git fetch /path/to/ac03_saga_2026-04-30.bundle main:saga-2026-04-30 +git checkout saga-2026-04-30 + +# (Optional) Pull conditioner pad fix +git fetch /path/to/ac01_99_pad_fix.bundle tmp_ac03_main:99-pad-fix +``` + +For CUDA work just `git clone https://github.com/OminiX-ai/OminiX-CUDA.git` — `main` has all of today's wins. + +--- + +## Hosts and SSH info + +**For another agent on a fresh machine without the existing `~/.ssh/config` aliases**, here is everything needed to connect to each box. Each line gives a complete SSH command you can paste directly. + +### Ascend cluster (Huawei ModelArts notebooks) + +All three Ascend boxes share the same hostname (`dev-modelarts.cn-southwest-2.huaweicloud.com`), user (`ma-user`), and key (`/Users/yuechen/home/tensordock/KeyPair-4fbd-yue.pem`). They differ only by port. The key is a `.pem` file shared across all three boxes. + +| Host | Hostname | Port | User | Key | Role | +|---|---|---|---|---|---| +| **ac01** | `dev-modelarts.cn-southwest-2.huaweicloud.com` | **31984** | `ma-user` | `/Users/yuechen/home/tensordock/KeyPair-4fbd-yue.pem` | Idle. Has the OminiX-Ascend repo at `/home/ma-user/work/OminiX-Ascend-w1`. #99 pad fix lives here on branch `tmp_ac03_main` (commit `61c8e2f`). | +| **ac02** | `dev-modelarts.cn-southwest-2.huaweicloud.com` | **31210** | `ma-user` | (same key as ac01) | Idle. No repo at standard path — bring via bundle if needed. | +| **ac03** | `dev-modelarts.cn-southwest-2.huaweicloud.com` | **30412** | `ma-user` | (same key as ac01) | **Saga + dumps live here.** Repo at `/home/ma-user/work/OminiX-Ascend`. HEAD `3daae48`. Working tree clean. | + +Direct connect commands: + +```bash +ssh -i /Users/yuechen/home/tensordock/KeyPair-4fbd-yue.pem -p 31984 ma-user@dev-modelarts.cn-southwest-2.huaweicloud.com # ac01 +ssh -i /Users/yuechen/home/tensordock/KeyPair-4fbd-yue.pem -p 31210 ma-user@dev-modelarts.cn-southwest-2.huaweicloud.com # ac02 +ssh -i /Users/yuechen/home/tensordock/KeyPair-4fbd-yue.pem -p 30412 ma-user@dev-modelarts.cn-southwest-2.huaweicloud.com # ac03 +``` + +If the next agent doesn't have the `.pem` key locally: it's a Huawei Cloud KeyPair. The user (Yue Chen) can re-issue or share it. The keys are NOT in the repo for security. + +`~/.ssh/config` aliases (for Mac local convenience): + +``` +Host ac01 + HostName dev-modelarts.cn-southwest-2.huaweicloud.com + User ma-user + Port 31984 + IdentityFile /Users/yuechen/home/tensordock/KeyPair-4fbd-yue.pem + StrictHostKeyChecking accept-new + +Host ac02 + HostName dev-modelarts.cn-southwest-2.huaweicloud.com + User ma-user + Port 31210 + IdentityFile /Users/yuechen/home/tensordock/KeyPair-4fbd-yue.pem + StrictHostKeyChecking accept-new + +Host ac03 + HostName dev-modelarts.cn-southwest-2.huaweicloud.com + User ma-user + Port 30412 + IdentityFile /Users/yuechen/home/tensordock/KeyPair-4fbd-yue.pem + StrictHostKeyChecking accept-new +``` + +### NVIDIA GB10 Blackwell cluster + +Both CUDA boxes share the same hostname (`163.192.33.32`), user (`user1`), and key (`~/.ssh/id_ed25519`). They differ only by port. + +| Host | Hostname | Port | User | Key | Role | +|---|---|---|---|---|---| +| **zgx-5b44** | `163.192.33.32` | **6022** | `user1` | `~/.ssh/id_ed25519` | CUDA QIE box. Repo at `/home/user1/ominix-cuda`. HEAD `86ce667a`. Pushed to OminiX-CUDA main via merge commit `da979b819`. | +| **zgx-3675** | `163.192.33.32` | **6222** | `user1` | `~/.ssh/id_ed25519` | CUDA TTS box. Repo at `/home/user1/ominix-cuda`. HEAD `49e4db56`. Pushed to OminiX-CUDA main via merge commit `3fa53afd1`. | + +Direct connect commands: + +```bash +ssh -i ~/.ssh/id_ed25519 -p 6022 user1@163.192.33.32 # zgx-5b44 (CUDA QIE) +ssh -i ~/.ssh/id_ed25519 -p 6222 user1@163.192.33.32 # zgx-3675 (CUDA TTS) +``` + +`~/.ssh/config` aliases: + +``` +Host zgx-5b44 + HostName 163.192.33.32 + User user1 + Port 6022 + IdentityFile ~/.ssh/id_ed25519 + +Host zgx-3675 + HostName 163.192.33.32 + User user1 + Port 6222 + IdentityFile ~/.ssh/id_ed25519 +``` + +The `id_ed25519` key is whatever ed25519 key the user uses for these boxes — typically already present on a developer machine via `ssh-keygen`. If absent, ask the user (Yue Chen) for the key. NOT in the repo. + +### Verify access + +```bash +# Smoke test all five boxes: +for h in ac01 ac02 ac03 zgx-5b44 zgx-3675; do + echo "=== $h ===" + ssh -o ConnectTimeout=5 -o BatchMode=yes $h 'hostname; uptime' 2>&1 | head -2 +done +``` + +--- + +## Critical artifacts on ac03 (do not delete) + +These are the dump files that make the substep bisect possible. They are NOT in any git repo — `/tmp` files only. + +| Path | What | Size | +|---|---|---| +| `/tmp/qie_5513f_eng_blocks/block??/.f32` | Engine-side per-block per-substep dumps (256²×2, matched inputs) | ~3.6 GB | +| `/tmp/qie_5513f_cli_blocks/block??/qie_cli_blk??_.f32.bin` | CLI-side same | ~571 MB | +| `/tmp/qie_q45_inputs/` | Matched inputs (txt_cond, init_latent, sigmas) for both engine + CLI runs | ~3 MB | +| `/home/ma-user/work/qie_5513f/` | Run logs from #206 | small | +| `/home/ma-user/work/qie_f32_refs/` | Per-op F32 references from #201 (the per-op bisect that proved weight-free ops are clean) | small | + +If ac03 is reset / rebooted / reimaged, these are GONE. The bundle backups don't include them. They take ~30 min of compute to regenerate (256² × 2 step on engine + CLI with matched inputs and full dump hooks enabled). + +--- + +## Key memory files (Mac-local, persistent across sessions) + +These are the auto-memory files Claude reads across sessions. They have the canonical project state: + +- `/Users/yuechen/.claude/projects/-Users-yuechen-home-OminiX-API/memory/MEMORY.md` — top-level index +- `/Users/yuechen/.claude/projects/.../memory/project_qie_ascend_1024_first_nan.md` — QIE Ascend saga state (corrected this morning) +- `/Users/yuechen/.claude/projects/.../memory/feedback_codex_for_review_and_exploration.md` — workflow rule: codex critique on every exploration / strategic claim before promoting +- `/Users/yuechen/.claude/projects/.../memory/feedback_no_coauthor.md` — never add Co-Authored-By Claude to commits + +--- + +## Lessons learned (for the next agent) + +These came up repeatedly today; future agents should bake them in: + +1. **Codex critique on every exploration / dispatch plan, in parallel.** Five times today codex caught a real issue I would have walked into. Cost: ~5 min wallclock per critique. Value: prevented multi-hour misdirected dispatches. Lock this in as default practice. The pattern is: dispatch the agent + fire `codex exec -s read-only ... out 2> err` in parallel. +2. **Trust measurement, not projection.** Five exploration agents this session over-projected wins by 5-10× (P0 graph capture: projected 15-25%, got 1%; P0 capture-once: projected −1.5-2.0s, got +3% slower; P1 device-pos: projected −30%, got 0%; P1 fused norm+gate: projected 15-20%, got 1% pre-fix; #195 FA-3: projected 6-8%, agent declined cleanly). Each subsequent dispatch had to rebaseline from traces. +3. **PNG eye-check is not the only gate.** Codex critique on §5.5.13c specifically flagged this: latent stats GREEN does NOT mean PNG GREEN. Visual polka-dot can come from RoPE, patchify/unpatchify, VAE, or batching even when latent range passes. Both gates required. +4. **Source-level proof beats compute** when applicable. The ac01 clamp disambiguation was answered in 12 min of source reading instead of 30 min of remote build/run. +5. **Synthetic-input oracles can be misleading.** §5.5.16/17/18/19 all PASSED at synthetic inputs; §5.5.34/35 found "drift enters at REAL inputs". The block-0 substep bisect on real inputs is what matters now. +6. **Tracer-mode results are not production-cadence results.** `SD_FIRST_NAN_TRACE=1` serializes per-node compute and adds CPU work between dispatches. The audit #188 verdict ("native is closer to production than CLI, switch") was based on tracer-mode evidence and was over-confident. Always re-verify in production cadence (no tracer) before strategic decisions. + +--- + +## How to resume the saga (concrete next dispatch) + +The cheapest path to closing the 256² visual gate: + +1. **scp the existing dumps from ac03 to wherever the bisect runs** (Mac local works — pure numpy comparison, no GPU needed). +2. **Add `dump_tensor_f32` hooks** in the engine for upstream tags (`02_img_mod_out`, modulation chunks, `06_LN1`, `07_modulated_LN`) — temporary, env-gated, revert after capture. Rebuild + re-run engine on ac03 (~3 min). +3. **Add matching CLI dumps** in `tools/ominix_diffusion/src/qwen_image.hpp` for the same tags. Rebuild CLI + re-run at 256²×2 (~6 min). +4. **Run the per-substep cos / std_ratio / amax_ratio comparison** numpy script. Find first substep where cos < 0.95. +5. **Apply targeted fix** at the divergent substep. Per saga history, most likely candidates: gate1 modulation chunk-order/transpose at REAL inputs; attn_out projection BF16 widening at REAL inputs; img_mod.1 chunk indexing at real seq_len. + +Total estimated cycle: 4-6 hours of agent time. Closes the 256² visual gate if the bug is one of the above. + +--- + +## Final state of the swarm + +All boxes idle, no agents in flight, all wins committed where they landed. + +``` +ac01: HEAD `61c8e2f` on tmp_ac03_main (#99 fix), idle, 0 ominix processes +ac02: no repo at standard path, idle +ac03: HEAD `3daae48` on main (saga state), idle, 0 ominix processes +zgx-5b44: HEAD `86ce667a` on main (CUDA QIE state), pushed via OminiX-CUDA `3fa53afd1`, idle +zgx-3675: HEAD `49e4db56` on main (CUDA TTS state), pushed via OminiX-CUDA `3fa53afd1`, idle +``` + +— Yue Chen (with Claude Opus 4.7), 2026-04-30 diff --git a/ascend_native_engine_audit.md b/ascend_native_engine_audit.md new file mode 100644 index 0000000..dd44519 --- /dev/null +++ b/ascend_native_engine_audit.md @@ -0,0 +1,132 @@ +# Ascend Native QIE-Edit Engine — Readiness Audit + +**Auditor:** Claude (Opus 4.7, 1M context) +**Date:** 2026-04-29 +**Scope:** Read-only audit of `tools/qwen_image_edit/native/image_diffusion_engine.{h,cpp}` to decide if the native engine is production-ready, partial, or dead code. +**Sources read:** +- `/Users/yuechen/home/OminiX-Ascend/tools/qwen_image_edit/native/image_diffusion_engine.h` (870 LOC) +- `/Users/yuechen/home/OminiX-Ascend/tools/qwen_image_edit/native/image_diffusion_engine.cpp` (5,647 LOC) +- `/Users/yuechen/home/OminiX-Ascend/docs/qie_q2_phase4_smoke.md` (3,232 LOC saga) +- ac03 git log + `/tmp/qie_q45_step4*` artefact inventory + ac03 build dir + +--- + +## Verdict at a glance + +**Branch B — engine is REAL and FUNCTIONAL but NOT visually correct yet.** + +- The `init_from_gguf` → 60-block `forward_block_` → `denoise_full` (Phase 4.5 Step 4) → host-side patchify/unpatchify → flow-Euler chain is fully implemented and runs end-to-end on ac03 against the real `Qwen-Image-Edit-2509-Q4_0.gguf`. +- A 20-step run at 32×32 latent (256² eye-check resolution) produces a finite latent with NaN=0, std=4.86, mean=-2.46 — **GREEN on the numerical gate** at ~1.17 s/step (~24 s wall). +- The eye-check PNG is finite + structured but **shows a coherent tile pattern, not a recognizable cat** — the engine's attention path or unpatchify-host code still has a residual numerical/layout bug that the §5.5.x bisect has been narrowing. +- The **public `denoise()` (header line 362, body line 1264)** is still a Phase-2 stub returning false. Production runs go through `denoise_full()` (header line 510, body line 4780) instead — there are TWO entry points and only the latter is wired. + +The "Phase-1 skeleton — bodies stubbed" comment at header lines 6–11 is a **stale TLDR from the original scaffold commit** that was never updated. Codex was right to flag it. The actual code has been through Phases 2, 3, 4, 4.1, 4.2, 4.3, 4.4 (a-d), 4.5 (Steps 1-4), and §5.5 (sub-steps 1-13 of Step 4 hardening). + +--- + +## Phase status table + +| Phase | Header marker | Status | Citation | +|---|---|---|---| +| 1 | "Phase-1 skeleton" intro | **STALE** — describes original 2024-Q3 scaffold; never refreshed | h:6–11 | +| 2 | `init_from_gguf` weight upload | **COMPLETE** (Q2.1 Q4-resident path) | h:339, cpp:721–1235 (514 LOC) | +| 2.1 | Q4-resident packed-INT4 + F16 scale tile | **COMPLETE** | cpp file header :2 | +| 3 | `forward_block_` real compute | **COMPLETE** (~15 aclnn ops/block) | h:711, cpp:3037–3837 (800 LOC) | +| 4 | denoise loop wiring (canonical `denoise()`) | **STUB** — returns false with log | h:362, cpp:1264–1285 | +| 4.1 | On-device RoPE | **COMPLETE** (host fallback preserved via `QIE_ROPE_HOST=1`) | h:834, cpp:2620–3036 | +| 4.2 | `forward_all_blocks_test` 60-block stack | **COMPLETE** | h:410, cpp:4182–4267 | +| 4.3 | Euler scheduler + `denoise_loop_test` | **COMPLETE** (cos_sim=1.0 vs CPU ref on synthetic) | h:421/450, cpp:3863–4155 | +| 4.4 | Real-Q4-GGUF + F32 residual stream | **COMPLETE** (Q2.4.4d landed F32 residual fix) | h:614, cpp:2098–2216 | +| 4.5 | `denoise_full` production entry + `init_from_dump` | **COMPLETE** | h:460/510, cpp:4655–5645 (865 LOC) | +| 4.5.4c-d | BF16 plumbing for ff_down + attn-out residual contributors | **COMPLETE** under `QIE_ALL_BF16=1` | h:665, cpp:3037–3837 | +| §5.5 | NaN+visual hardening (Steps 4c→4l, 13 sub-iters) | **PARTIAL** — NaN gate GREEN, visual gate still RED | docs/qie_q2_phase4_smoke.md §5.5.5–§5.5.13 | + +--- + +## Method-by-method classification (.cpp) + +| Method | Body lines | LOC | Classification | Notes | +|---|---|---|---|---| +| `~ImageDiffusionEngine` | 631–719 | 89 | **COMPLETE** | Full device-buffer teardown | +| `init_from_gguf` | 721–1235 | 514 | **COMPLETE** | GGUF parse + Q4-resident upload + RoPE table build + scratch allocation | +| `forward` | 1241–1259 | 18 | **COMPLETE** | Loops `forward_block_` over `cfg_.num_layers` | +| `denoise` (canonical) | 1264–1285 | 22 | **STUB** | Logs "scaffold Phase 2", returns false. Production calls `denoise_full` instead. | +| `alloc_dev_` / `ensure_workspace_` | 1290–1335 | 46 | **COMPLETE** | | +| `build_rope_tables_` | 1336–1344 | 8 | **NOOP** (subsumed by init_from_gguf) | Documented; not a real bug | +| `build_time_emb_` | 1345–1374 | 30 | **COMPLETE** | Sinusoidal 256→f16 | +| `dispatch_matmul_` | 1460–1907 | 448 | **COMPLETE** | WQBMMv3 + aclnnMm fallback + BF16 output path | +| `modulate_` | 1909–1985 | 77 | **COMPLETE** | | +| `gated_residual_add_` | 1987–2043 | 57 | **COMPLETE** (F16 path) | +| `gated_residual_add_f32_` | 2098–2216 | 119 | **COMPLETE** (Phase 4.4c F32 accumulator) | +| `gated_residual_add_f32_bf16src_` | 2218–2342 | 125 | **COMPLETE** (Q2.4.5.4c BF16-src) | +| `cast_f32_to_f16_` | 2045–2096 | 52 | **COMPLETE** | +| `layer_norm_` | 2344–2402 | 59 | **COMPLETE** (affine-off) | +| `layer_norm_f32_to_f16_` | 2404–2482 | 79 | **COMPLETE** (Phase 4.4c) | +| `rms_norm_row_f32_to_f16_` | 2484–2551 | 68 | **COMPLETE** (Phase 4.5 Step 4) | +| `rms_norm_head_` | 2553–2618 | 66 | **COMPLETE** | +| `apply_rope_` (dispatcher) | 2620–2644 | 25 | **COMPLETE** | +| `apply_rope_host_` | 2646–2758 | 113 | **COMPLETE** (Phase-3 baseline, retained as fallback) | +| `apply_rope_on_device_` | 2760–2856 | 97 | **COMPLETE** (Phase 4.1) | +| `apply_rope_manual_` | 2858–3035 | 178 | **COMPLETE** (manual 4-Mul + 2-Add fallback) | +| `forward_block_` | 3037–3837 | **800** | **COMPLETE** | 15+ aclnn ops/block, F32 residual, BF16 leak sites, Q/K-RMSNorm + RoPE + FIA-or-MM-softmax-MM, FFN | +| `scheduler_step_` | 3839–3855 | 17 | **NOOP** (canonical body deferred — production uses `scheduler_step_test`) | +| `scheduler_step_test` | 3863–3913 | 51 | **COMPLETE** | +| `denoise_loop_test` | 3943–4155 | 213 | **COMPLETE** (cos_sim=1.0 GREEN on synthetic) | +| `forward_block_test` | 4161–4180 | 20 | **COMPLETE** | +| `forward_all_blocks_test` | 4182–4267 | 86 | **COMPLETE** | +| `mutable_layer_weights` | 4269–4272 | 4 | **COMPLETE** | +| `init_for_smoke` | 4274–4453 | 180 | **COMPLETE** | +| `init_from_dump` | 4655–4778 | 124 | **COMPLETE** | +| `denoise_full` | 4780–5645 | **865** | **COMPLETE** | Production 20-step loop: time_embed→img_in→60 blocks→norm_out→proj_out→host unpatchify→flow-Euler | + +**Stub count:** 2 of 31 (canonical `denoise` + canonical `scheduler_step_`). Both are documented as "use the test/full variants instead". Not blockers. + +--- + +## Smoke test evidence (read-only, from existing logs on ac03) + +**Did NOT run a new smoke** — ac03 is currently idle (only `tail -F` processes), but existing artefacts already prove the engine runs end-to-end: + +- Binary: `/home/ma-user/work/OminiX-Ascend/tools/probes/qie_q45_step4_full_denoise/test_qie_q45_step4_full_denoise` (1.8 MB, mtime 2026-04-26) +- Object: `build-w1/tools/qwen_image_edit/CMakeFiles/qwen_image_edit_native.dir/native/image_diffusion_engine.cpp.o` exists +- Run log `/tmp/qie_q45_step4f_precise.log` tail (most recent §5.5.6 run): + ``` + [smoke45s4] dispatching denoise_full (real Q4_0 weights, real host conditioning, 20-step flow Euler, cfg=1.00)... + [qie_native] denoise_full: W_lat=32 H_lat=32 C_lat=16 B=1 img_tokens=256+256=512 txt_seq=214 joint_dim=3584 n_steps=20 cfg=1.00 + ... + [smoke45s4] denoise_full OK (24770.59 ms) + per-step min=1169.80 median=1173.55 max=1305.24 + out_latent: mean=-2.4559 std=4.8611 min/max=-13.3203/7.6250 NaN=0 inf=0 + VERDICT: GREEN (gate: NaN=0, inf=0, std>0.0010, |min|<20, |max|<20) + ``` +- Output PNG: `/tmp/qie_q45_step4d_allbf16_cat.png` (110 KB, 256×256 RGB) — coherent tile pattern, **not** a recognizable cat. §5.5.6 documents this is the open visual gate. +- Block-0 substep oracle (§5.5.13): `cos=1.000000` for all six QKV projections vs analytical Q5_K oracle — engine math validated bit-for-bit at the projection layer. + +--- + +## Why Branch B (and not A or C) + +**Not Branch A:** the canonical 256² eye-check still emits a tile-pattern PNG, not a cat. `cos≈0.48` at substep 11 (`attn_out`) and substep 24 (`resid2`) vs CPU reference. §5.5.6 calls this "host unpatchify or pe-table layout artifact" but the bisect (§5.5.7–§5.5.13) has been chasing it for ~13 sub-iterations without yet closing. Switching production CLI to native NOW would ship recognizable-image regressions. + +**Not Branch C:** the engine is far too real to be dead code — 5,647 LOC of compute, 514 LOC weight upload that completes in 101 s and is 20% of the unpushed-ahead commits, GREEN NaN gate on real GGUF, and active development under §5.5.x. The §5.5.13 oracle proof (cos=1.0 vs analytical Q5_K) is decisive — the engine produces correct numerics through QKV projection at least. + +**Why Branch B fits:** the engine works, is the obvious endgame, but is not visually correct today. The CLI / ggml-cann path (`§5.5.67` first-NaN trace) is still the only path that has produced a recognizable cat-edit at 1024². Both must coexist until the native engine clears its visual gate. + +--- + +## Recommendation + +**Single recommended next step:** finish the §5.5 visual-gate bisect in the native engine, NOT in the CLI ggml-cann path. The native engine is closer to production than ggml-cann (NaN gate GREEN, only attention/unpatchify drift remains), and the §5.5.x agents are already inside it. Specifically: + +1. **Park §5.5.67** (CLI ggml-cann first-NaN trace) — its work is duplicative once native lands. Keep ggml-cann working as today's fallback but stop investing. +2. **Resume from §5.5.13's "Open" item:** "end-to-end denoise → VAE → PNG to confirm the win condition." The §5.5.13 substep recovery (08 cos 0→0.80, projection oracle cos=1.0) is a strong signal the upstream bug is now in attention or unpatchify, not projection. Run the next bisect there. +3. **Once native cat-PNG is recognizable**, retire `denoise()` as a stub-with-redirect (or wire it to call `denoise_full`), retire CLI ggml-cann path from production-config defaults, and the §5.5.x saga closes. + +**Integration cost when native is GREEN visually:** the CLI side (`tools/ominix_diffusion/cli/main.cpp`) currently drives the ggml-cann graph; the native engine has its own driver `tools/qwen_image_edit/native/main_native.cpp` (152 LOC). Merging them behind a `--native` flag is ~50 LOC of arg parsing + `ImageDiffusionEngine::denoise_full` call site. Not a multi-day task once the visual gate clears. + +--- + +## Open contradictions resolved + +- Header lines 6–11 ("Phase-1 skeleton — bodies stubbed") describe the original scaffold and were not updated. The TRUE current state is at header lines 460–520 (Phase 4.5 Step 4 production `denoise_full`), 614–627 (Phase 4.4c F32 residual), and 665–680 (Q2.4.5.4c BF16 plumbing). +- Two `denoise` symbols exist: canonical `bool denoise(...)` is a stub (returns false) and **production code uses `bool denoise_full(...)` instead.** Anyone reading only the canonical entry point would conclude Branch C; anyone reading `denoise_full` and the §5.5 saga would conclude Branch A. The truth is between: Branch B. diff --git a/ascend_qie_native_extension_survey.md b/ascend_qie_native_extension_survey.md new file mode 100644 index 0000000..77e27cc --- /dev/null +++ b/ascend_qie_native_extension_survey.md @@ -0,0 +1,117 @@ +# Ascend QIE Native Engine — Abstraction-Break Survey + +**Status:** Read-only survey. Recommendations are post-§5.5.66 (saga close gates everything below). +**TL;DR:** The native QIE engine at `tools/qwen_image_edit/native/image_diffusion_engine.{cpp,h}` is **already fully native** — no ggml-cann involvement on the forward path. The 30× TTS arc was *getting to native*; QIE is already there. The remaining payoff comes from layering **the same six TTS post-native optimizations** that QIE has not yet adopted: NZ weights, AddRmsNorm fusion, FFNV3, GroupedMatmulV3, aclGraph capture, and TASK_QUEUE_ENABLE=2 pipelining. Realistic ceiling: **~2× over the current native baseline**, not another 30×. + +--- + +## 1. Coverage map + +The native engine compiles to 5,647 LOC of direct aclnn dispatch. **`forward_block_` (cpp:3037) does not call any ggml-cann path.** The 60-block DiT runs entirely on the engine's own scratch buffers: + +| Op family | Native dispatch (current) | Reference: TTS native equivalent | +|---|---|---| +| Q/K/V projections (6 per block) | 6× `aclnnWeightQuantBatchMatmulV3` (Q4) or 6× `aclnnMm` (F16) | TTS uses fused `aclnnGroupedMatmulV3` (3-into-1) | +| Modulation linears (img_mod, txt_mod) | 2× WQBMMv3 / aclnnMm | same pattern | +| LayerNorm1/2 + modulate | F32 `aclnnLayerNorm` → cast → `aclnnMul` + `aclnnAdd` | (no analogue — TTS uses RMSNorm only) | +| Q/K RMSNorm (4 per block) | 4× `aclnnRmsNorm` (F16 in/out, F32 gamma) | TTS uses `aclnnAddRmsNorm` (fused with prior residual add) | +| RoPE | on-device 4×Mul + 2×Add manual (cpp:2858) or `aclnnRotaryPositionEmbedding` | TTS uses `aclnnApplyRotaryPosEmbV2` (fused 2-call) | +| Joint attention | `aclnnFusedInferAttentionScoreV2` BSND seq=4352 | same op | +| Attn output projections | 2× WQBMMv3 / aclnnMm | same | +| FFN per stream (up + GELU + down) | 3 dispatches (`Mm` → `GeluV2` → `Mm`) | TTS uses fused `aclnnFFNV3` (1 call) | +| Gated residual add (F32 accumulator) | `aclnnMul` + `aclnnCast` + `aclnnInplaceAdd` | TTS uses `aclnnInplaceAddRmsNorm` (fused into next-block RMSNorm) | + +**Memory:** scratch buffers are pre-sized at init for the worst-case `max_img_seq + max_txt_seq` (4352) × hidden (3072). Workspace grows lazily. RoPE tables are pre-computed (Q0.5.3). Q4-resident weights (~5.1 GiB) + F16 scales (~0.6 GiB) keep ~18 GiB resident, well under 32 GB HBM. The memory pattern already matches TTS's "workspace tensor reuse" guidance. + +**Streams:** single `compute_stream_ = primary_stream_` (cpp:733-734). No dual-stream overlap. + +**`ggml-cann` involvement on the hot path:** **zero.** GGUF parsing at `init_from_gguf` uses ggml weight-tensor I/O; that is one-time, not per-step. + +--- + +## 2. What's NOT yet native (vs TTS) + +Confirmed by symbol-name grep (`aclnnAddRmsNorm`, `aclnnFFNV3`, `aclnnGroupedMatmul`, `aclnnTransMatmulWeight`, `aclgraph`, `TASK_QUEUE_ENABLE`): + +| TTS optimization | TTS impact | QIE state | Source | +|---|---|---|---| +| FRACTAL_NZ weight pre-conversion | +15% (29.7→25.9 was ND→NZ in the writeup, table at qwen_tts_optimization_writeup.md:133-134) | **Not adopted.** No `aclnnTransMatmulWeight` calls. Weights stay ND. | docs/qwen_tts_optimization_writeup.md:127-145 | +| `aclnnFFNV3` (fused up/act/down) | shipped | **Not adopted.** QIE FFN is 3 separate dispatches per stream (cpp:3777-3811). | tools/qwen_tts/cp_cann_engine.cpp:732-921 | +| `aclnnGroupedMatmulV3` for Q/K/V | shipped (A4c Phase 1) | **Not adopted.** QIE does 3 separate matmuls per stream × 2 streams = 6 per block (cpp:3410-3428). | tools/qwen_tts/talker_cann_engine.cpp:660-769 | +| `aclnnInplaceAddRmsNorm` (fuses prior residual add into next RMSNorm) | shipped (Phase A.1) | **Not directly applicable** — QIE uses LayerNorm not RMSNorm at residual sites. But `aclnnAddLayerNorm` (if it exists in CANN 8.5) is the analogue. | tools/qwen_tts/cp_cann_engine.cpp:1495-1526 | +| aclGraph capture | parked in TTS (2.3× *slower* on iterative decode) | **Likely a win for QIE** because QIE is steady-state (60 blocks × 20 steps with constant shapes), not iterative-with-shape-flip. | docs/qwen_tts_optimization_writeup.md:114-128 | +| `TASK_QUEUE_ENABLE=2` | **the final enabler** in TTS — pipelines launch with execute | **Not adopted.** No env-var setup in QIE init. | docs/qwen_tts_optimization_writeup.md §"Final enabler" | + +Other notable potentials surfaced by the source: +- **Transient tensor descriptors:** `forward_block_` constructs 76 `aclCreateTensor` / `aclDestroyTensor` pairs per block per step. TTS engine constructs 137 across the *whole* loop with descriptor reuse. At 60 blocks × 20 steps, QIE creates ~91k tensors per image; TTS-style descriptor pooling would amortize this. +- **Modulation broadcast `aclnnMul`:** in `modulate_` the F16 `[B,H]` shift/scale is applied row-wise to `[B,seq,H]` via repeated mul/add — could collapse into a single fused kernel. + +--- + +## 3. Top 5 abstraction-break candidates (ranked by payoff × tractability) + +Ranking model: payoff scaled to TTS's deltas, divided by effort. All deltas refer to the *current native baseline* (denoise_full wall-clock per step), not the 1-fps ggml-cann starting point. + +### #1 — TASK_QUEUE_ENABLE=2 + aclGraph capture per-block +- **Replaces:** synchronous launch model (host-wait per dispatch) +- **Op count:** ~one-line env set + ~150 LOC of capture/replay machinery copying TTS's `capture_aclgraph_forwards_` (cp_cann_engine.cpp:1672-1677) +- **Expected win:** **medium-large** — TTS reports TASK_QUEUE_ENABLE as the "final enabler"; aclGraph captures 60 blocks × 20 = 1200 replays of identical shape, exactly the workload it was designed for (QIE is steady-state, unlike TTS's iterative decode where graph capture lost). Ballpark +20-40%. +- **Risk:** low numerical (replay is bit-identical), medium correctness (block 0 dump infrastructure breaks under capture; gate behind `QIE_ACLGRAPH=1`) +- **Effort:** ~1 week — symbol surface already loaded in `cp_cann_symbols.h:486-540` + +### #2 — `aclnnFFNV3` for img / txt FFN +- **Replaces:** 6 dispatches per block (3× img + 3× txt: up Mm + GeluV2 + down Mm) → 2 dispatches +- **Caveat:** QIE FFN is GELU-tanh, not SwiGLU. `aclnnFFNV3` accepts `activation="gelu"` per the symbol header. Verify on CANN 8.5 sample list. +- **Expected win:** **medium** — TTS's FFN is the largest single sublayer; same is true for QIE (FF=12288 = 4× hidden). Ballpark +10-15%. +- **Risk:** low if CANN 8.5 supports gelu in FFNV3; if not, the alternative is `aclnnFusedActivationLinearForward` or a custom AscendC gelu+matmul kernel (effort jumps 5×). +- **Effort:** ~3 days if FFNV3-gelu works, else 2-3 weeks for AscendC. + +### #3 — `aclnnGroupedMatmulV3` for Q/K/V +- **Replaces:** 3 separate WQBMMv3 / aclnnMm calls per stream per block → 1 grouped dispatch +- **Caveat:** TTS's GroupedMatmulV3 is gated on W8 (INT8) weights. QIE uses Q4_K antiquant via WQBMMv3. Need to check whether GroupedMatmulV3 has a WQBMM-grouped sibling or whether Q/K/V need to be batched at the WQBMM level via stride packing. +- **Expected win:** **small-medium** — TTS reports GMM at ~94-100 μs vs ~98-102 μs for 3× WQBMMv3 (4-8% per attention sublayer; symbols header notes this at line 435). Ballpark +5-8% if portable. +- **Risk:** medium — if no Q4 grouped variant exists, this is a wash on QIE. +- **Effort:** ~1 week including the bias-channel-pack rework. + +### #4 — FRACTAL_NZ weight pre-conversion at init +- **Replaces:** ND-format weight at every WQBMMv3 / aclnnMm call → NZ-format (Ascend's native cube tiling) +- **Expected win:** **small-medium for Q4 path, medium for F16 fallback path** — TTS measured +15% in the F16 path (writeup §"Layer 3"). The Q4 WQBMMv3 path may or may not benefit; CANN docs (`aclnn_weight_quant_batch_matmul_v3.h`) need verification on whether mat2 NZ is consumed. +- **Risk:** low — TTS proved it; metadata mutation is in-place. +- **Effort:** ~3 days — copy `set_use_nz_weights` from `cp_cann_engine.cpp:424-462` and apply at every Linear weight in `init_from_gguf`. +- **Caveat:** this is the lowest-risk win listed but the smallest expected magnitude. + +### #5 — Tensor descriptor pool + AscendC custom modulate kernel +- **Replaces:** 91k transient `aclCreateTensor`/`aclDestroyTensor` per image; the 4-op `modulate_` (Mul + Add + Cast chain) with one fused kernel +- **Expected win:** **small** — descriptor work is already inside a single host thread; CANN's tensor descriptor allocator is fast. Likely +2-5%. +- **Risk:** low for descriptor pool; AscendC kernel is high effort. +- **Effort:** descriptor pool ~3 days; AscendC modulate ~2-3 weeks. **Skip the AscendC half** — won't pay back. + +### Honorable mention — Drop the F32 residual stream once §5.5.66 closes +- The Q2.4.5.4d "QIE_ALL_BF16" path (cpp:3094-3102) was the §5.5 saga's overflow workaround. Once the dtype audit closes (§5.5.66) the residual can return to F16, halving residual bandwidth and removing the F32→F16 cast pair per residual add. Estimated +5-8%, but it's *coupled* to saga close, so it doesn't survey separately. + +--- + +## 4. Recommendation (post-§5.5.66) + +**Attack #1 first: TASK_QUEUE_ENABLE=2 + per-block aclGraph capture.** + +Reasons: +1. Highest expected payoff per LOC. The TTS writeup calls `TASK_QUEUE_ENABLE=2` "the final enabler"; pairing it with aclGraph (which TTS *parked* because TTS's iterative shape-flip kills it but QIE's identical 60-block steady-state matches it) is the unique QIE win. The combination is exactly the "pre-warmed caches for repeated shapes" pattern from the survey prompt. +2. Numerically free. Graph replay is bit-identical to eager dispatch; no precision concerns to bisect against the §5.5 evidence pack. +3. Decoupled from the saga. It does not depend on residual-stream dtype, RoPE layout, or any of the §5.5 evidence chain. Can run on any post-§5.5.66 baseline. +4. Symbol surface already in the dlsym shim (`cp_cann_symbols.h:486-540`). +5. Compounds with #2-#4. Once aclGraph captures the per-block sequence, swapping FFNV3 / GroupedMatmul / NZ inside that captured graph is mechanical (capture-replace-recapture). + +Targets: realistic stretch is **~2× current native** with all five layers landed (TASK_QUEUE +25%, FFNV3 +15%, GMM +6%, NZ +12%, descriptor pool +3% — multiplicative ≈ 1.75-2.0×). Not another 30×; the engine's already past the easy gains the TTS arc captured. + +--- + +## File index +- `/Users/yuechen/home/OminiX-Ascend/tools/qwen_image_edit/native/image_diffusion_engine.h` (870 LOC) +- `/Users/yuechen/home/OminiX-Ascend/tools/qwen_image_edit/native/image_diffusion_engine.cpp` (5,647 LOC) +- `/Users/yuechen/home/OminiX-Ascend/tools/qwen_image_edit/native/main_native.cpp` (152 LOC, lock + driver) +- `/Users/yuechen/home/OminiX-Ascend/tools/qwen_tts/cp_cann_symbols.{h,cpp}` (978 LOC, dlsym shim used by both engines) +- `/Users/yuechen/home/OminiX-Ascend/tools/qwen_tts/cp_cann_engine.cpp` (3,674 LOC, reference for NZ + aclGraph + AddRmsNorm patterns) +- `/Users/yuechen/home/OminiX-Ascend/tools/qwen_tts/talker_cann_engine.cpp` (reference for GroupedMatmulV3 and FFNV3 wiring) +- `/Users/yuechen/home/OminiX-Ascend/tools/ominix_diffusion/src/qwen_image.hpp` (847 LOC, CPU reference for parity) +- `/Users/yuechen/home/OminiX-Ascend/docs/qwen_tts_optimization_writeup.md` (the 30× TTS retrospective; baseline-and-target numbers, NZ +15%, W8 +14%) diff --git a/cuda_native_engine_audit.md b/cuda_native_engine_audit.md new file mode 100644 index 0000000..197afa9 --- /dev/null +++ b/cuda_native_engine_audit.md @@ -0,0 +1,143 @@ +# CUDA Native QIE-Edit Engine — Readiness Audit + +**Auditor:** Claude (Opus 4.7, 1M context) +**Date:** 2026-04-29 +**Scope:** Read-only audit of `tools/qwen_image_edit/native/image_diffusion_cuda_engine.{h,cpp}` on `zgx-5b44` to decide if the native CUDA engine is production-ready, partial scaffold, or dead code. +**Sources read on zgx-5b44:** +- `/home/user1/ominix-cuda/tools/qwen_image_edit/native/image_diffusion_cuda_engine.h` (343 LOC) +- `/home/user1/ominix-cuda/tools/qwen_image_edit/native/image_diffusion_cuda_engine.cpp` (1,654 LOC) +- `/home/user1/ominix-cuda/tools/qwen_image_edit/native/cuda_kernels/dit_kernels.{cu,h}` (813 + 188 LOC) +- `/home/user1/ominix-cuda/tools/qwen_image_edit/native/test_image_diffusion_cuda_{init,block,dit,e2e}.cpp` +- `/tmp/qie_3p3c_*.log`, `/tmp/qie_3p3c_*.f32.bin` (prior run artefacts) +- ran `test_image_diffusion_cuda_init` and `test_image_diffusion_cuda_block` at 1024² (PASS, finite, no NaN/Inf) + +--- + +## Verdict at a glance + +**Branch B — engine is REAL and END-TO-END FUNCTIONAL but ~2× SLOWER than the CLI baseline at 1024².** + +- Every method declared in the header has a real implementation in the `.cpp`. + `init_from_gguf` → `compute_t_emb_` → `forward_block` (60 blocks of cuBLAS GEMM + custom RMSNorm + multi-axis NEOX RoPE + naive softmax attention + AdaLN modulate + GELU MLP + gated residual) → `forward_dit` → `denoise` (host-orchestrated 20-step Euler-flow loop with patchify/unpatchify) all run through to a finite F32 output latent. +- Smoke evidence is on disk: `/tmp/qie_3p3c_run.log` shows a successful **1024² × 20-step** end-to-end run on 2026-04-26 producing `qie_3p3c_latent.f32.bin` (NaN=0, Inf=0, std=1.034). I re-ran `test_image_diffusion_cuda_init` and `test_image_diffusion_cuda_block` today (2026-04-29): both PASS, finite outputs, init uploads 38.06 GiB of weights cleanly. +- **Performance: 33.8 s/step at 1024²** (per-step log) — **slower** than the ggml-cuda CLI baseline of 15 s/step that the perf project is trying to beat. Native engine is NOT a free win; it is a different point in design space. +- **Quality is unverified.** Final latent magnitudes barely move across 20 steps (max_abs 4.795 → 4.898 ≈ 2 % drift) which suggests near-pass-through behaviour, i.e. the proj_out velocity prediction is close to zero. A VAE decode + visual eye-check has never been done for the 1024² CUDA path. The Phase-3.4d code comment in `denoise()` claims a similar bug was previously fixed for the Ascend twin, but there is no parity log proving the CUDA fix landed. +- **Source files are NOT tracked in git.** `git status` reports `tools/qwen_image_edit/native/` as Untracked. There is no commit on the CUDA repo's `main` for any of the 1,997 LOC of native engine code. The most recent CUDA commit (`9af35f05`) is the in-flight ggml-cuda allocator fix (#187). All "Phase 3.x" commit history in the git log refers to the Ascend twin under `OminiX-Ascend`. + +The header comment at `image_diffusion_cuda_engine.h:33` says "Phase 3.1 skeleton + Phase 3.2 will fill + Phase 3.3 will fill". **That comment is stale.** The actual `.cpp` has Phase-3.1 + 3.2 + 3.3a + 3.3b + 3.3c + 3.3b widening (F32 attention path, §5.5.46 Ascend BF16 analog) + 3.4d (Euler velocity-vs-denoised fix) all landed. + +--- + +## Header status + +**Dispatch-mapping table (h:14–22):** declared mappings vs implementation in cpp: + +| Ascend op | CUDA mapping declared | Actually used in cpp | +|---|---|---| +| `aclnnMm` | `cublasGemmEx` | YES — every projection (mod / qkv / out / mlp_0 / mlp_2 / norm_out / proj_out / img_in / txt_in) is `cublasGemmEx` with F16 weights, F32 accumulate, mostly F32 output | +| `aclnnRmsNorm` / LayerNorm | custom CUDA kernel | YES — `launch_layernorm_noaffine_f32` + `launch_rmsnorm_head_f32_g32` | +| `aclnnApplyRotaryPosEmbV2` | custom CUDA (joint RoPE) | YES — `launch_rope_neox_3axis_f32` + persistent `pe_cos_dev_/pe_sin_dev_` table built at init from temporal=16 + h=56 + w=56 axes | +| `aclnnFusedInferAttentionScoreV2` | cuDNN FMHA / CUTLASS FMHA | **NO** — uses `attn_joint_naive_f32_kernel`, naive O(seq²) softmax. cuDNN handle is created but unused for FMHA | +| `aclnnWeightQuantBatchMatmul` (A8W8) | cuBLAS INT8 / Q8_0 dequant | **NO** — Q8_0/Q4_0 weights are dequantized to F16 at init by `upload_tensor_f16` and stored as plain F16. No on-the-fly INT8/A8W8 GEMM. | +| `aclmdlRI` (ACL Graph) | `cudaGraph` | **NO** — no `cudaGraph_t` capture/replay anywhere. Each step issues 60×~20 launches per block from host. | + +**Phase markers in header:** +- h:33 "Phase 3.1 (THIS) lands the engine class skeleton" — stale; everything below is also landed +- h:38 "Phase 3.2 will fill `forward_block`" — DONE in cpp +- h:42 "Phase 3.3 wires the full loop + lm_head/proj_out + VAE decode hand-off" — denoise is DONE; **VAE decode is NOT in the engine** and is delegated to the e2e harness caller +- h:54 "frozen for Phase 3" config struct — present and used + +--- + +## Method-by-method classification + +| Method | Header line | Cpp line | Status | Notes | +|---|---|---|---|---| +| `init_from_gguf` | h:114 | cpp:238 | **COMPLETE** | Opens DiT GGUF via `gguf_init_from_file`, validates `general.architecture=='qwen_image'`, uploads all 60 blocks × ~30 tensors + global head/tail (img_in / txt_in / txt_norm / time_lin1 / time_lin2 / norm_out / proj_out) → 38.06 GiB F16 weights on device, builds 3-axis RoPE pe-table, allocs t_emb scratch. `nonfinite_weight_count_=0` confirmed on real run. LLM/vision/VAE paths accepted but ignored (`(void)llm_path;`) | +| `build_pe_table_` | h:269 | cpp:602 | **COMPLETE** | Calls host `build_qwen_rope_pe_host_3axis` (mirrors Ascend `compute_qwen_rope_pe_host`), uploads to F16 device buffers | +| `ensure_scratch_` | h:268 | cpp:617 | **COMPLETE** | Lazy alloc for img/txt residual/norm/q/k/v/attn/mlp/proj scratch in both F16 and F32 (Phase 3.3b widened path) | +| `compute_t_emb_` | h:270 | cpp:692 | **COMPLETE** | Sinusoidal[256] → time_lin1 GEMM → SiLU → time_lin2 GEMM → SiLU, all on device | +| `forward_block` | h:135 | cpp:777 | **COMPLETE** | Real per-block compute: AdaLN-mod, LayerNorm1, QKV F32-out GEMMs, head-wise RMSNorm, multi-axis RoPE, **naive O(seq²) F32 attention**, output proj, gated-residual add, LayerNorm2+AdaLN, MLP_0+GELU+MLP_2, gated-residual add #2. F32 residual chain (Phase 3.3b widened path mirrors Ascend §5.5.46 BF16 fix). H2D/D2H per call. ~1.1 s wall at 1024² seq_tot=4352 | +| `final_proj` | h:144 | cpp:1127 | **STUB / DEAD** | `std::abort()` with "Phase 3.3 stub" message. **Replaced by inline tail logic in `forward_dit`** — never called by the e2e flow | +| `forward_dit` | h:163 | cpp:1147 | **COMPLETE** | 60-block loop with host F32 ping-pong buffers + AdaLN-final tail (norm_out.linear → split shift/scale → LN → modulate → proj_out → patch_out F16 → host F32). Optional `OMINIX_CUDA_DUMP_BLOCKS=1` per-block diagnostic | +| `denoise` | h:200 | cpp:1382 | **COMPLETE** | One-shot txt_norm + txt_in GEMM hoisted before loop; per-step host patchify → device img_in GEMM → forward_dit → host unpatchify → flow-Euler update `x += (x-denoised)/sigma * dt`. Phase 3.4d comment notes the velocity-vs-denoised semantics fix landed. Ref-latent path declared but only implemented for `ref==null` | + +--- + +## Smoke test inventory + +**Tests in source:** init / block / dit / e2e — all four buildable, all four built: +``` +/home/user1/ominix-cuda/build/bin/test_image_diffusion_cuda_init (built 2026-04-29) +/home/user1/ominix-cuda/build/bin/test_image_diffusion_cuda_block (built 2026-04-29) +/home/user1/ominix-cuda/build/bin/test_image_diffusion_cuda_dit (built 2026-04-29) +/home/user1/ominix-cuda/build/bin/test_image_diffusion_cuda_e2e (built 2026-04-29) +``` + +**Tests run today (2026-04-29):** +- `test_image_diffusion_cuda_init ` → **PASS**. `Phase 3.3a init OK uploaded=38.06 GiB nonfinite=0 pe_total_pos=8537`. load_ms=129647 (most of which is GGUF dequant + H2D). +- `test_image_diffusion_cuda_block ` → **PASS**. Single-block 1024² fwd: `wall_ms=1095.8`, `img_out max_abs=6.158e+07 NaN=0 Inf=0`. (Magnitudes are pre-AdaLN-final; Phase-3.3a smoke does not gate them.) + +**Prior run artefacts on box (2026-04-26):** +- `/tmp/qie_3p3c_run.log` — full **1024² × 20-step** e2e run: 676.2 s wall, 33.8 s/step, finite final latent, NaN=0 +- `/tmp/qie_3p3c_256_full.log` — 256² × 20-step e2e: 16.0 s wall, 0.78 s/step, finite, NaN=0 +- `/tmp/qie_3p3c_latent.f32.bin` — final latent F32 [1, 16, 128, 128], 1 MiB +- **No PNG output anywhere.** VAE decode is delegated to the e2e harness caller and was never wired. Visual correctness unverified. + +--- + +## Performance table (this matters for the verdict) + +| Path | 1024²×20 step time | Status | +|---|---|---| +| ggml-cuda CLI (current, what perf project is optimizing) | 15.0 s/step | Baseline | +| torch.compile reference | 5.6 s/step | Target (2.7× gap) | +| **Native CUDA engine (this audit)** | **33.8 s/step** | **2.25× SLOWER than CLI baseline** | + +The naive O(seq²×head_dim) attention kernel at seq_tot=4352 is the obvious culprit. cuDNN FMHA + cudaGraph are declared in the header dispatch table as the targets but **never implemented**. The Phase 3.3a probe path (block-only at seq_tot=4352) takes 1.1 s wall × 60 blocks = 66 s of pure compute, which is consistent with the per-step number once you subtract H2D/D2H+patchify overhead. + +--- + +## Why "Branch B, not Branch A" + +Native engine is REAL but it is **not yet a perf win**. Compared to the Ascend twin (which has parity bugs but acceptable speed), the CUDA twin has: +- ✅ Numerical health (no NaN, end-to-end run completes) +- ❌ FMHA / FlashAttention (the single biggest CUDA-side perf lever) +- ❌ Quantized GEMM (A8W8 / Q8_0 stays-quantized — current code dequantizes to F16 at load → 38 GiB resident, no compute speedup) +- ❌ cudaGraph capture (60-block per-step launch overhead is paid on every step) +- ❌ VAE decode (engine emits a latent, e2e harness calls a stub) +- ❌ Quality verification (no PNG; pass-through-looking magnitude trace is suspicious) +- ❌ Git tracking (entire engine is `Untracked` — one `rm -rf` and it's gone) + +To make this Branch A you would need to: +1. **cuDNN FMHA or CUTLASS FlashAttention.** Expected: replace the 4352² softmax kernel; should drop attention from O(N²) to ~O(N log N) memory traffic. Realistic 5-10× attention speedup; net per-step probably 8-12 s/step. Cost: 3–5 days. +2. **cudaGraph capture of forward_block.** Topology is stable across blocks (60 identical block shapes, just weight pointer swap). Cost: 1–2 days. +3. **Resident-quantized Q8_0/Q4_0 GEMM.** Either dequant-at-runtime per tile (smaller HBM footprint, similar compute) or true INT8 cuBLAS. Cost: 3–5 days. +4. **VAE decode + PNG eye-check.** Borrow VAE from CLI path, validate the 20-step 1024² output is a recognizable cat-edit. Cost: 1 day. +5. **CLI flag `--engine native` + weight-handoff plumbing + git commit the source.** Cost: 1 day. + +Total: ~10–15 engineer-days to make native a real Branch-A on CUDA. The 2.7× gap to torch.compile would likely be **closed and possibly beaten** if all four perf items land — torch.compile uses Triton FlashAttention + CUDA graphs + fused norms, all of which the native path could match natively. + +--- + +## Recommendation + +**Continue the kernel-fusion / FP8 stack on the CLI path for now, but file a tracking ticket to revive the native engine when CLI hits its ceiling.** + +Concrete next step: **run `test_image_diffusion_cuda_e2e` once with the existing fixtures, pipe the latent through the CLI's VAE decode step, and produce a single 1024² PNG.** This is one shell pipeline (`test_image_diffusion_cuda_e2e ... && python decode_vae.py /tmp/qie_3p3c_latent.f32.bin out.png`) and gives a yes/no verdict on whether the engine produces visually correct output. If YES → native becomes a viable Branch-A target once perf items land. If NO → native is parked behind a documented quality bug and CLI perf work is unambiguously the right place to spend time. + +After that one decode, also `git add tools/qwen_image_edit/native/` and commit so 1,997 LOC of working code stops being a `git clean -fd` accident waiting to happen. + +--- + +## Files / paths referenced + +- `/home/user1/ominix-cuda/tools/qwen_image_edit/native/image_diffusion_cuda_engine.h` +- `/home/user1/ominix-cuda/tools/qwen_image_edit/native/image_diffusion_cuda_engine.cpp` +- `/home/user1/ominix-cuda/tools/qwen_image_edit/native/cuda_kernels/dit_kernels.cu` +- `/home/user1/ominix-cuda/tools/qwen_image_edit/native/cuda_kernels/dit_kernels.h` +- `/home/user1/ominix-cuda/tools/qwen_image_edit/native/test_image_diffusion_cuda_{init,block,dit,e2e}.cpp` +- `/home/user1/ominix-cuda/tools/qwen_image_edit/CMakeLists.txt` +- `/home/user1/ominix-cuda/build/bin/test_image_diffusion_cuda_{init,block,dit,e2e}` (built 2026-04-29) +- `/tmp/qie_3p3c_run.log`, `/tmp/qie_3p3c_256_full.log`, `/tmp/qie_3p3c_256_smoke.log` (prior runs on zgx-5b44) +- `/tmp/qie_3p3c_latent.f32.bin` (most recent 1024² final latent) diff --git a/cuda_qie_perf_exploration.md b/cuda_qie_perf_exploration.md new file mode 100644 index 0000000..293fb25 --- /dev/null +++ b/cuda_qie_perf_exploration.md @@ -0,0 +1,110 @@ +# CUDA QIE-Edit Perf Exploration on GB10 (sm_121, Blackwell) + +**Read-only investigation. No code changed.** Profiled the production +ggml-cuda CLI (`~/ominix-cuda/build/bin/ominix-diffusion-cli`, commit +`ad5ef19c`) on host `zgx-5b44`, with the canonical cat→B/W run at +1024² × 5 steps (Q4_0 DiT, Q8_0 Qwen2.5-VL, fp32 VAE, `--diffusion-fa`). + +## Run summary + +| metric | value | source | +| --- | --- | --- | +| sampling wall time (5 steps × 2 cfg) | **75.58 s** | `/tmp/nsys_qie/run.log:4065` | +| per-step wall (cfg sequential) | **15.12 s** | same | +| denoise GPU window | 80.7 s | sqlite `MAX(end)-MIN(start)` | +| GPU kernel time | **78.24 s (96.9% busy)** | `SUM(end-start)` over `CUPTI_ACTIVITY_KIND_KERNEL` | +| total kernel launches in window | **74,127** (~14,800 / step) | sqlite | +| `cudaLaunchKernel` API time | 57.6 s | nsys api-sum | +| CUDA Graph instantiations | **1** (the Qwen2.5-VL conditioner only — DiT is eager) | api-sum | +| memcpy D2D | 86 GB total / **1.13 s** = ~76 GB/s | mem-sum | +| memcpy H2D | 20.7 GB / 0.39 s | mem-sum | + +**For comparison**, the in-tree PyTorch diffusers reference +(`qie_cuda/src/bench_qie_diffusers.py`, `torch.compile max-autotune`, +`bf16`, same input) runs the same workload at **2.80 s / transformer +call** = 5.60 s / step, vs ggml-cuda at **15.12 s/step** — a **2.7×** +gap. PyTorch is the correct ceiling reference; ggml-cuda is leaving a +lot on the table. + +## Top 15 kernels by GPU time (5-step run) + +| # | kernel (truncated) | GPU s | calls | % of kernel time | +| - | ------------------ | ----- | ----- | ---------------- | +| 1 | `flash_attn_ext_f16<128,128,64,1,...>` | 15.998 | 600 | **20.45 %** | +| 2 | `mul_mat_q` (large) | 10.823 | 2,900 | **13.83 %** | +| 3 | `k_bin_bcast` | 10.749 | 16,392 | **13.74 %** | +| 4 | `mul_mat_q` | 6.139 | 580 | 7.85 % | +| 5 | `scale_f32` | 5.230 | 7,860 | 6.68 % | +| 6 | `k_bin_bcast` | 4.451 | 7,448 | 5.69 % | +| 7 | `cpy_scalar` (i.e. `ggml_cont`) | 4.053 | 4,923 | 5.18 % | +| 8 | `quantize_mmq_q8_1` (activation quant pre-mmq) | 3.175 | 7,170 | 4.06 % | +| 9 | `unary_op_kernel` | 2.098 | 1,191 | 2.68 % | +|10 | `cpy_scalar_contiguous` (FA K/V cast) | 1.923 | 1,800 | 2.46 % | +|11 | `im2col_3d_kernel<__half>` (VAE encode/decode) | 1.793 | 57 | 2.29 % | +|12 | `k_bin_bcast` | 1.599 | 2,400 | 2.04 % | +|13 | `concat_f32_dim2` | 1.584 | 1,800 | 2.02 % | +|14 | `concat_f32_dim0` | 1.520 | 656 | 1.94 % | +|15 | `rms_norm_f32<256, true, false>` | 1.235 | 2,400 | 1.58 % | +|— | (`norm_f32` LayerNorm, `pad_f32`, etc. follow) | | | rest 7 % | + +(Source: `nsys profile` artefact `/tmp/nsys_qie/qie_5step.{nsys-rep,sqlite}` on `zgx-5b44`.) + +### What this says + +1. **Attention is f16 with f32-accum (FA enabled)** — `flash_attn_ext_f16<128,128,64>` at 20%. 600 calls = 60 layers × 10 transformer calls. This is the *Ampere-style* (mma.sync 16x8x16 fp16) FA kernel. **There is no Blackwell-tuned FA-3 path**; ggml-cuda has no awareness of sm_100/sm_120 tensor-memory or wgmma. Each FA call averages 26.6 ms, working on ~9k tokens × 24 heads × 128 d_head. +2. **Q4_0 mmq dominates matmul** — 2,900 launches × 3.7 ms = 10.8s on `mul_mat_q`. Plus `mul_mat_q` (6.1s, 580 calls — that's 1 per layer per cfg = MLP `to_out`) and Q8_1 activation prequant (3.2s). **The whole diffusion model is Q4 — no FP8 path exists**, even though sm_121 has native E4M3 support. +3. **Element-wise kernels dwarf matmul** — `add` (10.7s) + `mul` (4.5s) + `scale` (5.2s) + `cpy` (4.0s) + `quant` (3.2s) + `cont→half` (1.9s) + `repeat` (1.6s) + `concat` (3.1s) + `gelu` (2.1s) = **36.3 s = 46% of GPU time** spent on bandwidth-bound glue between matmuls. Each is a separate kernel launch. +4. **Memory bandwidth is *not* the binary bottleneck**: only 1.13s of pure-memcpy time. But the elementwise kernels above ARE bandwidth-bound — they read/write the activation tensor (~64 MB at hidden 3072 × 9k tokens fp32) repeatedly. Those 16,392 `add` launches *are* the memory traffic. +5. **CUDA Graphs not used for DiT** — only 1 graph instantiation in the entire 75s window (it's the Qwen2.5-VL conditioner; DiT is rebuilt eagerly each transformer call). 14,800 launches/step × ~3.5 µs API overhead each ≈ **52 ms/step pure launch overhead** baked in. +6. **No fused RMSNorm-modulate-gate kernel exists.** Per DiT block (`qwen_image.hpp:255-345`) the modulation+norm+gate sequence is 13 separate ggml ops. Multiplied by 60 blocks × 10 calls = 7,800 unfused launches — most of items 3, 5, 6 above. + +GB10 unified memory peak BW is ~120 GB/s (LPDDR5x). Achieved on the elementwise sweep: with 36 s on ~Y GB of activation traffic per layer (~120 MB/elementwise × 16k launches ≈ 1.9 TB total), achieved ≈ **53 GB/s on the elementwise tail = ~44% of peak**. For Q4_0 mmq, 11 GB of weights × 600 reads ≈ 6.6 TB at 10.8s = **610 GB/s effective** (well above the unified-memory ceiling — kernel is running mostly out of L2 + register-resident dequant; this is compute-bound on Q4-decode + tensor-core throughput for a quantized GEMM that doesn't use Blackwell's wgmma). + +## Ranked candidate wins + +| Rank | Candidate | Expected win | Risk | Effort | Where | +| ---- | --------- | ------------ | ---- | ------ | ----- | +| 1 | **Capture DiT into a CUDA Graph** (cudaStreamBeginCapture around 60-block forward, replay each step / each cfg branch) | **Eliminate ~50 ms/step launch overhead = ~5–8% step**; bigger if the elementwise tail is launch-bound, which the 14.8k calls/step and 70% API time on `cudaLaunchKernel` strongly suggest. Could reach **15–25%**. | Low. ggml already has graph infra (`ggml-cuda.cu:2927`); just needs DiT op coverage. PE-cache and modulate-index already shape-stable. | ~200 LOC + debug, 1–2 days | `ggml/src/ggml-cuda/ggml-cuda.cu` graph compatibility checks; `qwen_image.hpp` `build_graph` | +| 2 | **Hand-fused RMSNorm + modulate (scale/shift) + gated-residual kernel** to replace items 3/5/6/7 of the kernel table inside each DiT block | RMSNorm/LN + 2 muls + 1 scale + 1 add per modulated branch × 4 branches × 60 layers × 10 calls ≈ **8–12 s** of elementwise + cont could fold into 60×10×4 = 2,400 launches. Estimate **15–20% step time**. | Medium. Need to audit shape-vs-broadcast for img/txt and Q4 CFG-batched code path. Numerical regression possible in modulate broadcast. | ~600 LOC + tests, 3–5 days | new `.cu` next to `norm.cu`; wire via `ggml_qwen_image_block_modulate` custom op | +| 3 | **FP8 (E4M3) DiT weights via cuBLASLt FP8 GEMM** — replace `mul_mat_q` for Linear layers (Q,K,V,to_out, MLP up/down) with cuBLASLt FP8 GEMM; activations bf16 | Matmul drops from ~17 s (Q4_0+Q4_1+Q8_1 quant) to ~6–8 s on Blackwell tensor cores at 2× density vs bf16. Expect **10–15% step**. **TTS got ~5% from FP8 on lm-head alone; DiT has 60× more FP8-eligible matmul.** Better quality vs Q4 too. | Medium-high. Need calibration (per-tensor scaling) and a CUDA path that hands raw FP8 tensors to cuBLASLt. ggml has no FP8 type today. | ~1,500 LOC (new ggml type, packing, cuBLASLt wrapper), 1–2 weeks | new files `ggml/src/ggml-cuda/fp8/`; add `GGML_TYPE_FP8_E4M3` | +| 4 | **Replace `ggml_ext_attention_ext` cast/permute prologue with one fused QKV-permute-cast-pad kernel; or hand-call cuDNN FMHA / FlashAttention-3** | The FA kernel itself is fine (20%), but the *3 ms of prep* per call (`cpy_scalar f32→f32` + `cpy_scalar_contiguous f32→half` + `pad`) costs ~3 s/run = **3–4%**. Switching to FA-3 (Hopper/Blackwell wgmma) could cut FA itself by ~30% = **~6%**. | Medium. FA-3 needs head_dim 128 path, which *does* exist in upstream Tri Dao branch and in cuDNN 9.x. Need to handle Q4 CFG-batched mask too. | ~400 LOC + cuDNN integration, 3–4 days | `ggml_extend.hpp:1265-1380`; new dispatch in `fattn-mma-f16.cuh` for sm_121 | +| 5 | **Drop `ggml_cont` after attention output split (line 184–185 of qwen_image.hpp)** — the txt/img view-then-cont is inserted because downstream `to_out_0->forward` expects contiguous; with cuBLASLt epilogues we can ingest strided directly | Two `cpy_scalar` per layer per call × 60 × 10 = 1,200 calls, ~2.5 s = **3% step** | Low. Just verify the linear-forward strided-input path in ggml-cuda. | ~50 LOC, half a day | `qwen_image.hpp:184-186`; `mul_mat` strided support | +| 6 | **CFG batching (run cond+uncond as ne[3]=2)** — already partly implemented (`OMINIX_CFG_BATCHED=1`), currently disabled for 1024² because mask footprint > 256 MiB budget | If batch utilisation lifts each kernel by ~1.4× (not 2× because some kernels are already underfull), expect **20–35%** end-to-end. | Low-medium. Mask footprint problem is the actual blocker — need a sparse/blocked mask representation, not dense [N,L_q,L_k] f16. | ~300 LOC + memory plumbing, 2–3 days | `stable-diffusion.cpp:2364-2399`, `OMINIX_CFG_BATCHED` path | +| 7 | **Disable Q8_1 activation pre-quantization for low-token matmuls** — `quantize_mmq_q8_1` runs 7,170× = 3.2 s. Many small matmuls (modulation projection, LayerNorm-to-modparam) re-quantize activations that are tiny. A direct bf16×Q4 dot path would save the round-trip. | Item alone = **3–4%**. | Low. Already a tunable in `mmq.cuh`. | small | `ggml-cuda/mmq.cu` MMQ_MIN_BATCH heuristic | + +(Items 1, 2, 6 stack mostly orthogonally; 3 partially overlaps 2.) + +## Recommendation: try CUDA Graphs on the DiT first + +Reasons: +1. **It is the biggest single concrete win available with the lowest + risk and effort** (1–2 days vs 1–2 weeks for FP8). The harness is + already there in `ggml-cuda.cu`; we only need to widen graph + compatibility for the DiT op set and confirm shape stability across + denoise steps (which the existing PE-cache work indicates is + already true). +2. **The profile is unambiguous**: 14,800 launches/step, 96.9% GPU + busy, only 1 graph instantiation observed → graphs are *not* + currently active for DiT despite the project's belief. Confirming + this and turning them on is the highest-value first move. +3. **It de-risks every subsequent optimisation.** Once the DiT runs as + a captured graph, fused-kernel work (item 2) and FP8 swap-in + (item 3) become drop-in node replacements rather than + eager-launch refactors. +4. **It gives the right diagnostic signal** for the bigger question. + If turning on graphs only buys 8% (just the launch overhead), then + the bottleneck is genuinely the kernels, and FP8/fusion is + warranted. If it buys 25%+, the bottleneck was the launch cadence + itself, which would change the next priority. + +The PyTorch `torch.compile` reference at 5.6 s/step shows the workload +admits ~2.7× headroom on this same GPU — strong evidence that the win +is real and not a hardware ceiling. + +## Artefacts left on `zgx-5b44` + +- `/tmp/nsys_qie/qie_5step.nsys-rep` (6.5 MB) — open in Nsight Systems +- `/tmp/nsys_qie/qie_5step.sqlite` (15 MB) — for further sql queries +- `/tmp/nsys_qie/run.log` — full ggml-cuda log of the profiled run +- `/tmp/nsys_qie/run.sh` — exact command line +- `/tmp/nsys_qie/out.png` — sanity-check output image (5-step degraded but recognisable) diff --git a/cuda_tts_perf_exploration.md b/cuda_tts_perf_exploration.md new file mode 100644 index 0000000..9cd3dbe --- /dev/null +++ b/cuda_tts_perf_exploration.md @@ -0,0 +1,64 @@ +# Qwen3-TTS CUDA / GB10 — Performance Headroom Survey + +**Profile date:** 2026-04-29 +**Host:** `zgx-3675` (NVIDIA GB10, sm_121a / Blackwell, 48 SMs, 24 MiB L2, **546.1 GB/s** unified LPDDR5X, 119 GiB) +**Daemon:** `/home/user1/ominix-cuda/build-phase21/bin/tts_server` (Phase 2.10 warm) +**Test prompt:** `"Hello world, this is the ominix CUDA TTS."` → 10.24 s audio +**Warm wall-clock:** 6.66 – 7.21 s (RTF 0.65 – 0.70) +**Profile:** `nsys profile -t cuda,nvtx,cublas,cudnn --cuda-graph-trace=node --delay=12 --duration=15` +**Capture file:** `/tmp/tts_prof/tts_warm_113139.nsys-rep` on host +**Source files referenced (all on host `zgx-3675`):** +`/home/user1/ominix-cuda/tools/qwen_tts/native/talker_cuda_engine.cpp` (engine + Phase 2.5 graph capture, Phase 2.6 FP8 LM-head, attention), +`/home/user1/ominix-cuda/tools/qwen_tts/native/tts_server.cpp` (Phase 2.10 daemon), +`/home/user1/ominix-cuda/tools/qwen_tts/native/cuda_kernels/{attn_gqa.cu,rmsnorm.cu,rope_neox.cu,swiglu.cu,decoder_ops.cu,elementwise.cu}`. + +--- + +## 1. nsys Top-10 Kernels (warm window, ~1.7 requests captured before duration cutoff) + +| Rank | % GPU | Total (ns) | Calls | Avg (us) | Kernel | +|------|------:|-----------:|------:|---------:|--------| +| 1 | **66.7 %** | 4,527,401,216 | 74,036 | 61.2 | `cublas internal::gemvx::kernel<__half,__half,__half,float>` (decode-loop projections) | +| 2 | 9.3 % | 628,577,056 | 19,200 | 32.7 | `cutlass_80_wmma_tensorop_f16_s161616gemm_f16_32x32_128x2_tn_align8` (predictor / batch GEMM) | +| 3 | 8.8 % | 595,949,920 | 1,920 | 310.4 | `cublas gemvx __half→float` (LM-head decode path, F16 fallback) | +| 4 | 5.8 % | 391,264,128 | 4 | 97 815 | `ominix_cuda::causal_conv_transpose1d_f32_kernel` (codec upsampler) | +| 5 | 2.8 % | 192,064,672 | 19,200 | 10.0 | `cublas gemvx __half` (smaller projection variant) | +| 6 | 2.3 % | 156,739,168 | 66,399 | 2.4 | `rmsnorm_f16_g32_kernel` (custom) | +| 7 | 1.2 % | 84,588,672 | 16,062 | 5.3 | `attn_decode_gqa_kernel` (custom GQA attention, NOT cuDNN FMHA) | +| 8 | 0.5 % | 31,866,944 | 32,124 | 1.0 | `rope_neox_f16_kernel` (custom) | +| 9 | 0.5 % | 31,732,480 | 32,124 | 1.0 | `add_f16_kernel` (residual) | +| 10 | 0.5 % | 30,784,032 | 12 | 2 565.3 | `dilated_causal_conv1d_im2col_f32_kernel` (codec) | + +GPU busy: **6.79 s out of 10.24 s observed** = **33.7 % idle on the GPU**. Inter-kernel gap: median 2.66 us, p90 7.07 us, totalling **1.02 s of small (<100 us) gaps** — pure launch-overhead between micro-kernels. +CUDA-Runtime API call counts: **264,376 cudaLaunchKernel** + **52,527 cudaMemcpyAsync** (708 ms / 80 % of API time on the latter). Default daemon launch did **NOT** enable Phase 2.5 graphs (`use_cuda_graphs_=false`); a re-run with `TALKER_USE_CUDA_GRAPHS=1` showed **133 `cudaGraphInstantiate` + 133 `cudaGraphLaunch` per request** (one capture per token position) and **no measurable wall-clock improvement** — graphs are rebuilt every step, so launch cost is just shifted to capture time. + +## 2. Per-stage breakdown (one full warm request, 7.14 s wall) + +| Stage | Wall | GPU busy | Idle % | Notes | +|-------|-----:|---------:|------:|-------| +| Prefill (text embed + first-pos forward) | ~0.36 s | ~0.41 s | low | Single forward over short prompt; wmma dominates briefly | +| **LM autoregressive decode loop** | **~6.17 s (86 %)** | 4.50 s | **27 %** | 35+ gemv/token × N tokens × 2 nets (Talker + Predictor); launch-bound | +| Codec / SpeechTokenizerDecoder (cuDNN+custom conv) | ~0.51 s | 0.51 s | ~0 % | Saturated; large kernels (97 ms causal\_conv\_transpose1d, 2.6 ms dilated\_causal) | + +## 3. Memory bandwidth utilisation + +GB10 spec: **546.1 GB/s**. +Median decode gemv = 38 us; at peak BW that window can move 20.7 MB. A typical 1024×1024 f16 weight read = 2 MB → **~10 % effective HBM BW** on the steady-state decode kernel. D2D memcpy total 49.7 ms / 732 MB = **14.7 GB/s** (KV-cache movement, negligible). The pipeline is **launch-bound and SM-occupancy-bound, NOT bandwidth-bound** on GB10. There is large unused HBM headroom that bigger-batch or fused kernels could turn into wall-clock. + +## 4. Ranked candidate wins + +| Rank | Candidate | Expected payoff | Effort | Risk | +|------|-----------|----------------:|-------|------| +| **1** | **Persistent CUDA Graph re-launch** (capture **once**, launch N times instead of capture-per-pos) | **−25 to −35 % decode wall (≈ −1.5 to −2.0 s)**; closes the 27 % decode-loop idle gap and eliminates ~75 % of 264 k cudaLaunchKernel | Medium (Phase 2.5 already wires capture; needs cache key by `pos 0.0f) { + clamp_residual_f32_(img_hidden, ...); + clamp_residual_f32_(txt_hidden, ...); + } +} +``` + +**Key observation**: `dump_tensor_f32("13_img_resid1.f32", ...)` (line 4316) runs **before** the `QIE_RESID_CLAMP` block (line 4322). The dump captures the F32 residual in its **pre-clamp** state. + +### Single-fire dump guard + +Lines 3514-3519: +```cpp +static std::set s_dump_fired; +const bool match_multi = !s_dump_dir.empty() && !s_dump_indices.empty() + && s_dump_indices.count(s_block_idx_now) > 0 + && s_dump_fired.count(s_block_idx_now) == 0; +if (match_multi) s_dump_fired.insert(s_block_idx_now); +``` + +`s_dump_fired` is a process-lifetime static. Each block index dumps exactly **once** — on the first call to `forward_block_(block_idx)`. For block 0, that is step 0. + +### No prior clamp can have run + +At block 0 step 0: +- `s_intra_calls` transitions 0 → 1 (this is the first `forward_block_` invocation) +- No prior block call exists in the process +- Therefore no prior `clamp_residual_f32_` call has ever executed +- The clamp value setting (60000 or 0) cannot influence any input to block 0 step 0 + +## Magnitude check on ac03 baselines + +``` +eng_baseline (CLAMP=60000): N=1572864 mean=-2.255e-01 std=6.202e+01 + absmax=1.235e+03 nan=0 inf=0 +cli_baseline: N=1572864 mean= 2.712e+00 std=1.702e+02 + absmax=2.516e+03 nan=0 inf=0 +cos(eng60000, cli) = 0.05068731 +mag_ratio absmax eng/cli = 0.49089465 +``` + +Engine absmax = **1234.9**, CLI absmax = **2515.6**. Both are **far below the 60000 clamp threshold** — even if the clamp had run before this dump (it didn't), it would have been a no-op on these tensors. The clamp couldn't possibly trigger at block 0 step 0 with these magnitudes. + +The 3.26e10 magnitude mentioned in the task context refers to **deep blocks (block 27+) at later steps**, not block 0 step 0. At block 0 step 0, the residual is well-conditioned (~1000-2500 absmax) and the divergence is NOT a saturation issue. + +## ac01 setup state (for reference) + +- `OminiX-Ascend-w1` synced from ac03 main via git bundle: now at `3daae48`. +- ac01 NPU (910B4, 32GB HBM) idle and healthy. +- Inputs synced: `/tmp/qie_q45_inputs/` (3.2 MB, all 5 .f32.bin tensors). +- Baselines synced: `/tmp/qie_5513f_eng_blocks/block00/13_img_resid1.f32` and `/tmp/qie_5513f_cli_blocks/block00/qie_cli_blk00_13_img_resid1.f32.bin`. +- GGUF (12 GB) was not transferred — bridging through local mac would have taken ~30 min and the source-level proof made it unnecessary. + +If the empirical run is still wanted, the only blocker on ac01 is the GGUF transfer; everything else is staged. + +## Verdict + +**Clamp is IRRELEVANT to the block-0 `13_img_resid1` divergence** (NOT a band-aid masking it). + +Reasons: +1. Dump occurs strictly before clamp in source order — clamp setting cannot affect this dump. +2. Single-fire `s_dump_fired` guard means block-0 dump is captured on the very first `forward_block_` call (step 0), before any clamp has ever executed. +3. Tensor magnitudes at block 0 step 0 (absmax ~1235 engine, ~2516 CLI) are an order of magnitude below the 60000 clamp threshold, so even if the clamp had run it would have been a no-op. +4. Empirical CLAMP=0 vs CLAMP=60000 dumps would be **bitwise identical** — not a useful diagnostic at block 0. + +### Implication for #207 + +The cos=0.0507 / mag_ratio=0.491 divergence at block-0 `13_img_resid1` is **genuine, upstream, and clamp-independent**. The substep bisect on ac03 (#207) is the correct diagnostic path — the bug is in one of the substeps that produce `img_hidden` after the gated residual add: img/txt LN1, mod1, attention QKV, RMSNorm, RoPE, attn_out, or the gated residual add itself. The clamp is a separate concern for deep blocks at later steps where the residual stream grows large. + +The §5.5.44 clamp band-aid is a downstream-block / late-step hack for F16 LN saturation — it has no bearing on the block-0 divergence under investigation. diff --git a/qie_f32_ref_findings.md b/qie_f32_ref_findings.md new file mode 100644 index 0000000..052d05e --- /dev/null +++ b/qie_f32_ref_findings.md @@ -0,0 +1,310 @@ +# QIE F32 Per-Op Reference Bisect — Findings + +**Date**: 2026-04-29 +**Scope**: F32 numpy/torch references for individual QIE-Edit ops, comparing +against native engine dumps. Per-op tolerance bisect (NOT full Diffusers E2E). +**Inputs**: Native engine dumps under `/tmp/qie_q45_inputs_1024/`, +`/tmp/qie_5542_v2/block0[01]/`, `/tmp/qie_dumps_real_5520/`, +`/tmp/qie_5543_step2/`. +**Refs**: `/home/ma-user/work/qie_f32_refs/{euler,rmsnorm,adaln,patchify,block0}_ref.py`. +**Tolerances**: cos ≥ 0.999 = MATCH, rel_max ≤ 1e-4 = F16-rounding precision, +rel_mean ≤ 1e-3 = acceptable F16 drift. + +## TL;DR + +**Every weight-free op verified by F32 reference is numerically correct +within F16-rounding precision (cos = 1.000000 across the board).** The +remaining drift between native and F32-ref is uniformly bounded at +`rel_max ≈ 3-7×10⁻⁴` and `rel_mean ≈ 1×10⁻⁵` — exactly the spread +expected when an op consumes a tensor that was round-tripped through F16 +storage on the device. + +The "polka-dot" / TILE pattern magnitude leak is therefore NOT in any of +the ops covered by this bisect (Euler, SiLU, RMSNorm, AdaLN modulate, +LayerNorm, gated residual, patchify/unpatchify). It MUST live in a +weight-dependent op that I cannot probe without weight dumps: + +- Q/K/V projection matmuls +- attn-out projection matmul +- FFN up / gate / down matmuls (with GELU) +- t_emb upstream chain (timestep_embedder.linear_1 / linear_2 + SiLU) +- img_in / txt_in projection matmuls +- norm_out / proj_out (final unpatchify pre-step) + +The smoking gun is the **t_emb magnitude itself**: `00_t_emb.f32` has +`abs_max=111.75` identically across engine and CLI dumps. That's the +input to the AdaLN matmul (`img_mod.1 / txt_mod.1`), which then yields +chunks with `abs_max` up to **491** (block 1, chunk[1] = "shift1" under +legacy binding). The arithmetic that consumes those chunks is correct; +it's the chunk values themselves that are pathological. §5.5.7 in +`docs/qie_q2_phase4_smoke.md` already isolates this as "the underlying +defect (something is amplifying t_emb / Q4 dequant by ~10×)". + +## Per-op verdicts + +### 1. Euler step (image_diffusion_engine.cpp:6280-6296) — MATCH + +| dump pair | cos | max_abs_diff | rel_max | status | +|-----------|----:|-------------:|--------:|-------:| +| step 0 (sigma=1.0→0.75, dt=-0.25) plain Euler | 1.000000 | 2.98e-08 | 8.3e-09 | **MATCH** | +| step 0 with c_skip+c_out reconstruction (`x'=x+v*dt`) | 0.990265 | 1.27 | 0.26 | DRIFT | + +The plain Euler form `d=(x-denoised)/sigma; x+=d*dt` matches the engine +dump bit-perfectly. The §5.5.47 c_skip+c_out reconstruction +(`denoised := x + (-sigma)*v` followed by the same Euler) does NOT match +— which confirms §5.5.47's RED verdict (the engine does NOT use that +reconstruction; the dumped `denoised_host` is treated as the *denoised +prediction* directly, not as the velocity `v`). + +This is good news: the §5.5.47 hypothesis is mathematically wrong for +this engine — `denoised_host` is what the diffusers v-prediction CLI +calls "v" only if you also reinterpret the Euler form, but as-coded the +engine's plain Euler matches its own pre-Euler dump exactly. **No fix +needed in Euler step at step 0.** + +Caveat: I could not validate step 1 because step 1 is all-NaN (per +§5.5.43). The Euler ARITHMETIC at step 1 is irrelevant since `denoised` +is already NaN going in. + +### 2. SiLU(t_emb) (image_diffusion_engine.cpp:3656-3678) — MATCH + +| dump | cos | max_abs_diff | rel_max | status | +|------|----:|-------------:|--------:|-------:| +| block0/1024² silu | 1.000000 | 2.16e-04 | 1.93e-06 | **MATCH** | + +`y = x * sigmoid(x)` reproduces the engine's `aclnnSilu` output to F16 +ULP. SiLU itself is fine. + +### 3. RMSNorm (image_diffusion_engine.cpp:2725-2880) — MATCH (within F16) + +All eight cases (block 0 + block 1 × img/txt × Q/K) hit cos = 1.000000 +with `rel_max ≈ 3-4×10⁻⁴`. Slightly above my strict 1e-4 threshold so +classified ROUND_DRIFT, but that's because the dumped Q/K inputs went +F16→F32 once (engine stored F16 device tensors) and the gamma I had to +recover was estimated from the noisy Y/X ratio. The engine's RMSNorm +math is correct. + +What is interesting is the input scale: + +| stream | x abs_max | y abs_max | gamma abs_max | +|--------|----------:|----------:|--------------:| +| block 0 / 256² img_Q | 627 | 724 | 64.0 | +| block 1 / 1024² img_Q | **9544** | 12.6 | 2.6 | +| block 0 / 256² txt_Q | 1029 | 6.79 | 1.55 | +| block 1 / 1024² txt_Q | 2656 | 5.51 | 1.56 | + +Q activations grow to ~10 000 magnitude entering block 1 (1024²). The +post-RMSNorm Y is ~10× bigger than canonical (~1) only on block 0 256² +img_Q (gamma_max=64 — most likely a Q4_0 dequant artifact in the gamma +weight tensor itself). RMSNorm correctly normalises, but the upstream +matmul has produced inputs that are 10× too large. + +### 4. AdaLN modulate (image_diffusion_engine.cpp:2013-2080) — MATCH (legacy binding) + +| dump pair | cos | max_abs_diff | rel_max | status | +|-----------|----:|-------------:|--------:|-------:| +| block 0 1024² img modulate1 [legacy: scale=ch0, shift=ch1] | 1.000000 | 6.7e-02 | 7.3e-04 | **MATCH** | +| block 0 1024² img modulate1 [SPEC: shift=ch0, scale=ch1] | 0.132321 | 109.07 | 1.03 | DRIFT | +| block 1 1024² img modulate1 [legacy] | 1.000000 | 0.31 | 5.9e-04 | **MATCH** | +| block 1 1024² img modulate1 [SPEC] | -0.101663 | 689 | 1.14 | DRIFT | + +Confirms `modulate_(x, scale, shift)` does compute +`x * (1 + scale) + shift` correctly. The native engine is using the +**legacy chunk binding** as its source comment in §5.5.7 documents. + +But the chunks are pathologically large: +- block 1 chunk[1] (used as shift1 under legacy) has `abs_max=491` +- block 1 chunk[4] (used as shift2 under legacy) has `abs_max=317` +- block 0 chunk[4] (shift2 legacy) has `abs_max=200` + +So: the matmul math is correct, but chunks fed in are 100-500× larger +than canonical. §5.5.7 already notes this and pins legacy because under +the spec binding these would multiply (`(1 + scale=491)`) and explode by +another 491×. **The chunk-binding choice is a band-aid; the real +problem is the upstream `silu(t_emb) @ W_mod + b_mod` matmul producing +chunks that are 100× too large.** + +### 5. Patchify / unpatchify (image_diffusion_engine.cpp:5182-5260) — MATCH + +| test | cos | max_abs_diff | status | +|------|----:|-------------:|-------:| +| Engine vs vectorised diffusers patchify | 1.000000 | 0.0 | **MATCH (bit-exact)** | +| Round-trip patchify+unpatchify | 1.000000 | 0.0 | **MATCH (bit-exact)** | + +The engine's patchify and unpatchify are bit-exact equivalents of the +diffusers reshape+permute path. **Patchify is NOT the source of the +"tile pattern" visual.** The tile pattern observed in §5.5.42 onward +must come from per-token magnitude inhomogeneity introduced by deeper +(weighted) ops, not from the patch-layout transform. + +### 6. Block 0 weight-free transitions — MATCH + +12 transitions (T1-T12) verified across block 0 256² real, block 0 1024² +1-step, and block 1 1024² 1-step. All 12 hit cos = 1.000000 in every +test case: + +``` +T1 LayerNorm(00_img) -> 04_img_LN1 ROUND_DRIFT (cos=1.0) +T2 04_img_LN1 * (1+scale1) + shift1 -> 05_img_mod1 ROUND_DRIFT (cos=1.0) +T3 LayerNorm(00_txt) -> 06_txt_LN1 ROUND_DRIFT (cos=1.0) +T4 06_txt_LN1 * (1+t_scale1) + t_shift1 -> 07_txt_mod1 ROUND_DRIFT (cos=1.0) +T5 00_img + gate1 * 12_to_out_0 -> 13_img_resid1 MATCH (bit-exact) +T6 00_txt + t_gate1 * 12_to_add_out -> 13_txt_resid1 (covered by T5 layout) +T7 LayerNorm(13_img_resid1) -> 14_img_LN2 ROUND_DRIFT (cos=1.0) +T8 14_img_LN2 * (1+scale2) + shift2 -> 15_img_mod2 ROUND_DRIFT (cos=1.0) +T9 LayerNorm(13_txt_resid1) -> 16_txt_LN2 ROUND_DRIFT (cos=1.0) +T10 16_txt_LN2 * (1+t_scale2) + t_shift2 -> 17_txt_mod2 ROUND_DRIFT (cos=1.0) +T11 13_img_resid1 + gate2 * 20_img_ff_down -> 24_img_resid2 MATCH (bit-exact) +T12 13_txt_resid1 + t_gate2 * 23_txt_ff_down -> 24_txt_resid2 (covered) +``` + +Critically: the magnitudes of intermediates ARE pathological: + +- block 0 1024² T11 `24_img_resid2` `abs_max = 7.28×10⁶` (7 million) +- block 1 1024² T11 `24_img_resid2` `abs_max = 1.41×10⁸` (140 million) + +The img residual stream ends a single block at ±10⁷ and is ±10⁸ entering +block 2, growing ~20×/block until F32 saturates around block 27 (per +§5.5.43). The growth is from the *weighted* ops: + +- `12_to_out_0` (img attn out projection) feeds into the gated residual + T5 at magnitude that is already enormous. +- `20_img_ff_down` feeds into T11 at the magnitude that grows the + residual stream by ~10⁷ per block. + +These two MATMUL outputs are the magnitude leak vectors. Their inputs +in turn come from RMSNorm-correct Q/K/V (whose sources had abs_max=10⁴), +so the entire weighted chain is amplifying. + +## Root cause analysis + +The bisect rules out **all** non-weighted F32 ops in the per-block code. +The drift therefore lives in one of the matmuls or their dequant. Three +suspect paths, in order of suspicion: + +### Suspect 1 (HIGH): `img_mod.1` / `txt_mod.1` matmul output is 100× too large + +Direct evidence from chunk[1] (legacy shift1) `abs_max=491` at block 1. +A trained QwenImage AdaLN linear should produce shifts ~O(1) and scales +~O(0.1). Instead we see shifts of ±491 and scales of ±152. + +Possible causes: +a) Q4_0 dequant of `img_mod.1.weight` is wrong (wrong scales/biases lookup). +b) `silu(t_emb)` input to the matmul has `abs_max=111.75` (vs O(1) + expected). The amplification factor 111 from the timestep_embedder + chain alone would explain a ~100× chunk inflation. +c) The matmul is summing in F16, hitting F16 max representable at deep + reductions, but mod1 is a 256→3072 matmul; F16 reduction at this + depth shouldn't blow up. + +The §5.5.20 t_emb oracle (already on disk at +`tools/probes/qie_t_emb_oracle/qie_t_emb_oracle.py`) has bytes-equal +agreement against the F32 reference for the t_emb chain, which means +the engine *correctly computes* a t_emb that has `abs_max=111.75`. So +the upstream defect lives in either the timestep_embedder.linear_1 / +linear_2 weights themselves OR in the dequant/cast of those weights — +NOT in the SiLU+matmul math the engine performs. + +### Suspect 2 (HIGH): FFN down (and possibly attn-out) projection matmul + +Block 1 `13_img_resid1` enters at `abs_max=8.98×10⁶` and `24_img_resid2` +exits at `abs_max=1.41×10⁸`. The growth factor across block 1 is **15.7×**. +The growth is in `20_img_ff_down`'s output (since T11 is bit-exact, the +gate2 application doesn't change the order of magnitude — gate2's max +is bounded by chunk[5] `abs_max=221`). + +So `20_img_ff_down` is producing outputs ~`abs_max = 1.4×10⁸ / gate2_max +≈ 6×10⁵`, which then gates-residual into the residual stream. +Possible causes: +a) Q4_0 dequant of `ff.net.2.weight` (FFN down projection) is wrong. +b) F16 saturation of intermediate FFN activations (we already saw + `08_img_K abs_max=10744` at block 1 — the post-RMSNorm path is + normalised but the FFN path is NOT, so 10⁵+ activations entering + the down projection are F16 ulp-rounded by ~0.5). +c) The §5.5.46 BF16 widen patch hadn't covered FFN down — the residual + stream stays F32 but the hidden FFN tensor goes F16. + +### Suspect 3 (MEDIUM): img_in / txt_in / norm_out / proj_out + +These are not directly observable in the per-block dumps but feed the +chain on entry/exit. If `img_in` produces `00_img` with `abs_max=10⁴`, +that already concedes the leak before any block executes. + +The actual `00_img.f32` for block 0 / 1024² 1-step has stats: +- mean ≈ 0 +- std ≈ 1.0 (good!) +- abs_max ≈ 10ish + +So `img_in` is fine — the residual stream enters block 0 at the right +scale (~1σ). The amplification happens INSIDE the blocks. + +## Recommended fix + +The next agent should NOT touch: +- Euler step (correct — see §5.5.47 RED revisit confirmed) +- SiLU +- RMSNorm +- AdaLN modulate +- Patchify / unpatchify +- LayerNorm +- Gated residual + +The next agent SHOULD instrument and verify, in this order: + +1. **Dump `silu(t_emb) @ img_mod.1.W^T + img_mod.1.b` against an F32 + numpy reference using the actual GGUF Q4_0 dequant**. The §5.5.20 + `qie_t_emb_oracle.py` already contains the GGUF dequant scaffolding + (`load_dequant_weight`); extend it to dequant `img_mod.1.weight` and + compute the F32 matmul output, then compare to engine + `02_img_mod_out.f32`. This will localise whether the chunk + magnitudes (±491) come from genuine trained weights or from a + dequant defect. **File: `image_diffusion_engine.cpp:3681`** — + `dispatch_matmul_(scratch_q_dev_, lw.img_mod_w_q4, lw.img_mod_scale, + lw.img_mod_b, B, H, 6 * H, scratch_mod_dev_)`. + +2. **Dump `ff_norm(adaLN(img_resid1)) @ ff.net.0.weight` (FFN gate + matmul) and `... @ ff.net.0.proj.weight` (FFN up matmul) and + `gelu(...) @ ff.net.2.weight` (FFN down matmul)** for block 1, then + compare each to its F32 numpy dequant reference. The 15.7× per-block + growth strongly implicates one of these. **File: + `image_diffusion_engine.cpp` ~line 4304-4400** in the FFN dispatch + (find via `grep -n 'ff.net\|img_ff'`). + +3. **Verify the Q4_0 super-block scale tensor itself for these specific + weight names**. The §5.5.7 hypothesis "something is amplifying + t_emb / Q4 dequant by ~10×" maps to a Q4_0 super-block that has the + wrong scale stored or read. The native engine reads scales via + `dequant_upload_q4_0` — check that the scale tensor sub-name + (`.scale`) is loaded with the correct GGUF type and that the + dequantization formula in `dispatch_matmul_` matches GGML's + reference (`q4_0_dequantize_block_8`-equivalent). + +## Open questions + +1. The chunk[1] / chunk[4] amplification is consistent across block 0 + and block 1 with the same magnitude tier (~50×, ~300×). If both + blocks have similar amplification, the defect is not block-specific + — it's in the modulation matmul code path, not in any one block's + weight tensor. This suggests issue #1 above (img_mod / txt_mod + matmul) is the highest-priority root cause. + +2. We do not have a CLI ground-truth `02_img_mod_out.f32` to compare + against. The §5.5.46 carry-over already noted: dispatch a CLI dump + of the same intermediate to localise engine-vs-CLI divergence at + the chunk level. That should be the next agent's first probe. + +3. The §5.5.13 attention bit-correctness verification (FIA = bit-exact + F32 oracle at REAL inputs, cos=1.000000) needs to be re-confirmed at + 1024² with current dumps — it was done at 256² real-data, where + block 0 `08_img_Q abs_max=627`. At 1024² block 1 `08_img_Q abs_max + = 9544`, an order of magnitude larger. Whether FIA stays bit-exact + at the larger magnitude has not been re-tested. + +## Files + +- `/home/ma-user/work/qie_f32_refs/euler_ref.py` +- `/home/ma-user/work/qie_f32_refs/rmsnorm_ref.py` +- `/home/ma-user/work/qie_f32_refs/adaln_ref.py` +- `/home/ma-user/work/qie_f32_refs/patchify_ref.py` +- `/home/ma-user/work/qie_f32_refs/block0_ref.py` +- `/home/ma-user/work/qie_f32_refs/_compare.py` diff --git a/qie_q4_matmul_oracle_findings.md b/qie_q4_matmul_oracle_findings.md new file mode 100644 index 0000000..0a0ebd8 --- /dev/null +++ b/qie_q4_matmul_oracle_findings.md @@ -0,0 +1,161 @@ +# QIE Q4_0 / Q5_K / Q4_1 matmul oracle — magnitude leak verdict + +Per agent #201's per-op F32 bisect, all weight-free op classes (Euler, +RMSNorm, SiLU, AdaLN math, patchify, weight-free block transitions) are +bit-correct vs F32 numpy. The DiT magnitude blow-up that hits F16 +saturation around block 27 (per §5.5.43) was narrowed to two +quantized-weight matmuls in the engine, both flagged in §5.5.42 V2 +post-fix dumps as still inflated: + +1. `silu(t_emb) @ img_mod.1.weight + img_mod.1.bias` → `02_img_mod_out` + (per-block; block-1 chunks absmax ≈ 491, block-0 absmax ≈ 269) +2. `gelu(15_img_mod2 @ img_mlp.net.0.proj.W.T + b) @ img_mlp.net.2.W.T + b` + → `20_img_ff_down` (block-1 absmax ≈ 6.10×10⁵, far above F16 65504) + +Hypotheses framed by the parent agent: +- A. Trained weights are pathological → magnitudes are by design; + fix = widen output dtype to BF16/F32 in the engine matmul path. +- B. Q4_0 dispatch has a residual defect → fix dequant path. + +## Method + +Pure-F32 numpy oracle: `tmp/qie_q4_matmul_oracle.py` (deployed to +ac03 at `/tmp/qie_q4_matmul_oracle.py`). + +- Q4_0 dequant validated **byte-equal** vs `gguf-py` reference + (`quants.dequantize_blocks`) on `transformer_blocks.1.img_mod.1.weight` + (qt=2 / Q4_0). `bit_equal=True`, `max|hand-ref|=0`. + → Both dequant paths are numerically identical and trustable. +- Inputs come from engine F32 dumps under `/tmp/qie_5542_v2/blockNN/` + (post-§5.5.42 V2 Q4_0 dispatch fix). +- Oracle = `silu @ W.T + b` in pure F32 (and a F16-rounded mimic). +- Oracle compared to engine `02_img_mod_out` and `20_img_ff_down`. + +## Tensor types in QIE-Edit-2509-Q4_0.gguf (relevant) + +| Weight | qt | type | +|---|---|---| +| block-0 `img_mod.1.weight` | 13 | **Q5_K** | +| block-1 `img_mod.1.weight` | 2 | **Q4_0** | +| block-1 `img_mlp.net.0.proj.weight` (ff_up) | 2 | **Q4_0** | +| block-1 `img_mlp.net.2.weight` (ff_down) | 3 | **Q4_1** | + +Notable: the GGUF labelled "Q4_0" actually mixes Q4_0 / Q4_1 / Q5_K / +F16 / F32 across tensors. The "ff.net.2" weight is **Q4_1**, not Q4_0, +which the dispatch path treats identically (same WQBMMv2 / aclnnMm +codepath, just different scale layout). Both Q4_0 and Q4_1 paths are +exercised by the test. + +## Results + +### Test 1 — `img_mod.1` matmul (clean isolation, full input + output dumped) + +| Block | qt | engine absmax | oracle absmax | cos | max_abs_diff | mean_abs_diff | rel_max | +|---|---|---|---|---|---|---|---| +| 0 | Q5_K | 2.690e+02 | 2.690e+02 | **1.000000** | 1.526e-01 | 2.36e-03 | 5.67e-04 | +| 1 | Q4_0 | 4.915e+02 | 4.914e+02 | **1.000000** | 2.372e-01 | 6.77e-03 | 4.83e-04 | + +Magnitude (oracle absmax 491 at block 1) is **reproduced bit-for-bit by the +F32 oracle from the same Q4_0 weight + the same dumped silu(t_emb) input**. +The ≤ 0.25 max_abs_diff is pure F16-rounding round-trip noise from the +engine's `out_dtype=ACL_FLOAT16` cast on store. + +**Verdict for img_mod.1: A.** Magnitude is from the trained weights. +Q4_0 dispatch is correct. + +Why oracle absmax 491 ≠ engine 491.5 (one part in 4000): engine path +is `aclnnWeightQuantBatchMatmulV2(input_f16, weight_q4_0, scale_f16) → +F16 output → InplaceAdd(bias_f16)`. The oracle stores F32. Cast-back +parity at this level (`rel_max=4.8e-4`) is the expected floor for +F16 intermediate accumulation rounding, not a Q4_0 defect. + +### Test 2 — img MLP chain (ff_up → gelu → ff_down) at block 1 + +Run as a single oracle from `15_img_mod2` to `20_img_ff_down` because the +engine does not dump `19_img_gelu`. Both Q4_0 (ff_up) and Q4_1 (ff_down) +matmuls are exercised; if either had a dispatch bug, the chain would +diverge. + +| variant | engine absmax | oracle absmax | cos | max_abs_diff | mean_abs_diff | rel_max | +|---|---|---|---|---|---|---| +| F32 oracle (gelu tanh approx, ggml-cann GeluV2) | 6.103e+05 | 6.094e+05 | **0.999998** | 2.17e+03 | 3.96e+01 | 3.56e-03 | +| F32 oracle (gelu erf exact) | 6.103e+05 | 6.094e+05 | **0.999998** | 2.17e+03 | 3.96e+01 | 3.56e-03 | + +The chain output magnitude (engine absmax = 610 300, oracle 609 380) is +**reproduced by the F32 oracle to cos=0.999998 and rel_max=3.6e-3**. The +diff floor is consistent with two F16-rounded matmul intermediates plus +a BF16 final-cast on the engine side (`dump_tensor_dt(..., +ffn_down_bf16 ? PROBE_BF16 : PROBE_F16)` on line 4382 — this dump was +under `QIE_FFN_DOWN_BF16=1` since 6.10e5 > F16 max 65504). + +**Verdict for ff.net.2 (and ff.net.0): A.** Chain magnitude is from the +trained weights. Q4_0 (ff_up) and Q4_1 (ff_down) dispatch are both +numerically correct. + +## Final verdict + +**A confirmed for both weights.** The Q4_0 dispatch fix in §5.5.42 V2 +(WQBMMv2 mirror of ggml-cann mul_mat_quant) is correct. The 491 mod1 +chunks and 6.10e5 ff_down outputs are intrinsic to the trained Q4_0/Q4_1 +weights. The magnitude cascade that hits F16 saturation by block 27 is a +**dtype-saturation bug in the storage cast**, not a quantization defect. + +## Recommended fix + +The `img_mod.1` matmul path is the one currently NOT widened to BF16. +Source: +- `tools/qwen_image_edit/native/image_diffusion_engine.cpp:740` (header + default `aclDataType out_dtype = ACL_FLOAT16`). +- `tools/qwen_image_edit/native/image_diffusion_engine.cpp:3681-3682` + — img_mod.1 dispatch, no out_dtype arg → defaults to F16. +- `tools/qwen_image_edit/native/image_diffusion_engine.cpp:3688-3692` + — txt_mod.1 dispatch, same pattern. + +Block-1 already produces 491 chunks; block-0 already produces 269 chunks. +Once `(1 + scale) * x_LN` is computed, `(1+491)*x` can land near F16 max +even from a small LN1, and the residual chain compounds it. + +Mirror the §5.5.45/46 widening pattern used for QKV (line 3846, +`s_qkv_bf16 ? ACL_BF16 : ACL_FLOAT16`) and ff_down (line 4377, +`ffn_down_bf16 ? ACL_BF16 : ACL_FLOAT16`): + +1. Add a static-cached env knob `s_mod_bf16` parallel to `s_qkv_bf16`, + reading e.g. `QIE_MOD_BF16` (and respecting `QIE_ALL_BF16`). +2. Change line 3681/3688 to pass + `s_mod_bf16 ? ACL_BF16 : ACL_FLOAT16` as the out_dtype. +3. The downstream consumer (`scratch_mod_dev_` chunks) is already F32- + capable per §5.5.45 plumbing — verify the chunk reader at + `image_diffusion_engine.cpp:3804` handles BF16 with the existing + PROBE_BF16 / `dump_tensor_dt(...)` paths. + +## Open questions / next-agent + +- Does the `02_img_mod_out` dump at `/tmp/qie_5542_v2/block01/` already + include the post-§5.5.42 fix V2? Yes — confirmed by directory name + and oracle agreement. The 491 chunks are the **correct V2 output**. +- Why does §5.5.43 cite F16 saturation around block 27, not block 1? + Because `(1 + 491) * x_LN1` doesn't *immediately* saturate F16 when + x_LN1 is small (LN1 outputs are O(1)); the saturation is in the + residual accumulation `img += gate * to_out` which compounds across + ~27 blocks before exceeding 65504. +- Pre-§5.5.42 dumps at `/tmp/qie_5536_eng_real/block01/` showed + oracle vs engine `02_img_mod_out` cos=0.045 (per `/tmp/qie_b1_oracle.log`) + — that confirms §5.5.42 V2 fix DID resolve a real Q4_0 dispatch defect. + The current dumps are post-fix and clean. +- BF16 widening alone may not be enough if some intermediate is still + F16; audit all of `dispatch_matmul_(*, ACL_FLOAT16)` (default) call + sites; mod1, txt_mod1, and any to_add_out / similar paths still on + F16 default. + +## Reproduction + +On `ssh ac03`: + +``` +~/anaconda3/envs/PyTorch-2.7.1/bin/python /tmp/qie_q4_matmul_oracle.py +``` + +Inputs: `/tmp/qie_5542_v2/block00/`, `/tmp/qie_5542_v2/block01/`, +`/home/ma-user/work/qie_weights/Qwen-Image-Edit-2509-Q4_0.gguf`. +Run log: `/tmp/qie_q4_oracle_run.log`. +Oracle script (local copy): `tmp/qie_q4_matmul_oracle.py`. diff --git a/qie_t_emb_oracle.py b/qie_t_emb_oracle.py new file mode 100644 index 0000000..34b33d9 --- /dev/null +++ b/qie_t_emb_oracle.py @@ -0,0 +1,295 @@ +#!/usr/bin/env python3 +""" +Q2.4.5.5.20 - Pure-F32 numpy oracle for QIE-Edit time_text_embed chain. + +After §5.5.19 GREEN-B (mod1 dispatch is bit-accurate when given native +t_emb as input), drift is upstream of mod1 - i.e. in t_emb itself or +the chain that produces it. + +Engine path (image_diffusion_engine.cpp:5150-5223): + 1. sinusoidal[256] = host_timestep_embedding_f32(t=sigma*1000, dim=256, max_period=10000) + layout: [cos(arg_0..arg_127), sin(arg_0..arg_127)] + 2. cast F32->F16 + 3. t_emb_mid_f16[H] = sinusoidal_f16 @ time_linear1.W^T + time_linear1.b (dispatch_matmul_) + 4. silu in-place (aclnnSilu) + 5. t_emb_out_f16[H] = silu_t_f16 @ time_linear2.W^T + time_linear2.b (dispatch_matmul_) + 6. dump 00_t_emb.f32 (F16 cast back to F32) + +NOTE: QIE-Edit has NO text_embedder. time_text_embed = timestep_embedder ONLY. + +Dump origin: /tmp/qie_dumps_5516/ from qie_q45_real_denoise_smoke with +make_flow_sigmas(20) -> sigmas[0]=1.0 -> t_val=1000.0 (FIRST step). +""" +import os +import sys +import numpy as np + + +DUMP = os.environ.get("QIE_DUMPS", "/tmp/qie_dumps_5516") +GGUF = os.environ.get( + "QIE_GGUF", + "/home/ma-user/work/qie_weights/Qwen-Image-Edit-2509-Q4_0.gguf", +) +H = 3072 + + +def find_gguf_py(repo_root): + try: + import gguf # noqa: F401 + return + except ImportError: + pass + cand = os.path.join(repo_root, "gguf-py") + if os.path.isdir(cand): + sys.path.insert(0, cand) + + +def load_dequant_weight(gguf_path, name): + from gguf import GGUFReader, GGMLQuantizationType, quants + reader = GGUFReader(gguf_path) + tensor = None + for t in reader.tensors: + if t.name == name: + tensor = t + break + if tensor is None: + raise RuntimeError(f"tensor not found: {name}") + K = int(tensor.shape[0]) + N = int(tensor.shape[1]) + qt = tensor.tensor_type + raw = np.asarray(tensor.data, dtype=np.uint8).reshape(-1) + + if qt == GGMLQuantizationType.F32: + return raw.view(np.float32).reshape(N, K).astype(np.float32), "F32" + if qt == GGMLQuantizationType.F16: + return raw.view(np.float16).reshape(N, K).astype(np.float32), "F16" + if qt == GGMLQuantizationType.BF16: + u16 = raw.view(np.uint16).astype(np.uint32) + return ((u16 << 16).view(np.float32)).reshape(N, K).astype(np.float32), "BF16" + + qcls_map = {q.qtype: q for q in quants.__Quant.__subclasses__() + if hasattr(q, "qtype")} + qcls = qcls_map.get(qt) + if qcls is None: + raise RuntimeError(f"no Python dequant for type {qt}") + from gguf.constants import GGML_QUANT_SIZES + elems_per_blk, bytes_per_blk = GGML_QUANT_SIZES[qt] + n_total = N * K + n_blocks = n_total // elems_per_blk + blocks = raw.reshape(n_blocks, bytes_per_blk) + dq = qcls.dequantize_blocks(blocks) + qt_name = str(qt).split(".")[-1] + return dq.reshape(N, K).astype(np.float32), qt_name + + +def load_bias(gguf_path, name): + from gguf import GGUFReader, GGMLQuantizationType + reader = GGUFReader(gguf_path) + for t in reader.tensors: + if t.name == name: + data = np.asarray(t.data) + qt = t.tensor_type + if qt == GGMLQuantizationType.F32: + return data.astype(np.float32).reshape(-1).copy(), "F32" + if qt == GGMLQuantizationType.F16: + return data.astype(np.float32).reshape(-1), "F16" + if qt == GGMLQuantizationType.BF16: + u16 = data.view(np.uint16).astype(np.uint32) + return ((u16 << 16).view(np.float32)).reshape(-1).astype(np.float32), "BF16" + raise RuntimeError(f"bias dtype unsupported: {qt}") + raise RuntimeError(f"bias not found: {name}") + + +def cossim(a, b): + a = a.flatten().astype(np.float64) + b = b.flatten().astype(np.float64) + na = np.linalg.norm(a); nb = np.linalg.norm(b) + if na == 0.0 or nb == 0.0: + return 0.0 + return float(np.dot(a, b) / (na * nb)) + + +def f16_round(x): + return x.astype(np.float16).astype(np.float32) + + +def silu(x): + return (x * (1.0 / (1.0 + np.exp(-x.astype(np.float64))))).astype(np.float32) + + +def stats(label, x): + print(f" {label:36s} mean={x.mean():+.4e} std={x.std():.4e} " + f"min={x.min():+.4e} max={x.max():+.4e} absmax={np.abs(x).max():.4e}") + + +def host_timestep_embedding_f32(t, dim=256, max_period=10000): + """Engine layout (image_diffusion_engine.cpp:4626-4647): + [cos(arg_0..arg_half-1), sin(arg_0..arg_half-1)] + arg_j = t * exp(-log(max_period) * j / half) + """ + out = np.zeros(dim, dtype=np.float32) + half = dim // 2 + j = np.arange(half, dtype=np.float32) + freq = np.exp(-np.log(float(max_period)) * j / float(half)) + arg = float(t) * freq + out[:half] = np.cos(arg).astype(np.float32) + out[half:2*half] = np.sin(arg).astype(np.float32) + return out + + +def host_timestep_interleaved(t, dim=256, max_period=10000): + """Alt layout: [sin0, cos0, sin1, cos1, ...].""" + out = np.zeros(dim, dtype=np.float32) + half = dim // 2 + j = np.arange(half, dtype=np.float32) + freq = np.exp(-np.log(float(max_period)) * j / float(half)) + arg = float(t) * freq + out[0::2] = np.sin(arg).astype(np.float32) + out[1::2] = np.cos(arg).astype(np.float32) + return out + + +def host_timestep_sin_cos(t, dim=256, max_period=10000): + """Alt: [sin..., cos...] (HF default).""" + out = np.zeros(dim, dtype=np.float32) + half = dim // 2 + j = np.arange(half, dtype=np.float32) + freq = np.exp(-np.log(float(max_period)) * j / float(half)) + arg = float(t) * freq + out[:half] = np.sin(arg).astype(np.float32) + out[half:2*half] = np.cos(arg).astype(np.float32) + return out + + +def run_chain(sinu_f32, W1, b1, W2, b2, mode="f16"): + """Replicate engine: F32 sinu -> cast F16 -> matmul1(F16 weights) -> silu -> matmul2(F16 weights).""" + if mode == "f16": + x = f16_round(sinu_f32) + W1_ = f16_round(W1); b1_ = f16_round(b1) + W2_ = f16_round(W2); b2_ = f16_round(b2) + h = x @ W1_.T + b1_ + h = f16_round(h) # post linear1 (engine stores F16) + sh = f16_round(silu(h)) # post silu (engine stores F16) + y = sh @ W2_.T + b2_ + y = f16_round(y) # post linear2 (engine stores F16) + return x, h, sh, y + else: # pure F32 reference + x = sinu_f32 + h = x @ W1.T + b1 + sh = silu(h) + y = sh @ W2.T + b2 + return x, h, sh, y + + +def main(): + repo_root = os.path.abspath(os.path.join(os.path.dirname(__file__), + "..", "..", "..")) + find_gguf_py(repo_root) + + print(f"GGUF : {GGUF}") + print(f"DUMPS : {DUMP}") + print(f"H : {H}") + print() + + # ---- 1. native t_emb dump ---- + t_emb_native = np.fromfile(os.path.join(DUMP, "00_t_emb.f32"), + dtype=np.float32) + assert t_emb_native.size == H, f"shape mismatch: {t_emb_native.size}" + print("--- native 00_t_emb.f32 stats ---") + stats("00_t_emb (native)", t_emb_native) + print() + + # ---- 2. weights ---- + print("--- loading time_text_embed weights ---") + W1, W1_qt = load_dequant_weight(GGUF, "time_text_embed.timestep_embedder.linear_1.weight") + b1, b1_qt = load_bias(GGUF, "time_text_embed.timestep_embedder.linear_1.bias") + W2, W2_qt = load_dequant_weight(GGUF, "time_text_embed.timestep_embedder.linear_2.weight") + b2, b2_qt = load_bias(GGUF, "time_text_embed.timestep_embedder.linear_2.bias") + print(f" linear_1.weight shape={W1.shape} qt={W1_qt}") + print(f" linear_1.bias shape={b1.shape} qt={b1_qt}") + print(f" linear_2.weight shape={W2.shape} qt={W2_qt}") + print(f" linear_2.bias shape={b2.shape} qt={b2_qt}") + if W1.shape != (H, 256): + print(f" WARN: W1 shape != (H={H}, 256) — checking transpose...") + if W2.shape != (H, H): + print(f" WARN: W2 shape != (H, H)") + stats("W1", W1); stats("b1", b1); stats("W2", W2); stats("b2", b2) + print() + + # ---- 3. timestep ---- + # Run came from qie_q45_real_denoise_smoke -> make_flow_sigmas(20) -> + # sigmas[0]=1.0 -> t_val = 1000.0 + t_val_default = 1000.0 + print(f"--- primary candidate t_val={t_val_default} (sigma=1.0 * 1000) ---") + + # ---- 4. main oracle (engine layout: cos-then-sin) ---- + sinu = host_timestep_embedding_f32(t_val_default, 256, 10000) + stats("sinu_engine [cos|sin]", sinu) + x, h, sh, y = run_chain(sinu, W1, b1, W2, b2, mode="f16") + stats("post_linear1 (F16-RT)", h) + stats("post_silu (F16-RT)", sh) + stats("post_linear2 = oracle t_emb (F16-RT)", y) + cos_full = cossim(y, t_emb_native) + print(f"\n ORACLE COS (engine [cos|sin], t=1000, F16-RT): {cos_full:.6f}") + print(f" max_abs_diff = {np.abs(y - t_emb_native).max():.4e}") + print(f" mean_abs_diff = {np.abs(y - t_emb_native).mean():.4e}") + print() + + # ---- 5. alt-config sweep ---- + print("=" * 70) + print("ALT-CONFIG SWEEP") + print("=" * 70) + + configs = [ + ("engine [cos|sin] t=1000 F16", sinu, "f16"), + ("engine [cos|sin] t=1000 F32", sinu, "f32"), + ("HF [sin|cos] t=1000 F16", host_timestep_sin_cos(1000.0), "f16"), + ("interleaved t=1000 F16", host_timestep_interleaved(1000.0), "f16"), + ("engine [cos|sin] t=999 F16", host_timestep_embedding_f32(999.0), "f16"), + ("engine [cos|sin] t=1 F16", host_timestep_embedding_f32(1.0), "f16"), + ("engine [cos|sin] t=950 F16", host_timestep_embedding_f32(950.0), "f16"), + ("engine [cos|sin] t=500 F16", host_timestep_embedding_f32(500.0), "f16"), + ] + for label, s, mode in configs: + _, _, _, yy = run_chain(s, W1, b1, W2, b2, mode=mode) + c = cossim(yy, t_emb_native) + amx = np.abs(yy).max() + print(f" {label:38s} cos={c:+.6f} oracle_absmax={amx:.4e}") + print() + + # ---- 6. test transposed weights (W2[K,N] vs [N,K]) ---- + print("--- weight-orientation sanity check ---") + # If load returned (N,K) but engine expects (K,N), .T is wrong direction. + try: + sinu_eng = host_timestep_embedding_f32(1000.0) + x_ = f16_round(sinu_eng) + W1_alt = f16_round(W1) # try without transpose: x @ W1 (no .T) + if W1_alt.shape[0] == 256: + h_alt = x_ @ W1_alt + f16_round(b1) + print(f" W1 no-transpose path: shape would be {h_alt.shape}") + except Exception as e: + print(f" no-transpose check err: {e}") + print() + + # ---- 7. magnitude vs §5.5.7 historical ---- + print("--- magnitude reality check ---") + print(f" Current native 00_t_emb absmax: {np.abs(t_emb_native).max():.4e}") + print(f" §5.5.7 historical : 1.118e+02") + print(f" Oracle absmax (F16, t=1000) : {np.abs(y).max():.4e}") + print(f" Engine path is QUANTIZED weights (Q4_0/Q5_K) -> dequant may shift") + print() + + # ---- decision ---- + print("=" * 70) + if cos_full >= 0.999: + verdict = "GREEN-B (chain bit-accurate)" + elif cos_full >= 0.99: + verdict = "AMBER (near-bit-accurate, F16 drift only)" + else: + verdict = "GREEN-A (CHAIN BUG)" + print(f"VERDICT: {verdict} cos={cos_full:.6f}") + print("=" * 70) + + +if __name__ == "__main__": + main() diff --git a/src/handlers/audio.rs b/src/handlers/audio.rs index a5da363..186cbf4 100644 --- a/src/handlers/audio.rs +++ b/src/handlers/audio.rs @@ -10,7 +10,8 @@ use crate::engines::qwen3_tts; use crate::engines::tts_trait::{TtsCloneRequest, TtsRequest as TtsTraitRequest, TtsResponse}; use crate::error::render_error; use crate::inference::{AudioChunk, InferenceRequest, TtsRequest}; -use crate::types::{SpeechCloneRequest, SpeechRequest, TranscriptionRequest}; +use crate::types::{SpeechCloneRequest, SpeechRequest, TranscriptionRequest, VoiceNotFoundError}; +use crate::voice_registry::VoiceRegistry; use super::helpers::{get_state, send_and_wait}; @@ -174,10 +175,41 @@ pub async fn tts_qwen3( let wants_wav = req.query::("format").as_deref() == Some("wav") || request.response_format == "wav"; + // Resolve the requested voice through the same normalization as before + // (empty/"default" -> "vivian"), then reject unknown names with 404. + // Silent fallback to a default voice is what prompted this endpoint — + // fm_tts callers were getting 200 + audio in the wrong voice because + // an unregistered `voice` string was being swallowed downstream. + let resolved_voice = normalize_voice(request.voice.clone()); + let registry = VoiceRegistry::load(); + if !registry.contains(&resolved_voice) { + let requester = request_peer(req); + let available = registry.available_voices().to_vec(); + tracing::info!( + "rejected TTS request: unknown voice {} (available: {}) from {}", + resolved_voice, + available.join(", "), + requester, + ); + let message = format!( + "Voice '{}' is not registered. Available: {}", + resolved_voice, + available.join(", "), + ); + res.status_code(salvo::http::StatusCode::NOT_FOUND); + res.render(Json(VoiceNotFoundError { + error: "voice_not_found", + message, + requested_voice: resolved_voice, + available_voices: available, + })); + return Ok(()); + } + let chunk_rx = spawn_per_sentence_tts( state.inference_tx.clone(), qwen3_tts::split_sentences(&request.input), - normalize_voice(request.voice), + resolved_voice, request.language.unwrap_or_else(|| "chinese".to_string()), request.speed, request.instruct, @@ -195,6 +227,30 @@ pub async fn tts_qwen3( Ok(()) } +/// Best-effort client identifier for log lines. Prefers +/// `X-Forwarded-For` / `X-Real-IP` (gateway-provided), else the peer +/// socket. Falls back to "unknown" so logs stay single-line. +fn request_peer(req: &Request) -> String { + if let Some(fwd) = req.headers().get("x-forwarded-for") { + if let Ok(s) = fwd.to_str() { + if let Some(first) = s.split(',').next() { + let trimmed = first.trim(); + if !trimmed.is_empty() { + return trimmed.to_string(); + } + } + } + } + if let Some(real) = req.headers().get("x-real-ip") { + if let Ok(s) = real.to_str() { + if !s.is_empty() { + return s.to_string(); + } + } + } + req.remote_addr().to_string() +} + /// POST /v1/audio/tts/clone — Qwen3-TTS voice cloning (Base model) /// /// Dedicated endpoint for zero-shot voice cloning via x-vector speaker embedding. @@ -891,3 +947,193 @@ fn stream_pcm_response(chunk_rx: mpsc::Receiver, res: &mut Response) res.stream(stream); } + +#[cfg(test)] +mod tests { + //! Handler-level tests for `POST /v1/audio/tts/qwen3` voice validation. + //! + //! These exercise the new 404 behavior without touching the real TTS + //! pipeline. The 404 rejection short-circuits before any inference + //! request is sent, so a dummy receiver is sufficient. + //! + //! For happy-path requests (registered voice) the test spawns a stub + //! receiver that answers each `SpeechOneSentence` with a tiny PCM + //! buffer, mirroring what the real inference thread would produce. + use super::*; + use crate::inference::TtsRequest as CoreTtsRequest; + use crate::state::AppState; + use salvo::test::{ResponseExt, TestClient}; + use serde_json::Value; + use std::sync::Arc; + use tokio::sync::{broadcast, mpsc}; + + /// Build an `AppState` whose `inference_tx` is consumed by a stub + /// that answers Qwen3 TTS sentence requests with a 4-byte PCM blob. + /// Other channels drop their messages (they're not exercised by + /// `tts_qwen3`). + fn make_test_state() -> AppState { + let (inference_tx, mut inference_rx) = mpsc::channel(8); + tokio::spawn(async move { + while let Some(req) = inference_rx.recv().await { + if let crate::inference::InferenceRequest::Qwen3Tts(tts_req) = req { + match tts_req { + CoreTtsRequest::SpeechOneSentence { response_tx, .. } => { + let _ = response_tx.send(Ok(vec![0u8; 4])); + } + CoreTtsRequest::CloneOneSentence { response_tx, .. } => { + let _ = response_tx.send(Ok(vec![0u8; 4])); + } + CoreTtsRequest::PrepareCloneRef { response_tx, .. } => { + let _ = response_tx.send(Ok(())); + } + CoreTtsRequest::Speech { response_tx, .. } => { + let _ = response_tx.send(Ok(vec![0u8; 4])); + } + CoreTtsRequest::SpeechClone { response_tx, .. } => { + let _ = response_tx.send(Ok(vec![0u8; 4])); + } + } + } + } + }); + let (training_tx, _training_rx) = mpsc::channel(1); + let (progress_tx, _) = broadcast::channel(1); + let (download_tx, _download_rx) = mpsc::channel(1); + let (download_progress_tx, _) = broadcast::channel(1); + AppState { + inference_tx, + training_tx, + progress_tx, + cancel_flag: Default::default(), + download_tx, + download_progress_tx, + download_cancel_flags: Default::default(), + server_config: Arc::new(crate::server_config::ServerConfig::default()), + ascend_config: None, + ascend_tts_backend: None, + } + } + + fn make_test_service() -> salvo::Service { + let state = make_test_state(); + let router = Router::new().push(Router::with_path("v1/audio/tts/qwen3").post(tts_qwen3)); + let router = router.hoop(salvo::affix_state::inject(state)); + salvo::Service::new(router) + } + + /// Point the registry at a scratch file so tests don't accidentally + /// accept voices from the developer's real `~/.OminiX/models/voices.json`. + /// Env mutation is serialized by `voice_test_guard`. + fn set_voices_json_to_empty() -> tempfile::NamedTempFile { + let file = tempfile::NamedTempFile::new().expect("tempfile"); + std::fs::write(file.path(), r#"{"voices":{}}"#).unwrap(); + std::env::set_var("OMINIX_VOICES_JSON", file.path()); + file + } + + fn set_voices_json_with(content: &str) -> tempfile::NamedTempFile { + let file = tempfile::NamedTempFile::new().expect("tempfile"); + std::fs::write(file.path(), content).unwrap(); + std::env::set_var("OMINIX_VOICES_JSON", file.path()); + file + } + + /// Serialize tests that mutate `OMINIX_VOICES_JSON`. A tokio-aware + /// mutex is used because the guard is held across await points + /// within each test. + async fn voice_test_guard() -> tokio::sync::MutexGuard<'static, ()> { + use std::sync::OnceLock; + use tokio::sync::Mutex; + static LOCK: OnceLock> = OnceLock::new(); + LOCK.get_or_init(|| Mutex::new(())).lock().await + } + + #[tokio::test] + async fn should_return_404_when_voice_is_unknown() { + let _guard = voice_test_guard().await; + let _temp = set_voices_json_to_empty(); + let service = make_test_service(); + + let mut res = TestClient::post("http://127.0.0.1:5800/v1/audio/tts/qwen3") + .json(&serde_json::json!({"voice": "yangmi", "input": "test"})) + .send(&service) + .await; + + assert_eq!(res.status_code.unwrap().as_u16(), 404); + let body: Value = res.take_json().await.expect("json body"); + assert_eq!(body["error"], "voice_not_found"); + assert_eq!(body["requested_voice"], "yangmi"); + assert!( + body["available_voices"] + .as_array() + .expect("available_voices array") + .iter() + .any(|v| v == "vivian"), + "expected vivian in available_voices: {body}" + ); + let msg = body["message"].as_str().unwrap(); + assert!(msg.contains("yangmi")); + assert!(msg.contains("vivian")); + } + + #[tokio::test] + async fn should_return_200_when_voice_is_registered_preset() { + let _guard = voice_test_guard().await; + let _temp = set_voices_json_to_empty(); + let service = make_test_service(); + + // Ask for WAV so the handler awaits the stub receiver's PCM and + // returns a complete body rather than streaming chunked. + let res = TestClient::post("http://127.0.0.1:5800/v1/audio/tts/qwen3?format=wav") + .json(&serde_json::json!({"voice": "vivian", "input": "hello"})) + .send(&service) + .await; + + assert_eq!(res.status_code.unwrap().as_u16(), 200); + } + + #[tokio::test] + async fn should_return_200_when_voice_field_is_missing() { + let _guard = voice_test_guard().await; + let _temp = set_voices_json_to_empty(); + let service = make_test_service(); + + let res = TestClient::post("http://127.0.0.1:5800/v1/audio/tts/qwen3?format=wav") + .json(&serde_json::json!({"input": "hello"})) + .send(&service) + .await; + + // Empty voice normalizes to "vivian" which is a preset — happy path. + assert_eq!(res.status_code.unwrap().as_u16(), 200); + } + + #[tokio::test] + async fn should_return_200_when_voice_matches_custom_alias() { + let _guard = voice_test_guard().await; + let _temp = set_voices_json_with(r#"{"voices":{"mia_clone":{"aliases":["mia"]}}}"#); + let service = make_test_service(); + + let res = TestClient::post("http://127.0.0.1:5800/v1/audio/tts/qwen3?format=wav") + .json(&serde_json::json!({"voice": "mia", "input": "hi"})) + .send(&service) + .await; + + assert_eq!(res.status_code.unwrap().as_u16(), 200); + } + + #[test] + fn should_include_requested_voice_when_error_serializes() { + // Sanity-check the VoiceNotFoundError shape matches the contract. + let err = VoiceNotFoundError { + error: "voice_not_found", + message: "Voice 'x' is not registered. Available: vivian".into(), + requested_voice: "x".into(), + available_voices: vec!["vivian".into()], + }; + let v: Value = serde_json::from_str(&serde_json::to_string(&err).unwrap()).unwrap(); + assert_eq!(v["error"], "voice_not_found"); + assert_eq!(v["requested_voice"], "x"); + assert_eq!(v["available_voices"][0], "vivian"); + assert!(v["message"].as_str().unwrap().contains("vivian")); + } +} diff --git a/src/main.rs b/src/main.rs index b17b620..ee430e8 100644 --- a/src/main.rs +++ b/src/main.rs @@ -35,6 +35,7 @@ mod training; mod types; mod utils; mod version; +mod voice_registry; use config::Config; use inference::{InferenceRequest, TtsPoolConfig}; diff --git a/src/types/audio.rs b/src/types/audio.rs index 5775f04..c68357a 100644 --- a/src/types/audio.rs +++ b/src/types/audio.rs @@ -69,6 +69,22 @@ pub struct SpeechRequest { // Audio Speech Clone (Voice Cloning TTS) // ============================================================================ +/// Response body for a rejected TTS request whose `voice` field did not +/// match any registered voice. Returned with HTTP 404 from +/// `POST /v1/audio/tts/qwen3`. See `src/voice_registry.rs`. +#[derive(Debug, Serialize)] +pub struct VoiceNotFoundError { + /// Stable error tag: always `"voice_not_found"`. + pub error: &'static str, + /// Human-readable message including the requested voice and a hint + /// at the available set. + pub message: String, + /// Echo of the voice the client asked for. + pub requested_voice: String, + /// Canonical list of voices the client may request instead. + pub available_voices: Vec, +} + /// Clone request built from multipart form fields (not JSON). #[derive(Debug)] pub struct SpeechCloneRequest { diff --git a/src/voice_registry.rs b/src/voice_registry.rs new file mode 100644 index 0000000..690715e --- /dev/null +++ b/src/voice_registry.rs @@ -0,0 +1,193 @@ +//! Voice registry for validating TTS voice names. +//! +//! Collects the set of voices the server will actually synthesize for: +//! the built-in Qwen3-TTS preset speakers plus any custom cloned voices +//! declared in `~/.OminiX/models/voices.json`. Used to reject +//! `POST /v1/audio/tts/qwen3` requests whose `voice` field does not match +//! a registered voice, instead of silently falling back to a default. +//! +//! An empty/`"default"` voice is considered valid at the handler level +//! (the handler substitutes the documented default via `normalize_voice` +//! before looking up here). + +use std::collections::HashSet; + +/// Qwen3-TTS CustomVoice preset speaker names (built into the model). +/// +/// Kept in lockstep with `PRESET_SPEAKERS` in `handlers::training`; any +/// preset added to the model must be added to both. A future refactor +/// can unify the two copies without changing behavior. +const PRESET_SPEAKERS: &[&str] = &[ + "vivian", + "serena", + "ryan", + "aiden", + "uncle_fu", + "chinese_woman", + "chinese_man", + "dialect", + "english_man", +]; + +/// Default path of the custom-voice registry file. +const DEFAULT_VOICES_JSON: &str = "~/.OminiX/models/voices.json"; + +/// Environment variable that overrides `DEFAULT_VOICES_JSON`. Primarily +/// for tests and operators who store the voices file elsewhere. +const VOICES_JSON_ENV: &str = "OMINIX_VOICES_JSON"; + +/// Registry of voices the TTS backend will accept. +#[derive(Debug, Clone)] +pub struct VoiceRegistry { + /// Canonical voice names — preset speakers plus custom names declared in + /// `voices.json`. Ordering: presets first (registry order), then customs + /// in JSON-iteration order. + canonical_names: Vec, + /// Lookup keys: every canonical name plus every alias. Case-sensitive + /// to match the existing model behavior (Qwen3-TTS preset names are + /// lowercase by convention). + valid_keys: HashSet, +} + +impl VoiceRegistry { + /// Load the registry from the default location + /// (`~/.OminiX/models/voices.json`, or the path in the + /// `OMINIX_VOICES_JSON` env var when set). Missing or malformed JSON + /// is tolerated — the registry still includes the preset speakers. + pub fn load() -> Self { + let path = std::env::var(VOICES_JSON_ENV) + .unwrap_or_else(|_| crate::utils::expand_tilde(DEFAULT_VOICES_JSON)); + Self::load_from_path(&path) + } + + /// Load the registry from an explicit path. Exposed for tests. + pub fn load_from_path(path: &str) -> Self { + let custom_json = std::fs::read_to_string(path) + .ok() + .and_then(|content| serde_json::from_str::(&content).ok()); + Self::from_json(custom_json.as_ref()) + } + + /// Build the registry from optional custom-voice JSON (the raw + /// contents of `voices.json`). + fn from_json(custom: Option<&serde_json::Value>) -> Self { + let mut canonical_names: Vec = + PRESET_SPEAKERS.iter().map(|s| s.to_string()).collect(); + let mut valid_keys: HashSet = canonical_names.iter().cloned().collect(); + + if let Some(config) = custom { + if let Some(voices) = config.get("voices").and_then(|v| v.as_object()) { + for (name, voice) in voices { + canonical_names.push(name.clone()); + valid_keys.insert(name.clone()); + if let Some(aliases) = voice.get("aliases").and_then(|a| a.as_array()) { + for alias in aliases.iter().filter_map(|v| v.as_str()) { + valid_keys.insert(alias.to_string()); + } + } + } + } + } + + Self { + canonical_names, + valid_keys, + } + } + + /// Return `true` when `voice` matches a registered canonical name or + /// alias. Case-sensitive. + pub fn contains(&self, voice: &str) -> bool { + self.valid_keys.contains(voice) + } + + /// Canonical list of voices for client-facing error messages. + pub fn available_voices(&self) -> &[String] { + &self.canonical_names + } +} + +#[cfg(test)] +mod tests { + use super::*; + + fn registry_with_custom(json: serde_json::Value) -> VoiceRegistry { + VoiceRegistry::from_json(Some(&json)) + } + + #[test] + fn should_accept_preset_speakers_when_no_custom_voices() { + let reg = VoiceRegistry::from_json(None); + assert!(reg.contains("vivian")); + assert!(reg.contains("serena")); + assert!(reg.contains("english_man")); + } + + #[test] + fn should_reject_unknown_voice_when_not_registered() { + let reg = VoiceRegistry::from_json(None); + assert!(!reg.contains("yangmi")); + assert!(!reg.contains("does_not_exist")); + } + + #[test] + fn should_accept_custom_voice_when_declared_in_json() { + let reg = registry_with_custom(serde_json::json!({ + "voices": { + "yangmi": {"aliases": []} + } + })); + assert!(reg.contains("yangmi")); + } + + #[test] + fn should_accept_alias_when_declared_in_json() { + let reg = registry_with_custom(serde_json::json!({ + "voices": { + "vivian_clone": {"aliases": ["vivian2", "vivian-new"]} + } + })); + assert!(reg.contains("vivian_clone")); + assert!(reg.contains("vivian2")); + assert!(reg.contains("vivian-new")); + } + + #[test] + fn should_be_case_sensitive_when_matching_voice() { + let reg = VoiceRegistry::from_json(None); + assert!(reg.contains("vivian")); + assert!(!reg.contains("Vivian")); + assert!(!reg.contains("VIVIAN")); + } + + #[test] + fn should_list_presets_first_in_available_voices() { + let reg = registry_with_custom(serde_json::json!({ + "voices": {"custom_a": {"aliases": []}} + })); + let names = reg.available_voices(); + assert_eq!(names[0], "vivian"); + assert!(names.iter().any(|n| n == "custom_a")); + // Customs appear after all presets. + let custom_idx = names.iter().position(|n| n == "custom_a").unwrap(); + assert!(custom_idx >= PRESET_SPEAKERS.len()); + } + + #[test] + fn should_tolerate_malformed_voices_json_when_loading() { + // Point at a file that does not exist — should still list presets. + let reg = VoiceRegistry::load_from_path("/nonexistent/voices.json"); + assert!(reg.contains("vivian")); + assert!(!reg.contains("yangmi")); + } + + #[test] + fn should_tolerate_voices_without_aliases_field() { + let reg = registry_with_custom(serde_json::json!({ + "voices": { + "no_alias_voice": {} + } + })); + assert!(reg.contains("no_alias_voice")); + } +} diff --git a/tools/qie_5513f_analyze.py b/tools/qie_5513f_analyze.py new file mode 100644 index 0000000..b41bd05 --- /dev/null +++ b/tools/qie_5513f_analyze.py @@ -0,0 +1,111 @@ +#!/usr/bin/env python3 +""" +Per-block residual magnitude bisect for QIE Ascend native engine vs CLI ground truth. + +Reads CLI block dumps from /tmp/qie_5513f_cli_blocks/blockNN/qie_cli_blkNN_.f32.bin +and engine block dumps from /tmp/qie_5513f_eng_blocks/blockNN/.f32 + +Tags compared: + 13_img_resid1 (post-attention residual, before LN2/FFN) + 24_img_resid2 (post-FFN residual, end-of-block — the per-block residual stream) + 13_txt_resid1 + 24_txt_resid2 + +Reports per-block: cos similarity, ratio of absmax (eng/cli), absmax for each side, +std for each side. Identifies the divergence signature. +""" +import numpy as np +import os +import sys + +ENG_DIR = "/tmp/qie_5513f_eng_blocks" +CLI_DIR = "/tmp/qie_5513f_cli_blocks" + +BLOCKS = [0,1,2,3,4,5,6,7,8,9,16,24,30,36,45,50,51,52,53,54,55,56,57,58,59] +TAGS_IMG = ["13_img_resid1", "24_img_resid2"] +TAGS_TXT = ["13_txt_resid1", "24_txt_resid2"] + +def load_eng(blk, tag): + p = f"{ENG_DIR}/block{blk:02d}/{tag}.f32" + if not os.path.exists(p): return None + return np.fromfile(p, dtype=np.float32) + +def load_cli(blk, tag): + p = f"{CLI_DIR}/block{blk:02d}/qie_cli_blk{blk:02d}_{tag}.f32.bin" + if not os.path.exists(p): return None + return np.fromfile(p, dtype=np.float32) + +def stats_one(a): + finite = a[np.isfinite(a)] + return dict( + n=len(a), + n_finite=len(finite), + nan=int(np.isnan(a).sum()), + inf=int(np.isinf(a).sum()), + absmax=float(np.max(np.abs(finite))) if len(finite) else float('nan'), + std=float(np.std(finite)) if len(finite) else float('nan'), + ) + +def cos_sim(a, b): + fa = np.isfinite(a) & np.isfinite(b) + if fa.sum() == 0: return float('nan') + a = a[fa].astype(np.float64) + b = b[fa].astype(np.float64) + na = np.linalg.norm(a) + nb = np.linalg.norm(b) + if na == 0 or nb == 0: return float('nan') + return float(np.dot(a, b) / (na * nb)) + +def main(): + print(f"{'blk':>3} | {'tag':<14} | {'eng_amax':>10} {'eng_std':>10} | {'cli_amax':>10} {'cli_std':>10} | {'cos':>6} | {'r_amax':>8} | {'eng_n/c':>10}") + print("-" * 110) + rows = [] + for blk in BLOCKS: + for tag in TAGS_IMG + TAGS_TXT: + eng = load_eng(blk, tag) + cli = load_cli(blk, tag) + if eng is None and cli is None: continue + if eng is None: + print(f"{blk:>3} | {tag:<14} | {'MISSING':>21} | {stats_one(cli)['absmax']:>10.3g} {stats_one(cli)['std']:>10.3g} | | |") + continue + if cli is None: + print(f"{blk:>3} | {tag:<14} | {stats_one(eng)['absmax']:>10.3g} {stats_one(eng)['std']:>10.3g} | {'MISSING':>21} | | |") + continue + if eng.shape != cli.shape: + print(f"{blk:>3} | {tag:<14} | shape mismatch eng={eng.shape} cli={cli.shape}") + continue + se = stats_one(eng) + sc = stats_one(cli) + cos = cos_sim(eng, cli) + ratio_amax = se['absmax']/sc['absmax'] if sc['absmax'] else float('nan') + nan_summary = f"{se['nan']}/{sc['nan']}" + print(f"{blk:>3} | {tag:<14} | {se['absmax']:>10.3g} {se['std']:>10.3g} | {sc['absmax']:>10.3g} {sc['std']:>10.3g} | {cos:>6.3f} | {ratio_amax:>8.3g} | {nan_summary:>10}") + rows.append((blk, tag, cos, ratio_amax, se['absmax'], sc['absmax'])) + + print("\n=== Divergence signature analysis ===") + # focus on 24_img_resid2 (end-of-block residual stream) — the canonical magnitude metric + img_r2 = [(blk, cos, r) for (blk, tag, cos, r, _, _) in rows if tag == "24_img_resid2"] + if img_r2: + print("\nFor tag 24_img_resid2 (end-of-block img residual):") + first_cos_under_95 = None + first_ratio_over_2 = None + first_ratio_under_05 = None + for blk, cos, r in img_r2: + note = [] + if first_cos_under_95 is None and cos < 0.95: + first_cos_under_95 = blk + note.append("first cos<0.95") + if first_ratio_over_2 is None and r >= 2.0: + first_ratio_over_2 = blk + note.append("first r>=2") + if first_ratio_under_05 is None and r <= 0.5: + first_ratio_under_05 = blk + note.append("first r<=0.5") + ann = " <-- " + ", ".join(note) if note else "" + print(f" blk{blk:02d}: cos={cos:.4f} ratio_amax(eng/cli)={r:.4g}{ann}") + print(f"\nFirst block with cos<0.95: {first_cos_under_95}") + print(f"First block with ratio_amax>=2 (eng>cli): {first_ratio_over_2}") + print(f"First block with ratio_amax<=0.5 (eng