perf(engine): sub-phase routed-MoE prefill diagnostic; verify+close #112/#121#172
Conversation
…#112/#168) Add a [moe-subphase] line to the SHARPI_PREFILL_PROFILE=1 batched-prefill dump breaking BatchedRoutedExperts into bucket / normQ / phaseA(gate+up) / silu+reduce / gateQ / phaseC(down). All sites guarded by _prefillProfile, so zero cost when profiling is off. This is the diagnostic that confirmed #112's routed-MoE prefill is dot-bound: on Carnice 35B-A3B (CPU-MoE, warm, N=1212) phaseA+phaseC are 96% of routedMoE (3779+2249 of 6270 ms), with quantization/bucketing/reduce overhead ~3%. It also serves issue #168's request for a diagnostic that pinpoints the next mixed-quant cliff. No behavior change to the forward pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces detailed sub-phase profiling for the batched routed-MoE (BatchedRoutedExperts) within CudaHybridGdnForwardPass.cs. It adds private fields to track the time spent in various stages—such as bucketing, quantization, Phase A (gate+up), SiLU/reduce, gate quantization, and Phase C (down)—and prints these metrics when prefill profiling is enabled. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Summary
Closes #112 and #121 (both verified already-shipped — see issue comments) and lands a permanent sub-phase prefill diagnostic for the batched routed-MoE.
#112(4-input dequant-once MoE dots) and#121(CUDA batched FFN/MoE prefill) were substantively implemented in earlier merged PRs (#114-A/#116, #128, #129). Profiling on RTX 4070 Ti + Ryzen 9 7900X confirmed there is no bigger bit-parity-preserving lever for routed-MoE prefill on this hardware — it is compute-bound on the int8 maddubs dots at the .NET-10 ISA ceiling (noAvx512Vnni).What this PR adds
A
[moe-subphase]line on theSHARPI_PREFILL_PROFILE=1batched-prefill dump, splittingBatchedRoutedExpertsinto bucket / normQ / phaseA(gate+up) / silu+reduce / gateQ / phaseC(down). All sites guarded by_prefillProfile, so zero cost when profiling is off — purely additive timing, no forward-pass behavior change. Also serves #168's request for a diagnostic that pinpoints the next mixed-quant cliff.Evidence (Carnice 35B-A3B,
SHARPI_CPU_MOE=1, warm, N=1212)routedMoE = 66% of prefill; phaseA+phaseC (int8 dots) = 96% of routedMoE; quant/bucket/silu/reduce ≈ 3%. Weights are read ~13.7 GB once/chunk ≈ 0.23 s at RAM BW vs 6.3 s compute → compute-bound, not memory-bound.
Test
Build clean (TreatWarningsAsErrors). No forward-pass logic touched, so existing bit-parity oracles are unaffected; the diagnostic was validated live via the runs above.
🤖 Generated with Claude Code