perf(engine): sub-phase routed-MoE prefill diagnostic; verify+close #112/#121 by pekkah · Pull Request #172 · pekkah/SharpInference

pekkah · 2026-06-07T20:38:42Z

Summary

Closes #112 and #121 (both verified already-shipped — see issue comments) and lands a permanent sub-phase prefill diagnostic for the batched routed-MoE.

#112 (4-input dequant-once MoE dots) and #121 (CUDA batched FFN/MoE prefill) were substantively implemented in earlier merged PRs (#114-A/#116, #128, #129). Profiling on RTX 4070 Ti + Ryzen 9 7900X confirmed there is no bigger bit-parity-preserving lever for routed-MoE prefill on this hardware — it is compute-bound on the int8 maddubs dots at the .NET-10 ISA ceiling (no Avx512Vnni).

What this PR adds

A [moe-subphase] line on the SHARPI_PREFILL_PROFILE=1 batched-prefill dump, splitting BatchedRoutedExperts into bucket / normQ / phaseA(gate+up) / silu+reduce / gateQ / phaseC(down). All sites guarded by _prefillProfile, so zero cost when profiling is off — purely additive timing, no forward-pass behavior change. Also serves #168's request for a diagnostic that pinpoints the next mixed-quant cliff.

Evidence (Carnice 35B-A3B, `SHARPI_CPU_MOE=1`, warm, N=1212)

[batched-prefill] total=9598ms trunk=2370 router=902 routedMoE=6270 combine=37
[moe-subphase]    bucket=8 normQ=32 phaseA(gate+up)=3779 silu/reduce=141 gateQ=59 phaseC(down)=2249

routedMoE = 66% of prefill; phaseA+phaseC (int8 dots) = 96% of routedMoE; quant/bucket/silu/reduce ≈ 3%. Weights are read ~13.7 GB once/chunk ≈ 0.23 s at RAM BW vs 6.3 s compute → compute-bound, not memory-bound.

Test

Build clean (TreatWarningsAsErrors). No forward-pass logic touched, so existing bit-parity oracles are unaffected; the diagnostic was validated live via the runs above.

🤖 Generated with Claude Code

…#112/#168) Add a [moe-subphase] line to the SHARPI_PREFILL_PROFILE=1 batched-prefill dump breaking BatchedRoutedExperts into bucket / normQ / phaseA(gate+up) / silu+reduce / gateQ / phaseC(down). All sites guarded by _prefillProfile, so zero cost when profiling is off. This is the diagnostic that confirmed #112's routed-MoE prefill is dot-bound: on Carnice 35B-A3B (CPU-MoE, warm, N=1212) phaseA+phaseC are 96% of routedMoE (3779+2249 of 6270 ms), with quantization/bucketing/reduce overhead ~3%. It also serves issue #168's request for a diagnostic that pinpoints the next mixed-quant cliff. No behavior change to the forward pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces detailed sub-phase profiling for the batched routed-MoE (BatchedRoutedExperts) within CudaHybridGdnForwardPass.cs. It adds private fields to track the time spent in various stages—such as bucketing, quantization, Phase A (gate+up), SiLU/reduce, gate quantization, and Phase C (down)—and prints these metrics when prefill profiling is enabled. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

pekkah merged commit 6503d32 into master Jun 7, 2026
1 check passed

pekkah deleted the perf/moe-prefill-followups-112-121 branch June 7, 2026 20:38

gemini-code-assist Bot reviewed Jun 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(engine): sub-phase routed-MoE prefill diagnostic; verify+close #112/#121#172

perf(engine): sub-phase routed-MoE prefill diagnostic; verify+close #112/#121#172
pekkah merged 1 commit into
masterfrom
perf/moe-prefill-followups-112-121

pekkah commented Jun 7, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Jun 7, 2026

Summary

What this PR adds

Evidence (Carnice 35B-A3B, SHARPI_CPU_MOE=1, warm, N=1212)

Test

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Evidence (Carnice 35B-A3B, `SHARPI_CPU_MOE=1`, warm, N=1212)