Skip to content

perf(engine): sub-phase routed-MoE prefill diagnostic; verify+close #112/#121#172

Merged
pekkah merged 1 commit into
masterfrom
perf/moe-prefill-followups-112-121
Jun 7, 2026
Merged

perf(engine): sub-phase routed-MoE prefill diagnostic; verify+close #112/#121#172
pekkah merged 1 commit into
masterfrom
perf/moe-prefill-followups-112-121

Conversation

@pekkah

@pekkah pekkah commented Jun 7, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #112 and #121 (both verified already-shipped — see issue comments) and lands a permanent sub-phase prefill diagnostic for the batched routed-MoE.

#112 (4-input dequant-once MoE dots) and #121 (CUDA batched FFN/MoE prefill) were substantively implemented in earlier merged PRs (#114-A/#116, #128, #129). Profiling on RTX 4070 Ti + Ryzen 9 7900X confirmed there is no bigger bit-parity-preserving lever for routed-MoE prefill on this hardware — it is compute-bound on the int8 maddubs dots at the .NET-10 ISA ceiling (no Avx512Vnni).

What this PR adds

A [moe-subphase] line on the SHARPI_PREFILL_PROFILE=1 batched-prefill dump, splitting BatchedRoutedExperts into bucket / normQ / phaseA(gate+up) / silu+reduce / gateQ / phaseC(down). All sites guarded by _prefillProfile, so zero cost when profiling is off — purely additive timing, no forward-pass behavior change. Also serves #168's request for a diagnostic that pinpoints the next mixed-quant cliff.

Evidence (Carnice 35B-A3B, SHARPI_CPU_MOE=1, warm, N=1212)

[batched-prefill] total=9598ms trunk=2370 router=902 routedMoE=6270 combine=37
[moe-subphase]    bucket=8 normQ=32 phaseA(gate+up)=3779 silu/reduce=141 gateQ=59 phaseC(down)=2249

routedMoE = 66% of prefill; phaseA+phaseC (int8 dots) = 96% of routedMoE; quant/bucket/silu/reduce ≈ 3%. Weights are read ~13.7 GB once/chunk ≈ 0.23 s at RAM BW vs 6.3 s compute → compute-bound, not memory-bound.

Test

Build clean (TreatWarningsAsErrors). No forward-pass logic touched, so existing bit-parity oracles are unaffected; the diagnostic was validated live via the runs above.

🤖 Generated with Claude Code

…#112/#168)

Add a [moe-subphase] line to the SHARPI_PREFILL_PROFILE=1 batched-prefill dump
breaking BatchedRoutedExperts into bucket / normQ / phaseA(gate+up) /
silu+reduce / gateQ / phaseC(down). All sites guarded by _prefillProfile, so
zero cost when profiling is off.

This is the diagnostic that confirmed #112's routed-MoE prefill is dot-bound:
on Carnice 35B-A3B (CPU-MoE, warm, N=1212) phaseA+phaseC are 96% of routedMoE
(3779+2249 of 6270 ms), with quantization/bucketing/reduce overhead ~3%. It
also serves issue #168's request for a diagnostic that pinpoints the next
mixed-quant cliff. No behavior change to the forward pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah pekkah merged commit 6503d32 into master Jun 7, 2026
1 check passed
@pekkah pekkah deleted the perf/moe-prefill-followups-112-121 branch June 7, 2026 20:38

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces detailed sub-phase profiling for the batched routed-MoE (BatchedRoutedExperts) within CudaHybridGdnForwardPass.cs. It adds private fields to track the time spent in various stages—such as bucketing, quantization, Phase A (gate+up), SiLU/reduce, gate quantization, and Phase C (down)—and prints these metrics when prefill profiling is enabled. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(cpu,engine): dequant-once multi-input MoE dot kernels ("one GEMM per expert") — routed MoE is dequant-bound (#110 follow-up)

1 participant