perf(moe): grouped-by-expert MoE verify batching for MTP (#210)#270
Conversation
CudaHybridGdnForwardPass.BatchVerify ran the routed-expert FFN per draft token (a CpuMoeFfnCore loop), so on routed-MoE MTP models (Qwen3.6-35B-A3B-MTP) the trunk + lm_head batched but the routed experts — the actual cost — re-read their mmap'd weights k times. MTP-on decode sat at ~parity with MTP-off. New BatchVerifyCpuMoe routes the k draft tokens' routed FFN through the existing #110 group-by-expert core (BatchedRoutedExperts): each selected expert's gate/up/down rows are read once and dotted against every draft that routed to it. The k GPU shared experts are kicked async and overlap the host routed compute (mirroring CpuMoeFfnCore's per-token GPU-shared/CPU-routed overlap, in bulk). Bit-identical to the per-token loop it replaces (the greedy-verify contract): same DispatchDot/DispatchDotQ8K kernels, same top-k accumulation order, same (routed + sharedScaled) + resid operand order. SHARPI_MTP_BATCHED_MOE_VERIFY=0 reverts to the per-token loop for parity bisection. Qwen3.6-35B-A3B-MTP Q4_K_M, RTX 4070 Ti, warm interleaved median-of-3 decode: MTP off (baseline) 26.9 t/s MTP per-token verify 27.8 t/s (1.03x) MTP batched verify 30.9 t/s (1.15x baseline, +11% over per-token) 100% MTP acceptance, coherent output. The CPU-only sibling HybridGdnForwardPass.BatchVerify is left on its per-token loop: it has no BatchedRoutedExperts core to reuse (a separate port) and CPU MTP shows no measured speedup today. Deferred as a follow-up. Tests: BatchVerifyCpuMoe_BitwiseMatchesPerToken_Carnice asserts byte-identical logits + greedy parity at every position (Carnice Q3_K/Q8_0, the Q8_KS path). Regression green: CudaMtpBatchVerifyTests (dense 27B verify), Qwen35Mtp coherence (35B MoE), BatchedPrefill parity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
/gemini review |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces grouped-by-expert MoE verification batching for Multi-Token Prediction (MTP) on the CUDA-hybrid path, grouping draft tokens by selected expert to optimize expert weight reads. It also includes a benchmark script and a bitwise parity unit test. The feedback suggests replacing Parallel.For with a sequential loop when combining routed and shared expert outputs, as the small draft batch size (typically 2 to 4 tokens) makes the overhead of parallel scheduling counterproductive.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| Parallel.For(0, k, s_moeParallelOpts, i => | ||
| { | ||
| float* routed = routedAll + (long)i * embDimL; | ||
| float* shared = sharedAll + (long)i * embDimL; | ||
| float* outp = hiddenAll + (long)i * embDimL; | ||
| for (int r = 0; r < embDimL; r++) | ||
| outp[r] = routed[r] + shared[r]; | ||
| }); |
There was a problem hiding this comment.
Using Parallel.For for combining the routed and shared expert outputs in BatchVerifyCpuMoe introduces unnecessary overhead. Since k represents the draft verification batch size (which is typically very small, e.g., 2 to 4 tokens), the overhead of Task Parallel Library (TPL) thread scheduling, context switching, and synchronization will likely exceed the cost of a simple sequential loop.
Replacing Parallel.For with a standard sequential for loop will avoid this overhead and improve performance for small batch sizes.
for (int i = 0; i < k; i++)
{
float* routed = routedAll + (long)i * embDimL;
float* shared = sharedAll + (long)i * embDimL;
float* outp = hiddenAll + (long)i * embDimL;
for (int r = 0; r < embDimL; r++)
outp[r] = routed[r] + shared[r];
}There was a problem hiding this comment.
Good catch — applied in bcc2a7a. Replaced the Parallel.For combine with a single flat sequential pass over the contiguous [k×embDim] buffers. For the draft verify batch (k=2–4) the TPL scheduling plus the per-call closure heap-allocation dwarf the few-thousand adds, and the allocation also violated this repo's hot-path no-managed-alloc rule. Bit-identical (independent element writes); the byte-parity test stays green.
The (routed + sharedScaled) combine ran Parallel.For over k, but k is the draft verify batch (2-4 tokens) — TPL scheduling plus the per-call closure heap-alloc dwarf the ~k*embDim adds, and the alloc violates the hot-path no-managed-alloc rule. Replace with one flat sequential pass over the contiguous [k*embDim] buffers. Bit-identical (independent element writes); parity test still green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements grouped-by-expert MoE verification batching for MTP on the CUDA-hybrid path, allowing selected expert weights to be read once across the draft chain. It includes a benchmark script, configuration toggles, and unit tests verifying bitwise parity with the sequential per-token path. The reviewer noted that using Parallel.For to combine routed and shared expert outputs is inefficient because the number of draft tokens is typically very small, and suggested replacing it with a sequential for loop to avoid scheduling overhead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| long combineCount = (long)k * embDim; | ||
| for (long r = 0; r < combineCount; r++) | ||
| _bHiddenAll[r] = _bRoutedAll[r] + _bSharedAll[r]; | ||
| _gpu.UploadInto(_gpuBvFfnAll!, (nint)_bHiddenAll, k * embDim); | ||
| _gpu.AddInPlace(_gpuBvFfnAll!, blockOut); | ||
| _gpu.CopyDeviceRegion(stream, 0, _gpuBvFfnAll!, 0, (long)k * embBytes); | ||
| } | ||
|
|
There was a problem hiding this comment.
Using Parallel.For for k iterations is highly inefficient here. In speculative decoding and MTP verification, k represents the number of draft tokens being verified, which is typically extremely small (usually between 2 and 4, and capped at MaxBatchVerifyTokens).
The overhead of thread dispatching, task scheduling, and context switching in Parallel.For will significantly exceed the actual work done in each iteration (which is just a simple element-wise addition of embDim floats). Replacing it with a standard sequential for loop avoids this overhead and improves performance.
for (int i = 0; i < k; i++)
{
float* routed = routedAll + (long)i * embDimL;
float* shared = sharedAll + (long)i * embDimL;
float* outp = hiddenAll + (long)i * embDimL;
for (int r = 0; r < embDimL; r++)
outp[r] = routed[r] + shared[r];
}There was a problem hiding this comment.
This is already addressed in the reviewed commit (bcc2a7a) — the diff hunk attached to this very comment shows the combine is now a flat sequential for (long r = 0; r < combineCount; r++) loop, not Parallel.For. No change needed; looks like a re-flag of the earlier (resolved) finding.
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
Closes #210.
Problem (validated)
On routed-MoE MTP models,
CudaHybridGdnForwardPass.BatchVerifyran the routed-expert FFN per draft token (CpuMoeFfnCoreloop). The trunk and lm_head batched across the k-token verify, but each draft routes to a different top-K expert set, so the routed experts — the actual cost on this model — re-read their mmap'd weights k times with no amortization. On Qwen3.6-35B-A3B-MTP this left MTP-on decode at ~parity with MTP-off.Both of the issue's claims check out:
CudaHybridGdnForwardPass.BatchVerify(and the CPU siblingHybridGdnForwardPass.BatchVerify) loopCpuMoeFfnCore/MoeFfnCoreper token, andSupportsBatchVerifyadmits MoE only when_cpuMoe(the CPU-MoE config).Fix
New
BatchVerifyCpuMoe(layer, k, ...)routes the k draft tokens' routed FFN through the existing #110 group-by-expert core (BatchedRoutedExperts): each selected expert's gate/up/down rows are read once and dotted against every draft that routed to it. Router top-K is computed per token (host), then the k GPU shared experts are kicked async and overlap the host routed compute (mirroringCpuMoeFfnCore's per-token GPU-shared/CPU-routed overlap, in bulk), and(routed + sharedScaled) + residcombines.Bit-identical to the per-token loop it replaces — the greedy-verify accept/reject contract: same
DispatchDot/DispatchDotQ8Kkernels, same top-k accumulation order, same combine operand order.SHARPI_MTP_BATCHED_MOE_VERIFY=0reverts to the per-token loop for parity bisection.Results
Qwen3.6-35B-A3B-MTP Q4_K_M, RTX 4070 Ti, warm interleaved median-of-3 decode (
scripts/bench-210-ab.ps1):The shared expert stays a per-token GPU
MatVec(a GEMM-N batch would be faster but breaks the bit-parity contract the issue asks for — reuse of the bit-exact #110 routed core).Tests
BatchVerifyCpuMoe_BitwiseMatchesPerToken_Carnice— toggles onlySHARPI_MTP_BATCHED_MOE_VERIFY, runs an identical prefill + k-tokenBatchVerify, asserts byte-identical logits (SingleToInt32Bits) + greedy parity at every position on Carnice (Q3_K + Q8_0, the hardest Q8_KS path).CudaMtpBatchVerifyTests(dense 27B verify),Qwen35Mtpcoherence (35B MoE),BatchedPrefillparity (Carnice).Three review passes (deep bit-parity / code-reviewer / silent-failure-hunter) were clean; applied a toggle-snapshot hardening so the scratch-alloc gate and the FFN-branch select can't desync.
Scope / follow-up
This implements the CUDA-hybrid path (the acceptance target). The CPU-only sibling
HybridGdnForwardPass.BatchVerifyis intentionally left on its per-token loop: it has noBatchedRoutedExpertscore to reuse (a separate ~200-line port), and CPU MTP shows no measured speedup today. Deferred.🤖 Generated with Claude Code