Skip to content

perf(moe): grouped-by-expert MoE verify batching for MTP (#210)#270

Merged
pekkah merged 2 commits into
masterfrom
perf/210-moe-verify-batching
Jun 17, 2026
Merged

perf(moe): grouped-by-expert MoE verify batching for MTP (#210)#270
pekkah merged 2 commits into
masterfrom
perf/210-moe-verify-batching

Conversation

@pekkah

@pekkah pekkah commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Closes #210.

Problem (validated)

On routed-MoE MTP models, CudaHybridGdnForwardPass.BatchVerify ran the routed-expert FFN per draft token (CpuMoeFfnCore loop). The trunk and lm_head batched across the k-token verify, but each draft routes to a different top-K expert set, so the routed experts — the actual cost on this model — re-read their mmap'd weights k times with no amortization. On Qwen3.6-35B-A3B-MTP this left MTP-on decode at ~parity with MTP-off.

Both of the issue's claims check out: CudaHybridGdnForwardPass.BatchVerify (and the CPU sibling HybridGdnForwardPass.BatchVerify) loop CpuMoeFfnCore/MoeFfnCore per token, and SupportsBatchVerify admits MoE only when _cpuMoe (the CPU-MoE config).

Fix

New BatchVerifyCpuMoe(layer, k, ...) routes the k draft tokens' routed FFN through the existing #110 group-by-expert core (BatchedRoutedExperts): each selected expert's gate/up/down rows are read once and dotted against every draft that routed to it. Router top-K is computed per token (host), then the k GPU shared experts are kicked async and overlap the host routed compute (mirroring CpuMoeFfnCore's per-token GPU-shared/CPU-routed overlap, in bulk), and (routed + sharedScaled) + resid combines.

Bit-identical to the per-token loop it replaces — the greedy-verify accept/reject contract: same DispatchDot/DispatchDotQ8K kernels, same top-k accumulation order, same combine operand order. SHARPI_MTP_BATCHED_MOE_VERIFY=0 reverts to the per-token loop for parity bisection.

Results

Qwen3.6-35B-A3B-MTP Q4_K_M, RTX 4070 Ti, warm interleaved median-of-3 decode (scripts/bench-210-ab.ps1):

cell decode t/s vs baseline
MTP off (baseline) 26.9 1.00×
MTP per-token verify 27.8 1.03×
MTP batched verify (#210) 30.9 1.149×
  • +11% over per-token verify (1.112×) — the direct, robust measure of this change.
  • 1.149× over MTP-off baseline — at the issue's 1.15× acceptance threshold (within bench noise; was ~parity before). 100% MTP acceptance, coherent output.

The shared expert stays a per-token GPU MatVec (a GEMM-N batch would be faster but breaks the bit-parity contract the issue asks for — reuse of the bit-exact #110 routed core).

Tests

  • New BatchVerifyCpuMoe_BitwiseMatchesPerToken_Carnice — toggles only SHARPI_MTP_BATCHED_MOE_VERIFY, runs an identical prefill + k-token BatchVerify, asserts byte-identical logits (SingleToInt32Bits) + greedy parity at every position on Carnice (Q3_K + Q8_0, the hardest Q8_KS path).
  • Regression green: CudaMtpBatchVerifyTests (dense 27B verify), Qwen35Mtp coherence (35B MoE), BatchedPrefill parity (Carnice).

Three review passes (deep bit-parity / code-reviewer / silent-failure-hunter) were clean; applied a toggle-snapshot hardening so the scratch-alloc gate and the FFN-branch select can't desync.

Scope / follow-up

This implements the CUDA-hybrid path (the acceptance target). The CPU-only sibling HybridGdnForwardPass.BatchVerify is intentionally left on its per-token loop: it has no BatchedRoutedExperts core to reuse (a separate ~200-line port), and CPU MTP shows no measured speedup today. Deferred.

🤖 Generated with Claude Code

CudaHybridGdnForwardPass.BatchVerify ran the routed-expert FFN per draft
token (a CpuMoeFfnCore loop), so on routed-MoE MTP models (Qwen3.6-35B-A3B-MTP)
the trunk + lm_head batched but the routed experts — the actual cost — re-read
their mmap'd weights k times. MTP-on decode sat at ~parity with MTP-off.

New BatchVerifyCpuMoe routes the k draft tokens' routed FFN through the existing
#110 group-by-expert core (BatchedRoutedExperts): each selected expert's
gate/up/down rows are read once and dotted against every draft that routed to it.
The k GPU shared experts are kicked async and overlap the host routed compute
(mirroring CpuMoeFfnCore's per-token GPU-shared/CPU-routed overlap, in bulk).

Bit-identical to the per-token loop it replaces (the greedy-verify contract):
same DispatchDot/DispatchDotQ8K kernels, same top-k accumulation order, same
(routed + sharedScaled) + resid operand order. SHARPI_MTP_BATCHED_MOE_VERIFY=0
reverts to the per-token loop for parity bisection.

Qwen3.6-35B-A3B-MTP Q4_K_M, RTX 4070 Ti, warm interleaved median-of-3 decode:
  MTP off (baseline)     26.9 t/s
  MTP per-token verify   27.8 t/s  (1.03x)
  MTP batched verify     30.9 t/s  (1.15x baseline, +11% over per-token)
100% MTP acceptance, coherent output.

The CPU-only sibling HybridGdnForwardPass.BatchVerify is left on its per-token
loop: it has no BatchedRoutedExperts core to reuse (a separate port) and CPU MTP
shows no measured speedup today. Deferred as a follow-up.

Tests: BatchVerifyCpuMoe_BitwiseMatchesPerToken_Carnice asserts byte-identical
logits + greedy parity at every position (Carnice Q3_K/Q8_0, the Q8_KS path).
Regression green: CudaMtpBatchVerifyTests (dense 27B verify), Qwen35Mtp coherence
(35B MoE), BatchedPrefill parity.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah

pekkah commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist

Copy link
Copy Markdown

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@pekkah

pekkah commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces grouped-by-expert MoE verification batching for Multi-Token Prediction (MTP) on the CUDA-hybrid path, grouping draft tokens by selected expert to optimize expert weight reads. It also includes a benchmark script and a bitwise parity unit test. The feedback suggests replacing Parallel.For with a sequential loop when combining routed and shared expert outputs, as the small draft batch size (typically 2 to 4 tokens) makes the overhead of parallel scheduling counterproductive.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +3923 to +3930
Parallel.For(0, k, s_moeParallelOpts, i =>
{
float* routed = routedAll + (long)i * embDimL;
float* shared = sharedAll + (long)i * embDimL;
float* outp = hiddenAll + (long)i * embDimL;
for (int r = 0; r < embDimL; r++)
outp[r] = routed[r] + shared[r];
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using Parallel.For for combining the routed and shared expert outputs in BatchVerifyCpuMoe introduces unnecessary overhead. Since k represents the draft verification batch size (which is typically very small, e.g., 2 to 4 tokens), the overhead of Task Parallel Library (TPL) thread scheduling, context switching, and synchronization will likely exceed the cost of a simple sequential loop.

Replacing Parallel.For with a standard sequential for loop will avoid this overhead and improve performance for small batch sizes.

        for (int i = 0; i < k; i++)
        {
            float* routed = routedAll + (long)i * embDimL;
            float* shared = sharedAll + (long)i * embDimL;
            float* outp = hiddenAll + (long)i * embDimL;
            for (int r = 0; r < embDimL; r++)
                outp[r] = routed[r] + shared[r];
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — applied in bcc2a7a. Replaced the Parallel.For combine with a single flat sequential pass over the contiguous [k×embDim] buffers. For the draft verify batch (k=2–4) the TPL scheduling plus the per-call closure heap-allocation dwarf the few-thousand adds, and the allocation also violated this repo's hot-path no-managed-alloc rule. Bit-identical (independent element writes); the byte-parity test stays green.

The (routed + sharedScaled) combine ran Parallel.For over k, but k is the draft
verify batch (2-4 tokens) — TPL scheduling plus the per-call closure heap-alloc
dwarf the ~k*embDim adds, and the alloc violates the hot-path no-managed-alloc
rule. Replace with one flat sequential pass over the contiguous [k*embDim]
buffers. Bit-identical (independent element writes); parity test still green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah

pekkah commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements grouped-by-expert MoE verification batching for MTP on the CUDA-hybrid path, allowing selected expert weights to be read once across the draft chain. It includes a benchmark script, configuration toggles, and unit tests verifying bitwise parity with the sequential per-token path. The reviewer noted that using Parallel.For to combine routed and shared expert outputs is inefficient because the number of draft tokens is typically very small, and suggested replacing it with a sequential for loop to avoid scheduling overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +3923 to +3930
long combineCount = (long)k * embDim;
for (long r = 0; r < combineCount; r++)
_bHiddenAll[r] = _bRoutedAll[r] + _bSharedAll[r];
_gpu.UploadInto(_gpuBvFfnAll!, (nint)_bHiddenAll, k * embDim);
_gpu.AddInPlace(_gpuBvFfnAll!, blockOut);
_gpu.CopyDeviceRegion(stream, 0, _gpuBvFfnAll!, 0, (long)k * embBytes);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using Parallel.For for k iterations is highly inefficient here. In speculative decoding and MTP verification, k represents the number of draft tokens being verified, which is typically extremely small (usually between 2 and 4, and capped at MaxBatchVerifyTokens).

The overhead of thread dispatching, task scheduling, and context switching in Parallel.For will significantly exceed the actual work done in each iteration (which is just a simple element-wise addition of embDim floats). Replacing it with a standard sequential for loop avoids this overhead and improves performance.

        for (int i = 0; i < k; i++)
        {
            float* routed = routedAll + (long)i * embDimL;
            float* shared = sharedAll + (long)i * embDimL;
            float* outp = hiddenAll + (long)i * embDimL;
            for (int r = 0; r < embDimL; r++)
                outp[r] = routed[r] + shared[r];
        }

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already addressed in the reviewed commit (bcc2a7a) — the diff hunk attached to this very comment shows the combine is now a flat sequential for (long r = 0; r < combineCount; r++) loop, not Parallel.For. No change needed; looks like a re-flag of the earlier (resolved) finding.

@gemini-code-assist

Copy link
Copy Markdown

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@pekkah pekkah merged commit c835073 into master Jun 17, 2026
1 check passed
@pekkah pekkah deleted the perf/210-moe-verify-batching branch June 17, 2026 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(moe): grouped-by-expert MoE verify batching - turn 35B-A3B-MTP parity into a win

1 participant