perf(moe): grouped-by-expert MoE verify batching for MTP (#210) by pekkah · Pull Request #270 · pekkah/SharpInference

pekkah · 2026-06-17T11:30:32Z

Closes #210.

Problem (validated)

On routed-MoE MTP models, CudaHybridGdnForwardPass.BatchVerify ran the routed-expert FFN per draft token (CpuMoeFfnCore loop). The trunk and lm_head batched across the k-token verify, but each draft routes to a different top-K expert set, so the routed experts — the actual cost on this model — re-read their mmap'd weights k times with no amortization. On Qwen3.6-35B-A3B-MTP this left MTP-on decode at ~parity with MTP-off.

Both of the issue's claims check out: CudaHybridGdnForwardPass.BatchVerify (and the CPU sibling HybridGdnForwardPass.BatchVerify) loop CpuMoeFfnCore/MoeFfnCore per token, and SupportsBatchVerify admits MoE only when _cpuMoe (the CPU-MoE config).

Fix

New BatchVerifyCpuMoe(layer, k, ...) routes the k draft tokens' routed FFN through the existing #110 group-by-expert core (BatchedRoutedExperts): each selected expert's gate/up/down rows are read once and dotted against every draft that routed to it. Router top-K is computed per token (host), then the k GPU shared experts are kicked async and overlap the host routed compute (mirroring CpuMoeFfnCore's per-token GPU-shared/CPU-routed overlap, in bulk), and (routed + sharedScaled) + resid combines.

Bit-identical to the per-token loop it replaces — the greedy-verify accept/reject contract: same DispatchDot/DispatchDotQ8K kernels, same top-k accumulation order, same combine operand order. SHARPI_MTP_BATCHED_MOE_VERIFY=0 reverts to the per-token loop for parity bisection.

Results

Qwen3.6-35B-A3B-MTP Q4_K_M, RTX 4070 Ti, warm interleaved median-of-3 decode (scripts/bench-210-ab.ps1):

cell	decode t/s	vs baseline
MTP off (baseline)	26.9	1.00×
MTP per-token verify	27.8	1.03×
MTP batched verify (#210)	30.9	1.149×

+11% over per-token verify (1.112×) — the direct, robust measure of this change.
1.149× over MTP-off baseline — at the issue's 1.15× acceptance threshold (within bench noise; was ~parity before). 100% MTP acceptance, coherent output.

The shared expert stays a per-token GPU MatVec (a GEMM-N batch would be faster but breaks the bit-parity contract the issue asks for — reuse of the bit-exact #110 routed core).

Tests

New BatchVerifyCpuMoe_BitwiseMatchesPerToken_Carnice — toggles only SHARPI_MTP_BATCHED_MOE_VERIFY, runs an identical prefill + k-token BatchVerify, asserts byte-identical logits (SingleToInt32Bits) + greedy parity at every position on Carnice (Q3_K + Q8_0, the hardest Q8_KS path).
Regression green: CudaMtpBatchVerifyTests (dense 27B verify), Qwen35Mtp coherence (35B MoE), BatchedPrefill parity (Carnice).

Three review passes (deep bit-parity / code-reviewer / silent-failure-hunter) were clean; applied a toggle-snapshot hardening so the scratch-alloc gate and the FFN-branch select can't desync.

Scope / follow-up

This implements the CUDA-hybrid path (the acceptance target). The CPU-only sibling HybridGdnForwardPass.BatchVerify is intentionally left on its per-token loop: it has no BatchedRoutedExperts core to reuse (a separate ~200-line port), and CPU MTP shows no measured speedup today. Deferred.

🤖 Generated with Claude Code

CudaHybridGdnForwardPass.BatchVerify ran the routed-expert FFN per draft token (a CpuMoeFfnCore loop), so on routed-MoE MTP models (Qwen3.6-35B-A3B-MTP) the trunk + lm_head batched but the routed experts — the actual cost — re-read their mmap'd weights k times. MTP-on decode sat at ~parity with MTP-off. New BatchVerifyCpuMoe routes the k draft tokens' routed FFN through the existing #110 group-by-expert core (BatchedRoutedExperts): each selected expert's gate/up/down rows are read once and dotted against every draft that routed to it. The k GPU shared experts are kicked async and overlap the host routed compute (mirroring CpuMoeFfnCore's per-token GPU-shared/CPU-routed overlap, in bulk). Bit-identical to the per-token loop it replaces (the greedy-verify contract): same DispatchDot/DispatchDotQ8K kernels, same top-k accumulation order, same (routed + sharedScaled) + resid operand order. SHARPI_MTP_BATCHED_MOE_VERIFY=0 reverts to the per-token loop for parity bisection. Qwen3.6-35B-A3B-MTP Q4_K_M, RTX 4070 Ti, warm interleaved median-of-3 decode: MTP off (baseline) 26.9 t/s MTP per-token verify 27.8 t/s (1.03x) MTP batched verify 30.9 t/s (1.15x baseline, +11% over per-token) 100% MTP acceptance, coherent output. The CPU-only sibling HybridGdnForwardPass.BatchVerify is left on its per-token loop: it has no BatchedRoutedExperts core to reuse (a separate port) and CPU MTP shows no measured speedup today. Deferred as a follow-up. Tests: BatchVerifyCpuMoe_BitwiseMatchesPerToken_Carnice asserts byte-identical logits + greedy parity at every position (Carnice Q3_K/Q8_0, the Q8_KS path). Regression green: CudaMtpBatchVerifyTests (dense 27B verify), Qwen35Mtp coherence (35B MoE), BatchedPrefill parity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pekkah · 2026-06-17T11:31:06Z

/gemini review

gemini-code-assist · 2026-06-17T11:32:41Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

pekkah · 2026-06-17T11:33:03Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces grouped-by-expert MoE verification batching for Multi-Token Prediction (MTP) on the CUDA-hybrid path, grouping draft tokens by selected expert to optimize expert weight reads. It also includes a benchmark script and a bitwise parity unit test. The feedback suggests replacing Parallel.For with a sequential loop when combining routed and shared expert outputs, as the small draft batch size (typically 2 to 4 tokens) makes the overhead of parallel scheduling counterproductive.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-17T11:35:06Z

+        Parallel.For(0, k, s_moeParallelOpts, i =>
+        {
+            float* routed = routedAll + (long)i * embDimL;
+            float* shared = sharedAll + (long)i * embDimL;
+            float* outp = hiddenAll + (long)i * embDimL;
+            for (int r = 0; r < embDimL; r++)
+                outp[r] = routed[r] + shared[r];
+        });


Using Parallel.For for combining the routed and shared expert outputs in BatchVerifyCpuMoe introduces unnecessary overhead. Since k represents the draft verification batch size (which is typically very small, e.g., 2 to 4 tokens), the overhead of Task Parallel Library (TPL) thread scheduling, context switching, and synchronization will likely exceed the cost of a simple sequential loop.

Replacing Parallel.For with a standard sequential for loop will avoid this overhead and improve performance for small batch sizes.

for (int i = 0; i < k; i++) { float* routed = routedAll + (long)i * embDimL; float* shared = sharedAll + (long)i * embDimL; float* outp = hiddenAll + (long)i * embDimL; for (int r = 0; r < embDimL; r++) outp[r] = routed[r] + shared[r]; }

Good catch — applied in bcc2a7a. Replaced the Parallel.For combine with a single flat sequential pass over the contiguous [k×embDim] buffers. For the draft verify batch (k=2–4) the TPL scheduling plus the per-call closure heap-allocation dwarf the few-thousand adds, and the allocation also violated this repo's hot-path no-managed-alloc rule. Bit-identical (independent element writes); the byte-parity test stays green.

The (routed + sharedScaled) combine ran Parallel.For over k, but k is the draft verify batch (2-4 tokens) — TPL scheduling plus the per-call closure heap-alloc dwarf the ~k*embDim adds, and the alloc violates the hot-path no-managed-alloc rule. Replace with one flat sequential pass over the contiguous [k*embDim] buffers. Bit-identical (independent element writes); parity test still green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pekkah · 2026-06-17T11:37:13Z

/gemini review

gemini-code-assist

Code Review

This pull request implements grouped-by-expert MoE verification batching for MTP on the CUDA-hybrid path, allowing selected expert weights to be read once across the draft chain. It includes a benchmark script, configuration toggles, and unit tests verifying bitwise parity with the sequential per-token path. The reviewer noted that using Parallel.For to combine routed and shared expert outputs is inefficient because the number of draft tokens is typically very small, and suggested replacing it with a sequential for loop to avoid scheduling overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-17T11:37:26Z

+        long combineCount = (long)k * embDim;
+        for (long r = 0; r < combineCount; r++)
+            _bHiddenAll[r] = _bRoutedAll[r] + _bSharedAll[r];
+        _gpu.UploadInto(_gpuBvFfnAll!, (nint)_bHiddenAll, k * embDim);
+        _gpu.AddInPlace(_gpuBvFfnAll!, blockOut);
+        _gpu.CopyDeviceRegion(stream, 0, _gpuBvFfnAll!, 0, (long)k * embBytes);
+    }
+


Using Parallel.For for k iterations is highly inefficient here. In speculative decoding and MTP verification, k represents the number of draft tokens being verified, which is typically extremely small (usually between 2 and 4, and capped at MaxBatchVerifyTokens).

The overhead of thread dispatching, task scheduling, and context switching in Parallel.For will significantly exceed the actual work done in each iteration (which is just a simple element-wise addition of embDim floats). Replacing it with a standard sequential for loop avoids this overhead and improves performance.

for (int i = 0; i < k; i++) { float* routed = routedAll + (long)i * embDimL; float* shared = sharedAll + (long)i * embDimL; float* outp = hiddenAll + (long)i * embDimL; for (int r = 0; r < embDimL; r++) outp[r] = routed[r] + shared[r]; }

This is already addressed in the reviewed commit (bcc2a7a) — the diff hunk attached to this very comment shows the combine is now a flat sequential for (long r = 0; r < combineCount; r++) loop, not Parallel.For. No change needed; looks like a re-flag of the earlier (resolved) finding.

gemini-code-assist · 2026-06-17T11:40:16Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

pekkah merged commit c835073 into master Jun 17, 2026
1 check passed

pekkah deleted the perf/210-moe-verify-batching branch June 17, 2026 12:20

pekkah mentioned this pull request Jun 17, 2026

perf(moe): grouped-by-expert MoE verify batching for the pure-CPU GDN path (HybridGdnForwardPass) #271

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(moe): grouped-by-expert MoE verify batching for MTP (#210)#270

perf(moe): grouped-by-expert MoE verify batching for MTP (#210)#270
pekkah merged 2 commits into
masterfrom
perf/210-moe-verify-batching

pekkah commented Jun 17, 2026

Uh oh!

pekkah commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot commented Jun 17, 2026

Uh oh!

pekkah commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

pekkah Jun 17, 2026

Uh oh!

pekkah commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

pekkah Jun 17, 2026

Uh oh!

gemini-code-assist Bot commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Jun 17, 2026

Problem (validated)

Fix

Results

Tests

Scope / follow-up

Uh oh!

pekkah commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot commented Jun 17, 2026

Uh oh!

pekkah commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant