perf(cuda): batched FFN/MoE prefill (#121), SnapKV batched SDPA (#122), Coder-30B batched trunk (#123) by pekkah · Pull Request #128 · pekkah/SharpInference

pekkah · 2026-06-04T08:29:08Z

Three follow-ups to #119 (PR #120), each bit-identical to its sequential reference and validated on an RTX 4070 Ti.

#121 — batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill

PrefillBatchedTrunkGpuFfn batched the trunk but ran FFN/MoE per token. Now:

Dense FFN gate/up/down as single GEMM-N launches over N (BatchedGpuDenseFfn).
GPU-SLRU routed experts grouped-by-expert (BatchedGpuMoeFfn): each cached expert loaded once and matmul'd against all its tokens, then a per-token top-k-ordered reduce that preserves the exact AddScaledInPlace accumulation order of the sequential GpuMoeFfn (shared expert added last) — so it stays byte-identical.
SHARPI_BATCHED_FFN gate; unsupported dtypes / CPU-dense-fallback layers stay on the per-token path.

#122 — batched-query SDPA + Q-capture for SnapKV-active GDN-hybrid prefill

SnapKV used to force the per-position loop (the only place Q was captured). Now AttnBlockBatched captures the trailing window [wStart,N) from qAll in a single batched CopyDeviceRegion, then runs the AttentionBatched / AttentionBatchedWave (#118) SDPA — byte-identical capture and attention output vs the per-position path.

#123 — batched-trunk prefill for `CudaHybridForwardPass` (Qwen3-Coder-30B)

The -g N layer-split pure-attention MoE hybrid had no batched prefill. New PrefillBatchedTrunk / GpuLayerBatchedTrunk batch the attention trunk of the GPU layers over N (norm/proj/RoPE/QK-norm/KV-append/SDPA shared+wave), run FFN/MoE per token, transfer the N hidden rows GPU→CPU, run the CPU layers per token, then output. Gated to supported configs (non-Gemma4, non-TQ, NEOX RoPE, no L2-QK-norm, no attn-bias, GPU-resident embedding, batchable dtypes) with per-token fallback; _faulted latch added.

Validation

All new/changed oracles assert bit-identical final-token logits vs the deterministic sequential Forward reference:

BatchedFfn_BitwiseMatchesPerTokenFfn_Dense27BMtp — SHARPI_BATCHED_FFN=0 equivalence (perf(cuda): batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill (#119 follow-up) #121).
SnapKvActive_BatchedWave_BitwiseMatchesSequential_Over4096_Carnice — >4096 wave + SnapKV capture (perf(cuda): batched-query SDPA + Q-capture for SnapKV-active GDN-hybrid prefill (#118 follow-up) #122).
new CudaHybridBatchedPrefillTests — Coder-30B single / multi-chunk / >4096, each asserting LastPrefillWasBatched so a gated-out config can't pass vacuously (perf(cuda): batched-trunk prefill for the pure-attention layer-split hybrid (CudaHybridForwardPass / Qwen3-Coder-30B) #123). Coder split: 29 GPU + 19 CPU layers.
GPU-SLRU MoE oracle pins the batched flags on; existing GDN oracles stay green (11/11).

Reviewed per-issue (pr-review-toolkit code-reviewer) + a final pass (silent-failure-hunter / pr-test-analyzer / comment-analyzer); findings addressed (view-handle leak, anti-vacuity assertions, flag pinning, the missing SHARPI_BATCHED_FFN=0 oracle, comment accuracy).

Measured perf (warm-cache ~1K-ctx A/B, RTX 4070 Ti) + README

perf(cuda): batched-trunk prefill for the pure-attention layer-split hybrid (CudaHybridForwardPass / Qwen3-Coder-30B) #123 Coder-30B CUDA hybrid: prefill 25.2 → 30.1 t/s (+19%) vs per-token Forward (SHARPI_BATCHED_PREFILL=0). README row updated to 30.1; decode unchanged (FFN/MoE + CPU layers stay per token).
perf(cuda): batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill (#119 follow-up) #121 dense FFN (27B-MTP Q4/Q5 CUDA): prefill-neutral at ~1K (8.5↔8.5, 5.6↔5.5) — 44/64 (Q4) / 51/64 (Q5) FFN layers are CPU-mmap and dominate prefill; the launch-collapse pays off when more FFN is GPU-resident. README numbers unchanged; notes credit perf(cuda): batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill (#119 follow-up) #121.
perf(cuda): batched-query SDPA + Q-capture for SnapKV-active GDN-hybrid prefill (#118 follow-up) #122 SnapKV: not in the perf table — no benchmarked row uses SnapKV (long-context-only win). (Aside: the ~1K SnapKV probe on 27B-MTP surfaced a pre-existing SnapKV+MTP-decode incompatibility — BatchForward2: _kvCache.Length=128 != startPos, the evicted cache vs MTP verify — unrelated to this PR; worth a separate issue.)

🤖 Generated with Claude Code

#122), Coder-30B batched trunk (#123) Three follow-ups to #119/#120, each bit-identical to its sequential reference. #121 — batch the FFN/MoE stage of PrefillBatchedTrunkGpuFfn (GDN-hybrid): - dense FFN gate/up/down as GEMM-N over N tokens (BatchedGpuDenseFfn) - GPU-SLRU routed experts grouped-by-expert (BatchedGpuMoeFfn), with a per-token top-k-ordered reduce that preserves the exact AddScaledInPlace accumulation order of the sequential GpuMoeFfn (shared expert added last) - SHARPI_BATCHED_FFN gate + per-token / unsupported-dtype fallback #122 — let the batched/wave SDPA run with SnapKV active (AttnBlockBatched): capture the trailing-window Q rows [wStart,N) from qAll into _snapKvQCapture in a single batched CopyDeviceRegion, then run AttentionBatched/Wave. Byte- identical capture + SDPA vs the per-position path. #123 — batched-trunk prefill for CudaHybridForwardPass (Coder-30B layer-split): new PrefillBatchedTrunk / GpuLayerBatchedTrunk batch the attention trunk of the GPU layers over N (norm/proj/RoPE/QK-norm/KV-append/SDPA shared+wave), per-token FFN/MoE, GPU->CPU N-row transfer + per-token CPU layers, then output. Gated (non-Gemma4, non-TQ, NEOX, no L2-QK-norm, no attn-bias, GPU-resident embedding, batchable dtypes) with per-token fallback; _faulted latch added. Tests (all bit-identical vs sequential Forward on RTX 4070 Ti): - new BatchedFfn_BitwiseMatchesPerTokenFfn_Dense27BMtp (SHARPI_BATCHED_FFN=0 equiv) - new SnapKvActive_BatchedWave_BitwiseMatchesSequential_Over4096_Carnice - new CudaHybridBatchedPrefillTests (Coder-30B single / multi-chunk / >4096), asserting LastPrefillWasBatched so the batched path can't pass vacuously - GPU-SLRU oracle pins the batched flags on; existing GDN oracles still green Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements batched-trunk prompt prefill for pure-attention configurations (Issue #123), batched FFN/MoE stages (Issue #121), and batched attention with SnapKV active (Issue #122) to improve prefill throughput, along with comprehensive parity tests. The review feedback highlights critical performance bottlenecks, specifically the sequential host-side looping for top-k reduction and the synchronous weight downloading in the batched MoE FFN path, both of which should be optimized. Additionally, several instances of using _gpu.CopyDevice for host-device transfers violate CUDA specifications and should be replaced with proper Download or UploadInto operations.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-04T08:31:47Z

+        for (int i = 0; i < N; i++)
+        {
+            // Routed accumulation in top-k order, starting from zero (matches the
+            // sequential Clear(_gpuHidden) + per-k AddScaledInPlace).
+            _gpu.Clear(_gpuHidden);
+            for (int k = 0; k < na; k++)
+            {
+                long slotIdx = (long)i * na + k;
+                var partV = _gpu.View(downPartial, slotIdx * embDim, embDim);
+                try { _gpu.AddScaledInPlace(_gpuHidden, partV, weights[(long)i * na + k]); }
+                finally { _gpu.Free(partV); }
+            }
+            // Add the per-token shared expert (sequential adds it last).
+            var sharedV = _gpu.View(hiddenAll, (long)i * embDim, embDim);
+            try { _gpu.AddInPlace(_gpuHidden, sharedV); }
+            finally { _gpu.Free(sharedV); }
+            _gpu.CopyDeviceRegion(hiddenAll, (long)i * embDim * sizeof(float),
+                                  _gpuHidden, 0, (long)embDim * sizeof(float));
+        }


Looping over the batch size 'N' on the host to perform the top-k reduction and shared expert addition introduces a massive performance bottleneck. For a batch size of 'N = 4096' and 'na = 8' active experts, this loop launches over 45,000 stream operations (including '_gpu.Clear', '_gpu.View', '_gpu.AddScaledInPlace', '_gpu.AddInPlace', and '_gpu.CopyDeviceRegion') sequentially from the host. The overhead of these P/Invoke and driver launches from C# will severely bottleneck the GPU, completely defeating the throughput benefits of the batched prefill.\n\nTo resolve this, implement a single custom batched reduction CUDA kernel that performs the top-k weighted sum and shared expert addition in parallel across all 'N' tokens on the GPU, rather than looping on the host.

gemini-code-assist · 2026-06-04T08:31:47Z

+        //    GpuMoeFfn: ffn_down @ (SiLU(gate@x) * (up@x)), then a per-token sigmoid scalar
+        //    gate from ffn_gate_inp_shexp · x. Reads each token's norm into _gpuNormBuf.
+        _gpu.Download(normAll, new Span<float>(_bfNormReadback, (int)((long)N * embDim)));
+        _gpu.Download(_gpuWGateInpShexp[layer], new Span<float>(_hostQ, embDim)); // gate-inp weight (shared across tokens)


Downloading the constant weight tensor '_gpuWGateInpShexp[layer]' from the GPU to the CPU on every single 'BatchedGpuMoeFfn' call is highly inefficient. Because '_gpu.Download' is synchronous (it calls 'StreamSynchronize'), this forces the CPU to block and wait for the GPU to finish all preceding work, completely destroying CPU-GPU pipelining and introducing significant latency.\n\nSince these weights are constant and extremely small (only '_embDim' floats or ~8 KB per layer), they should be cached on the CPU during model construction and read directly from CPU memory, completely eliminating the synchronous '_gpu.Download' call during the forward pass.

gemini-code-assist · 2026-06-04T08:31:47Z

+            var pinnedAll = EnsurePinnedHiddenAll(N);
+            _gpu.CopyDevice(pinnedAll, stream);     // device stream → pinned host (N × embDim)
+            _gpu.Synchronize();
+            float* hostAll = _gpu.MapPinned(pinnedAll);


Using '_gpu.CopyDevice' (which internally calls 'cudaMemcpyAsync' with 'cudaMemcpyDeviceToDevice') to copy from a device stream to a pinned host tensor ('pinnedAll') is a spec violation in CUDA. While it may work on some platforms due to driver-level UVA mapping, it can fail with 'cudaErrorInvalidValue' or cause undefined behavior/performance degradation on systems where UVA is not fully symmetric or P2P access is restricted.\n\nInstead, use '_gpu.Download' with the mapped host pointer, which uses 'DeviceToHost' and synchronizes the stream automatically, allowing you to remove the manual '_gpu.Synchronize()' call.

var pinnedAll = EnsurePinnedHiddenAll(N); float* hostAll = _gpu.MapPinned(pinnedAll); _gpu.Download(stream, (nint)hostAll, N * embDim);

gemini-code-assist · 2026-06-04T08:31:47Z

+            float* pinned = _gpu.MapPinned(_pinnedHidden);
+            new ReadOnlySpan<float>(_cpuHidden, embDim).CopyTo(new Span<float>(pinned, embDim));
+            _gpu.UnmapPinned(_pinnedHidden);
+            _gpu.CopyDevice(_gpuHidden, _pinnedHidden);


Using '_gpu.CopyDevice' (which internally calls 'cudaMemcpyAsync' with 'cudaMemcpyDeviceToDevice') to copy from a pinned host tensor ('_pinnedHidden') to a device tensor ('_gpuHidden') is a spec violation in CUDA.\n\nInstead, use '_gpu.UploadInto' with the mapped host pointer, which correctly uses 'HostToDevice' copy.

float* pinned = _gpu.MapPinned(_pinnedHidden); new ReadOnlySpan<float>(_cpuHidden, embDim).CopyTo(new Span<float>(pinned, embDim)); _gpu.UploadInto(_gpuHidden, (nint)pinned, embDim); _gpu.UnmapPinned(_pinnedHidden);

gemini-code-assist · 2026-06-04T08:31:47Z

+                // No GPU output weight: compute the output on CPU from the last hidden row.
+                _gpu.CopyDevice(_pinnedHidden, _gpuHidden);
+                _gpu.Synchronize();
+                float* pinned = _gpu.MapPinned(_pinnedHidden);
+                new ReadOnlySpan<float>(pinned, embDim).CopyTo(new Span<float>(_cpuHidden, embDim));
+                _gpu.UnmapPinned(_pinnedHidden);


Using '_gpu.CopyDevice' (which internally calls 'cudaMemcpyAsync' with 'cudaMemcpyDeviceToDevice') to copy from a device tensor ('_gpuHidden') to a pinned host tensor ('_pinnedHidden') is a spec violation in CUDA.\n\nInstead, use '_gpu.Download' with the mapped host pointer, which correctly uses 'DeviceToHost' copy and synchronizes the stream automatically, allowing you to remove the manual '_gpu.Synchronize()' call.

// No GPU output weight: compute the output on CPU from the last hidden row. float* pinned = _gpu.MapPinned(_pinnedHidden); _gpu.Download(_gpuHidden, (nint)pinned, embDim); new ReadOnlySpan<float>(pinned, embDim).CopyTo(new Span<float>(_cpuHidden, embDim)); _gpu.UnmapPinned(_pinnedHidden);

…121 FFN batching is prefill-neutral on the 27B-MTP rows Measured warm-cache ~1K-ctx A/B on RTX 4070 Ti (bench-textgen harness): - #123 Coder-30B CUDA hybrid: batched-trunk prefill lifts 25.2 → 30.1 t/s (+19%) vs the per-token Forward loop (SHARPI_BATCHED_PREFILL=0). Headline prefill column updated to 30.1; decode unchanged (FFN/MoE + CPU layers stay per token). - #121 dense-FFN batching (27B-MTP Q4/Q5 CUDA): bit-identical but ~1K-prefill- neutral (8.5 ↔ 8.5, 5.6 ↔ 5.5) — 44/64 (Q4) / 51/64 (Q5) FFN layers are CPU-mmap and dominate prefill; the launch-collapse helps when more FFN is GPU-resident. Number unchanged; notes credit #121 for traceability. - #122 SnapKV: no benchmarked table row uses SnapKV (long-context-only win); not added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Address the one valid finding from the PR #128 review (Gemini): BatchedGpuMoeFfn's Phase-3 per-token reduce issues ~N·(na+2) small stream ops, undercutting the grouped-GEMM win for large-N GPU-SLRU prefill. Correctness/bit-parity are unaffected and it only gates the non-default SHARPI_CPU_MOE=0 path; a batched weighted-scatter-reduce kernel is tracked in #129. Comment-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pekkah · 2026-06-04T10:31:22Z

Thanks @gemini-code-assist — adjudicated each finding against the codebase (independently cross-checked with the repo's review toolkit). Summary:

1. Host-loop top-k reduce in BatchedGpuMoeFfn (high) — valid, deferred to #129.
Correct that the Phase-3 per-token reduce issues ~N·(na+2) small stream ops; for large-N prefill the launch overhead undercuts the grouped-GEMM win. It is bit-identical to the sequential GpuMoeFfn accumulation (load-bearing FP order), so this is a throughput optimization, not a correctness fix, and it only gates the non-default GPU-SLRU path (SHARPI_CPU_MOE=0) — the benchmarked 35B-A3B default routes MoE to CPU (BatchedRoutedExperts, already batched) and never enters this loop. A proper fix needs a new batched weighted-scatter-reduce NVRTC kernel (must preserve the k-order), so I've tracked it as #129 and added a code note pointing there rather than rushing a kernel into this PR.

2. Per-call Download(_gpuWGateInpShexp[layer], …) (high) — false positive.
The sequential reference this path must stay bit-identical to (GpuMoeFfn) downloads this same constant weight once per token; the batched version hoists it out of the per-token loop (once per layer-call, reused for all N). So #121 already improves this by ~N×. Caching at construction would help the sequential path equally and is out of scope here.

3. CopyDevice for host↔pinned transfers ×3 (medium) — false positive (established convention).
AllocatePinned registers the cudaMallocHost pointer as a UVA device pointer, and the existing per-token Forward/GpuLayer path already stages every _gpuHidden↔_pinnedHidden move via CopyGpuBuffer (== RecordComputeCopy == CopyDevice, D2D). The #123 code follows that exact UVA-dependent convention — it introduces no new dependency or risk, and the bit-parity oracles (CudaHybridBatchedPrefillTests, single/multi-chunk/>4096) pass. Switching #123 alone to Download/UploadInto would make it inconsistent with the rest of the class; a uniform migration would be a separate codebase-wide change.

No correctness/bit-parity issues — all parity oracles assert byte-identical logits vs the sequential reference and are green.

gemini-code-assist · 2026-06-04T10:31:37Z

Thanks for the detailed breakdown, @pekkah. Your assessment of the deferred items and the justification for the established conventions in the codebase are clear. Tracking the host-loop top-k reduce in #129 is the right approach to avoid blocking this PR, and the bit-parity validation confirms the integrity of the batched paths. I have no further concerns regarding these points.

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

pekkah mentioned this pull request Jun 4, 2026

perf(cuda): batched weighted scatter-reduce kernel for the GPU-SLRU MoE prefill (#121 follow-up) #129

Closed

pekkah force-pushed the perf/cuda-batched-prefill-121-123 branch from 7272f28 to 6c12e6a Compare June 4, 2026 10:31

pekkah merged commit 622ca4a into master Jun 4, 2026
1 check passed

pekkah deleted the perf/cuda-batched-prefill-121-123 branch June 4, 2026 10:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda): batched FFN/MoE prefill (#121), SnapKV batched SDPA (#122), Coder-30B batched trunk (#123)#128

perf(cuda): batched FFN/MoE prefill (#121), SnapKV batched SDPA (#122), Coder-30B batched trunk (#123)#128
pekkah merged 3 commits into
masterfrom
perf/cuda-batched-prefill-121-123

pekkah commented Jun 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

pekkah commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

#121 — batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill

#122 — batched-query SDPA + Q-capture for SnapKV-active GDN-hybrid prefill

#123 — batched-trunk prefill for CudaHybridForwardPass (Qwen3-Coder-30B)

Validation

Measured perf (warm-cache ~1K-ctx A/B, RTX 4070 Ti) + README

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pekkah commented Jun 4, 2026 •

edited

Loading

#123 — batched-trunk prefill for `CudaHybridForwardPass` (Qwen3-Coder-30B)