perf(cuda,engine): batched GPU-SLRU MoE prefill (#129) + SnapKV×MTP gate (#130) by pekkah · Pull Request #133 · pekkah/SharpInference

pekkah · 2026-06-04T12:33:57Z

Batched-prefill follow-ups from #128. Two independent changes + a README condensation, split into three commits.

#129 — batched GPU-SLRU MoE prefill reduce (`perf(cuda)`)

Replaces BatchedGpuMoeFfn's two host-side per-token loops with batched ops:

Phase 1 (shared expert): batched GEMM-N gate/up/down + one llm_scale_rows_inplace for the per-token sigmoid gate.
Phase 3 (reduce): one new llm_moe_weighted_reduce kernel does the top-k weighted scatter-reduce + shared add over all N tokens, replacing ~N·(na+2) tiny stream ops.

Bit-identical to the prior host reduce — each routed term contracts to the same fmaf (NVRTC fmad=true) as add_scaled_inplace, the shexp scale is a separate rounded pass (never folded into an fma), and the plain shared-add matches add_inplace, routed-first/shared-last. Only the non-default SHARPI_CPU_MOE=0 GPU-SLRU path is affected.

Verified: new per-kernel oracle (MoeWeightedReduce/ScaleRowsInPlace, no model needed) + the 22 GB BatchedTrunkGpuFfn_BitwiseMatchesSequential_GpuSlruMoe end-to-end oracle.
Benchmark (Qwen3.6-35B-A3B, RTX 4070 Ti, SHARPI_CPU_MOE=0, ~1K ctx, warm, backend confirmed CUDA-hybrid GPU-SLRU): prefill 45.3 → 54.3 t/s (+20%), decode unchanged. Reproduce with scripts/bench-129-ab.ps1.

Closes #129.

#130 — gate MTP batched-verify off on a SnapKV-evicted cache (`fix(engine)`)

BatchForward2 requires _kvCache.Length == startPos, but SnapKV eviction leaves Length at budget K while LogicalLength (RoPE position) stays at prompt length N — so MTP decode threw _kvCache.Length=128 != startPos=N. SupportsBatchVerify now also requires !KvCacheCompacted (Length != LogicalLength), so MtpDecoder falls back to the eviction-safe sequential path; defensive preconditions retained. The signal is false in all normal operation, so the working non-evicted batched path is unaffected. Coexisting batched-verify with eviction is the follow-up.

Tests: non-vacuous decode-after-eviction regression test (proven to fail without the gate), a no-eviction gate-false-positive guard, a PagedKvCache Length/LogicalLength invariant test, and a SHARPI_TRACE_MTP fallback trace line.

Closes #130.

README

Condensed the perf-table notes and prose 340 → 199 lines — kept every perf number, repo/size, the FastScan kernel table, and usage/flags; dropped the issue-by-issue implementation narrative.

Process

Implemented via the subagent workflow with a code-review gate between tasks and a 4-agent toolkit review at the end (silent-failure-hunter, comment-analyzer, pr-test-analyzer, code-reviewer). No defects; high-value findings applied (incl. the toolkit catching a factual error in a comment, now fixed). Full solution builds clean (0 warnings); the #129-only commit also builds standalone. All SnapKV/MTP/kernel tests + the GPU-SLRU oracle pass.

🤖 Generated with Claude Code

Replace BatchedGpuMoeFfn's two host-side per-token loops with batched ops: - Phase 1 shared expert: batched GEMM-N gate/up/down + one llm_scale_rows_inplace for the per-token sigmoid gate (the CPU dot is kept for bit-parity). - Phase 3 reduce: one llm_moe_weighted_reduce kernel does the top-k weighted scatter-reduce + shared add over all N tokens, replacing ~N*(na+2) tiny launches. Bit-identical to the prior host reduce: each routed term contracts to the same fmaf (NVRTC fmad=true) as add_scaled_inplace, the shexp scale is a separate rounded pass (not folded into an fma), and the plain shared add matches add_inplace with routed-first/shared-last ordering. Only the SHARPI_CPU_MOE=0 GPU-SLRU path is gated. Verified by a new per-kernel oracle (MoeWeightedReduce/ScaleRowsInPlace bit-parity, no model needed) and the 22 GB BatchedTrunkGpuFfn_BitwiseMatchesSequential_GpuSlruMoe end-to-end oracle. A/B (Qwen3.6-35B-A3B, RTX 4070 Ti, SHARPI_CPU_MOE=0, ~1K ctx): GPU-SLRU prefill 45.3 -> 54.3 t/s (+20%), decode unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

) BatchForward2 (MTP N=2 batched verify) requires _kvCache.Length == startPos, but SnapKV prefill eviction leaves Length at the budget K while LogicalLength (the RoPE position) stays at the prompt length N. The first MTP decode iteration therefore threw "_kvCache.Length=128 != startPos=N" on MTP models (e.g. Qwen3.6-27B-MTP). Gate SupportsBatchVerify off when the cache is compacted (Length != LogicalLength) in both hybrid passes, so MtpDecoder falls back to the eviction-safe sequential Forward path. The defensive BatchForward2 preconditions are kept (clearer messages). Length != LogicalLength is false in all normal operation (IncrementPosition advances both; TruncateTo/Reset keep them equal; only Compact diverges them), so the working non-evicted batched path is unaffected. Coexisting batched-verify with eviction is the #130 follow-up. Adds a non-vacuous regression test (decode-after-eviction stays coherent; proven to fail without the gate), a no-eviction gate-false-positive guard, a PagedKvCache Length/LogicalLength invariant test, and a SHARPI_TRACE_MTP fallback trace line. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Trim the issue-by-issue implementation history from the perf-table Notes cells and the deep-internals prose; keep every perf number, repo/size, the FastScan kernel table, and the usage/flags. Document the #129 GPU-SLRU prefill win in the 35B-A3B row. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request optimizes the GPU-SLRU MoE prefill path by replacing sequential host-loop stream operations with batched, single-launch CUDA kernels for row scaling and weighted reduction. It also addresses a regression (Issue #130) by disabling MTP batched-verify when the KV cache is compacted, safely falling back to sequential decoding. The review feedback suggests a valuable performance optimization: launching the new CUDA kernels with a 2D grid mapping instead of a 1D grid to eliminate expensive integer division and modulo operations on the GPU, while also avoiding potential integer overflow risks during host launch configuration.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-04T12:35:58Z

+extern ""C"" __global__ void llm_scale_rows_inplace(
+    float* __restrict__ buf, const float* __restrict__ scales, int rows, int cols)
+{
+    long idx = (long)blockIdx.x * blockDim.x + threadIdx.x;
+    long total = (long)rows * cols;
+    if (idx >= total) return;
+    int i = (int)(idx / cols);
+    buf[idx] *= scales[i];
+}


Integer division on GPUs is extremely slow. We can completely eliminate it by launching the kernel with a 2D grid mapping, where blockIdx.x maps to the column index and blockIdx.y maps to the row index.

extern "C" __global__ void llm_scale_rows_inplace( float* __restrict__ buf, const float* __restrict__ scales, int rows, int cols) { int e = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y; if (e >= cols || i >= rows) return; buf[(long)i * cols + e] *= scales[i]; }

gemini-code-assist · 2026-06-04T12:35:58Z

+extern ""C"" __global__ void llm_moe_weighted_reduce(
+    const float* __restrict__ downPartial, const float* __restrict__ weights,
+    float* __restrict__ shared, int N, int na, int embDim)
+{
+    long idx = (long)blockIdx.x * blockDim.x + threadIdx.x;
+    long total = (long)N * embDim;
+    if (idx >= total) return;
+    int i = (int)(idx / embDim);
+    int e = (int)(idx - (long)i * embDim);
+    float acc = 0.0f;
+    const float* w = weights + (long)i * na;
+    const float* p = downPartial + ((long)i * na) * embDim + e;
+    for (int k = 0; k < na; k++)
+        acc += p[(long)k * embDim] * w[k];
+    acc += shared[(long)i * embDim + e];
+    shared[(long)i * embDim + e] = acc;
+}


Similarly, we can optimize llm_moe_weighted_reduce by launching it with a 2D grid mapping to avoid expensive integer division and modulo operations on the GPU.

extern "C" __global__ void llm_moe_weighted_reduce( const float* __restrict__ downPartial, const float* __restrict__ weights, float* __restrict__ shared, int N, int na, int embDim) { int e = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y; if (e >= embDim || i >= N) return; float acc = 0.0f; const float* w = weights + (long)i * na; const float* p = downPartial + ((long)i * na) * embDim + e; for (int k = 0; k < na; k++) acc += p[(long)k * embDim] * w[k]; acc += shared[(long)i * embDim + e]; shared[(long)i * embDim + e] = acc; }

gemini-code-assist · 2026-06-04T12:35:58Z

+    public void ScaleRowsInPlace(Tensor buf, Tensor scales, int rows, int cols)
+    {
+        EnsureImageKernels();
+        if (!_imageKernelsAvailable)
+            throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels.");
+
+        long total = (long)rows * cols;
+        nint p0 = GetDevPtr(buf);
+        nint p1 = GetDevPtr(scales);
+        int  p2 = rows, p3 = cols;
+        nint* args = stackalloc nint[4] { (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3) };
+        Launch1D(_scaleRowsKernel, checked((int)total), args);
+    }


To support the 2D grid mapping and avoid expensive integer division on the GPU, we should launch the kernel with a 2D grid. This also eliminates the potential integer overflow risk from checked((int)total).

public void ScaleRowsInPlace(Tensor buf, Tensor scales, int rows, int cols) { EnsureImageKernels(); if (!_imageKernelsAvailable) throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels."); nint p0 = GetDevPtr(buf); nint p1 = GetDevPtr(scales); int p2 = rows, p3 = cols; nint* args = stackalloc nint[4] { (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3) }; uint gridX = (uint)((cols + 255) / 256); uint gridY = (uint)rows; int r = NvrtcInterop.LaunchKernel(_scaleRowsKernel, gridX, gridY, 1, 256, 1, 1, 0, _stream, args, null); if (r != 0) throw new InvalidOperationException($"cuLaunchKernel(scale_rows) failed: {r}"); }

gemini-code-assist · 2026-06-04T12:35:58Z

+    public void MoeWeightedReduce(Tensor downPartial, Tensor weights, Tensor shared,
+                                  int N, int na, int embDim)
+    {
+        EnsureImageKernels();
+        if (!_imageKernelsAvailable)
+            throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels.");
+
+        long total = (long)N * embDim;
+        nint p0 = GetDevPtr(downPartial);
+        nint p1 = GetDevPtr(weights);
+        nint p2 = GetDevPtr(shared);
+        int  p3 = N, p4 = na, p5 = embDim;
+        nint* args = stackalloc nint[6]
+            { (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3), (nint)(&p4), (nint)(&p5) };
+        Launch1D(_moeWeightedReduceKernel, checked((int)total), args);
+    }


We should launch MoeWeightedReduce with a 2D grid mapping to avoid expensive integer division and modulo operations on the GPU, while also eliminating the overflow risk of checked((int)total).

public void MoeWeightedReduce(Tensor downPartial, Tensor weights, Tensor shared, int N, int na, int embDim) { EnsureImageKernels(); if (!_imageKernelsAvailable) throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels."); nint p0 = GetDevPtr(downPartial); nint p1 = GetDevPtr(weights); nint p2 = GetDevPtr(shared); int p3 = N, p4 = na, p5 = embDim; nint* args = stackalloc nint[6] { (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3), (nint)(&p4), (nint)(&p5) }; uint gridX = (uint)((embDim + 255) / 256); uint gridY = (uint)N; int r = NvrtcInterop.LaunchKernel(_moeWeightedReduceKernel, gridX, gridY, 1, 256, 1, 1, 0, _stream, args, null); if (r != 0) throw new InvalidOperationException($"cuLaunchKernel(moe_weighted_reduce) failed: {r}"); }

…eview) Address the Gemini review on #133: launch llm_scale_rows_inplace and llm_moe_weighted_reduce with a 2D grid (x = element/column, y = row/token) so the kernels recover (i, e) from block/thread indices instead of a per-thread integer divide+modulo over a flattened N*embDim index. Also drops the checked((int)total) host cast (no flattened total) and adds a loud gridDim.y <= 65535 guard so an oversized N/rows fails fast rather than silently skipping rows. Bit-identical: only the index->(i,e) mapping changes; the per-term fmaf accumulation, k-order, and plain shared-add are untouched. Re-verified by the per-kernel oracle (MoeWeightedReduce/ScaleRowsInPlace) and the 22 GB GpuSlruMoe end-to-end oracle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pekkah · 2026-06-04T12:39:42Z

Thanks — applied the 2D-grid suggestion to both kernels in 75d219a. llm_scale_rows_inplace and llm_moe_weighted_reduce now launch with a 2D grid (x = element/column, y = row/token) and recover (i, e) directly from block/thread indices, dropping the per-thread integer divide/modulo. This also removes the checked((int)total) host cast; I added a loud gridDim.y <= 65535 guard so an oversized N/rows fails fast rather than silently skipping rows.

Bit-identical (only the index→(i,e) mapping changed; the per-term fmaf accumulation, k-order, and plain shared-add are untouched) — re-verified by the per-kernel oracle (MoeWeightedReduce_BitwiseMatchesSequentialReduce / ScaleRowsInPlace_BitwiseMatchesPerRowScaleInPlace) and the 22 GB BatchedTrunkGpuFfn_BitwiseMatchesSequential_GpuSlruMoe end-to-end oracle.

pekkah and others added 3 commits June 4, 2026 15:32

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

pekkah merged commit e0aaf8f into master Jun 4, 2026
1 check passed

pekkah deleted the perf/cuda-batched-prefill-followups-129-snapkv-mtp branch June 4, 2026 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda,engine): batched GPU-SLRU MoE prefill (#129) + SnapKV×MTP gate (#130)#133

perf(cuda,engine): batched GPU-SLRU MoE prefill (#129) + SnapKV×MTP gate (#130)#133
pekkah merged 4 commits into
masterfrom
perf/cuda-batched-prefill-followups-129-snapkv-mtp

pekkah commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

pekkah commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Jun 4, 2026

#129 — batched GPU-SLRU MoE prefill reduce (perf(cuda))

#130 — gate MTP batched-verify off on a SnapKV-evicted cache (fix(engine))

README

Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

#129 — batched GPU-SLRU MoE prefill reduce (`perf(cuda)`)

#130 — gate MTP batched-verify off on a SnapKV-evicted cache (`fix(engine)`)