perf(cuda,engine): batched GPU-SLRU MoE prefill (#129) + SnapKV×MTP gate (#130)#133
Conversation
Replace BatchedGpuMoeFfn's two host-side per-token loops with batched ops: - Phase 1 shared expert: batched GEMM-N gate/up/down + one llm_scale_rows_inplace for the per-token sigmoid gate (the CPU dot is kept for bit-parity). - Phase 3 reduce: one llm_moe_weighted_reduce kernel does the top-k weighted scatter-reduce + shared add over all N tokens, replacing ~N*(na+2) tiny launches. Bit-identical to the prior host reduce: each routed term contracts to the same fmaf (NVRTC fmad=true) as add_scaled_inplace, the shexp scale is a separate rounded pass (not folded into an fma), and the plain shared add matches add_inplace with routed-first/shared-last ordering. Only the SHARPI_CPU_MOE=0 GPU-SLRU path is gated. Verified by a new per-kernel oracle (MoeWeightedReduce/ScaleRowsInPlace bit-parity, no model needed) and the 22 GB BatchedTrunkGpuFfn_BitwiseMatchesSequential_GpuSlruMoe end-to-end oracle. A/B (Qwen3.6-35B-A3B, RTX 4070 Ti, SHARPI_CPU_MOE=0, ~1K ctx): GPU-SLRU prefill 45.3 -> 54.3 t/s (+20%), decode unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
) BatchForward2 (MTP N=2 batched verify) requires _kvCache.Length == startPos, but SnapKV prefill eviction leaves Length at the budget K while LogicalLength (the RoPE position) stays at the prompt length N. The first MTP decode iteration therefore threw "_kvCache.Length=128 != startPos=N" on MTP models (e.g. Qwen3.6-27B-MTP). Gate SupportsBatchVerify off when the cache is compacted (Length != LogicalLength) in both hybrid passes, so MtpDecoder falls back to the eviction-safe sequential Forward path. The defensive BatchForward2 preconditions are kept (clearer messages). Length != LogicalLength is false in all normal operation (IncrementPosition advances both; TruncateTo/Reset keep them equal; only Compact diverges them), so the working non-evicted batched path is unaffected. Coexisting batched-verify with eviction is the #130 follow-up. Adds a non-vacuous regression test (decode-after-eviction stays coherent; proven to fail without the gate), a no-eviction gate-false-positive guard, a PagedKvCache Length/LogicalLength invariant test, and a SHARPI_TRACE_MTP fallback trace line. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Trim the issue-by-issue implementation history from the perf-table Notes cells and the deep-internals prose; keep every perf number, repo/size, the FastScan kernel table, and the usage/flags. Document the #129 GPU-SLRU prefill win in the 35B-A3B row. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request optimizes the GPU-SLRU MoE prefill path by replacing sequential host-loop stream operations with batched, single-launch CUDA kernels for row scaling and weighted reduction. It also addresses a regression (Issue #130) by disabling MTP batched-verify when the KV cache is compacted, safely falling back to sequential decoding. The review feedback suggests a valuable performance optimization: launching the new CUDA kernels with a 2D grid mapping instead of a 1D grid to eliminate expensive integer division and modulo operations on the GPU, while also avoiding potential integer overflow risks during host launch configuration.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| extern ""C"" __global__ void llm_scale_rows_inplace( | ||
| float* __restrict__ buf, const float* __restrict__ scales, int rows, int cols) | ||
| { | ||
| long idx = (long)blockIdx.x * blockDim.x + threadIdx.x; | ||
| long total = (long)rows * cols; | ||
| if (idx >= total) return; | ||
| int i = (int)(idx / cols); | ||
| buf[idx] *= scales[i]; | ||
| } |
There was a problem hiding this comment.
Integer division on GPUs is extremely slow. We can completely eliminate it by launching the kernel with a 2D grid mapping, where blockIdx.x maps to the column index and blockIdx.y maps to the row index.
extern "C" __global__ void llm_scale_rows_inplace(
float* __restrict__ buf, const float* __restrict__ scales, int rows, int cols)
{
int e = blockIdx.x * blockDim.x + threadIdx.x;
int i = blockIdx.y;
if (e >= cols || i >= rows) return;
buf[(long)i * cols + e] *= scales[i];
}| extern ""C"" __global__ void llm_moe_weighted_reduce( | ||
| const float* __restrict__ downPartial, const float* __restrict__ weights, | ||
| float* __restrict__ shared, int N, int na, int embDim) | ||
| { | ||
| long idx = (long)blockIdx.x * blockDim.x + threadIdx.x; | ||
| long total = (long)N * embDim; | ||
| if (idx >= total) return; | ||
| int i = (int)(idx / embDim); | ||
| int e = (int)(idx - (long)i * embDim); | ||
| float acc = 0.0f; | ||
| const float* w = weights + (long)i * na; | ||
| const float* p = downPartial + ((long)i * na) * embDim + e; | ||
| for (int k = 0; k < na; k++) | ||
| acc += p[(long)k * embDim] * w[k]; | ||
| acc += shared[(long)i * embDim + e]; | ||
| shared[(long)i * embDim + e] = acc; | ||
| } |
There was a problem hiding this comment.
Similarly, we can optimize llm_moe_weighted_reduce by launching it with a 2D grid mapping to avoid expensive integer division and modulo operations on the GPU.
extern "C" __global__ void llm_moe_weighted_reduce(
const float* __restrict__ downPartial, const float* __restrict__ weights,
float* __restrict__ shared, int N, int na, int embDim)
{
int e = blockIdx.x * blockDim.x + threadIdx.x;
int i = blockIdx.y;
if (e >= embDim || i >= N) return;
float acc = 0.0f;
const float* w = weights + (long)i * na;
const float* p = downPartial + ((long)i * na) * embDim + e;
for (int k = 0; k < na; k++)
acc += p[(long)k * embDim] * w[k];
acc += shared[(long)i * embDim + e];
shared[(long)i * embDim + e] = acc;
}| public void ScaleRowsInPlace(Tensor buf, Tensor scales, int rows, int cols) | ||
| { | ||
| EnsureImageKernels(); | ||
| if (!_imageKernelsAvailable) | ||
| throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels."); | ||
|
|
||
| long total = (long)rows * cols; | ||
| nint p0 = GetDevPtr(buf); | ||
| nint p1 = GetDevPtr(scales); | ||
| int p2 = rows, p3 = cols; | ||
| nint* args = stackalloc nint[4] { (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3) }; | ||
| Launch1D(_scaleRowsKernel, checked((int)total), args); | ||
| } |
There was a problem hiding this comment.
To support the 2D grid mapping and avoid expensive integer division on the GPU, we should launch the kernel with a 2D grid. This also eliminates the potential integer overflow risk from checked((int)total).
public void ScaleRowsInPlace(Tensor buf, Tensor scales, int rows, int cols)
{
EnsureImageKernels();
if (!_imageKernelsAvailable)
throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels.");
nint p0 = GetDevPtr(buf);
nint p1 = GetDevPtr(scales);
int p2 = rows, p3 = cols;
nint* args = stackalloc nint[4] { (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3) };
uint gridX = (uint)((cols + 255) / 256);
uint gridY = (uint)rows;
int r = NvrtcInterop.LaunchKernel(_scaleRowsKernel, gridX, gridY, 1, 256, 1, 1, 0, _stream, args, null);
if (r != 0) throw new InvalidOperationException($"cuLaunchKernel(scale_rows) failed: {r}");
}| public void MoeWeightedReduce(Tensor downPartial, Tensor weights, Tensor shared, | ||
| int N, int na, int embDim) | ||
| { | ||
| EnsureImageKernels(); | ||
| if (!_imageKernelsAvailable) | ||
| throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels."); | ||
|
|
||
| long total = (long)N * embDim; | ||
| nint p0 = GetDevPtr(downPartial); | ||
| nint p1 = GetDevPtr(weights); | ||
| nint p2 = GetDevPtr(shared); | ||
| int p3 = N, p4 = na, p5 = embDim; | ||
| nint* args = stackalloc nint[6] | ||
| { (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3), (nint)(&p4), (nint)(&p5) }; | ||
| Launch1D(_moeWeightedReduceKernel, checked((int)total), args); | ||
| } |
There was a problem hiding this comment.
We should launch MoeWeightedReduce with a 2D grid mapping to avoid expensive integer division and modulo operations on the GPU, while also eliminating the overflow risk of checked((int)total).
public void MoeWeightedReduce(Tensor downPartial, Tensor weights, Tensor shared,
int N, int na, int embDim)
{
EnsureImageKernels();
if (!_imageKernelsAvailable)
throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels.");
nint p0 = GetDevPtr(downPartial);
nint p1 = GetDevPtr(weights);
nint p2 = GetDevPtr(shared);
int p3 = N, p4 = na, p5 = embDim;
nint* args = stackalloc nint[6]
{ (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3), (nint)(&p4), (nint)(&p5) };
uint gridX = (uint)((embDim + 255) / 256);
uint gridY = (uint)N;
int r = NvrtcInterop.LaunchKernel(_moeWeightedReduceKernel, gridX, gridY, 1, 256, 1, 1, 0, _stream, args, null);
if (r != 0) throw new InvalidOperationException($"cuLaunchKernel(moe_weighted_reduce) failed: {r}");
}…eview) Address the Gemini review on #133: launch llm_scale_rows_inplace and llm_moe_weighted_reduce with a 2D grid (x = element/column, y = row/token) so the kernels recover (i, e) from block/thread indices instead of a per-thread integer divide+modulo over a flattened N*embDim index. Also drops the checked((int)total) host cast (no flattened total) and adds a loud gridDim.y <= 65535 guard so an oversized N/rows fails fast rather than silently skipping rows. Bit-identical: only the index->(i,e) mapping changes; the per-term fmaf accumulation, k-order, and plain shared-add are untouched. Re-verified by the per-kernel oracle (MoeWeightedReduce/ScaleRowsInPlace) and the 22 GB GpuSlruMoe end-to-end oracle. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks — applied the 2D-grid suggestion to both kernels in 75d219a. Bit-identical (only the index→(i,e) mapping changed; the per-term |
Batched-prefill follow-ups from #128. Two independent changes + a README condensation, split into three commits.
#129 — batched GPU-SLRU MoE prefill reduce (
perf(cuda))Replaces
BatchedGpuMoeFfn's two host-side per-token loops with batched ops:llm_scale_rows_inplacefor the per-token sigmoid gate.llm_moe_weighted_reducekernel does the top-k weighted scatter-reduce + shared add over all N tokens, replacing ~N·(na+2) tiny stream ops.Bit-identical to the prior host reduce — each routed term contracts to the same
fmaf(NVRTCfmad=true) asadd_scaled_inplace, the shexp scale is a separate rounded pass (never folded into an fma), and the plain shared-add matchesadd_inplace, routed-first/shared-last. Only the non-defaultSHARPI_CPU_MOE=0GPU-SLRU path is affected.Verified: new per-kernel oracle (
MoeWeightedReduce/ScaleRowsInPlace, no model needed) + the 22 GBBatchedTrunkGpuFfn_BitwiseMatchesSequential_GpuSlruMoeend-to-end oracle.Benchmark (Qwen3.6-35B-A3B, RTX 4070 Ti,
SHARPI_CPU_MOE=0, ~1K ctx, warm, backend confirmed CUDA-hybrid GPU-SLRU): prefill 45.3 → 54.3 t/s (+20%), decode unchanged. Reproduce withscripts/bench-129-ab.ps1.Closes #129.
#130 — gate MTP batched-verify off on a SnapKV-evicted cache (
fix(engine))BatchForward2requires_kvCache.Length == startPos, but SnapKV eviction leavesLengthat budget K whileLogicalLength(RoPE position) stays at prompt length N — so MTP decode threw_kvCache.Length=128 != startPos=N.SupportsBatchVerifynow also requires!KvCacheCompacted(Length != LogicalLength), soMtpDecoderfalls back to the eviction-safe sequential path; defensive preconditions retained. The signal is false in all normal operation, so the working non-evicted batched path is unaffected. Coexisting batched-verify with eviction is the follow-up.Tests: non-vacuous decode-after-eviction regression test (proven to fail without the gate), a no-eviction gate-false-positive guard, a
PagedKvCacheLength/LogicalLength invariant test, and aSHARPI_TRACE_MTPfallback trace line.Closes #130.
README
Condensed the perf-table notes and prose 340 → 199 lines — kept every perf number, repo/size, the FastScan kernel table, and usage/flags; dropped the issue-by-issue implementation narrative.
Process
Implemented via the subagent workflow with a code-review gate between tasks and a 4-agent toolkit review at the end (silent-failure-hunter, comment-analyzer, pr-test-analyzer, code-reviewer). No defects; high-value findings applied (incl. the toolkit catching a factual error in a comment, now fixed). Full solution builds clean (0 warnings); the #129-only commit also builds standalone. All SnapKV/MTP/kernel tests + the GPU-SLRU oracle pass.
🤖 Generated with Claude Code