Skip to content

perf(cuda,engine): batched GPU-SLRU MoE prefill (#129) + SnapKV×MTP gate (#130)#133

Merged
pekkah merged 4 commits into
masterfrom
perf/cuda-batched-prefill-followups-129-snapkv-mtp
Jun 4, 2026
Merged

perf(cuda,engine): batched GPU-SLRU MoE prefill (#129) + SnapKV×MTP gate (#130)#133
pekkah merged 4 commits into
masterfrom
perf/cuda-batched-prefill-followups-129-snapkv-mtp

Conversation

@pekkah

@pekkah pekkah commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Batched-prefill follow-ups from #128. Two independent changes + a README condensation, split into three commits.

#129 — batched GPU-SLRU MoE prefill reduce (perf(cuda))

Replaces BatchedGpuMoeFfn's two host-side per-token loops with batched ops:

  • Phase 1 (shared expert): batched GEMM-N gate/up/down + one llm_scale_rows_inplace for the per-token sigmoid gate.
  • Phase 3 (reduce): one new llm_moe_weighted_reduce kernel does the top-k weighted scatter-reduce + shared add over all N tokens, replacing ~N·(na+2) tiny stream ops.

Bit-identical to the prior host reduce — each routed term contracts to the same fmaf (NVRTC fmad=true) as add_scaled_inplace, the shexp scale is a separate rounded pass (never folded into an fma), and the plain shared-add matches add_inplace, routed-first/shared-last. Only the non-default SHARPI_CPU_MOE=0 GPU-SLRU path is affected.

Verified: new per-kernel oracle (MoeWeightedReduce/ScaleRowsInPlace, no model needed) + the 22 GB BatchedTrunkGpuFfn_BitwiseMatchesSequential_GpuSlruMoe end-to-end oracle.
Benchmark (Qwen3.6-35B-A3B, RTX 4070 Ti, SHARPI_CPU_MOE=0, ~1K ctx, warm, backend confirmed CUDA-hybrid GPU-SLRU): prefill 45.3 → 54.3 t/s (+20%), decode unchanged. Reproduce with scripts/bench-129-ab.ps1.

Closes #129.

#130 — gate MTP batched-verify off on a SnapKV-evicted cache (fix(engine))

BatchForward2 requires _kvCache.Length == startPos, but SnapKV eviction leaves Length at budget K while LogicalLength (RoPE position) stays at prompt length N — so MTP decode threw _kvCache.Length=128 != startPos=N. SupportsBatchVerify now also requires !KvCacheCompacted (Length != LogicalLength), so MtpDecoder falls back to the eviction-safe sequential path; defensive preconditions retained. The signal is false in all normal operation, so the working non-evicted batched path is unaffected. Coexisting batched-verify with eviction is the follow-up.

Tests: non-vacuous decode-after-eviction regression test (proven to fail without the gate), a no-eviction gate-false-positive guard, a PagedKvCache Length/LogicalLength invariant test, and a SHARPI_TRACE_MTP fallback trace line.

Closes #130.

README

Condensed the perf-table notes and prose 340 → 199 lines — kept every perf number, repo/size, the FastScan kernel table, and usage/flags; dropped the issue-by-issue implementation narrative.

Process

Implemented via the subagent workflow with a code-review gate between tasks and a 4-agent toolkit review at the end (silent-failure-hunter, comment-analyzer, pr-test-analyzer, code-reviewer). No defects; high-value findings applied (incl. the toolkit catching a factual error in a comment, now fixed). Full solution builds clean (0 warnings); the #129-only commit also builds standalone. All SnapKV/MTP/kernel tests + the GPU-SLRU oracle pass.

🤖 Generated with Claude Code

pekkah and others added 3 commits June 4, 2026 15:32
Replace BatchedGpuMoeFfn's two host-side per-token loops with batched ops:
- Phase 1 shared expert: batched GEMM-N gate/up/down + one llm_scale_rows_inplace
  for the per-token sigmoid gate (the CPU dot is kept for bit-parity).
- Phase 3 reduce: one llm_moe_weighted_reduce kernel does the top-k weighted
  scatter-reduce + shared add over all N tokens, replacing ~N*(na+2) tiny launches.

Bit-identical to the prior host reduce: each routed term contracts to the same
fmaf (NVRTC fmad=true) as add_scaled_inplace, the shexp scale is a separate rounded
pass (not folded into an fma), and the plain shared add matches add_inplace with
routed-first/shared-last ordering. Only the SHARPI_CPU_MOE=0 GPU-SLRU path is gated.

Verified by a new per-kernel oracle (MoeWeightedReduce/ScaleRowsInPlace bit-parity,
no model needed) and the 22 GB BatchedTrunkGpuFfn_BitwiseMatchesSequential_GpuSlruMoe
end-to-end oracle. A/B (Qwen3.6-35B-A3B, RTX 4070 Ti, SHARPI_CPU_MOE=0, ~1K ctx):
GPU-SLRU prefill 45.3 -> 54.3 t/s (+20%), decode unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
)

BatchForward2 (MTP N=2 batched verify) requires _kvCache.Length == startPos, but
SnapKV prefill eviction leaves Length at the budget K while LogicalLength (the RoPE
position) stays at the prompt length N. The first MTP decode iteration therefore threw
"_kvCache.Length=128 != startPos=N" on MTP models (e.g. Qwen3.6-27B-MTP).

Gate SupportsBatchVerify off when the cache is compacted (Length != LogicalLength) in
both hybrid passes, so MtpDecoder falls back to the eviction-safe sequential Forward
path. The defensive BatchForward2 preconditions are kept (clearer messages). Length !=
LogicalLength is false in all normal operation (IncrementPosition advances both;
TruncateTo/Reset keep them equal; only Compact diverges them), so the working
non-evicted batched path is unaffected. Coexisting batched-verify with eviction is the
#130 follow-up.

Adds a non-vacuous regression test (decode-after-eviction stays coherent; proven to
fail without the gate), a no-eviction gate-false-positive guard, a PagedKvCache
Length/LogicalLength invariant test, and a SHARPI_TRACE_MTP fallback trace line.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Trim the issue-by-issue implementation history from the perf-table Notes cells and the
deep-internals prose; keep every perf number, repo/size, the FastScan kernel table, and
the usage/flags. Document the #129 GPU-SLRU prefill win in the 35B-A3B row.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the GPU-SLRU MoE prefill path by replacing sequential host-loop stream operations with batched, single-launch CUDA kernels for row scaling and weighted reduction. It also addresses a regression (Issue #130) by disabling MTP batched-verify when the KV cache is compacted, safely falling back to sequential decoding. The review feedback suggests a valuable performance optimization: launching the new CUDA kernels with a 2D grid mapping instead of a 1D grid to eliminate expensive integer division and modulo operations on the GPU, while also avoiding potential integer overflow risks during host launch configuration.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +111 to +119
extern ""C"" __global__ void llm_scale_rows_inplace(
float* __restrict__ buf, const float* __restrict__ scales, int rows, int cols)
{
long idx = (long)blockIdx.x * blockDim.x + threadIdx.x;
long total = (long)rows * cols;
if (idx >= total) return;
int i = (int)(idx / cols);
buf[idx] *= scales[i];
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Integer division on GPUs is extremely slow. We can completely eliminate it by launching the kernel with a 2D grid mapping, where blockIdx.x maps to the column index and blockIdx.y maps to the row index.

extern "C" __global__ void llm_scale_rows_inplace(
    float* __restrict__ buf, const float* __restrict__ scales, int rows, int cols)
{
    int e = blockIdx.x * blockDim.x + threadIdx.x;
    int i = blockIdx.y;
    if (e >= cols || i >= rows) return;
    buf[(long)i * cols + e] *= scales[i];
}

Comment on lines +139 to +155
extern ""C"" __global__ void llm_moe_weighted_reduce(
const float* __restrict__ downPartial, const float* __restrict__ weights,
float* __restrict__ shared, int N, int na, int embDim)
{
long idx = (long)blockIdx.x * blockDim.x + threadIdx.x;
long total = (long)N * embDim;
if (idx >= total) return;
int i = (int)(idx / embDim);
int e = (int)(idx - (long)i * embDim);
float acc = 0.0f;
const float* w = weights + (long)i * na;
const float* p = downPartial + ((long)i * na) * embDim + e;
for (int k = 0; k < na; k++)
acc += p[(long)k * embDim] * w[k];
acc += shared[(long)i * embDim + e];
shared[(long)i * embDim + e] = acc;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similarly, we can optimize llm_moe_weighted_reduce by launching it with a 2D grid mapping to avoid expensive integer division and modulo operations on the GPU.

extern "C" __global__ void llm_moe_weighted_reduce(
    const float* __restrict__ downPartial, const float* __restrict__ weights,
    float* __restrict__ shared, int N, int na, int embDim)
{
    int e = blockIdx.x * blockDim.x + threadIdx.x;
    int i = blockIdx.y;
    if (e >= embDim || i >= N) return;
    float acc = 0.0f;
    const float* w = weights + (long)i * na;
    const float* p = downPartial + ((long)i * na) * embDim + e;
    for (int k = 0; k < na; k++)
        acc += p[(long)k * embDim] * w[k];
    acc += shared[(long)i * embDim + e];
    shared[(long)i * embDim + e] = acc;
}

Comment on lines +3775 to +3787
public void ScaleRowsInPlace(Tensor buf, Tensor scales, int rows, int cols)
{
EnsureImageKernels();
if (!_imageKernelsAvailable)
throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels.");

long total = (long)rows * cols;
nint p0 = GetDevPtr(buf);
nint p1 = GetDevPtr(scales);
int p2 = rows, p3 = cols;
nint* args = stackalloc nint[4] { (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3) };
Launch1D(_scaleRowsKernel, checked((int)total), args);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To support the 2D grid mapping and avoid expensive integer division on the GPU, we should launch the kernel with a 2D grid. This also eliminates the potential integer overflow risk from checked((int)total).

    public void ScaleRowsInPlace(Tensor buf, Tensor scales, int rows, int cols)
    {
        EnsureImageKernels();
        if (!_imageKernelsAvailable)
            throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels.");

        nint p0 = GetDevPtr(buf);
        nint p1 = GetDevPtr(scales);
        int  p2 = rows, p3 = cols;
        nint* args = stackalloc nint[4] { (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3) };
        uint gridX = (uint)((cols + 255) / 256);
        uint gridY = (uint)rows;
        int r = NvrtcInterop.LaunchKernel(_scaleRowsKernel, gridX, gridY, 1, 256, 1, 1, 0, _stream, args, null);
        if (r != 0) throw new InvalidOperationException($"cuLaunchKernel(scale_rows) failed: {r}");
    }

Comment on lines +3802 to +3817
public void MoeWeightedReduce(Tensor downPartial, Tensor weights, Tensor shared,
int N, int na, int embDim)
{
EnsureImageKernels();
if (!_imageKernelsAvailable)
throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels.");

long total = (long)N * embDim;
nint p0 = GetDevPtr(downPartial);
nint p1 = GetDevPtr(weights);
nint p2 = GetDevPtr(shared);
int p3 = N, p4 = na, p5 = embDim;
nint* args = stackalloc nint[6]
{ (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3), (nint)(&p4), (nint)(&p5) };
Launch1D(_moeWeightedReduceKernel, checked((int)total), args);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

We should launch MoeWeightedReduce with a 2D grid mapping to avoid expensive integer division and modulo operations on the GPU, while also eliminating the overflow risk of checked((int)total).

    public void MoeWeightedReduce(Tensor downPartial, Tensor weights, Tensor shared,
                                  int N, int na, int embDim)
    {
        EnsureImageKernels();
        if (!_imageKernelsAvailable)
            throw new NotSupportedException("NVRTC is not available; cannot run CUDA image kernels.");

        nint p0 = GetDevPtr(downPartial);
        nint p1 = GetDevPtr(weights);
        nint p2 = GetDevPtr(shared);
        int  p3 = N, p4 = na, p5 = embDim;
        nint* args = stackalloc nint[6]
            { (nint)(&p0), (nint)(&p1), (nint)(&p2), (nint)(&p3), (nint)(&p4), (nint)(&p5) };
        uint gridX = (uint)((embDim + 255) / 256);
        uint gridY = (uint)N;
        int r = NvrtcInterop.LaunchKernel(_moeWeightedReduceKernel, gridX, gridY, 1, 256, 1, 1, 0, _stream, args, null);
        if (r != 0) throw new InvalidOperationException($"cuLaunchKernel(moe_weighted_reduce) failed: {r}");
    }

…eview)

Address the Gemini review on #133: launch llm_scale_rows_inplace and
llm_moe_weighted_reduce with a 2D grid (x = element/column, y = row/token) so the
kernels recover (i, e) from block/thread indices instead of a per-thread integer
divide+modulo over a flattened N*embDim index. Also drops the checked((int)total)
host cast (no flattened total) and adds a loud gridDim.y <= 65535 guard so an
oversized N/rows fails fast rather than silently skipping rows.

Bit-identical: only the index->(i,e) mapping changes; the per-term fmaf accumulation,
k-order, and plain shared-add are untouched. Re-verified by the per-kernel oracle
(MoeWeightedReduce/ScaleRowsInPlace) and the 22 GB GpuSlruMoe end-to-end oracle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah

pekkah commented Jun 4, 2026

Copy link
Copy Markdown
Owner Author

Thanks — applied the 2D-grid suggestion to both kernels in 75d219a. llm_scale_rows_inplace and llm_moe_weighted_reduce now launch with a 2D grid (x = element/column, y = row/token) and recover (i, e) directly from block/thread indices, dropping the per-thread integer divide/modulo. This also removes the checked((int)total) host cast; I added a loud gridDim.y <= 65535 guard so an oversized N/rows fails fast rather than silently skipping rows.

Bit-identical (only the index→(i,e) mapping changed; the per-term fmaf accumulation, k-order, and plain shared-add are untouched) — re-verified by the per-kernel oracle (MoeWeightedReduce_BitwiseMatchesSequentialReduce / ScaleRowsInPlace_BitwiseMatchesPerRowScaleInPlace) and the 22 GB BatchedTrunkGpuFfn_BitwiseMatchesSequential_GpuSlruMoe end-to-end oracle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant