Skip to content

perf(cuda): batched FFN/MoE prefill (#121), SnapKV batched SDPA (#122), Coder-30B batched trunk (#123)#128

Merged
pekkah merged 3 commits into
masterfrom
perf/cuda-batched-prefill-121-123
Jun 4, 2026
Merged

perf(cuda): batched FFN/MoE prefill (#121), SnapKV batched SDPA (#122), Coder-30B batched trunk (#123)#128
pekkah merged 3 commits into
masterfrom
perf/cuda-batched-prefill-121-123

Conversation

@pekkah

@pekkah pekkah commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Three follow-ups to #119 (PR #120), each bit-identical to its sequential reference and validated on an RTX 4070 Ti.

#121 — batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill

PrefillBatchedTrunkGpuFfn batched the trunk but ran FFN/MoE per token. Now:

  • Dense FFN gate/up/down as single GEMM-N launches over N (BatchedGpuDenseFfn).
  • GPU-SLRU routed experts grouped-by-expert (BatchedGpuMoeFfn): each cached expert loaded once and matmul'd against all its tokens, then a per-token top-k-ordered reduce that preserves the exact AddScaledInPlace accumulation order of the sequential GpuMoeFfn (shared expert added last) — so it stays byte-identical.
  • SHARPI_BATCHED_FFN gate; unsupported dtypes / CPU-dense-fallback layers stay on the per-token path.

#122 — batched-query SDPA + Q-capture for SnapKV-active GDN-hybrid prefill

SnapKV used to force the per-position loop (the only place Q was captured). Now AttnBlockBatched captures the trailing window [wStart,N) from qAll in a single batched CopyDeviceRegion, then runs the AttentionBatched / AttentionBatchedWave (#118) SDPA — byte-identical capture and attention output vs the per-position path.

#123 — batched-trunk prefill for CudaHybridForwardPass (Qwen3-Coder-30B)

The -g N layer-split pure-attention MoE hybrid had no batched prefill. New PrefillBatchedTrunk / GpuLayerBatchedTrunk batch the attention trunk of the GPU layers over N (norm/proj/RoPE/QK-norm/KV-append/SDPA shared+wave), run FFN/MoE per token, transfer the N hidden rows GPU→CPU, run the CPU layers per token, then output. Gated to supported configs (non-Gemma4, non-TQ, NEOX RoPE, no L2-QK-norm, no attn-bias, GPU-resident embedding, batchable dtypes) with per-token fallback; _faulted latch added.

Validation

All new/changed oracles assert bit-identical final-token logits vs the deterministic sequential Forward reference:

Reviewed per-issue (pr-review-toolkit code-reviewer) + a final pass (silent-failure-hunter / pr-test-analyzer / comment-analyzer); findings addressed (view-handle leak, anti-vacuity assertions, flag pinning, the missing SHARPI_BATCHED_FFN=0 oracle, comment accuracy).

Measured perf (warm-cache ~1K-ctx A/B, RTX 4070 Ti) + README

🤖 Generated with Claude Code

#122), Coder-30B batched trunk (#123)

Three follow-ups to #119/#120, each bit-identical to its sequential reference.

#121 — batch the FFN/MoE stage of PrefillBatchedTrunkGpuFfn (GDN-hybrid):
  - dense FFN gate/up/down as GEMM-N over N tokens (BatchedGpuDenseFfn)
  - GPU-SLRU routed experts grouped-by-expert (BatchedGpuMoeFfn), with a
    per-token top-k-ordered reduce that preserves the exact AddScaledInPlace
    accumulation order of the sequential GpuMoeFfn (shared expert added last)
  - SHARPI_BATCHED_FFN gate + per-token / unsupported-dtype fallback

#122 — let the batched/wave SDPA run with SnapKV active (AttnBlockBatched):
  capture the trailing-window Q rows [wStart,N) from qAll into _snapKvQCapture
  in a single batched CopyDeviceRegion, then run AttentionBatched/Wave. Byte-
  identical capture + SDPA vs the per-position path.

#123 — batched-trunk prefill for CudaHybridForwardPass (Coder-30B layer-split):
  new PrefillBatchedTrunk / GpuLayerBatchedTrunk batch the attention trunk of
  the GPU layers over N (norm/proj/RoPE/QK-norm/KV-append/SDPA shared+wave),
  per-token FFN/MoE, GPU->CPU N-row transfer + per-token CPU layers, then output.
  Gated (non-Gemma4, non-TQ, NEOX, no L2-QK-norm, no attn-bias, GPU-resident
  embedding, batchable dtypes) with per-token fallback; _faulted latch added.

Tests (all bit-identical vs sequential Forward on RTX 4070 Ti):
  - new BatchedFfn_BitwiseMatchesPerTokenFfn_Dense27BMtp (SHARPI_BATCHED_FFN=0 equiv)
  - new SnapKvActive_BatchedWave_BitwiseMatchesSequential_Over4096_Carnice
  - new CudaHybridBatchedPrefillTests (Coder-30B single / multi-chunk / >4096),
    asserting LastPrefillWasBatched so the batched path can't pass vacuously
  - GPU-SLRU oracle pins the batched flags on; existing GDN oracles still green

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements batched-trunk prompt prefill for pure-attention configurations (Issue #123), batched FFN/MoE stages (Issue #121), and batched attention with SnapKV active (Issue #122) to improve prefill throughput, along with comprehensive parity tests. The review feedback highlights critical performance bottlenecks, specifically the sequential host-side looping for top-k reduction and the synchronous weight downloading in the batched MoE FFN path, both of which should be optimized. Additionally, several instances of using _gpu.CopyDevice for host-device transfers violate CUDA specifications and should be replaced with proper Download or UploadInto operations.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +4095 to +4113
for (int i = 0; i < N; i++)
{
// Routed accumulation in top-k order, starting from zero (matches the
// sequential Clear(_gpuHidden) + per-k AddScaledInPlace).
_gpu.Clear(_gpuHidden);
for (int k = 0; k < na; k++)
{
long slotIdx = (long)i * na + k;
var partV = _gpu.View(downPartial, slotIdx * embDim, embDim);
try { _gpu.AddScaledInPlace(_gpuHidden, partV, weights[(long)i * na + k]); }
finally { _gpu.Free(partV); }
}
// Add the per-token shared expert (sequential adds it last).
var sharedV = _gpu.View(hiddenAll, (long)i * embDim, embDim);
try { _gpu.AddInPlace(_gpuHidden, sharedV); }
finally { _gpu.Free(sharedV); }
_gpu.CopyDeviceRegion(hiddenAll, (long)i * embDim * sizeof(float),
_gpuHidden, 0, (long)embDim * sizeof(float));
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Looping over the batch size 'N' on the host to perform the top-k reduction and shared expert addition introduces a massive performance bottleneck. For a batch size of 'N = 4096' and 'na = 8' active experts, this loop launches over 45,000 stream operations (including '_gpu.Clear', '_gpu.View', '_gpu.AddScaledInPlace', '_gpu.AddInPlace', and '_gpu.CopyDeviceRegion') sequentially from the host. The overhead of these P/Invoke and driver launches from C# will severely bottleneck the GPU, completely defeating the throughput benefits of the batched prefill.\n\nTo resolve this, implement a single custom batched reduction CUDA kernel that performs the top-k weighted sum and shared expert addition in parallel across all 'N' tokens on the GPU, rather than looping on the host.

// GpuMoeFfn: ffn_down @ (SiLU(gate@x) * (up@x)), then a per-token sigmoid scalar
// gate from ffn_gate_inp_shexp · x. Reads each token's norm into _gpuNormBuf.
_gpu.Download(normAll, new Span<float>(_bfNormReadback, (int)((long)N * embDim)));
_gpu.Download(_gpuWGateInpShexp[layer], new Span<float>(_hostQ, embDim)); // gate-inp weight (shared across tokens)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Downloading the constant weight tensor '_gpuWGateInpShexp[layer]' from the GPU to the CPU on every single 'BatchedGpuMoeFfn' call is highly inefficient. Because '_gpu.Download' is synchronous (it calls 'StreamSynchronize'), this forces the CPU to block and wait for the GPU to finish all preceding work, completely destroying CPU-GPU pipelining and introducing significant latency.\n\nSince these weights are constant and extremely small (only '_embDim' floats or ~8 KB per layer), they should be cached on the CPU during model construction and read directly from CPU memory, completely eliminating the synchronous '_gpu.Download' call during the forward pass.

Comment on lines +1121 to +1124
var pinnedAll = EnsurePinnedHiddenAll(N);
_gpu.CopyDevice(pinnedAll, stream); // device stream → pinned host (N × embDim)
_gpu.Synchronize();
float* hostAll = _gpu.MapPinned(pinnedAll);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using '_gpu.CopyDevice' (which internally calls 'cudaMemcpyAsync' with 'cudaMemcpyDeviceToDevice') to copy from a device stream to a pinned host tensor ('pinnedAll') is a spec violation in CUDA. While it may work on some platforms due to driver-level UVA mapping, it can fail with 'cudaErrorInvalidValue' or cause undefined behavior/performance degradation on systems where UVA is not fully symmetric or P2P access is restricted.\n\nInstead, use '_gpu.Download' with the mapped host pointer, which uses 'DeviceToHost' and synchronizes the stream automatically, allowing you to remove the manual '_gpu.Synchronize()' call.

            var pinnedAll = EnsurePinnedHiddenAll(N);
            float* hostAll = _gpu.MapPinned(pinnedAll);
            _gpu.Download(stream, (nint)hostAll, N * embDim);

Comment on lines +1155 to +1158
float* pinned = _gpu.MapPinned(_pinnedHidden);
new ReadOnlySpan<float>(_cpuHidden, embDim).CopyTo(new Span<float>(pinned, embDim));
_gpu.UnmapPinned(_pinnedHidden);
_gpu.CopyDevice(_gpuHidden, _pinnedHidden);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using '_gpu.CopyDevice' (which internally calls 'cudaMemcpyAsync' with 'cudaMemcpyDeviceToDevice') to copy from a pinned host tensor ('_pinnedHidden') to a device tensor ('_gpuHidden') is a spec violation in CUDA.\n\nInstead, use '_gpu.UploadInto' with the mapped host pointer, which correctly uses 'HostToDevice' copy.

            float* pinned = _gpu.MapPinned(_pinnedHidden);
            new ReadOnlySpan<float>(_cpuHidden, embDim).CopyTo(new Span<float>(pinned, embDim));
            _gpu.UploadInto(_gpuHidden, (nint)pinned, embDim);
            _gpu.UnmapPinned(_pinnedHidden);

Comment on lines +1168 to +1173
// No GPU output weight: compute the output on CPU from the last hidden row.
_gpu.CopyDevice(_pinnedHidden, _gpuHidden);
_gpu.Synchronize();
float* pinned = _gpu.MapPinned(_pinnedHidden);
new ReadOnlySpan<float>(pinned, embDim).CopyTo(new Span<float>(_cpuHidden, embDim));
_gpu.UnmapPinned(_pinnedHidden);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using '_gpu.CopyDevice' (which internally calls 'cudaMemcpyAsync' with 'cudaMemcpyDeviceToDevice') to copy from a device tensor ('_gpuHidden') to a pinned host tensor ('_pinnedHidden') is a spec violation in CUDA.\n\nInstead, use '_gpu.Download' with the mapped host pointer, which correctly uses 'DeviceToHost' copy and synchronizes the stream automatically, allowing you to remove the manual '_gpu.Synchronize()' call.

                // No GPU output weight: compute the output on CPU from the last hidden row.
                float* pinned = _gpu.MapPinned(_pinnedHidden);
                _gpu.Download(_gpuHidden, (nint)pinned, embDim);
                new ReadOnlySpan<float>(pinned, embDim).CopyTo(new Span<float>(_cpuHidden, embDim));
                _gpu.UnmapPinned(_pinnedHidden);

…121 FFN batching is prefill-neutral on the 27B-MTP rows

Measured warm-cache ~1K-ctx A/B on RTX 4070 Ti (bench-textgen harness):
- #123 Coder-30B CUDA hybrid: batched-trunk prefill lifts 25.2 → 30.1 t/s
  (+19%) vs the per-token Forward loop (SHARPI_BATCHED_PREFILL=0). Headline
  prefill column updated to 30.1; decode unchanged (FFN/MoE + CPU layers stay
  per token).
- #121 dense-FFN batching (27B-MTP Q4/Q5 CUDA): bit-identical but ~1K-prefill-
  neutral (8.5 ↔ 8.5, 5.6 ↔ 5.5) — 44/64 (Q4) / 51/64 (Q5) FFN layers are
  CPU-mmap and dominate prefill; the launch-collapse helps when more FFN is
  GPU-resident. Number unchanged; notes credit #121 for traceability.
- #122 SnapKV: no benchmarked table row uses SnapKV (long-context-only win);
  not added.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
pekkah added a commit that referenced this pull request Jun 4, 2026
Address the one valid finding from the PR #128 review (Gemini): BatchedGpuMoeFfn's
Phase-3 per-token reduce issues ~N·(na+2) small stream ops, undercutting the
grouped-GEMM win for large-N GPU-SLRU prefill. Correctness/bit-parity are
unaffected and it only gates the non-default SHARPI_CPU_MOE=0 path; a batched
weighted-scatter-reduce kernel is tracked in #129. Comment-only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address the one valid finding from the PR #128 review (Gemini): BatchedGpuMoeFfn's
Phase-3 per-token reduce issues ~N·(na+2) small stream ops, undercutting the
grouped-GEMM win for large-N GPU-SLRU prefill. Correctness/bit-parity are
unaffected and it only gates the non-default SHARPI_CPU_MOE=0 path; a batched
weighted-scatter-reduce kernel is tracked in #129. Comment-only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah pekkah force-pushed the perf/cuda-batched-prefill-121-123 branch from 7272f28 to 6c12e6a Compare June 4, 2026 10:31
@pekkah

pekkah commented Jun 4, 2026

Copy link
Copy Markdown
Owner Author

Thanks @gemini-code-assist — adjudicated each finding against the codebase (independently cross-checked with the repo's review toolkit). Summary:

1. Host-loop top-k reduce in BatchedGpuMoeFfn (high) — valid, deferred to #129.
Correct that the Phase-3 per-token reduce issues ~N·(na+2) small stream ops; for large-N prefill the launch overhead undercuts the grouped-GEMM win. It is bit-identical to the sequential GpuMoeFfn accumulation (load-bearing FP order), so this is a throughput optimization, not a correctness fix, and it only gates the non-default GPU-SLRU path (SHARPI_CPU_MOE=0) — the benchmarked 35B-A3B default routes MoE to CPU (BatchedRoutedExperts, already batched) and never enters this loop. A proper fix needs a new batched weighted-scatter-reduce NVRTC kernel (must preserve the k-order), so I've tracked it as #129 and added a code note pointing there rather than rushing a kernel into this PR.

2. Per-call Download(_gpuWGateInpShexp[layer], …) (high) — false positive.
The sequential reference this path must stay bit-identical to (GpuMoeFfn) downloads this same constant weight once per token; the batched version hoists it out of the per-token loop (once per layer-call, reused for all N). So #121 already improves this by ~N×. Caching at construction would help the sequential path equally and is out of scope here.

3. CopyDevice for host↔pinned transfers ×3 (medium) — false positive (established convention).
AllocatePinned registers the cudaMallocHost pointer as a UVA device pointer, and the existing per-token Forward/GpuLayer path already stages every _gpuHidden_pinnedHidden move via CopyGpuBuffer (== RecordComputeCopy == CopyDevice, D2D). The #123 code follows that exact UVA-dependent convention — it introduces no new dependency or risk, and the bit-parity oracles (CudaHybridBatchedPrefillTests, single/multi-chunk/>4096) pass. Switching #123 alone to Download/UploadInto would make it inconsistent with the rest of the class; a uniform migration would be a separate codebase-wide change.

No correctness/bit-parity issues — all parity oracles assert byte-identical logits vs the sequential reference and are green.

@gemini-code-assist

Copy link
Copy Markdown

Thanks for the detailed breakdown, @pekkah. Your assessment of the deferred items and the justification for the established conventions in the codebase are clear. Tracking the host-loop top-k reduce in #129 is the right approach to avoid blocking this PR, and the bit-parity validation confirms the integrity of the batched paths. I have no further concerns regarding these points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant