perf(cuda): batched FFN/MoE prefill (#121), SnapKV batched SDPA (#122), Coder-30B batched trunk (#123)#128
Conversation
#122), Coder-30B batched trunk (#123) Three follow-ups to #119/#120, each bit-identical to its sequential reference. #121 — batch the FFN/MoE stage of PrefillBatchedTrunkGpuFfn (GDN-hybrid): - dense FFN gate/up/down as GEMM-N over N tokens (BatchedGpuDenseFfn) - GPU-SLRU routed experts grouped-by-expert (BatchedGpuMoeFfn), with a per-token top-k-ordered reduce that preserves the exact AddScaledInPlace accumulation order of the sequential GpuMoeFfn (shared expert added last) - SHARPI_BATCHED_FFN gate + per-token / unsupported-dtype fallback #122 — let the batched/wave SDPA run with SnapKV active (AttnBlockBatched): capture the trailing-window Q rows [wStart,N) from qAll into _snapKvQCapture in a single batched CopyDeviceRegion, then run AttentionBatched/Wave. Byte- identical capture + SDPA vs the per-position path. #123 — batched-trunk prefill for CudaHybridForwardPass (Coder-30B layer-split): new PrefillBatchedTrunk / GpuLayerBatchedTrunk batch the attention trunk of the GPU layers over N (norm/proj/RoPE/QK-norm/KV-append/SDPA shared+wave), per-token FFN/MoE, GPU->CPU N-row transfer + per-token CPU layers, then output. Gated (non-Gemma4, non-TQ, NEOX, no L2-QK-norm, no attn-bias, GPU-resident embedding, batchable dtypes) with per-token fallback; _faulted latch added. Tests (all bit-identical vs sequential Forward on RTX 4070 Ti): - new BatchedFfn_BitwiseMatchesPerTokenFfn_Dense27BMtp (SHARPI_BATCHED_FFN=0 equiv) - new SnapKvActive_BatchedWave_BitwiseMatchesSequential_Over4096_Carnice - new CudaHybridBatchedPrefillTests (Coder-30B single / multi-chunk / >4096), asserting LastPrefillWasBatched so the batched path can't pass vacuously - GPU-SLRU oracle pins the batched flags on; existing GDN oracles still green Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements batched-trunk prompt prefill for pure-attention configurations (Issue #123), batched FFN/MoE stages (Issue #121), and batched attention with SnapKV active (Issue #122) to improve prefill throughput, along with comprehensive parity tests. The review feedback highlights critical performance bottlenecks, specifically the sequential host-side looping for top-k reduction and the synchronous weight downloading in the batched MoE FFN path, both of which should be optimized. Additionally, several instances of using _gpu.CopyDevice for host-device transfers violate CUDA specifications and should be replaced with proper Download or UploadInto operations.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for (int i = 0; i < N; i++) | ||
| { | ||
| // Routed accumulation in top-k order, starting from zero (matches the | ||
| // sequential Clear(_gpuHidden) + per-k AddScaledInPlace). | ||
| _gpu.Clear(_gpuHidden); | ||
| for (int k = 0; k < na; k++) | ||
| { | ||
| long slotIdx = (long)i * na + k; | ||
| var partV = _gpu.View(downPartial, slotIdx * embDim, embDim); | ||
| try { _gpu.AddScaledInPlace(_gpuHidden, partV, weights[(long)i * na + k]); } | ||
| finally { _gpu.Free(partV); } | ||
| } | ||
| // Add the per-token shared expert (sequential adds it last). | ||
| var sharedV = _gpu.View(hiddenAll, (long)i * embDim, embDim); | ||
| try { _gpu.AddInPlace(_gpuHidden, sharedV); } | ||
| finally { _gpu.Free(sharedV); } | ||
| _gpu.CopyDeviceRegion(hiddenAll, (long)i * embDim * sizeof(float), | ||
| _gpuHidden, 0, (long)embDim * sizeof(float)); | ||
| } |
There was a problem hiding this comment.
Looping over the batch size 'N' on the host to perform the top-k reduction and shared expert addition introduces a massive performance bottleneck. For a batch size of 'N = 4096' and 'na = 8' active experts, this loop launches over 45,000 stream operations (including '_gpu.Clear', '_gpu.View', '_gpu.AddScaledInPlace', '_gpu.AddInPlace', and '_gpu.CopyDeviceRegion') sequentially from the host. The overhead of these P/Invoke and driver launches from C# will severely bottleneck the GPU, completely defeating the throughput benefits of the batched prefill.\n\nTo resolve this, implement a single custom batched reduction CUDA kernel that performs the top-k weighted sum and shared expert addition in parallel across all 'N' tokens on the GPU, rather than looping on the host.
| // GpuMoeFfn: ffn_down @ (SiLU(gate@x) * (up@x)), then a per-token sigmoid scalar | ||
| // gate from ffn_gate_inp_shexp · x. Reads each token's norm into _gpuNormBuf. | ||
| _gpu.Download(normAll, new Span<float>(_bfNormReadback, (int)((long)N * embDim))); | ||
| _gpu.Download(_gpuWGateInpShexp[layer], new Span<float>(_hostQ, embDim)); // gate-inp weight (shared across tokens) |
There was a problem hiding this comment.
Downloading the constant weight tensor '_gpuWGateInpShexp[layer]' from the GPU to the CPU on every single 'BatchedGpuMoeFfn' call is highly inefficient. Because '_gpu.Download' is synchronous (it calls 'StreamSynchronize'), this forces the CPU to block and wait for the GPU to finish all preceding work, completely destroying CPU-GPU pipelining and introducing significant latency.\n\nSince these weights are constant and extremely small (only '_embDim' floats or ~8 KB per layer), they should be cached on the CPU during model construction and read directly from CPU memory, completely eliminating the synchronous '_gpu.Download' call during the forward pass.
| var pinnedAll = EnsurePinnedHiddenAll(N); | ||
| _gpu.CopyDevice(pinnedAll, stream); // device stream → pinned host (N × embDim) | ||
| _gpu.Synchronize(); | ||
| float* hostAll = _gpu.MapPinned(pinnedAll); |
There was a problem hiding this comment.
Using '_gpu.CopyDevice' (which internally calls 'cudaMemcpyAsync' with 'cudaMemcpyDeviceToDevice') to copy from a device stream to a pinned host tensor ('pinnedAll') is a spec violation in CUDA. While it may work on some platforms due to driver-level UVA mapping, it can fail with 'cudaErrorInvalidValue' or cause undefined behavior/performance degradation on systems where UVA is not fully symmetric or P2P access is restricted.\n\nInstead, use '_gpu.Download' with the mapped host pointer, which uses 'DeviceToHost' and synchronizes the stream automatically, allowing you to remove the manual '_gpu.Synchronize()' call.
var pinnedAll = EnsurePinnedHiddenAll(N);
float* hostAll = _gpu.MapPinned(pinnedAll);
_gpu.Download(stream, (nint)hostAll, N * embDim);| float* pinned = _gpu.MapPinned(_pinnedHidden); | ||
| new ReadOnlySpan<float>(_cpuHidden, embDim).CopyTo(new Span<float>(pinned, embDim)); | ||
| _gpu.UnmapPinned(_pinnedHidden); | ||
| _gpu.CopyDevice(_gpuHidden, _pinnedHidden); |
There was a problem hiding this comment.
Using '_gpu.CopyDevice' (which internally calls 'cudaMemcpyAsync' with 'cudaMemcpyDeviceToDevice') to copy from a pinned host tensor ('_pinnedHidden') to a device tensor ('_gpuHidden') is a spec violation in CUDA.\n\nInstead, use '_gpu.UploadInto' with the mapped host pointer, which correctly uses 'HostToDevice' copy.
float* pinned = _gpu.MapPinned(_pinnedHidden);
new ReadOnlySpan<float>(_cpuHidden, embDim).CopyTo(new Span<float>(pinned, embDim));
_gpu.UploadInto(_gpuHidden, (nint)pinned, embDim);
_gpu.UnmapPinned(_pinnedHidden);| // No GPU output weight: compute the output on CPU from the last hidden row. | ||
| _gpu.CopyDevice(_pinnedHidden, _gpuHidden); | ||
| _gpu.Synchronize(); | ||
| float* pinned = _gpu.MapPinned(_pinnedHidden); | ||
| new ReadOnlySpan<float>(pinned, embDim).CopyTo(new Span<float>(_cpuHidden, embDim)); | ||
| _gpu.UnmapPinned(_pinnedHidden); |
There was a problem hiding this comment.
Using '_gpu.CopyDevice' (which internally calls 'cudaMemcpyAsync' with 'cudaMemcpyDeviceToDevice') to copy from a device tensor ('_gpuHidden') to a pinned host tensor ('_pinnedHidden') is a spec violation in CUDA.\n\nInstead, use '_gpu.Download' with the mapped host pointer, which correctly uses 'DeviceToHost' copy and synchronizes the stream automatically, allowing you to remove the manual '_gpu.Synchronize()' call.
// No GPU output weight: compute the output on CPU from the last hidden row.
float* pinned = _gpu.MapPinned(_pinnedHidden);
_gpu.Download(_gpuHidden, (nint)pinned, embDim);
new ReadOnlySpan<float>(pinned, embDim).CopyTo(new Span<float>(_cpuHidden, embDim));
_gpu.UnmapPinned(_pinnedHidden);…121 FFN batching is prefill-neutral on the 27B-MTP rows Measured warm-cache ~1K-ctx A/B on RTX 4070 Ti (bench-textgen harness): - #123 Coder-30B CUDA hybrid: batched-trunk prefill lifts 25.2 → 30.1 t/s (+19%) vs the per-token Forward loop (SHARPI_BATCHED_PREFILL=0). Headline prefill column updated to 30.1; decode unchanged (FFN/MoE + CPU layers stay per token). - #121 dense-FFN batching (27B-MTP Q4/Q5 CUDA): bit-identical but ~1K-prefill- neutral (8.5 ↔ 8.5, 5.6 ↔ 5.5) — 44/64 (Q4) / 51/64 (Q5) FFN layers are CPU-mmap and dominate prefill; the launch-collapse helps when more FFN is GPU-resident. Number unchanged; notes credit #121 for traceability. - #122 SnapKV: no benchmarked table row uses SnapKV (long-context-only win); not added. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address the one valid finding from the PR #128 review (Gemini): BatchedGpuMoeFfn's Phase-3 per-token reduce issues ~N·(na+2) small stream ops, undercutting the grouped-GEMM win for large-N GPU-SLRU prefill. Correctness/bit-parity are unaffected and it only gates the non-default SHARPI_CPU_MOE=0 path; a batched weighted-scatter-reduce kernel is tracked in #129. Comment-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address the one valid finding from the PR #128 review (Gemini): BatchedGpuMoeFfn's Phase-3 per-token reduce issues ~N·(na+2) small stream ops, undercutting the grouped-GEMM win for large-N GPU-SLRU prefill. Correctness/bit-parity are unaffected and it only gates the non-default SHARPI_CPU_MOE=0 path; a batched weighted-scatter-reduce kernel is tracked in #129. Comment-only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7272f28 to
6c12e6a
Compare
|
Thanks @gemini-code-assist — adjudicated each finding against the codebase (independently cross-checked with the repo's review toolkit). Summary: 1. Host-loop top-k reduce in 2. Per-call 3. No correctness/bit-parity issues — all parity oracles assert byte-identical logits vs the sequential reference and are green. |
|
Thanks for the detailed breakdown, @pekkah. Your assessment of the deferred items and the justification for the established conventions in the codebase are clear. Tracking the host-loop top-k reduce in #129 is the right approach to avoid blocking this PR, and the bit-parity validation confirms the integrity of the batched paths. I have no further concerns regarding these points. |
Three follow-ups to #119 (PR #120), each bit-identical to its sequential reference and validated on an RTX 4070 Ti.
#121 — batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill
PrefillBatchedTrunkGpuFfnbatched the trunk but ran FFN/MoE per token. Now:BatchedGpuDenseFfn).BatchedGpuMoeFfn): each cached expert loaded once and matmul'd against all its tokens, then a per-token top-k-ordered reduce that preserves the exactAddScaledInPlaceaccumulation order of the sequentialGpuMoeFfn(shared expert added last) — so it stays byte-identical.SHARPI_BATCHED_FFNgate; unsupported dtypes / CPU-dense-fallback layers stay on the per-token path.#122 — batched-query SDPA + Q-capture for SnapKV-active GDN-hybrid prefill
SnapKV used to force the per-position loop (the only place Q was captured). Now
AttnBlockBatchedcaptures the trailing window[wStart,N)fromqAllin a single batchedCopyDeviceRegion, then runs theAttentionBatched/AttentionBatchedWave(#118) SDPA — byte-identical capture and attention output vs the per-position path.#123 — batched-trunk prefill for
CudaHybridForwardPass(Qwen3-Coder-30B)The
-g Nlayer-split pure-attention MoE hybrid had no batched prefill. NewPrefillBatchedTrunk/GpuLayerBatchedTrunkbatch the attention trunk of the GPU layers over N (norm/proj/RoPE/QK-norm/KV-append/SDPA shared+wave), run FFN/MoE per token, transfer the N hidden rows GPU→CPU, run the CPU layers per token, then output. Gated to supported configs (non-Gemma4, non-TQ, NEOX RoPE, no L2-QK-norm, no attn-bias, GPU-resident embedding, batchable dtypes) with per-token fallback;_faultedlatch added.Validation
All new/changed oracles assert bit-identical final-token logits vs the deterministic sequential
Forwardreference:BatchedFfn_BitwiseMatchesPerTokenFfn_Dense27BMtp—SHARPI_BATCHED_FFN=0equivalence (perf(cuda): batch the FFN/MoE stage of the dense + GPU-SLRU GDN-hybrid prefill (#119 follow-up) #121).SnapKvActive_BatchedWave_BitwiseMatchesSequential_Over4096_Carnice— >4096 wave + SnapKV capture (perf(cuda): batched-query SDPA + Q-capture for SnapKV-active GDN-hybrid prefill (#118 follow-up) #122).CudaHybridBatchedPrefillTests— Coder-30B single / multi-chunk / >4096, each assertingLastPrefillWasBatchedso a gated-out config can't pass vacuously (perf(cuda): batched-trunk prefill for the pure-attention layer-split hybrid (CudaHybridForwardPass / Qwen3-Coder-30B) #123). Coder split: 29 GPU + 19 CPU layers.Reviewed per-issue (pr-review-toolkit code-reviewer) + a final pass (silent-failure-hunter / pr-test-analyzer / comment-analyzer); findings addressed (view-handle leak, anti-vacuity assertions, flag pinning, the missing
SHARPI_BATCHED_FFN=0oracle, comment accuracy).Measured perf (warm-cache ~1K-ctx A/B, RTX 4070 Ti) + README
Forward(SHARPI_BATCHED_PREFILL=0). README row updated to 30.1; decode unchanged (FFN/MoE + CPU layers stay per token).BatchForward2: _kvCache.Length=128 != startPos, the evicted cache vs MTP verify — unrelated to this PR; worth a separate issue.)🤖 Generated with Claude Code