feat(cuda): Gemma 4 continuous batching — per-layer head_dim / SWA / shared-KV / PLE (#195) by pekkah · Pull Request #273 · pekkah/SharpInference

pekkah · 2026-06-17T13:52:28Z

Summary

Closes #195. Stacked on #193 (PR #272) — base is feat/193-cuda-packed-prefill so this diff shows only the Gemma-4 work; rebase to master once #272 merges.

PR #192 (#190) shipped CUDA continuous batching for dense transformers only; CudaForwardPass excluded Gemma-4 models (SupportsContinuousBatching / ThrowIfBatchingUnsupported threw on per-layer head_dim). Gemma 4 is a primary target (#82/#124), so this adds batched serving for it.

What changed (`src/SharpInference.Engine/CudaForwardPass.cs`)

FillKvCacheArrays — extracts the owned-cache allocation into the single source of truth for per-layer KV geometry (per-layer head_dim + KV-head count, SWA ring sizing via SwaRingSize, shared-KV tail aliasing via KvSourceLayer). Used by both the constructor (fills the owned arrays in place — identity preserved so _ownedKCache stays valid) and CreateCache, so a per-sequence cache the batched decode / PrefillWithCache binds is byte-identical to the owned one. Forward-reference guard + frees partial (non-aliased) allocations on OOM.
CreateCache — now allocates Gemma-4 per-layer geometry + SWA rings + shared-KV aliasing (was dense-only).
RunBatchedTrunkGemma4 + GpuLayerBatchedDecodeGemma4 — the N-sequence decode analogue of the single-token RunGemma4DeviceRegion: batched embed (+scale) + PLE pre-pass, per layer {batched RmsNorm/QKV/QK-norm/V-norm, per-sequence RoPE / KV-append / SWA-or-global attention against caches[n] at positions[n], batched O-proj + sandwich norms + FFN + PLE injection + layer_output_scale}, batched final norm + output GEMM + final-logit softcap into _decodeLogitsAll. Matmuls use the proven GpuMatMulBatched (cuBLAS GEMM, weight-amortized across the batch); the dense WS/MMQ decode routing for Gemma 4 is a follow-up.
Gates — SupportsContinuousBatching = dense OR Gemma4BatchedDecodeSupported (the Gemma-4 gate does not exclude softcap, which the decode applies); ThrowIfBatchingUnsupported admits Gemma 4. Spec-verify (SupportsBatchVerify), the 2-slot prefix cache (SupportsMultiSlotPrefix), and the feat(cuda): packed multi-prompt prefill for continuous batching — PrefillPackedMulti is sequential (#190 follow-up) #193 dense packed prefill (IsDensePackablePrefill) stay dense-only — Gemma 4 prefill uses the sequential PrefillWithCache loop (the Gemma-4-capable single-seq batched trunk).
Loader — dense CUDA full-offload already routes via SupportsContinuousBatching, so Gemma 4 full-offload now batches automatically. Image input + batching stays rejected; hybrid / Q4_0 fall back to single-user.

Testing (Gemma 4 E4B Q8_0, RTX 4070 Ti)

E4B exercises per-layer head_dim (256), SWA rings, the 18-layer shared-KV tail, and PLE. New Gemma4CudaBatchForwardMultiTests:

SupportsContinuousBatching gate is true for E4B.
PrefillWithCache vs single-user Prefill (validates CreateCache geometry + aliasing).
BatchForwardMulti N=2 and two decode steps vs single-user ForwardGemma4 — argmax-stable within the fp16-GEMM tolerance.

Regressions: dense batched + #193 packed (15) and Gemma 4 single-user + prefill trunk (14) all pass — the constructor cache refactor is byte-correct for both families. The only local 12B is Q4_0 (not GEMM-N-batchable → single-user fallback), so k_eq_v's batched path isn't covered by a runnable test; it mirrors RunGemma4DeviceRegion exactly.

Review

Code-reviewer + silent-failure-hunter passes: no critical/high issues. Partial-state (cache .Length advanced only by the caller's tail after the trunk completes; caller discards all caches on throw), double-free (aliased-layer skip), and Q4_0-slip-through (dtype gate) all verified safe. Applied the suggested forward-reference guard and fixed two stale comments.

🤖 Generated with Claude Code

…shared-KV / PLE / softcap (#195) PR #192 (#190) shipped CUDA continuous batching for dense transformers only; CudaForwardPass excluded Gemma-4 models (SupportsContinuousBatching / ThrowIfBatchingUnsupported threw on per-layer head_dim). Gemma 4 is a primary target (#82/#124), so this adds batched serving for it. - FillKvCacheArrays: extract the owned-cache allocation into the single source of truth for per-layer KV geometry (per-layer head_dim + KV-head count, SWA ring sizing, shared-KV tail aliasing). Used by BOTH the constructor (fills the owned arrays in place — identity preserved so _ownedKCache stays valid) and CreateCache, so a per-sequence cache the batched decode / PrefillWithCache binds is byte-identical to the owned one. Adds a forward-reference guard + frees partial (non-aliased) allocations on OOM. - CreateCache: now allocates Gemma-4 per-layer geometry + SWA rings + shared-KV aliasing (was dense-only). - RunBatchedTrunkGemma4 + GpuLayerBatchedDecodeGemma4: the N-sequence decode analogue of the single-token RunGemma4DeviceRegion — batched embed (+scale) + PLE pre-pass, per layer {batched RmsNorm/QKV/QK-norm/V-norm, per-sequence RoPE/KV-append/SWA-or-global attention against caches[n] at positions[n], batched O-proj + sandwich norms + FFN + PLE injection + layer_output_scale}, batched final norm + output GEMM + final-logit softcap into _decodeLogitsAll. Matmuls use the proven GpuMatMulBatched (cuBLAS GEMM, weight-amortized) — the dense WS/MMQ decode routing for Gemma 4 is a follow-up. - Gates: SupportsContinuousBatching = dense OR Gemma4BatchedDecodeSupported (the latter does NOT exclude softcap, which the decode applies); ThrowIfBatching Unsupported admits Gemma 4. Spec-verify (SupportsBatchVerify), the 2-slot prefix cache (SupportsMultiSlotPrefix), and the #193 dense packed prefill (IsDensePackablePrefill) stay dense-only — Gemma 4 prefill uses the sequential PrefillWithCache loop (the Gemma-4-capable single-seq batched trunk). - Loader: dense CUDA full-offload already routes via SupportsContinuousBatching, so Gemma 4 full-offload now batches automatically (image input + batching stays rejected; hybrid/Q4_0 fall back to single-user). Tests (Gemma 4 E4B Q8_0, 4070 Ti — exercises per-layer head_dim 256, SWA rings, the 18-layer shared-KV tail, PLE): SupportsContinuousBatching gate, PrefillWithCache vs single-user prefill (CreateCache geometry/aliasing), BatchForwardMulti N=2 and two decode steps vs single-user ForwardGemma4 — argmax-stable within the fp16-GEMM tolerance. Dense (15) + Gemma 4 single-user + prefill (14) regressions all pass (the constructor cache refactor is byte-correct for both families). Two review passes (code-reviewer + silent-failure-hunter), no critical issues. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements CUDA continuous batching and batched decode support for Gemma-4 models (issue #195). Key changes include refactoring KV-cache allocation into a shared helper to maintain layout consistency, updating batching support gates, and introducing a dedicated Gemma-4 batched decode path (RunBatchedTrunkGemma4 and GpuLayerBatchedDecodeGemma4) that handles per-layer head dimensions, SWA rings, shared-KV aliasing, and PLE injection. New integration tests are also added to verify these paths. The review feedback provides excellent, actionable optimization suggestions to avoid allocating and freeing key/value tensor views on the hot path when dealing with shared-KV layers.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-17T13:54:50Z

+        var qAll = _gpu.View(_bpQ!, 0, (long)N * qDimL);
+        var kAll = _gpu.View(_bpK!, 0, (long)N * kvDimL);
+        var vAll = _gpu.View(_bpV!, 0, (long)N * kvDimL);
+        var attnAll = _gpu.View(_bpAttnOut!, 0, (long)N * qDimL);


For shared-KV layers (kvShared is true), the key and value projections are not computed. We can avoid creating the kAll and vAll views entirely in this case, reducing the overhead of creating and freeing views on the hot path.

var qAll = _gpu.View(_bpQ!, 0, (long)N * qDimL); Tensor? kAll = null; Tensor? vAll = null; if (!kvShared) { kAll = _gpu.View(_bpK!, 0, (long)N * kvDimL); vAll = _gpu.View(_bpV!, 0, (long)N * kvDimL); } var attnAll = _gpu.View(_bpAttnOut!, 0, (long)N * qDimL);

gemini-code-assist · 2026-06-17T13:54:50Z

+            var qv = _gpu.View(_bpQ!, (long)n * qDimL, qDimL);
+            var kv = _gpu.View(_bpK!, (long)n * kvDimL, kvDimL);
+            var vv = _gpu.View(_bpV!, (long)n * kvDimL, kvDimL);
+            var av = _gpu.View(_bpAttnOut!, (long)n * qDimL, qDimL);


Similarly, inside the sequence loop, we can avoid creating the kv and vv views when kvShared is true, as they are not used or modified for shared-KV layers.

var qv = _gpu.View(_bpQ!, (long)n * qDimL, qDimL); Tensor? kv = null; Tensor? vv = null; if (!kvShared) { kv = _gpu.View(_bpK!, (long)n * kvDimL, kvDimL); vv = _gpu.View(_bpV!, (long)n * kvDimL, kvDimL); } var av = _gpu.View(_bpAttnOut!, (long)n * qDimL, qDimL);

gemini-code-assist · 2026-06-17T13:54:51Z

+            finally
+            {
+                _gpu.Free(qv); _gpu.Free(kv); _gpu.Free(vv); _gpu.Free(av);
+            }


Update the finally block to only free kv and vv if they were allocated.

finally { _gpu.Free(qv); if (kv is not null) _gpu.Free(kv); if (vv is not null) _gpu.Free(vv); _gpu.Free(av); }

gemini-code-assist · 2026-06-17T13:54:51Z

+        if (_layerOutputScale is not null)
+            _gpu.ScaleInPlace(_bpHidden!, _layerOutputScale[layer]);
+
+        _gpu.Free(qAll); _gpu.Free(kAll); _gpu.Free(vAll); _gpu.Free(attnAll);


Update the final Free calls to only free kAll and vAll if they were allocated.

_gpu.Free(qAll); if (kAll is not null) _gpu.Free(kAll); if (vAll is not null) _gpu.Free(vAll); _gpu.Free(attnAll);

…195 review) Address Gemini Code Assist feedback on PR #273: GpuLayerBatchedDecodeGemma4 created the kAll/vAll batched views and per-sequence kv/vv views on every layer, but a shared-KV tail layer (KvSourceLayer >= 0) reuses the source layer's cache and computes no K/V — those views are never written or read. Create them only on KV-owning layers (nullable + conditional), eliding two batched + 2·N per-sequence view alloc/frees per shared layer per decode step. Bit-identical (the elided views were dead); Gemma 4 E4B batched-decode oracles still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pekkah · 2026-06-17T14:39:43Z

Thanks @gemini-code-assist — applied the suggestion: GpuLayerBatchedDecodeGemma4 now skips creating the kAll/vAll and per-sequence kv/vv views on shared-KV tail layers (they're never written/read there), eliding two batched + 2·N per-sequence view alloc/frees per shared layer per decode step. Bit-identical (the elided views were dead); E4B batched-decode oracles still pass. (commit 92cca20)

gemini-code-assist · 2026-06-17T14:40:27Z

Understood. The reduction in view allocations for shared-KV layers should effectively minimize overhead in the batched decode path. Since the oracles pass and the logic is verified, this optimization appears complete.

…195 review) Address Gemini Code Assist feedback on PR #273: GpuLayerBatchedDecodeGemma4 created the kAll/vAll batched views and per-sequence kv/vv views on every layer, but a shared-KV tail layer (KvSourceLayer >= 0) reuses the source layer's cache and computes no K/V — those views are never written or read. Create them only on KV-owning layers (nullable + conditional), eliding two batched + 2·N per-sequence view alloc/frees per shared layer per decode step. Bit-identical (the elided views were dead); Gemma 4 E4B batched-decode oracles still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…shared-KV / PLE (#195) (#278) * feat(cuda): Gemma 4 continuous batching — per-layer head_dim / SWA / shared-KV / PLE / softcap (#195) PR #192 (#190) shipped CUDA continuous batching for dense transformers only; CudaForwardPass excluded Gemma-4 models (SupportsContinuousBatching / ThrowIfBatchingUnsupported threw on per-layer head_dim). Gemma 4 is a primary target (#82/#124), so this adds batched serving for it. - FillKvCacheArrays: extract the owned-cache allocation into the single source of truth for per-layer KV geometry (per-layer head_dim + KV-head count, SWA ring sizing, shared-KV tail aliasing). Used by BOTH the constructor (fills the owned arrays in place — identity preserved so _ownedKCache stays valid) and CreateCache, so a per-sequence cache the batched decode / PrefillWithCache binds is byte-identical to the owned one. Adds a forward-reference guard + frees partial (non-aliased) allocations on OOM. - CreateCache: now allocates Gemma-4 per-layer geometry + SWA rings + shared-KV aliasing (was dense-only). - RunBatchedTrunkGemma4 + GpuLayerBatchedDecodeGemma4: the N-sequence decode analogue of the single-token RunGemma4DeviceRegion — batched embed (+scale) + PLE pre-pass, per layer {batched RmsNorm/QKV/QK-norm/V-norm, per-sequence RoPE/KV-append/SWA-or-global attention against caches[n] at positions[n], batched O-proj + sandwich norms + FFN + PLE injection + layer_output_scale}, batched final norm + output GEMM + final-logit softcap into _decodeLogitsAll. Matmuls use the proven GpuMatMulBatched (cuBLAS GEMM, weight-amortized) — the dense WS/MMQ decode routing for Gemma 4 is a follow-up. - Gates: SupportsContinuousBatching = dense OR Gemma4BatchedDecodeSupported (the latter does NOT exclude softcap, which the decode applies); ThrowIfBatching Unsupported admits Gemma 4. Spec-verify (SupportsBatchVerify), the 2-slot prefix cache (SupportsMultiSlotPrefix), and the #193 dense packed prefill (IsDensePackablePrefill) stay dense-only — Gemma 4 prefill uses the sequential PrefillWithCache loop (the Gemma-4-capable single-seq batched trunk). - Loader: dense CUDA full-offload already routes via SupportsContinuousBatching, so Gemma 4 full-offload now batches automatically (image input + batching stays rejected; hybrid/Q4_0 fall back to single-user). Tests (Gemma 4 E4B Q8_0, 4070 Ti — exercises per-layer head_dim 256, SWA rings, the 18-layer shared-KV tail, PLE): SupportsContinuousBatching gate, PrefillWithCache vs single-user prefill (CreateCache geometry/aliasing), BatchForwardMulti N=2 and two decode steps vs single-user ForwardGemma4 — argmax-stable within the fp16-GEMM tolerance. Dense (15) + Gemma 4 single-user + prefill (14) regressions all pass (the constructor cache refactor is byte-correct for both families). Two review passes (code-reviewer + silent-failure-hunter), no critical issues. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * perf(cuda): skip unused K/V views on Gemma 4 shared-KV decode layers (#195 review) Address Gemini Code Assist feedback on PR #273: GpuLayerBatchedDecodeGemma4 created the kAll/vAll batched views and per-sequence kv/vv views on every layer, but a shared-KV tail layer (KvSourceLayer >= 0) reuses the source layer's cache and computes no K/V — those views are never written or read. Create them only on KV-owning layers (nullable + conditional), eliding two batched + 2·N per-sequence view alloc/frees per shared layer per decode step. Bit-identical (the elided views were dead); Gemma 4 E4B batched-decode oracles still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 17, 2026

View reviewed changes

pekkah mentioned this pull request Jun 17, 2026

feat(cuda): SnapKV-aware continuous batching — per-sequence eviction + prefer-batching (#196) #274

Merged

This was referenced Jun 17, 2026

perf(cuda): Gemma 4 batched decode uses cuBLAS GEMM, not the dense WS/MMQ decode routing (#195 follow-up) #275

Closed

test(cuda): Gemma 4 k_eq_v batched-decode path is uncovered — needs a GEMM-N-batchable 12B fixture (#195 follow-up) #276

Closed

pekkah deleted the branch feat/193-cuda-packed-prefill June 17, 2026 14:51

pekkah closed this Jun 17, 2026

pekkah mentioned this pull request Jun 17, 2026

feat(cuda): Gemma 4 continuous batching — per-layer head_dim / SWA / shared-KV / PLE (#195) #278

Merged

This was referenced Jun 17, 2026

perf(cuda): Gemma 4 batched decode routes through the WS/decode-MMQ matvec (#275) #280

Merged

test(cuda): synthetic-fixture coverage for Gemma 4 k_eq_v batched decode (#276) #281

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cuda): Gemma 4 continuous batching — per-layer head_dim / SWA / shared-KV / PLE (#195)#273

feat(cuda): Gemma 4 continuous batching — per-layer head_dim / SWA / shared-KV / PLE (#195)#273
pekkah wants to merge 2 commits into
feat/193-cuda-packed-prefillfrom
feat/195-gemma4-cuda-batching

pekkah commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Uh oh!

pekkah commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Jun 17, 2026

Summary

What changed (src/SharpInference.Engine/CudaForwardPass.cs)

Testing (Gemma 4 E4B Q8_0, RTX 4070 Ti)

Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah commented Jun 17, 2026

Uh oh!

gemini-code-assist Bot commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

What changed (`src/SharpInference.Engine/CudaForwardPass.cs`)