perf(cuda): SWA-scoped KV-VRAM reserve — Gemma 4 q8 auto-context 30K→123K (#228) by pekkah · Pull Request #231 · pekkah/SharpInference

pekkah · 2026-06-13T12:24:27Z

Summary

Fixes #228. CudaForwardPass.EstimateAvailableKvVram reserved max(VRAM/3, 2GB) = 4 GiB on a 12 GB card — the dominant term shrinking the KV budget and pinning Gemma 4 12B q8_0 auto-context to ~30K when far more fits (the full 256K runs via -c).

The reserve is load-bearing for dense models: their KV grows linearly to fill any budget, and an earlier smaller reserve once left ~24 MiB free running Qwen3-8B, spilling the lm-head to system RAM over PCIe and collapsing prefill 65→4 t/s (#185). So it can't be cut globally.

Fix

Factor the reserve into the pure, testable CudaForwardPass.KvVramReserveBytes(hp, vram) and scope a smaller bounded reserve to SWA/per-layer (Gemma 4) models only: a fixed system allowance (max(2 GiB, VRAM/6), not VRAM/3) plus one PrefillBatchChunk's activation working set. This is safe precisely because SWA KV saturates past the sliding-window ring (only the few global layers grow), so a larger budget can't grow KV to consume the headroom — the dense failure mode. Dense models keep max(VRAM/3, 2GB) byte-for-byte unchanged.

Also enforce the model-max ceiling at the single _maxSeqLen chokepoint (Math.Min(_maxSeqLen, hp.ContextLength)) so neither an explicit -c nor a calculated context can ever exceed the model's own max, regardless of estimator.

Result (4070 Ti 12 GB, `gemma-4-12b-it-qat-q4_0`, `-g -1 --kv-type q8_0`)

	master	branch
Gemma q8 auto-ctx	30,840	123,361 (4×)
Gemma real 5052-token prefill	—	1312 t/s, no OOM/spill (1290 MiB free held)
Qwen3-8B (dense)	12080 bf16, 3474 MiB free	identical (byte-for-byte reserve)
explicit `-c 999999999`	—	clamps to 262144

Tests

3 GGUF-free KvVramReserveBytes unit tests: dense keeps max(VRAM/3, 2GB); SWA bounded below dense; SWA scales with model width. 42/42 CudaForwardPassKvDtypeTests green (excl. the slow >4096 chunked, validated via the CLI 5K-token prefill above). Release build clean under TreatWarningsAsErrors + AOT analyzers.

Safety A/B (the load-bearing check)

Qwen3-8B (the #185 cliff model) is byte-identical — dense reserve unchanged → no spill risk by construction. Gemma's real long-context prefill (5052 tokens, crossing the chunk + SWA-ring boundary) runs at 1312 t/s with no OOM.

🤖 Generated with Claude Code

…#228) EstimateAvailableKvVram reserved max(VRAM/3, 2GB) = 4 GiB on a 12 GB card, the dominant term shrinking the KV budget and pinning Gemma 4 12B q8_0 auto-context to ~30K when far more fits. The reserve is load-bearing for DENSE models (their KV grows linearly to fill any budget; an earlier smaller reserve spilled Qwen3-8B's lm-head to system RAM, 65→4 t/s, #185), so it can't be cut globally. Fix: factor the reserve into the pure, testable KvVramReserveBytes(hp, vram) and scope a smaller bounded reserve to SWA/per-layer (Gemma 4) models only — a fixed system allowance (max(2GiB, VRAM/6), NOT VRAM/3) plus one PrefillBatchChunk's activation working set. Safe precisely because SWA KV SATURATES past the sliding- window ring (only the few global layers grow), so a larger budget can't grow KV to eat the headroom — the dense failure mode. Dense models keep max(VRAM/3, 2GB) byte-for-byte. Also enforce the model-max ceiling at the single _maxSeqLen chokepoint (Math.Min(_maxSeqLen, hp.ContextLength)) so neither an explicit -c nor a calculated context can exceed the model's own max, regardless of estimator. Result (4070 Ti 12 GB, gemma-4-12b-it-qat-q4_0, -g -1 --kv-type q8_0): auto-ctx 30840 → 123361 (4×); a real 5052-token prefill runs at 1312 t/s with no OOM/spill (1290 MiB free held). Qwen3-8B dense UNCHANGED (12080 bf16, 3474 MiB free, byte-identical reserve). Explicit -c 999999999 clamps to 262144. Tests: 3 GGUF-free KvVramReserveBytes unit tests (dense keeps max(VRAM/3,2GB); SWA bounded below dense; SWA scales with model width). 42/42 CudaForwardPassKvDtypeTests green (excl. slow >4096 chunked, validated via the CLI long-prefill above). Release build clean under TreatWarningsAsErrors + AOT analyzers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a bounded VRAM reservation strategy for SWA/Gemma 4 models via the new KvVramReserveBytes method, preventing over-reservation of VRAM while maintaining the existing logic for dense models. It also ensures the resolved context length is clamped to the model's maximum. The review feedback points out several discrepancies in the prefillWorkingSet calculation compared to the actual allocations in EnsureBatchedTrunkScratch (such as handling of head dimensions, embedding buffers, and PLE buffers) and provides a code suggestion to align them accurately.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-13T12:26:03Z

+        long prefillWorkingSet = (long)PrefillBatchChunk *
+            (hp.EmbeddingDim * 4L + hp.IntermediateDim * 2L
+             + (long)hp.NumHeads * hp.HeadDim + 2L * hp.NumKvHeads * hp.HeadDim) * sizeof(float);


The current calculation of prefillWorkingSet has a few discrepancies compared to the actual allocations performed in EnsureBatchedTrunkScratch for SWA/Gemma 4 models:

Q and AttnOut Buffers: _bpQ and _bpAttnOut are two buffers of size n * numHeads * maxHeadDim. The current formula only accounts for one buffer ((long)hp.NumHeads * hp.HeadDim) and uses hp.HeadDim instead of the maximum head dimension (maxHeadDim), which can be larger (e.g., 512 vs 256 in Gemma 4).

K and V Buffers: _bpK and _bpV are two buffers of size n * numKvHeads * maxHeadDim. The current formula uses hp.HeadDim instead of maxHeadDim.

Embedding Buffers: _bpHidden, _bpResidual, and _bpNorm are three buffers of size n * embDim. The formula unconditionally uses hp.EmbeddingDim * 4L, which over-reserves when hp.HasPerLayerTokenEmbd is false.

PLE Buffers: When hp.HasPerLayerTokenEmbd is true, _bpProjAll (n * NumLayers * pleWidth), _bpPleRowAll (n * NumLayers * pleWidth), and _bpPleGate (n * pleWidth) are allocated but completely omitted from the reserve calculation. For Gemma 4 12B, this under-reserves around 340 MiB of VRAM.

To ensure the VRAM estimation is robust and accurate, we should align the prefillWorkingSet calculation exactly with the allocations in EnsureBatchedTrunkScratch.

int maxHeadDim = hp.HeadDim; if (hp.LayerHeadDim is not null) { foreach (int hd in hp.LayerHeadDim) { if (hd > maxHeadDim) maxHeadDim = hd; } } long perTokenFloats = hp.EmbeddingDim * 3L + (long)hp.NumHeads * maxHeadDim * 2L + (long)hp.NumKvHeads * maxHeadDim * 2L + hp.IntermediateDim * 2L; if (hp.HasPerLayerTokenEmbd) { perTokenFloats += hp.NumLayers * (long)hp.PerLayerEmbeddingWidth * 2L + hp.PerLayerEmbeddingWidth + hp.EmbeddingDim; } long prefillWorkingSet = (long)PrefillBatchChunk * perTokenFloats * sizeof(float);

Addressed in 2ca5b5f + e4af59a: prefillWorkingSet now counts Q+AttnOut and K+V at the widest per-layer head_dim (×2 each), the PLE proj/row/gate stack + PleY (only when HasPerLayerTokenEmbd), and a 3×embDim base. Matches EnsureBatchedTrunkScratch exactly. New exact-value unit test KvVramReserveBytes_SwaModel_CountsPleAndWidestHeadDim pins it.

Addresses the #231 review findings: - HIGH (silent-failure): the SWA prefill-working-set term under-counted Gemma 4's real transient buffers — it omitted the PLE stacked proj/row scratch (NumLayers×pleWidth, allocated for HasPerLayerTokenEmbd) and used hp.HeadDim instead of the WIDEST per-layer head_dim (global 512). EnsureBatchedTrunkScratch allocates these lazily at the first long prefill and they coexist with the larger KV cache, so the reserve must bound them. Now counts hidden/residual/norm/PleY, Q+AttnOut and K+V at maxHeadDim, 2 FFN buffers, and the PLE stack. (The looser formula happened to pass the 5K-prefill gate via GpuBufferPool reuse, but the reserve should be a true bound, not rely on that.) - Important (code-reviewer): KvVramReserveBytes guarded on `IsSwaLayer is null` while SolveMaxCtxForKv uses `LayerHeadDim != null && IsSwaLayer != null`. Aligned: the SWA reserve now requires per-layer head_dim AND ≥1 real sliding-window layer, so a Gemma-arch GGUF with an all-false SWA pattern (linear-growth KV) correctly keeps the dense reserve. Also Math.Min(dense, ...) so the SWA reserve can never exceed what dense would reserve. Re-validated (4070 Ti 12 GB): Gemma q8 auto-ctx still 123361 (the larger reserve is absorbed by the pow2 KV step — win preserved), real 5052-token prefill 1313 t/s no OOM; Qwen3-8B dense byte-identical (12080 bf16, 3474 MiB free). Tests: +2 (VRAM/6 large-card scaling isolated at 48 GiB; exact PLE + widest-head_dim working-set pin). 5/5 KvVramReserveBytes green; Release build clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

pekkah · 2026-06-13T12:37:08Z

Review cycle complete

Ran three independent lenses (correctness, silent-failure/OOM, test-coverage) + the automated gemini review.

Findings addressed (commit 2ca5b5f):

HIGH (silent-failure-hunter): the SWA prefill-working-set term under-counted Gemma's real transient buffers — omitted the PLE stacked scratch (NumLayers×pleWidth) and used hp.HeadDim not the widest per-layer head_dim (global 512). Now a faithful bound on what EnsureBatchedTrunkScratch allocates. (The looser formula passed the 5K-prefill gate via GpuBufferPool reuse, but the reserve should be a true bound.)
Important (code-reviewer): the SWA-vs-dense guard now matches SolveMaxCtxForKv (LayerHeadDim != null && IsSwaLayer != null + ≥1 real SWA layer), so a Gemma-arch GGUF with an all-false SWA pattern keeps the dense reserve. Added Math.Min(dense, …) so the SWA reserve can never exceed dense.
Test gaps (pr-test-analyzer): +2 tests — VRAM/6 large-card scaling isolated at 48 GiB; exact PLE + widest-head_dim working-set pin.

Confirmed safe: Qwen3-8B (the #185 cliff model) is dense → reserve byte-identical → no spill risk by construction. Gemma's real 5052-token prefill (crossing the chunk + SWA-ring boundary) runs at 1313 t/s, no OOM; auto-ctx still 123361 (the larger reserve is absorbed by the pow2 KV step — the 4× win is preserved).

5/5 KvVramReserveBytes unit tests + 42/42 CudaForwardPassKvDtypeTests green; Release build clean under TreatWarningsAsErrors + AOT.

… presence gemini-code-assist point #3: _bpPleY is allocated only when HasPerLayerTokenEmbd, so the base is 3×embDim (hidden+residual+norm) and the +embDim PleY belongs in the PLE block. For real Gemma 4 (always PLE) the total is unchanged (4×embDim); this just avoids a ~63 MiB over-reserve for a hypothetical SWA-without-PLE model. 5/5 KvVramReserveBytes tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 13, 2026

View reviewed changes

pekkah merged commit 5eb17a5 into master Jun 13, 2026
1 check passed

pekkah deleted the perf/gemma4-swa-kv-reserve-228 branch June 13, 2026 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda): SWA-scoped KV-VRAM reserve — Gemma 4 q8 auto-context 30K→123K (#228)#231

perf(cuda): SWA-scoped KV-VRAM reserve — Gemma 4 q8 auto-context 30K→123K (#228)#231
pekkah merged 3 commits into
masterfrom
perf/gemma4-swa-kv-reserve-228

pekkah commented Jun 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Uh oh!

pekkah Jun 13, 2026

Uh oh!

pekkah commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Jun 13, 2026

Summary

Fix

Result (4070 Ti 12 GB, gemma-4-12b-it-qat-q4_0, -g -1 --kv-type q8_0)

Tests

Safety A/B (the load-bearing check)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah Jun 13, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah commented Jun 13, 2026

Review cycle complete

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Result (4070 Ti 12 GB, `gemma-4-12b-it-qat-q4_0`, `-g -1 --kv-type q8_0`)