Skip to content

perf(cuda): SWA-scoped KV-VRAM reserve — Gemma 4 q8 auto-context 30K→123K (#228)#231

Merged
pekkah merged 3 commits into
masterfrom
perf/gemma4-swa-kv-reserve-228
Jun 13, 2026
Merged

perf(cuda): SWA-scoped KV-VRAM reserve — Gemma 4 q8 auto-context 30K→123K (#228)#231
pekkah merged 3 commits into
masterfrom
perf/gemma4-swa-kv-reserve-228

Conversation

@pekkah

@pekkah pekkah commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes #228. CudaForwardPass.EstimateAvailableKvVram reserved max(VRAM/3, 2GB) = 4 GiB on a 12 GB card — the dominant term shrinking the KV budget and pinning Gemma 4 12B q8_0 auto-context to ~30K when far more fits (the full 256K runs via -c).

The reserve is load-bearing for dense models: their KV grows linearly to fill any budget, and an earlier smaller reserve once left ~24 MiB free running Qwen3-8B, spilling the lm-head to system RAM over PCIe and collapsing prefill 65→4 t/s (#185). So it can't be cut globally.

Fix

Factor the reserve into the pure, testable CudaForwardPass.KvVramReserveBytes(hp, vram) and scope a smaller bounded reserve to SWA/per-layer (Gemma 4) models only: a fixed system allowance (max(2 GiB, VRAM/6), not VRAM/3) plus one PrefillBatchChunk's activation working set. This is safe precisely because SWA KV saturates past the sliding-window ring (only the few global layers grow), so a larger budget can't grow KV to consume the headroom — the dense failure mode. Dense models keep max(VRAM/3, 2GB) byte-for-byte unchanged.

Also enforce the model-max ceiling at the single _maxSeqLen chokepoint (Math.Min(_maxSeqLen, hp.ContextLength)) so neither an explicit -c nor a calculated context can ever exceed the model's own max, regardless of estimator.

Result (4070 Ti 12 GB, gemma-4-12b-it-qat-q4_0, -g -1 --kv-type q8_0)

master branch
Gemma q8 auto-ctx 30,840 123,361 (4×)
Gemma real 5052-token prefill 1312 t/s, no OOM/spill (1290 MiB free held)
Qwen3-8B (dense) 12080 bf16, 3474 MiB free identical (byte-for-byte reserve)
explicit -c 999999999 clamps to 262144

Tests

3 GGUF-free KvVramReserveBytes unit tests: dense keeps max(VRAM/3, 2GB); SWA bounded below dense; SWA scales with model width. 42/42 CudaForwardPassKvDtypeTests green (excl. the slow >4096 chunked, validated via the CLI 5K-token prefill above). Release build clean under TreatWarningsAsErrors + AOT analyzers.

Safety A/B (the load-bearing check)

Qwen3-8B (the #185 cliff model) is byte-identical — dense reserve unchanged → no spill risk by construction. Gemma's real long-context prefill (5052 tokens, crossing the chunk + SWA-ring boundary) runs at 1312 t/s with no OOM.

🤖 Generated with Claude Code

…#228)

EstimateAvailableKvVram reserved max(VRAM/3, 2GB) = 4 GiB on a 12 GB card, the
dominant term shrinking the KV budget and pinning Gemma 4 12B q8_0 auto-context to
~30K when far more fits. The reserve is load-bearing for DENSE models (their KV
grows linearly to fill any budget; an earlier smaller reserve spilled Qwen3-8B's
lm-head to system RAM, 65→4 t/s, #185), so it can't be cut globally.

Fix: factor the reserve into the pure, testable KvVramReserveBytes(hp, vram) and
scope a smaller bounded reserve to SWA/per-layer (Gemma 4) models only — a fixed
system allowance (max(2GiB, VRAM/6), NOT VRAM/3) plus one PrefillBatchChunk's
activation working set. Safe precisely because SWA KV SATURATES past the sliding-
window ring (only the few global layers grow), so a larger budget can't grow KV to
eat the headroom — the dense failure mode. Dense models keep max(VRAM/3, 2GB)
byte-for-byte.

Also enforce the model-max ceiling at the single _maxSeqLen chokepoint
(Math.Min(_maxSeqLen, hp.ContextLength)) so neither an explicit -c nor a calculated
context can exceed the model's own max, regardless of estimator.

Result (4070 Ti 12 GB, gemma-4-12b-it-qat-q4_0, -g -1 --kv-type q8_0): auto-ctx
30840 → 123361 (4×); a real 5052-token prefill runs at 1312 t/s with no OOM/spill
(1290 MiB free held). Qwen3-8B dense UNCHANGED (12080 bf16, 3474 MiB free,
byte-identical reserve). Explicit -c 999999999 clamps to 262144.

Tests: 3 GGUF-free KvVramReserveBytes unit tests (dense keeps max(VRAM/3,2GB);
SWA bounded below dense; SWA scales with model width). 42/42 CudaForwardPassKvDtypeTests
green (excl. slow >4096 chunked, validated via the CLI long-prefill above). Release
build clean under TreatWarningsAsErrors + AOT analyzers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a bounded VRAM reservation strategy for SWA/Gemma 4 models via the new KvVramReserveBytes method, preventing over-reservation of VRAM while maintaining the existing logic for dense models. It also ensures the resolved context length is clamped to the model's maximum. The review feedback points out several discrepancies in the prefillWorkingSet calculation compared to the actual allocations in EnsureBatchedTrunkScratch (such as handling of head dimensions, embedding buffers, and PLE buffers) and provides a code suggestion to align them accurately.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +4014 to +4016
long prefillWorkingSet = (long)PrefillBatchChunk *
(hp.EmbeddingDim * 4L + hp.IntermediateDim * 2L
+ (long)hp.NumHeads * hp.HeadDim + 2L * hp.NumKvHeads * hp.HeadDim) * sizeof(float);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current calculation of prefillWorkingSet has a few discrepancies compared to the actual allocations performed in EnsureBatchedTrunkScratch for SWA/Gemma 4 models:

  1. Q and AttnOut Buffers: _bpQ and _bpAttnOut are two buffers of size n * numHeads * maxHeadDim. The current formula only accounts for one buffer ((long)hp.NumHeads * hp.HeadDim) and uses hp.HeadDim instead of the maximum head dimension (maxHeadDim), which can be larger (e.g., 512 vs 256 in Gemma 4).
  2. K and V Buffers: _bpK and _bpV are two buffers of size n * numKvHeads * maxHeadDim. The current formula uses hp.HeadDim instead of maxHeadDim.
  3. Embedding Buffers: _bpHidden, _bpResidual, and _bpNorm are three buffers of size n * embDim. The formula unconditionally uses hp.EmbeddingDim * 4L, which over-reserves when hp.HasPerLayerTokenEmbd is false.
  4. PLE Buffers: When hp.HasPerLayerTokenEmbd is true, _bpProjAll (n * NumLayers * pleWidth), _bpPleRowAll (n * NumLayers * pleWidth), and _bpPleGate (n * pleWidth) are allocated but completely omitted from the reserve calculation. For Gemma 4 12B, this under-reserves around 340 MiB of VRAM.

To ensure the VRAM estimation is robust and accurate, we should align the prefillWorkingSet calculation exactly with the allocations in EnsureBatchedTrunkScratch.

        int maxHeadDim = hp.HeadDim;
        if (hp.LayerHeadDim is not null)
        {
            foreach (int hd in hp.LayerHeadDim)
            {
                if (hd > maxHeadDim) maxHeadDim = hd;
            }
        }

        long perTokenFloats = hp.EmbeddingDim * 3L
            + (long)hp.NumHeads * maxHeadDim * 2L
            + (long)hp.NumKvHeads * maxHeadDim * 2L
            + hp.IntermediateDim * 2L;

        if (hp.HasPerLayerTokenEmbd)
        {
            perTokenFloats += hp.NumLayers * (long)hp.PerLayerEmbeddingWidth * 2L
                + hp.PerLayerEmbeddingWidth
                + hp.EmbeddingDim;
        }

        long prefillWorkingSet = (long)PrefillBatchChunk * perTokenFloats * sizeof(float);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 2ca5b5f + e4af59a: prefillWorkingSet now counts Q+AttnOut and K+V at the widest per-layer head_dim (×2 each), the PLE proj/row/gate stack + PleY (only when HasPerLayerTokenEmbd), and a 3×embDim base. Matches EnsureBatchedTrunkScratch exactly. New exact-value unit test KvVramReserveBytes_SwaModel_CountsPleAndWidestHeadDim pins it.

Addresses the #231 review findings:

- HIGH (silent-failure): the SWA prefill-working-set term under-counted Gemma 4's
  real transient buffers — it omitted the PLE stacked proj/row scratch
  (NumLayers×pleWidth, allocated for HasPerLayerTokenEmbd) and used hp.HeadDim
  instead of the WIDEST per-layer head_dim (global 512). EnsureBatchedTrunkScratch
  allocates these lazily at the first long prefill and they coexist with the larger
  KV cache, so the reserve must bound them. Now counts hidden/residual/norm/PleY,
  Q+AttnOut and K+V at maxHeadDim, 2 FFN buffers, and the PLE stack. (The looser
  formula happened to pass the 5K-prefill gate via GpuBufferPool reuse, but the
  reserve should be a true bound, not rely on that.)
- Important (code-reviewer): KvVramReserveBytes guarded on `IsSwaLayer is null`
  while SolveMaxCtxForKv uses `LayerHeadDim != null && IsSwaLayer != null`. Aligned:
  the SWA reserve now requires per-layer head_dim AND ≥1 real sliding-window layer,
  so a Gemma-arch GGUF with an all-false SWA pattern (linear-growth KV) correctly
  keeps the dense reserve. Also Math.Min(dense, ...) so the SWA reserve can never
  exceed what dense would reserve.

Re-validated (4070 Ti 12 GB): Gemma q8 auto-ctx still 123361 (the larger reserve is
absorbed by the pow2 KV step — win preserved), real 5052-token prefill 1313 t/s no
OOM; Qwen3-8B dense byte-identical (12080 bf16, 3474 MiB free).

Tests: +2 (VRAM/6 large-card scaling isolated at 48 GiB; exact PLE + widest-head_dim
working-set pin). 5/5 KvVramReserveBytes green; Release build clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah

pekkah commented Jun 13, 2026

Copy link
Copy Markdown
Owner Author

Review cycle complete

Ran three independent lenses (correctness, silent-failure/OOM, test-coverage) + the automated gemini review.

Findings addressed (commit 2ca5b5f):

  • HIGH (silent-failure-hunter): the SWA prefill-working-set term under-counted Gemma's real transient buffers — omitted the PLE stacked scratch (NumLayers×pleWidth) and used hp.HeadDim not the widest per-layer head_dim (global 512). Now a faithful bound on what EnsureBatchedTrunkScratch allocates. (The looser formula passed the 5K-prefill gate via GpuBufferPool reuse, but the reserve should be a true bound.)
  • Important (code-reviewer): the SWA-vs-dense guard now matches SolveMaxCtxForKv (LayerHeadDim != null && IsSwaLayer != null + ≥1 real SWA layer), so a Gemma-arch GGUF with an all-false SWA pattern keeps the dense reserve. Added Math.Min(dense, …) so the SWA reserve can never exceed dense.
  • Test gaps (pr-test-analyzer): +2 tests — VRAM/6 large-card scaling isolated at 48 GiB; exact PLE + widest-head_dim working-set pin.

Confirmed safe: Qwen3-8B (the #185 cliff model) is dense → reserve byte-identical → no spill risk by construction. Gemma's real 5052-token prefill (crossing the chunk + SWA-ring boundary) runs at 1313 t/s, no OOM; auto-ctx still 123361 (the larger reserve is absorbed by the pow2 KV step — the 4× win is preserved).

5/5 KvVramReserveBytes unit tests + 42/42 CudaForwardPassKvDtypeTests green; Release build clean under TreatWarningsAsErrors + AOT.

… presence

gemini-code-assist point #3: _bpPleY is allocated only when HasPerLayerTokenEmbd,
so the base is 3×embDim (hidden+residual+norm) and the +embDim PleY belongs in the
PLE block. For real Gemma 4 (always PLE) the total is unchanged (4×embDim); this
just avoids a ~63 MiB over-reserve for a hypothetical SWA-without-PLE model. 5/5
KvVramReserveBytes tests green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@pekkah pekkah merged commit 5eb17a5 into master Jun 13, 2026
1 check passed
@pekkah pekkah deleted the perf/gemma4-swa-kv-reserve-228 branch June 13, 2026 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant