perf(cuda): SWA-scoped KV-VRAM reserve — Gemma 4 q8 auto-context 30K→123K (#228)#231
Conversation
…#228) EstimateAvailableKvVram reserved max(VRAM/3, 2GB) = 4 GiB on a 12 GB card, the dominant term shrinking the KV budget and pinning Gemma 4 12B q8_0 auto-context to ~30K when far more fits. The reserve is load-bearing for DENSE models (their KV grows linearly to fill any budget; an earlier smaller reserve spilled Qwen3-8B's lm-head to system RAM, 65→4 t/s, #185), so it can't be cut globally. Fix: factor the reserve into the pure, testable KvVramReserveBytes(hp, vram) and scope a smaller bounded reserve to SWA/per-layer (Gemma 4) models only — a fixed system allowance (max(2GiB, VRAM/6), NOT VRAM/3) plus one PrefillBatchChunk's activation working set. Safe precisely because SWA KV SATURATES past the sliding- window ring (only the few global layers grow), so a larger budget can't grow KV to eat the headroom — the dense failure mode. Dense models keep max(VRAM/3, 2GB) byte-for-byte. Also enforce the model-max ceiling at the single _maxSeqLen chokepoint (Math.Min(_maxSeqLen, hp.ContextLength)) so neither an explicit -c nor a calculated context can exceed the model's own max, regardless of estimator. Result (4070 Ti 12 GB, gemma-4-12b-it-qat-q4_0, -g -1 --kv-type q8_0): auto-ctx 30840 → 123361 (4×); a real 5052-token prefill runs at 1312 t/s with no OOM/spill (1290 MiB free held). Qwen3-8B dense UNCHANGED (12080 bf16, 3474 MiB free, byte-identical reserve). Explicit -c 999999999 clamps to 262144. Tests: 3 GGUF-free KvVramReserveBytes unit tests (dense keeps max(VRAM/3,2GB); SWA bounded below dense; SWA scales with model width). 42/42 CudaForwardPassKvDtypeTests green (excl. slow >4096 chunked, validated via the CLI long-prefill above). Release build clean under TreatWarningsAsErrors + AOT analyzers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a bounded VRAM reservation strategy for SWA/Gemma 4 models via the new KvVramReserveBytes method, preventing over-reservation of VRAM while maintaining the existing logic for dense models. It also ensures the resolved context length is clamped to the model's maximum. The review feedback points out several discrepancies in the prefillWorkingSet calculation compared to the actual allocations in EnsureBatchedTrunkScratch (such as handling of head dimensions, embedding buffers, and PLE buffers) and provides a code suggestion to align them accurately.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| long prefillWorkingSet = (long)PrefillBatchChunk * | ||
| (hp.EmbeddingDim * 4L + hp.IntermediateDim * 2L | ||
| + (long)hp.NumHeads * hp.HeadDim + 2L * hp.NumKvHeads * hp.HeadDim) * sizeof(float); |
There was a problem hiding this comment.
The current calculation of prefillWorkingSet has a few discrepancies compared to the actual allocations performed in EnsureBatchedTrunkScratch for SWA/Gemma 4 models:
- Q and AttnOut Buffers:
_bpQand_bpAttnOutare two buffers of sizen * numHeads * maxHeadDim. The current formula only accounts for one buffer ((long)hp.NumHeads * hp.HeadDim) and useshp.HeadDiminstead of the maximum head dimension (maxHeadDim), which can be larger (e.g., 512 vs 256 in Gemma 4). - K and V Buffers:
_bpKand_bpVare two buffers of sizen * numKvHeads * maxHeadDim. The current formula useshp.HeadDiminstead ofmaxHeadDim. - Embedding Buffers:
_bpHidden,_bpResidual, and_bpNormare three buffers of sizen * embDim. The formula unconditionally useshp.EmbeddingDim * 4L, which over-reserves whenhp.HasPerLayerTokenEmbdis false. - PLE Buffers: When
hp.HasPerLayerTokenEmbdis true,_bpProjAll(n * NumLayers * pleWidth),_bpPleRowAll(n * NumLayers * pleWidth), and_bpPleGate(n * pleWidth) are allocated but completely omitted from the reserve calculation. For Gemma 4 12B, this under-reserves around 340 MiB of VRAM.
To ensure the VRAM estimation is robust and accurate, we should align the prefillWorkingSet calculation exactly with the allocations in EnsureBatchedTrunkScratch.
int maxHeadDim = hp.HeadDim;
if (hp.LayerHeadDim is not null)
{
foreach (int hd in hp.LayerHeadDim)
{
if (hd > maxHeadDim) maxHeadDim = hd;
}
}
long perTokenFloats = hp.EmbeddingDim * 3L
+ (long)hp.NumHeads * maxHeadDim * 2L
+ (long)hp.NumKvHeads * maxHeadDim * 2L
+ hp.IntermediateDim * 2L;
if (hp.HasPerLayerTokenEmbd)
{
perTokenFloats += hp.NumLayers * (long)hp.PerLayerEmbeddingWidth * 2L
+ hp.PerLayerEmbeddingWidth
+ hp.EmbeddingDim;
}
long prefillWorkingSet = (long)PrefillBatchChunk * perTokenFloats * sizeof(float);There was a problem hiding this comment.
Addressed in 2ca5b5f + e4af59a: prefillWorkingSet now counts Q+AttnOut and K+V at the widest per-layer head_dim (×2 each), the PLE proj/row/gate stack + PleY (only when HasPerLayerTokenEmbd), and a 3×embDim base. Matches EnsureBatchedTrunkScratch exactly. New exact-value unit test KvVramReserveBytes_SwaModel_CountsPleAndWidestHeadDim pins it.
Addresses the #231 review findings: - HIGH (silent-failure): the SWA prefill-working-set term under-counted Gemma 4's real transient buffers — it omitted the PLE stacked proj/row scratch (NumLayers×pleWidth, allocated for HasPerLayerTokenEmbd) and used hp.HeadDim instead of the WIDEST per-layer head_dim (global 512). EnsureBatchedTrunkScratch allocates these lazily at the first long prefill and they coexist with the larger KV cache, so the reserve must bound them. Now counts hidden/residual/norm/PleY, Q+AttnOut and K+V at maxHeadDim, 2 FFN buffers, and the PLE stack. (The looser formula happened to pass the 5K-prefill gate via GpuBufferPool reuse, but the reserve should be a true bound, not rely on that.) - Important (code-reviewer): KvVramReserveBytes guarded on `IsSwaLayer is null` while SolveMaxCtxForKv uses `LayerHeadDim != null && IsSwaLayer != null`. Aligned: the SWA reserve now requires per-layer head_dim AND ≥1 real sliding-window layer, so a Gemma-arch GGUF with an all-false SWA pattern (linear-growth KV) correctly keeps the dense reserve. Also Math.Min(dense, ...) so the SWA reserve can never exceed what dense would reserve. Re-validated (4070 Ti 12 GB): Gemma q8 auto-ctx still 123361 (the larger reserve is absorbed by the pow2 KV step — win preserved), real 5052-token prefill 1313 t/s no OOM; Qwen3-8B dense byte-identical (12080 bf16, 3474 MiB free). Tests: +2 (VRAM/6 large-card scaling isolated at 48 GiB; exact PLE + widest-head_dim working-set pin). 5/5 KvVramReserveBytes green; Release build clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review cycle completeRan three independent lenses (correctness, silent-failure/OOM, test-coverage) + the automated gemini review. Findings addressed (commit 2ca5b5f):
Confirmed safe: Qwen3-8B (the #185 cliff model) is dense → reserve byte-identical → no spill risk by construction. Gemma's real 5052-token prefill (crossing the chunk + SWA-ring boundary) runs at 1313 t/s, no OOM; auto-ctx still 123361 (the larger reserve is absorbed by the pow2 KV step — the 4× win is preserved). 5/5 |
… presence gemini-code-assist point #3: _bpPleY is allocated only when HasPerLayerTokenEmbd, so the base is 3×embDim (hidden+residual+norm) and the +embDim PleY belongs in the PLE block. For real Gemma 4 (always PLE) the total is unchanged (4×embDim); this just avoids a ~63 MiB over-reserve for a hypothetical SWA-without-PLE model. 5/5 KvVramReserveBytes tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Fixes #228.
CudaForwardPass.EstimateAvailableKvVramreservedmax(VRAM/3, 2GB)= 4 GiB on a 12 GB card — the dominant term shrinking the KV budget and pinning Gemma 4 12B q8_0 auto-context to ~30K when far more fits (the full 256K runs via-c).The reserve is load-bearing for dense models: their KV grows linearly to fill any budget, and an earlier smaller reserve once left ~24 MiB free running Qwen3-8B, spilling the lm-head to system RAM over PCIe and collapsing prefill 65→4 t/s (#185). So it can't be cut globally.
Fix
Factor the reserve into the pure, testable
CudaForwardPass.KvVramReserveBytes(hp, vram)and scope a smaller bounded reserve to SWA/per-layer (Gemma 4) models only: a fixed system allowance (max(2 GiB, VRAM/6), notVRAM/3) plus onePrefillBatchChunk's activation working set. This is safe precisely because SWA KV saturates past the sliding-window ring (only the few global layers grow), so a larger budget can't grow KV to consume the headroom — the dense failure mode. Dense models keepmax(VRAM/3, 2GB)byte-for-byte unchanged.Also enforce the model-max ceiling at the single
_maxSeqLenchokepoint (Math.Min(_maxSeqLen, hp.ContextLength)) so neither an explicit-cnor a calculated context can ever exceed the model's own max, regardless of estimator.Result (4070 Ti 12 GB,
gemma-4-12b-it-qat-q4_0,-g -1 --kv-type q8_0)-c 999999999Tests
3 GGUF-free
KvVramReserveBytesunit tests: dense keepsmax(VRAM/3, 2GB); SWA bounded below dense; SWA scales with model width. 42/42CudaForwardPassKvDtypeTestsgreen (excl. the slow >4096 chunked, validated via the CLI 5K-token prefill above). Release build clean under TreatWarningsAsErrors + AOT analyzers.Safety A/B (the load-bearing check)
Qwen3-8B (the #185 cliff model) is byte-identical — dense reserve unchanged → no spill risk by construction. Gemma's real long-context prefill (5052 tokens, crossing the chunk + SWA-ring boundary) runs at 1312 t/s with no OOM.
🤖 Generated with Claude Code