You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The auto-context the dense CUDA path picks (-g -1, no -c) is much smaller than what actually fits. On a 4070 Ti 12 GB with gemma-4-12b-it-qat-q4_0 + --kv-type q8_0:
ctx
auto (-g -1 --kv-type q8_0)
30,840
forced (-c 262144 --kv-type q8_0, = model max)
262,144 — constructs + runs a token
So the full 256K q8_0 KV cache fits, but auto picks ~8× less. (Found while verifying #220, which fixed a separate bug — auto ignoring --kv-type entirely.)
Root cause: a blunt, over-sized VRAM reserve
CudaForwardPass.EstimateAvailableKvVram (CudaForwardPass.cs:3946-3982) computes the KV budget as VRAM − estimatedWeights − scratch − reserve, where:
On a 12 GB card that reserve is 4094 MiB — and it's the dominant term. A SHARPI_TRACE_VRAM=1 run shows the reality:
constructor entry: free = 10854 MiB (~1.4 GB already used by the CUDA context)
before per-layer upload: free = 9316 MiB
after ALL weight uploads: free = 2061 MiB ← actual headroom for KV
The whole 256K q8_0 KV cache fits in that 2061 MiB (Gemma's ~5:1 SWA layers cap at their ~5K ring, so only the few global layers scale with context — cheap at q8_0). But the auto formula subtracts a 4 GB reserve on top of the weight estimate, leaving only ~1.3 GB of computed budget → 30,840. It's reserving ~2.5–3 GB it doesn't use.
Why the reserve is load-bearing (do NOT just delete it)
The comment at :3970-3977 documents why it was made large: an earlier max(VRAM/5, 1 GB) reserve left only ~24 MiB free on a 12 GB card running Qwen3-8B; the driver spilled the ~600 MiB lm-head into system RAM, the matvec ran over PCIe at ~22 GB/s instead of ~400 GB/s in HBM, and prefill collapsed 65 → 4 t/s (~16×). So a reserve must cover the cuBLAS workspace + pinned staging buffer + GPU buffer-pool churn + framebuffer that grow during a run, not just at construction. The fix must not reintroduce that cliff.
It's also doubly pessimistic for SWA models specifically: context past the sliding-window ring costs only the handful of global layers (near-free at q8_0), so a VRAM/3 reserve "spends" almost-free context.
Proposed work
Replace the max(VRAM/3, 2GB) heuristic with a measured / bounded reserve — e.g. a fixed allowance for the driver context + cuBLAS workspace + pinned staging + pool headroom (order ~1.5–2 GB, not a third of VRAM), ideally derived from the live gpu.FreeVramBytes after weight upload rather than a fraction of total VRAM.
Account for the transient prefill working set (activations for the chunk + cuBLAS workspace) so the budget is safe for a real long prefill, not just n=1 construction.
Applies to both the dense EstimateMaxContext/SolveMaxCtxForKv path and EstimateAvailableKvVram's consumers (TierPlanner prices its KV budget from the same family).
Acceptance
Gemma 4 12B -g -1 --kv-type q8_0 auto-context lands materially closer to what -c shows fits (target: a large multiple of today's 30,840), with no OOM on a real long-context generation (not just n=1).
Safety A/B (the load-bearing check): Qwen3-8B and Gemma 4 12B prefill t/s at the tighter reserve do not regress vs today — i.e. no weight/lm-head spill to system memory (compare against the q8_0/bf16 KV follow-ups: auto-narrow default + Tc/half2-flash q8 thunks (#179) #185 cliff: 65 vs 4 t/s). Validate with SHARPI_TRACE_VRAM that free VRAM after all allocations stays above a safe floor.
No change for models/dtypes where the current reserve already leaves little headroom.
A unit-testable seam if practical (the reserve calc as a pure function of total/used/weight bytes), cf. the ResolveKvDType/SolveMaxCtxForKv factoring.
Problem
The auto-context the dense CUDA path picks (
-g -1, no-c) is much smaller than what actually fits. On a 4070 Ti 12 GB withgemma-4-12b-it-qat-q4_0+--kv-type q8_0:-g -1 --kv-type q8_0)-c 262144 --kv-type q8_0, = model max)So the full 256K q8_0 KV cache fits, but auto picks ~8× less. (Found while verifying #220, which fixed a separate bug — auto ignoring
--kv-typeentirely.)Root cause: a blunt, over-sized VRAM reserve
CudaForwardPass.EstimateAvailableKvVram(CudaForwardPass.cs:3946-3982) computes the KV budget asVRAM − estimatedWeights − scratch − reserve, where:On a 12 GB card that reserve is 4094 MiB — and it's the dominant term. A
SHARPI_TRACE_VRAM=1run shows the reality:The whole 256K q8_0 KV cache fits in that 2061 MiB (Gemma's ~5:1 SWA layers cap at their ~5K ring, so only the few global layers scale with context — cheap at q8_0). But the auto formula subtracts a 4 GB reserve on top of the weight estimate, leaving only ~1.3 GB of computed budget → 30,840. It's reserving ~2.5–3 GB it doesn't use.
Why the reserve is load-bearing (do NOT just delete it)
The comment at
:3970-3977documents why it was made large: an earliermax(VRAM/5, 1 GB)reserve left only ~24 MiB free on a 12 GB card running Qwen3-8B; the driver spilled the ~600 MiB lm-head into system RAM, the matvec ran over PCIe at ~22 GB/s instead of ~400 GB/s in HBM, and prefill collapsed 65 → 4 t/s (~16×). So a reserve must cover the cuBLAS workspace + pinned staging buffer + GPU buffer-pool churn + framebuffer that grow during a run, not just at construction. The fix must not reintroduce that cliff.It's also doubly pessimistic for SWA models specifically: context past the sliding-window ring costs only the handful of global layers (near-free at q8_0), so a
VRAM/3reserve "spends" almost-free context.Proposed work
max(VRAM/3, 2GB)heuristic with a measured / bounded reserve — e.g. a fixed allowance for the driver context + cuBLAS workspace + pinned staging + pool headroom (order ~1.5–2 GB, not a third of VRAM), ideally derived from the livegpu.FreeVramBytesafter weight upload rather than a fraction of total VRAM.n=1construction.EstimateMaxContext/SolveMaxCtxForKvpath andEstimateAvailableKvVram's consumers (TierPlanner prices its KV budget from the same family).Acceptance
-g -1 --kv-type q8_0auto-context lands materially closer to what-cshows fits (target: a large multiple of today's 30,840), with no OOM on a real long-context generation (not justn=1).SHARPI_TRACE_VRAMthat free VRAM after all allocations stays above a safe floor.ResolveKvDType/SolveMaxCtxForKvfactoring.References