Skip to content

perf(cuda): EstimateAvailableKvVram reserves max(VRAM/3, 2GB) — far over-conservative, auto-context lands ~8× below what fits (Gemma 4 12B q8_0: 30K auto vs 256K real) #228

Description

@pekkah

Problem

The auto-context the dense CUDA path picks (-g -1, no -c) is much smaller than what actually fits. On a 4070 Ti 12 GB with gemma-4-12b-it-qat-q4_0 + --kv-type q8_0:

ctx
auto (-g -1 --kv-type q8_0) 30,840
forced (-c 262144 --kv-type q8_0, = model max) 262,144 — constructs + runs a token

So the full 256K q8_0 KV cache fits, but auto picks ~8× less. (Found while verifying #220, which fixed a separate bug — auto ignoring --kv-type entirely.)

Root cause: a blunt, over-sized VRAM reserve

CudaForwardPass.EstimateAvailableKvVram (CudaForwardPass.cs:3946-3982) computes the KV budget as VRAM − estimatedWeights − scratch − reserve, where:

long reserved = Math.Max(vramBytes / 3, 2L * 1024 * 1024 * 1024);   // :3978

On a 12 GB card that reserve is 4094 MiB — and it's the dominant term. A SHARPI_TRACE_VRAM=1 run shows the reality:

constructor entry:        free = 10854 MiB   (~1.4 GB already used by the CUDA context)
before per-layer upload:  free =  9316 MiB
after ALL weight uploads: free =  2061 MiB   ← actual headroom for KV

The whole 256K q8_0 KV cache fits in that 2061 MiB (Gemma's ~5:1 SWA layers cap at their ~5K ring, so only the few global layers scale with context — cheap at q8_0). But the auto formula subtracts a 4 GB reserve on top of the weight estimate, leaving only ~1.3 GB of computed budget → 30,840. It's reserving ~2.5–3 GB it doesn't use.

Why the reserve is load-bearing (do NOT just delete it)

The comment at :3970-3977 documents why it was made large: an earlier max(VRAM/5, 1 GB) reserve left only ~24 MiB free on a 12 GB card running Qwen3-8B; the driver spilled the ~600 MiB lm-head into system RAM, the matvec ran over PCIe at ~22 GB/s instead of ~400 GB/s in HBM, and prefill collapsed 65 → 4 t/s (~16×). So a reserve must cover the cuBLAS workspace + pinned staging buffer + GPU buffer-pool churn + framebuffer that grow during a run, not just at construction. The fix must not reintroduce that cliff.

It's also doubly pessimistic for SWA models specifically: context past the sliding-window ring costs only the handful of global layers (near-free at q8_0), so a VRAM/3 reserve "spends" almost-free context.

Proposed work

  • Replace the max(VRAM/3, 2GB) heuristic with a measured / bounded reserve — e.g. a fixed allowance for the driver context + cuBLAS workspace + pinned staging + pool headroom (order ~1.5–2 GB, not a third of VRAM), ideally derived from the live gpu.FreeVramBytes after weight upload rather than a fraction of total VRAM.
  • Keep a floor that provably avoids the q8_0/bf16 KV follow-ups: auto-narrow default + Tc/half2-flash q8 thunks (#179) #185 spill cliff (weights/lm-head must never page to system RAM).
  • Account for the transient prefill working set (activations for the chunk + cuBLAS workspace) so the budget is safe for a real long prefill, not just n=1 construction.
  • Applies to both the dense EstimateMaxContext/SolveMaxCtxForKv path and EstimateAvailableKvVram's consumers (TierPlanner prices its KV budget from the same family).

Acceptance

  • Gemma 4 12B -g -1 --kv-type q8_0 auto-context lands materially closer to what -c shows fits (target: a large multiple of today's 30,840), with no OOM on a real long-context generation (not just n=1).
  • Safety A/B (the load-bearing check): Qwen3-8B and Gemma 4 12B prefill t/s at the tighter reserve do not regress vs today — i.e. no weight/lm-head spill to system memory (compare against the q8_0/bf16 KV follow-ups: auto-narrow default + Tc/half2-flash q8 thunks (#179) #185 cliff: 65 vs 4 t/s). Validate with SHARPI_TRACE_VRAM that free VRAM after all allocations stays above a safe floor.
  • No change for models/dtypes where the current reserve already leaves little headroom.
  • A unit-testable seam if practical (the reserve calc as a pure function of total/used/weight bytes), cf. the ResolveKvDType/SolveMaxCtxForKv factoring.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions