perf(cuda): EstimateAvailableKvVram reserves max(VRAM/3, 2GB) — far over-conservative, auto-context lands ~8× below what fits (Gemma 4 12B q8_0: 30K auto vs 256K real)

## Problem

The auto-context the dense CUDA path picks (`-g -1`, no `-c`) is much smaller than what actually fits. On a 4070 Ti 12 GB with `gemma-4-12b-it-qat-q4_0` + `--kv-type q8_0`:

| | ctx |
|---|---|
| auto (`-g -1 --kv-type q8_0`) | **30,840** |
| forced (`-c 262144 --kv-type q8_0`, = model max) | **262,144** — constructs + runs a token |

So the full 256K q8_0 KV cache fits, but auto picks ~8× less. (Found while verifying #220, which fixed a *separate* bug — auto ignoring `--kv-type` entirely.)

## Root cause: a blunt, over-sized VRAM reserve

`CudaForwardPass.EstimateAvailableKvVram` (`CudaForwardPass.cs:3946-3982`) computes the KV budget as `VRAM − estimatedWeights − scratch − reserve`, where:

```
long reserved = Math.Max(vramBytes / 3, 2L * 1024 * 1024 * 1024);   // :3978
```

On a 12 GB card that reserve is **4094 MiB** — and it's the dominant term. A `SHARPI_TRACE_VRAM=1` run shows the reality:

```
constructor entry:        free = 10854 MiB   (~1.4 GB already used by the CUDA context)
before per-layer upload:  free =  9316 MiB
after ALL weight uploads: free =  2061 MiB   ← actual headroom for KV
```

The whole 256K q8_0 KV cache fits in that 2061 MiB (Gemma's ~5:1 SWA layers cap at their ~5K ring, so only the few global layers scale with context — cheap at q8_0). But the auto formula subtracts a 4 GB reserve on top of the weight estimate, leaving only ~1.3 GB of *computed* budget → 30,840. It's reserving ~2.5–3 GB it doesn't use.

## Why the reserve is load-bearing (do NOT just delete it)

The comment at `:3970-3977` documents why it was made large: an earlier `max(VRAM/5, 1 GB)` reserve left only ~24 MiB free on a 12 GB card running Qwen3-8B; the driver spilled the ~600 MiB lm-head into system RAM, the matvec ran over PCIe at ~22 GB/s instead of ~400 GB/s in HBM, and **prefill collapsed 65 → 4 t/s (~16×)**. So a reserve must cover the cuBLAS workspace + pinned staging buffer + GPU buffer-pool churn + framebuffer that grow *during* a run, not just at construction. The fix must not reintroduce that cliff.

It's also doubly pessimistic for SWA models specifically: context past the sliding-window ring costs only the handful of global layers (near-free at q8_0), so a `VRAM/3` reserve "spends" almost-free context.

## Proposed work

- Replace the `max(VRAM/3, 2GB)` heuristic with a **measured / bounded** reserve — e.g. a fixed allowance for the driver context + cuBLAS workspace + pinned staging + pool headroom (order ~1.5–2 GB, not a third of VRAM), ideally derived from the live `gpu.FreeVramBytes` after weight upload rather than a fraction of total VRAM.
- Keep a floor that provably avoids the #185 spill cliff (weights/lm-head must never page to system RAM).
- Account for the transient prefill working set (activations for the chunk + cuBLAS workspace) so the budget is safe for a real long prefill, not just `n=1` construction.
- Applies to both the dense `EstimateMaxContext`/`SolveMaxCtxForKv` path and `EstimateAvailableKvVram`'s consumers (TierPlanner prices its KV budget from the same family).

## Acceptance

- [ ] Gemma 4 12B `-g -1 --kv-type q8_0` auto-context lands materially closer to what `-c` shows fits (target: a large multiple of today's 30,840), with no OOM on a real long-context generation (not just `n=1`).
- [ ] **Safety A/B (the load-bearing check):** Qwen3-8B and Gemma 4 12B prefill t/s at the tighter reserve do **not** regress vs today — i.e. no weight/lm-head spill to system memory (compare against the #185 cliff: 65 vs 4 t/s). Validate with `SHARPI_TRACE_VRAM` that free VRAM after all allocations stays above a safe floor.
- [ ] No change for models/dtypes where the current reserve already leaves little headroom.
- [ ] A unit-testable seam if practical (the reserve calc as a pure function of total/used/weight bytes), cf. the `ResolveKvDType`/`SolveMaxCtxForKv` factoring.

## References
- #220 (dtype-aware auto-context — the *separate* bug fixed first; this is the reserve-size follow-up), #185 (KV auto-narrow + the documented PCIe spill cliff), #213 (long-context decode perf), #162/#164 (SWA ring sizing).

	ctx
auto (`-g -1 --kv-type q8_0`)	30,840
forced (`-c 262144 --kv-type q8_0`, = model max)	262,144 — constructs + runs a token

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda): EstimateAvailableKvVram reserves max(VRAM/3, 2GB) — far over-conservative, auto-context lands ~8× below what fits (Gemma 4 12B q8_0: 30K auto vs 256K real) #228

Problem

Root cause: a blunt, over-sized VRAM reserve

Why the reserve is load-bearing (do NOT just delete it)

Proposed work

Acceptance

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(cuda): EstimateAvailableKvVram reserves max(VRAM/3, 2GB) — far over-conservative, auto-context lands ~8× below what fits (Gemma 4 12B q8_0: 30K auto vs 256K real) #228

Description

Problem

Root cause: a blunt, over-sized VRAM reserve

Why the reserve is load-bearing (do NOT just delete it)

Proposed work

Acceptance

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions