Quantized (fp16/q8_0) KV cache for the dense CUDA path → unlock long context (Gemma 4 12B first)

## Problem

The dense CUDA path stores the KV cache in **fp32** (`CudaForwardPass.cs:594-595` allocates `_gpuKCache`/`_gpuVCache` as fp32; `GpuForwardPass.cs:199` — "K + V, fp32"). `KvAppend`, `Attention`, and `AttentionSwa` all read/write fp32 buffers with `sizeof(float)` strides. This applies to **every dense model on the CUDA path**, not just Gemma — fp32 KV is the single biggest barrier to long context on a 12 GB card.

`SHARPI_KV_DTYPE` already exists but only on the GDN hybrid path (`CudaHybridGdnForwardPass.cs:672`) and only supports `fp32 | bf16` — it is not wired to the dense path and has no q8_0 mode. llama.cpp serves these models at 128K on the same class of card only because `--cache-type-k/v q8_0` shrinks KV ~4×.

## This is a general backend enhancement

Quantized KV storage + quant-aware append/attention kernels are a **dense-CUDA-path capability**, not a model feature. Every dense model benefits:

- **Qwen3-8B Q4_K_M** — KV-bound at long context just like Gemma; gets the same ~4× context headroom.
- **Gemma 4 12B QAT** — the *driving use case* (see VRAM math) and the *validation target* (its SWA ring was never exercised past `-c 2048`).
- Any future dense model on the CUDA path inherits the knob for free.

Two Gemma-specific behaviors must be **preserved** (constraints, not new work):
- **SWA ring** (`SwaRingSize`/`SwaRingHeadroom`) — quantized kernels must keep honoring the modulo wrap. Non-SWA models (Qwen3) use the full-context branch unchanged.
- **`attention_k_eq_v`** (12B global layers, K=V, store K only) — V's dequant-on-read reads the same quantized K blocks.

## VRAM math — the driving case (Gemma 4 12B QAT, 12 GB card, 128K ctx)

48 layers, head_dim 256, ~40 SWA + 8 global; global layers MQA with `attention_k_eq_v` (K=V, store K only). SWA ring = window + `SwaRingHeadroom` (4096), capped at ctx.

| Component | fp32 (today) | q8_0 |
|---|---:|---:|
| Model (Q4_0 bulk + Q6_K tied embd) | ~7.0 GB | ~7.0 GB |
| SWA-layer KV (×40, ring) | ~3.4 GB | ~0.9 GB |
| Global-layer KV (full 128K, ×8, K=V) | ~1.1 GB | ~0.3 GB |
| **Total** | **~11.5 GB** | **~8.2 GB** |

At fp32 the model + KV alone nearly fills the card with no room for activation scratch. q8_0 leaves ~3.5 GB of headroom and makes 128K viable.

## Scope

1. **fp16/q8_0 KV storage on the dense CUDA path** (`CudaForwardPass.cs` / `GpuForwardPass.cs`). The contained part is allocation; the real cost is the **three kernels** — `KvAppend` (quantize K/V on write), `Attention` / `AttentionSwa` (dequant-on-read, or dp4a-style q8 dot, inside the online-softmax loop). No q8_0 KV kernel precedent exists — the GDN bf16 path is a straight half-width store, not block-quantized, so it doesn't carry the scale handling. Must stay argmax-stable through the online softmax.
2. **Extend `SHARPI_KV_DTYPE` to the dense path** and honor `--cache-type-k/v`; add a `q8_0` mode alongside the existing `fp32 | bf16`. Preserve the SWA ring and `attention_k_eq_v` aliasing.
3. **Long-context validation (Gemma 4 12B QAT)** — the SWA ring (`SwaRingSize`/`SwaRingHeadroom`) was validated for E4B in #162/#164; the 12B was only run at `-c 2048`. Verify correctness past the SWA window at 8K / 32K / 128K (needle-in-haystack or argmax parity vs fp32 KV at short ctx).
4. **Re-measure** the README text-gen row at long context (`-c 32768`, `-c 131072`) and record decode degradation vs the 2K number.

## Suggested increments

- **fp16 KV first** — half-width store, no block scales, trivially argmax-stable, already ~2× context. Lands as a checkpoint and de-risks the path.
- **q8_0 second** — block quantization + scale handling for the full ~4×.

## Acceptance

- Dense CUDA path runs ≥32K (ideally 128K) context within 12 GB with q8_0 KV. Gemma 4 12B QAT `-g -1` is the gating model; Qwen3-8B validated as the non-SWA case.
- Argmax-stable vs fp32 KV at short context; coherent long-context output. `SHARPI_KV_DTYPE` bisects.
- README text-gen row updated with a long-context decode figure.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantized (fp16/q8_0) KV cache for the dense CUDA path → unlock long context (Gemma 4 12B first) #179

Problem

This is a general backend enhancement

VRAM math — the driving case (Gemma 4 12B QAT, 12 GB card, 128K ctx)

Scope

Suggested increments

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Component	fp32 (today)	q8_0
Model (Q4_0 bulk + Q6_K tied embd)	~7.0 GB	~7.0 GB
SWA-layer KV (×40, ring)	~3.4 GB	~0.9 GB
Global-layer KV (full 128K, ×8, K=V)	~1.1 GB	~0.3 GB
Total	~11.5 GB	~8.2 GB

Quantized (fp16/q8_0) KV cache for the dense CUDA path → unlock long context (Gemma 4 12B first) #179

Description

Problem

This is a general backend enhancement

VRAM math — the driving case (Gemma 4 12B QAT, 12 GB card, 128K ctx)

Scope

Suggested increments

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions