Skip to content

Quantized (fp16/q8_0) KV cache for the dense CUDA path → unlock long context (Gemma 4 12B first) #179

Description

@pekkah

Problem

The dense CUDA path stores the KV cache in fp32 (CudaForwardPass.cs:594-595 allocates _gpuKCache/_gpuVCache as fp32; GpuForwardPass.cs:199 — "K + V, fp32"). KvAppend, Attention, and AttentionSwa all read/write fp32 buffers with sizeof(float) strides. This applies to every dense model on the CUDA path, not just Gemma — fp32 KV is the single biggest barrier to long context on a 12 GB card.

SHARPI_KV_DTYPE already exists but only on the GDN hybrid path (CudaHybridGdnForwardPass.cs:672) and only supports fp32 | bf16 — it is not wired to the dense path and has no q8_0 mode. llama.cpp serves these models at 128K on the same class of card only because --cache-type-k/v q8_0 shrinks KV ~4×.

This is a general backend enhancement

Quantized KV storage + quant-aware append/attention kernels are a dense-CUDA-path capability, not a model feature. Every dense model benefits:

  • Qwen3-8B Q4_K_M — KV-bound at long context just like Gemma; gets the same ~4× context headroom.
  • Gemma 4 12B QAT — the driving use case (see VRAM math) and the validation target (its SWA ring was never exercised past -c 2048).
  • Any future dense model on the CUDA path inherits the knob for free.

Two Gemma-specific behaviors must be preserved (constraints, not new work):

  • SWA ring (SwaRingSize/SwaRingHeadroom) — quantized kernels must keep honoring the modulo wrap. Non-SWA models (Qwen3) use the full-context branch unchanged.
  • attention_k_eq_v (12B global layers, K=V, store K only) — V's dequant-on-read reads the same quantized K blocks.

VRAM math — the driving case (Gemma 4 12B QAT, 12 GB card, 128K ctx)

48 layers, head_dim 256, ~40 SWA + 8 global; global layers MQA with attention_k_eq_v (K=V, store K only). SWA ring = window + SwaRingHeadroom (4096), capped at ctx.

Component fp32 (today) q8_0
Model (Q4_0 bulk + Q6_K tied embd) ~7.0 GB ~7.0 GB
SWA-layer KV (×40, ring) ~3.4 GB ~0.9 GB
Global-layer KV (full 128K, ×8, K=V) ~1.1 GB ~0.3 GB
Total ~11.5 GB ~8.2 GB

At fp32 the model + KV alone nearly fills the card with no room for activation scratch. q8_0 leaves ~3.5 GB of headroom and makes 128K viable.

Scope

  1. fp16/q8_0 KV storage on the dense CUDA path (CudaForwardPass.cs / GpuForwardPass.cs). The contained part is allocation; the real cost is the three kernelsKvAppend (quantize K/V on write), Attention / AttentionSwa (dequant-on-read, or dp4a-style q8 dot, inside the online-softmax loop). No q8_0 KV kernel precedent exists — the GDN bf16 path is a straight half-width store, not block-quantized, so it doesn't carry the scale handling. Must stay argmax-stable through the online softmax.
  2. Extend SHARPI_KV_DTYPE to the dense path and honor --cache-type-k/v; add a q8_0 mode alongside the existing fp32 | bf16. Preserve the SWA ring and attention_k_eq_v aliasing.
  3. Long-context validation (Gemma 4 12B QAT) — the SWA ring (SwaRingSize/SwaRingHeadroom) was validated for E4B in perf(cuda): close remaining Qwen3-8B Q4_K DECODE gap to llama.cpp — non-matvec cost (prefill handled by #167; kernel-efficiency in #149/#152) #162/perf(cuda): chunked batched prefill for Gemma 4 SWA via a real KV ring (#162) #164; the 12B was only run at -c 2048. Verify correctness past the SWA window at 8K / 32K / 128K (needle-in-haystack or argmax parity vs fp32 KV at short ctx).
  4. Re-measure the README text-gen row at long context (-c 32768, -c 131072) and record decode degradation vs the 2K number.

Suggested increments

  • fp16 KV first — half-width store, no block scales, trivially argmax-stable, already ~2× context. Lands as a checkpoint and de-risks the path.
  • q8_0 second — block quantization + scale handling for the full ~4×.

Acceptance

  • Dense CUDA path runs ≥32K (ideally 128K) context within 12 GB with q8_0 KV. Gemma 4 12B QAT -g -1 is the gating model; Qwen3-8B validated as the non-SWA case.
  • Argmax-stable vs fp32 KV at short context; coherent long-context output. SHARPI_KV_DTYPE bisects.
  • README text-gen row updated with a long-context decode figure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions