You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The dense CUDA path stores the KV cache in fp32 (CudaForwardPass.cs:594-595 allocates _gpuKCache/_gpuVCache as fp32; GpuForwardPass.cs:199 — "K + V, fp32"). KvAppend, Attention, and AttentionSwa all read/write fp32 buffers with sizeof(float) strides. This applies to every dense model on the CUDA path, not just Gemma — fp32 KV is the single biggest barrier to long context on a 12 GB card.
SHARPI_KV_DTYPE already exists but only on the GDN hybrid path (CudaHybridGdnForwardPass.cs:672) and only supports fp32 | bf16 — it is not wired to the dense path and has no q8_0 mode. llama.cpp serves these models at 128K on the same class of card only because --cache-type-k/v q8_0 shrinks KV ~4×.
This is a general backend enhancement
Quantized KV storage + quant-aware append/attention kernels are a dense-CUDA-path capability, not a model feature. Every dense model benefits:
Qwen3-8B Q4_K_M — KV-bound at long context just like Gemma; gets the same ~4× context headroom.
Gemma 4 12B QAT — the driving use case (see VRAM math) and the validation target (its SWA ring was never exercised past -c 2048).
Any future dense model on the CUDA path inherits the knob for free.
Two Gemma-specific behaviors must be preserved (constraints, not new work):
SWA ring (SwaRingSize/SwaRingHeadroom) — quantized kernels must keep honoring the modulo wrap. Non-SWA models (Qwen3) use the full-context branch unchanged.
attention_k_eq_v (12B global layers, K=V, store K only) — V's dequant-on-read reads the same quantized K blocks.
VRAM math — the driving case (Gemma 4 12B QAT, 12 GB card, 128K ctx)
48 layers, head_dim 256, ~40 SWA + 8 global; global layers MQA with attention_k_eq_v (K=V, store K only). SWA ring = window + SwaRingHeadroom (4096), capped at ctx.
Component
fp32 (today)
q8_0
Model (Q4_0 bulk + Q6_K tied embd)
~7.0 GB
~7.0 GB
SWA-layer KV (×40, ring)
~3.4 GB
~0.9 GB
Global-layer KV (full 128K, ×8, K=V)
~1.1 GB
~0.3 GB
Total
~11.5 GB
~8.2 GB
At fp32 the model + KV alone nearly fills the card with no room for activation scratch. q8_0 leaves ~3.5 GB of headroom and makes 128K viable.
Scope
fp16/q8_0 KV storage on the dense CUDA path (CudaForwardPass.cs / GpuForwardPass.cs). The contained part is allocation; the real cost is the three kernels — KvAppend (quantize K/V on write), Attention / AttentionSwa (dequant-on-read, or dp4a-style q8 dot, inside the online-softmax loop). No q8_0 KV kernel precedent exists — the GDN bf16 path is a straight half-width store, not block-quantized, so it doesn't carry the scale handling. Must stay argmax-stable through the online softmax.
Extend SHARPI_KV_DTYPE to the dense path and honor --cache-type-k/v; add a q8_0 mode alongside the existing fp32 | bf16. Preserve the SWA ring and attention_k_eq_v aliasing.
Re-measure the README text-gen row at long context (-c 32768, -c 131072) and record decode degradation vs the 2K number.
Suggested increments
fp16 KV first — half-width store, no block scales, trivially argmax-stable, already ~2× context. Lands as a checkpoint and de-risks the path.
q8_0 second — block quantization + scale handling for the full ~4×.
Acceptance
Dense CUDA path runs ≥32K (ideally 128K) context within 12 GB with q8_0 KV. Gemma 4 12B QAT -g -1 is the gating model; Qwen3-8B validated as the non-SWA case.
Argmax-stable vs fp32 KV at short context; coherent long-context output. SHARPI_KV_DTYPE bisects.
README text-gen row updated with a long-context decode figure.
Problem
The dense CUDA path stores the KV cache in fp32 (
CudaForwardPass.cs:594-595allocates_gpuKCache/_gpuVCacheas fp32;GpuForwardPass.cs:199— "K + V, fp32").KvAppend,Attention, andAttentionSwaall read/write fp32 buffers withsizeof(float)strides. This applies to every dense model on the CUDA path, not just Gemma — fp32 KV is the single biggest barrier to long context on a 12 GB card.SHARPI_KV_DTYPEalready exists but only on the GDN hybrid path (CudaHybridGdnForwardPass.cs:672) and only supportsfp32 | bf16— it is not wired to the dense path and has no q8_0 mode. llama.cpp serves these models at 128K on the same class of card only because--cache-type-k/v q8_0shrinks KV ~4×.This is a general backend enhancement
Quantized KV storage + quant-aware append/attention kernels are a dense-CUDA-path capability, not a model feature. Every dense model benefits:
-c 2048).Two Gemma-specific behaviors must be preserved (constraints, not new work):
SwaRingSize/SwaRingHeadroom) — quantized kernels must keep honoring the modulo wrap. Non-SWA models (Qwen3) use the full-context branch unchanged.attention_k_eq_v(12B global layers, K=V, store K only) — V's dequant-on-read reads the same quantized K blocks.VRAM math — the driving case (Gemma 4 12B QAT, 12 GB card, 128K ctx)
48 layers, head_dim 256, ~40 SWA + 8 global; global layers MQA with
attention_k_eq_v(K=V, store K only). SWA ring = window +SwaRingHeadroom(4096), capped at ctx.At fp32 the model + KV alone nearly fills the card with no room for activation scratch. q8_0 leaves ~3.5 GB of headroom and makes 128K viable.
Scope
CudaForwardPass.cs/GpuForwardPass.cs). The contained part is allocation; the real cost is the three kernels —KvAppend(quantize K/V on write),Attention/AttentionSwa(dequant-on-read, or dp4a-style q8 dot, inside the online-softmax loop). No q8_0 KV kernel precedent exists — the GDN bf16 path is a straight half-width store, not block-quantized, so it doesn't carry the scale handling. Must stay argmax-stable through the online softmax.SHARPI_KV_DTYPEto the dense path and honor--cache-type-k/v; add aq8_0mode alongside the existingfp32 | bf16. Preserve the SWA ring andattention_k_eq_valiasing.SwaRingSize/SwaRingHeadroom) was validated for E4B in perf(cuda): close remaining Qwen3-8B Q4_K DECODE gap to llama.cpp — non-matvec cost (prefill handled by #167; kernel-efficiency in #149/#152) #162/perf(cuda): chunked batched prefill for Gemma 4 SWA via a real KV ring (#162) #164; the 12B was only run at-c 2048. Verify correctness past the SWA window at 8K / 32K / 128K (needle-in-haystack or argmax parity vs fp32 KV at short ctx).-c 32768,-c 131072) and record decode degradation vs the 2K number.Suggested increments
Acceptance
-g -1is the gating model; Qwen3-8B validated as the non-SWA case.SHARPI_KV_DTYPEbisects.