✅ RESOLVED (2026-06-14)
Prefill went from ~108 → ~3444 t/s (E4B Q8, RTX 4070 Ti): dequant→cuBLAS/MMQ GEMM (#143, 108→1564), tensor-core flash-attention prefill (#146/#147, PR #148), and the SoA Q8_0 repack (#149). That is ~2× of llama.cpp's ~8475 t/s — within the "2–3×, not 70×" acceptance bar. The remaining ~2× compute lever (a different MMQ tiling regime) is tracked in #152 (split-K & cp.async already ruled out there). Decode was left untouched per scope (companion #142). Closing.
Finding
Gemma 4 GPU prefill is ~70× slower than llama.cpp — the single largest gap, and the one
most worth closing.
Benchmark (RTX 4070 Ti, gemma-4-E4B-it-Q8_0, all layers on GPU):
|
llama.cpp b9529 (cuda-12.4) |
SharpInference |
gap |
| Prefill (pp512) |
~8,000 t/s |
~108 t/s |
~70× |
(llama.cpp via llama-bench -ngl 99 -p 512; ours via bench-textgen warm prefill.)
Root cause
Our batched-trunk prefill (#136) collapses per-position launches into matvec / GEMM-N
kernels — but those are memory-bound: they re-stream the full weight matrix roughly every
~2 tokens (prefill ≈ 2× decode, which is all #136 bought). llama.cpp runs prefill as a
compute-bound cuBLAS GEMM that reads each weight once per ~512-token batch, so on a
compute-rich GPU it's bandwidth-bound vs compute-bound — hence the ~70×.
Proposal
Route the prefill trunk matmuls (Q/K/V/O projections, FFN gate/up/down) through a real GEMM
for the multi-token batch:
- We already link cuBLAS (
cublas64_12, used for DiT) — use cublasGemmEx with the prompt
activations as [N × dim].
- Q8_0 weights need a real GEMM input: dequantize to fp16/bf16 once per matmul (or tile on the
fly), or add a quantized-GEMM (MMQ-style) kernel. llama.cpp uses dequant→cuBLAS for large
batches and its MMQ int8 kernels otherwise.
- Decode stays on the existing matvec path (a 1-token "batch" is genuinely bandwidth-bound, so
GEMM there buys nothing — don't touch it).
Acceptance
- Gemma 4 prefill within ~2–3× of llama.cpp (not ~70×).
- Decode throughput unchanged.
- Output within fp tolerance of the current path (gemma4 CUDA forward tests stay green).
Why this one
Per "don't optimize paths that won't reach competitive speed": this is the path that can
close most of the gap. It dwarfs the CUDA-graph decode work (#136/#140, ~+5% decode, nothing
for prefill). Reference comparison: see the decode-gap companion issue.
Finding
Gemma 4 GPU prefill is ~70× slower than llama.cpp — the single largest gap, and the one
most worth closing.
Benchmark (RTX 4070 Ti,
gemma-4-E4B-it-Q8_0, all layers on GPU):(llama.cpp via
llama-bench -ngl 99 -p 512; ours viabench-textgenwarm prefill.)Root cause
Our batched-trunk prefill (#136) collapses per-position launches into matvec / GEMM-N
kernels — but those are memory-bound: they re-stream the full weight matrix roughly every
~2 tokens (prefill ≈ 2× decode, which is all #136 bought). llama.cpp runs prefill as a
compute-bound cuBLAS GEMM that reads each weight once per ~512-token batch, so on a
compute-rich GPU it's bandwidth-bound vs compute-bound — hence the ~70×.
Proposal
Route the prefill trunk matmuls (Q/K/V/O projections, FFN gate/up/down) through a real GEMM
for the multi-token batch:
cublas64_12, used for DiT) — usecublasGemmExwith the promptactivations as
[N × dim].fly), or add a quantized-GEMM (MMQ-style) kernel. llama.cpp uses dequant→cuBLAS for large
batches and its MMQ int8 kernels otherwise.
GEMM there buys nothing — don't touch it).
Acceptance
Why this one
Per "don't optimize paths that won't reach competitive speed": this is the path that can
close most of the gap. It dwarfs the CUDA-graph decode work (#136/#140, ~+5% decode, nothing
for prefill). Reference comparison: see the decode-gap companion issue.