Gemma 4 GPU prefill is ~70x slower than llama.cpp — adopt cuBLAS GEMM for prefill

> ## ✅ RESOLVED (2026-06-14)
> Prefill went from ~108 → ~3444 t/s (E4B Q8, RTX 4070 Ti): dequant→cuBLAS/MMQ GEMM (#143, 108→1564), tensor-core flash-attention prefill (#146/#147, PR #148), and the SoA Q8_0 repack (#149). That is ~2× of llama.cpp's ~8475 t/s — within the **"2–3×, not 70×"** acceptance bar. The remaining ~2× compute lever (a different MMQ tiling regime) is tracked in **#152** (split-K & cp.async already ruled out there). Decode was left untouched per scope (companion #142). Closing.
>
> ---

## Finding

Gemma 4 GPU **prefill is ~70× slower than llama.cpp** — the single largest gap, and the one
most worth closing.

Benchmark (RTX 4070 Ti, `gemma-4-E4B-it-Q8_0`, all layers on GPU):

| | llama.cpp b9529 (cuda-12.4) | SharpInference | gap |
|---|---|---|---|
| Prefill (pp512) | **~8,000 t/s** | ~108 t/s | **~70×** |

(llama.cpp via `llama-bench -ngl 99 -p 512`; ours via `bench-textgen` warm prefill.)

## Root cause

Our batched-trunk prefill (#136) collapses per-position launches into **matvec / GEMM-N**
kernels — but those are **memory-bound**: they re-stream the full weight matrix roughly every
~2 tokens (prefill ≈ 2× decode, which is all #136 bought). llama.cpp runs prefill as a
**compute-bound cuBLAS GEMM** that reads each weight **once per ~512-token batch**, so on a
compute-rich GPU it's bandwidth-bound vs compute-bound — hence the ~70×.

## Proposal

Route the prefill trunk matmuls (Q/K/V/O projections, FFN gate/up/down) through a real GEMM
for the multi-token batch:

- We already link cuBLAS (`cublas64_12`, used for DiT) — use `cublasGemmEx` with the prompt
  activations as `[N × dim]`.
- Q8_0 weights need a real GEMM input: dequantize to fp16/bf16 once per matmul (or tile on the
  fly), or add a quantized-GEMM (MMQ-style) kernel. llama.cpp uses dequant→cuBLAS for large
  batches and its MMQ int8 kernels otherwise.
- Decode stays on the existing matvec path (a 1-token "batch" is genuinely bandwidth-bound, so
  GEMM there buys nothing — don't touch it).

## Acceptance

- Gemma 4 prefill within ~2–3× of llama.cpp (not ~70×).
- Decode throughput unchanged.
- Output within fp tolerance of the current path (gemma4 CUDA forward tests stay green).

## Why this one

Per "don't optimize paths that won't reach competitive speed": this is the path that *can*
close most of the gap. It dwarfs the CUDA-graph decode work (#136/#140, ~+5% decode, nothing
for prefill). Reference comparison: see the decode-gap companion issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 4 GPU prefill is ~70x slower than llama.cpp — adopt cuBLAS GEMM for prefill #141

✅ RESOLVED (2026-06-14)

Finding

Root cause

Proposal

Acceptance

Why this one

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Gemma 4 GPU prefill is ~70x slower than llama.cpp — adopt cuBLAS GEMM for prefill #141

Description

✅ RESOLVED (2026-06-14)

Finding

Root cause

Proposal

Acceptance

Why this one

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions