Skip to content

Gemma 4 GPU prefill is ~70x slower than llama.cpp — adopt cuBLAS GEMM for prefill #141

Description

@pekkah

✅ RESOLVED (2026-06-14)

Prefill went from ~108 → ~3444 t/s (E4B Q8, RTX 4070 Ti): dequant→cuBLAS/MMQ GEMM (#143, 108→1564), tensor-core flash-attention prefill (#146/#147, PR #148), and the SoA Q8_0 repack (#149). That is ~2× of llama.cpp's ~8475 t/s — within the "2–3×, not 70×" acceptance bar. The remaining ~2× compute lever (a different MMQ tiling regime) is tracked in #152 (split-K & cp.async already ruled out there). Decode was left untouched per scope (companion #142). Closing.


Finding

Gemma 4 GPU prefill is ~70× slower than llama.cpp — the single largest gap, and the one
most worth closing.

Benchmark (RTX 4070 Ti, gemma-4-E4B-it-Q8_0, all layers on GPU):

llama.cpp b9529 (cuda-12.4) SharpInference gap
Prefill (pp512) ~8,000 t/s ~108 t/s ~70×

(llama.cpp via llama-bench -ngl 99 -p 512; ours via bench-textgen warm prefill.)

Root cause

Our batched-trunk prefill (#136) collapses per-position launches into matvec / GEMM-N
kernels — but those are memory-bound: they re-stream the full weight matrix roughly every
~2 tokens (prefill ≈ 2× decode, which is all #136 bought). llama.cpp runs prefill as a
compute-bound cuBLAS GEMM that reads each weight once per ~512-token batch, so on a
compute-rich GPU it's bandwidth-bound vs compute-bound — hence the ~70×.

Proposal

Route the prefill trunk matmuls (Q/K/V/O projections, FFN gate/up/down) through a real GEMM
for the multi-token batch:

  • We already link cuBLAS (cublas64_12, used for DiT) — use cublasGemmEx with the prompt
    activations as [N × dim].
  • Q8_0 weights need a real GEMM input: dequantize to fp16/bf16 once per matmul (or tile on the
    fly), or add a quantized-GEMM (MMQ-style) kernel. llama.cpp uses dequant→cuBLAS for large
    batches and its MMQ int8 kernels otherwise.
  • Decode stays on the existing matvec path (a 1-token "batch" is genuinely bandwidth-bound, so
    GEMM there buys nothing — don't touch it).

Acceptance

  • Gemma 4 prefill within ~2–3× of llama.cpp (not ~70×).
  • Decode throughput unchanged.
  • Output within fp tolerance of the current path (gemma4 CUDA forward tests stay green).

Why this one

Per "don't optimize paths that won't reach competitive speed": this is the path that can
close most of the gap. It dwarfs the CUDA-graph decode work (#136/#140, ~+5% decode, nothing
for prefill). Reference comparison: see the decode-gap companion issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions