Gemma 4 batched decode: validate on a GEMM-N-batchable 12B (real k_eq_v coverage + Q4_K decode-MMQ) — #275/#276 follow-up

## Context

PR #280 (#275) routed the Gemma 4 batched decode through `BatchDecodeMatMul` (WS / decode-MMQ / GEMM-N), and PR #281 (#276) added synthetic coverage for the 12B-global `attention_k_eq_v` batched path. Both were **limited by the lack of a GEMM-N-batchable 12B** (the only local 12B is `gemma-4-12b-it-qat-q4_0.gguf`, Q4_0 → single-user fallback; a Q8_0 12B is ~13 GB and doesn't fit the 4070 Ti's 12 GB at a useful context):

- **#275 decode-MMQ is untested for Gemma 4.** On E4B Q8_0 the decode-MMQ tile (Q4_K-only) correctly falls back to the WS matvec, so only the WS path was benched (2.70×/2.45×/1.79×/1.10× at N=1/2/4/8). The #201/#206 int8 decode-MMQ tile that `BatchDecodeMatMul` plumbs for big Q4_K shapes at N≥5 is **never exercised** on a Gemma 4 model.
- **#276 k_eq_v coverage is synthetic-only.** The 12B-global `CopyDevice(vAll, kAll)` batched branch is validated against the single-token oracle on a tiny **synthetic** all-global F32 fixture (mutation-verified), not on a real 12B. The realistic 12B pairing of *k_eq_v-on-global + real-V-on-SWA in one model* is not reproduced (the synthetic fixture is all-global; SWA is orthogonal and covered separately by the E4B oracles).

## Goal

When a **GEMM-N-batchable (Q4_K / Q5_K / Q6_K / Q8_0) 12B** that fits 12 GB becomes available (e.g. a Q4_K_M 12B, or a larger-VRAM box):

1. Add a real-12B `Gemma4CudaBatchForwardMultiTests`-style oracle that exercises the **global k_eq_v** layers end-to-end (mirrors the synthetic `Gemma4CudaKEqVBatchedDecodeTests`), closing the #276 synthetic-vs-real gap.
2. **Bench the Gemma 4 batched decode on a Q4_K 12B** so the decode-MMQ tile (N≥5, rows≥2048, cols%256) actually engages, and confirm it beats WS at higher N (the #275 routing already dispatches it; this validates the win exists for Gemma 4, not just the dense path).

## Secondary (lower priority, from #275)

- The PLE pre-pass + injection matmuls (`_gpuInpGate` / `_gpuPleProj` / `per_layer_model_proj`) deliberately stay on cuBLAS `GpuMatMulBatched` (their pleWidth shapes weren't a decode-matvec win and the issue scoped them out). If a Gemma 4 decode profile shows PLE GEMM is a non-trivial decode-time cost, evaluate routing them through `BatchDecodeMatMul` too — **bench before shipping** (cuBLAS GEMM is compute-bound for small-N decode, same reasoning as #275; but pleWidth is narrow, so the win may not materialize).

Refs: #275 (PR #280), #276 (PR #281), #195, #201/#206 (decode-MMQ), #190 (umbrella).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 4 batched decode: validate on a GEMM-N-batchable 12B (real k_eq_v coverage + Q4_K decode-MMQ) — #275/#276 follow-up #283

Context

Goal

Secondary (lower priority, from #275)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Gemma 4 batched decode: validate on a GEMM-N-batchable 12B (real k_eq_v coverage + Q4_K decode-MMQ) — #275/#276 follow-up #283

Description

Context

Goal

Secondary (lower priority, from #275)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions