Skip to content

Gemma 4 batched decode: validate on a GEMM-N-batchable 12B (real k_eq_v coverage + Q4_K decode-MMQ) — #275/#276 follow-up #283

@pekkah

Description

@pekkah

Context

PR #280 (#275) routed the Gemma 4 batched decode through BatchDecodeMatMul (WS / decode-MMQ / GEMM-N), and PR #281 (#276) added synthetic coverage for the 12B-global attention_k_eq_v batched path. Both were limited by the lack of a GEMM-N-batchable 12B (the only local 12B is gemma-4-12b-it-qat-q4_0.gguf, Q4_0 → single-user fallback; a Q8_0 12B is ~13 GB and doesn't fit the 4070 Ti's 12 GB at a useful context):

Goal

When a GEMM-N-batchable (Q4_K / Q5_K / Q6_K / Q8_0) 12B that fits 12 GB becomes available (e.g. a Q4_K_M 12B, or a larger-VRAM box):

  1. Add a real-12B Gemma4CudaBatchForwardMultiTests-style oracle that exercises the global k_eq_v layers end-to-end (mirrors the synthetic Gemma4CudaKEqVBatchedDecodeTests), closing the test(cuda): Gemma 4 k_eq_v batched-decode path is uncovered — needs a GEMM-N-batchable 12B fixture (#195 follow-up) #276 synthetic-vs-real gap.
  2. Bench the Gemma 4 batched decode on a Q4_K 12B so the decode-MMQ tile (N≥5, rows≥2048, cols%256) actually engages, and confirm it beats WS at higher N (the perf(cuda): Gemma 4 batched decode uses cuBLAS GEMM, not the dense WS/MMQ decode routing (#195 follow-up) #275 routing already dispatches it; this validates the win exists for Gemma 4, not just the dense path).

Secondary (lower priority, from #275)

  • The PLE pre-pass + injection matmuls (_gpuInpGate / _gpuPleProj / per_layer_model_proj) deliberately stay on cuBLAS GpuMatMulBatched (their pleWidth shapes weren't a decode-matvec win and the issue scoped them out). If a Gemma 4 decode profile shows PLE GEMM is a non-trivial decode-time cost, evaluate routing them through BatchDecodeMatMul too — bench before shipping (cuBLAS GEMM is compute-bound for small-N decode, same reasoning as perf(cuda): Gemma 4 batched decode uses cuBLAS GEMM, not the dense WS/MMQ decode routing (#195 follow-up) #275; but pleWidth is narrow, so the win may not materialize).

Refs: #275 (PR #280), #276 (PR #281), #195, #201/#206 (decode-MMQ), #190 (umbrella).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions