You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #280 (#275) routed the Gemma 4 batched decode through BatchDecodeMatMul (WS / decode-MMQ / GEMM-N), and PR #281 (#276) added synthetic coverage for the 12B-global attention_k_eq_v batched path. Both were limited by the lack of a GEMM-N-batchable 12B (the only local 12B is gemma-4-12b-it-qat-q4_0.gguf, Q4_0 → single-user fallback; a Q8_0 12B is ~13 GB and doesn't fit the 4070 Ti's 12 GB at a useful context):
test(cuda): Gemma 4 k_eq_v batched-decode path is uncovered — needs a GEMM-N-batchable 12B fixture (#195 follow-up) #276 k_eq_v coverage is synthetic-only. The 12B-global CopyDevice(vAll, kAll) batched branch is validated against the single-token oracle on a tiny synthetic all-global F32 fixture (mutation-verified), not on a real 12B. The realistic 12B pairing of k_eq_v-on-global + real-V-on-SWA in one model is not reproduced (the synthetic fixture is all-global; SWA is orthogonal and covered separately by the E4B oracles).
Goal
When a GEMM-N-batchable (Q4_K / Q5_K / Q6_K / Q8_0) 12B that fits 12 GB becomes available (e.g. a Q4_K_M 12B, or a larger-VRAM box):
The PLE pre-pass + injection matmuls (_gpuInpGate / _gpuPleProj / per_layer_model_proj) deliberately stay on cuBLAS GpuMatMulBatched (their pleWidth shapes weren't a decode-matvec win and the issue scoped them out). If a Gemma 4 decode profile shows PLE GEMM is a non-trivial decode-time cost, evaluate routing them through BatchDecodeMatMul too — bench before shipping (cuBLAS GEMM is compute-bound for small-N decode, same reasoning as perf(cuda): Gemma 4 batched decode uses cuBLAS GEMM, not the dense WS/MMQ decode routing (#195 follow-up) #275; but pleWidth is narrow, so the win may not materialize).
Context
PR #280 (#275) routed the Gemma 4 batched decode through
BatchDecodeMatMul(WS / decode-MMQ / GEMM-N), and PR #281 (#276) added synthetic coverage for the 12B-globalattention_k_eq_vbatched path. Both were limited by the lack of a GEMM-N-batchable 12B (the only local 12B isgemma-4-12b-it-qat-q4_0.gguf, Q4_0 → single-user fallback; a Q8_0 12B is ~13 GB and doesn't fit the 4070 Ti's 12 GB at a useful context):BatchDecodeMatMulplumbs for big Q4_K shapes at N≥5 is never exercised on a Gemma 4 model.CopyDevice(vAll, kAll)batched branch is validated against the single-token oracle on a tiny synthetic all-global F32 fixture (mutation-verified), not on a real 12B. The realistic 12B pairing of k_eq_v-on-global + real-V-on-SWA in one model is not reproduced (the synthetic fixture is all-global; SWA is orthogonal and covered separately by the E4B oracles).Goal
When a GEMM-N-batchable (Q4_K / Q5_K / Q6_K / Q8_0) 12B that fits 12 GB becomes available (e.g. a Q4_K_M 12B, or a larger-VRAM box):
Gemma4CudaBatchForwardMultiTests-style oracle that exercises the global k_eq_v layers end-to-end (mirrors the syntheticGemma4CudaKEqVBatchedDecodeTests), closing the test(cuda): Gemma 4 k_eq_v batched-decode path is uncovered — needs a GEMM-N-batchable 12B fixture (#195 follow-up) #276 synthetic-vs-real gap.Secondary (lower priority, from #275)
_gpuInpGate/_gpuPleProj/per_layer_model_proj) deliberately stay on cuBLASGpuMatMulBatched(their pleWidth shapes weren't a decode-matvec win and the issue scoped them out). If a Gemma 4 decode profile shows PLE GEMM is a non-trivial decode-time cost, evaluate routing them throughBatchDecodeMatMultoo — bench before shipping (cuBLAS GEMM is compute-bound for small-N decode, same reasoning as perf(cuda): Gemma 4 batched decode uses cuBLAS GEMM, not the dense WS/MMQ decode routing (#195 follow-up) #275; but pleWidth is narrow, so the win may not materialize).Refs: #275 (PR #280), #276 (PR #281), #195, #201/#206 (decode-MMQ), #190 (umbrella).