Follow-up to #118 (PR #120). The wave-based >4096 batched-query SDPA (and the ≤4096 shared-scores AttentionBatched) is gated BatchedAttnEnabled && !snapKvActive in AttnBlockBatched. When SnapKV (issue #58) is active the prefill falls back to the per-position KV-append + Attention loop, because that loop is also where the trailing-window query vectors are captured into _snapKvQCapture (for the post-prefill eviction scoring).
So a SnapKV-active prefill — fully reachable on any backend once SHARPI_SNAPKV_BUDGET is set and the prompt exceeds budget+window — gets neither the batched-query SDPA nor the #118 wave path; it pays the per-position attention-launch cost the rest of #114-B/#118 removed, and that cost still grows with context.
Goal
Let the batched / wave SDPA run with SnapKV active by capturing Q in the batched path.
Approach (sketch)
- The batched projections already produce
_gpuBtQ ([N × qDim]) before the SDPA. For the trailing window [wStart, N), copy those rows into _snapKvQCapture in one batched CopyDeviceRegion (or a small strided-copy kernel) keyed by the attention-layer index — instead of the per-position copies in the fallback loop — then run AttentionBatched / AttentionBatchedWave as normal.
- Must stay bit-identical to the per-position path for both the attention output AND the captured Q (the SnapKV eviction oracle +
BatchedTrunkGpuFfn_SnapKvActive_BitwiseMatchesSequential_Dense27BMtp must hold).
Tests
- Extend the SnapKV-active parity oracle to assert batched == sequential with
BatchedAttnEnabled on (today it passes because both arms take the per-position path; the new code must keep it bit-identical while actually using the batched SDPA).
- A >4096 SnapKV-active case (wave path + Q-capture).
Follow-up to #118 (PR #120). The wave-based >4096 batched-query SDPA (and the ≤4096 shared-scores
AttentionBatched) is gatedBatchedAttnEnabled && !snapKvActiveinAttnBlockBatched. When SnapKV (issue #58) is active the prefill falls back to the per-position KV-append +Attentionloop, because that loop is also where the trailing-window query vectors are captured into_snapKvQCapture(for the post-prefill eviction scoring).So a SnapKV-active prefill — fully reachable on any backend once
SHARPI_SNAPKV_BUDGETis set and the prompt exceeds budget+window — gets neither the batched-query SDPA nor the #118 wave path; it pays the per-position attention-launch cost the rest of #114-B/#118 removed, and that cost still grows with context.Goal
Let the batched / wave SDPA run with SnapKV active by capturing Q in the batched path.
Approach (sketch)
_gpuBtQ([N × qDim]) before the SDPA. For the trailing window[wStart, N), copy those rows into_snapKvQCapturein one batchedCopyDeviceRegion(or a small strided-copy kernel) keyed by the attention-layer index — instead of the per-position copies in the fallback loop — then runAttentionBatched/AttentionBatchedWaveas normal.BatchedTrunkGpuFfn_SnapKvActive_BitwiseMatchesSequential_Dense27BMtpmust hold).Tests
BatchedAttnEnabledon (today it passes because both arms take the per-position path; the new code must keep it bit-identical while actually using the batched SDPA).