Skip to content

perf(cuda): batched-query SDPA + Q-capture for SnapKV-active GDN-hybrid prefill (#118 follow-up) #122

Description

@pekkah

Follow-up to #118 (PR #120). The wave-based >4096 batched-query SDPA (and the ≤4096 shared-scores AttentionBatched) is gated BatchedAttnEnabled && !snapKvActive in AttnBlockBatched. When SnapKV (issue #58) is active the prefill falls back to the per-position KV-append + Attention loop, because that loop is also where the trailing-window query vectors are captured into _snapKvQCapture (for the post-prefill eviction scoring).

So a SnapKV-active prefill — fully reachable on any backend once SHARPI_SNAPKV_BUDGET is set and the prompt exceeds budget+window — gets neither the batched-query SDPA nor the #118 wave path; it pays the per-position attention-launch cost the rest of #114-B/#118 removed, and that cost still grows with context.

Goal

Let the batched / wave SDPA run with SnapKV active by capturing Q in the batched path.

Approach (sketch)

  • The batched projections already produce _gpuBtQ ([N × qDim]) before the SDPA. For the trailing window [wStart, N), copy those rows into _snapKvQCapture in one batched CopyDeviceRegion (or a small strided-copy kernel) keyed by the attention-layer index — instead of the per-position copies in the fallback loop — then run AttentionBatched / AttentionBatchedWave as normal.
  • Must stay bit-identical to the per-position path for both the attention output AND the captured Q (the SnapKV eviction oracle + BatchedTrunkGpuFfn_SnapKvActive_BitwiseMatchesSequential_Dense27BMtp must hold).

Tests

  • Extend the SnapKV-active parity oracle to assert batched == sequential with BatchedAttnEnabled on (today it passes because both arms take the per-position path; the new code must keep it bit-identical while actually using the batched SDPA).
  • A >4096 SnapKV-active case (wave path + Q-capture).

Metadata

Metadata

Assignees

No one assigned

    Labels

    perfPerformance optimization opportunity

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions