perf(cuda): batched-query SDPA + Q-capture for SnapKV-active GDN-hybrid prefill (#118 follow-up)

Follow-up to **#118** (PR #120). The wave-based >4096 batched-query SDPA (and the ≤4096 shared-scores `AttentionBatched`) is gated `BatchedAttnEnabled && !snapKvActive` in `AttnBlockBatched`. When SnapKV (issue #58) is active the prefill falls back to the **per-position** KV-append + `Attention` loop, because that loop is also where the trailing-window query vectors are captured into `_snapKvQCapture` (for the post-prefill eviction scoring).

So a SnapKV-active prefill — fully reachable on any backend once `SHARPI_SNAPKV_BUDGET` is set and the prompt exceeds budget+window — gets neither the batched-query SDPA nor the #118 wave path; it pays the per-position attention-launch cost the rest of #114-B/#118 removed, and that cost still grows with context.

## Goal
Let the batched / wave SDPA run with SnapKV active by capturing Q in the batched path.

## Approach (sketch)
- The batched projections already produce `_gpuBtQ` ([N × qDim]) before the SDPA. For the trailing window `[wStart, N)`, copy those rows into `_snapKvQCapture` in one batched `CopyDeviceRegion` (or a small strided-copy kernel) keyed by the attention-layer index — instead of the per-position copies in the fallback loop — then run `AttentionBatched` / `AttentionBatchedWave` as normal.
- Must stay bit-identical to the per-position path for both the attention output AND the captured Q (the SnapKV eviction oracle + `BatchedTrunkGpuFfn_SnapKvActive_BitwiseMatchesSequential_Dense27BMtp` must hold).

## Tests
- Extend the SnapKV-active parity oracle to assert batched == sequential with `BatchedAttnEnabled` on (today it passes because both arms take the per-position path; the new code must keep it bit-identical while actually using the batched SDPA).
- A >4096 SnapKV-active case (wave path + Q-capture).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda): batched-query SDPA + Q-capture for SnapKV-active GDN-hybrid prefill (#118 follow-up) #122

Goal

Approach (sketch)

Tests

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(cuda): batched-query SDPA + Q-capture for SnapKV-active GDN-hybrid prefill (#118 follow-up) #122

Description

Goal

Approach (sketch)

Tests

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions