SnapKV: CUDA hybrid GDN port (#51 follow-up)

Follow-up to #51 / PR #57.

The CPU SnapKV foundation is in master once #57 merges. The CUDA hybrid GDN path (`CudaHybridGdnForwardPass` — used by qwen35moe and qwen35 27B-MTP) is the primary 12 GB VRAM target for SnapKV but isn't covered by v1.

## Scope

1. **Last-W per-layer query capture during prefill.** `CudaHybridGdnForwardPass.Prefill` is a sequential token-by-token loop (`Forward(tokens[i], …)`), so there's no batched Q buffer to harvest. Add a GPU scratch ring `[numAttnLayers × W × qDim]` that captures `_gpuQ` per layer per token for the last `W` tokens. Reuse the trailing slots across the loop — only the last `W` need to survive into the eviction pass.

2. **Post-prefill scoring.** Run per-layer scoring against the GPU K cache (`_gpuKCache[layer]`). Two options:
   - Pure GPU: dispatch `Attention`-style kernel that emits softmaxed weights summed across heads/queries into a `[promptLen]` device float buffer.
   - GPU+CPU split: download captured Q's (small, ~6 MiB total for Qwen3.6-27B-MTP at W=64) and use existing `SimdKernels.DotF32` on a host-side K download. Simpler but pays the KV download cost; only worth it if the GPU scoring kernel is non-trivial to fuse.

3. **GPU-side compaction.** New NVRTC kernel `llm_kv_compact` that takes a `[K]` keep-position list and copies survivors into slots `[0..K)` within the flat ring. Two passes (K then V) or fused.

4. **Wire up bookkeeping.** `CudaHybridGdnForwardPass` keeps its own `PagedKvCache` purely for length tracking — switch it to call `.Compact(keep)` after the GPU compaction. Decode-side: `AttnLayer` already uses `kvPosition + 1` for the kernel `seqLen` — change to `_kvCache.Length + 1` so it sees the post-compaction count.

5. **Compose with #56 (bf16 KV).** The kernel needs a bf16-store variant too (i.e. `llm_kv_compact_bf16`), or generic byte-stride copy.

## Acceptance

- `SHARPI_SNAPKV_BUDGET=2048` on a 16 K prompt: GPU KV cache shrinks ~8× (matching the issue's projection), decode stays well-formed.
- No decode regression vs no-eviction (prefill-only cost).
- Existing `CudaHybridGdnForwardPass_*` tests pass with eviction disabled (default).

## Notes

- Captures of Q are post-RoPE / post-Q-norm (matches CPU path semantics).
- The bookkeeping cache (`_kvCache`) already supports the `LogicalLength` split from PR #57.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SnapKV: CUDA hybrid GDN port (#51 follow-up) #58

Scope

Acceptance

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

SnapKV: CUDA hybrid GDN port (#51 follow-up) #58

Description

Scope

Acceptance

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions