Skip to content

SnapKV: CUDA hybrid GDN port (#51 follow-up) #58

Description

@pekkah

Follow-up to #51 / PR #57.

The CPU SnapKV foundation is in master once #57 merges. The CUDA hybrid GDN path (CudaHybridGdnForwardPass — used by qwen35moe and qwen35 27B-MTP) is the primary 12 GB VRAM target for SnapKV but isn't covered by v1.

Scope

  1. Last-W per-layer query capture during prefill. CudaHybridGdnForwardPass.Prefill is a sequential token-by-token loop (Forward(tokens[i], …)), so there's no batched Q buffer to harvest. Add a GPU scratch ring [numAttnLayers × W × qDim] that captures _gpuQ per layer per token for the last W tokens. Reuse the trailing slots across the loop — only the last W need to survive into the eviction pass.

  2. Post-prefill scoring. Run per-layer scoring against the GPU K cache (_gpuKCache[layer]). Two options:

    • Pure GPU: dispatch Attention-style kernel that emits softmaxed weights summed across heads/queries into a [promptLen] device float buffer.
    • GPU+CPU split: download captured Q's (small, ~6 MiB total for Qwen3.6-27B-MTP at W=64) and use existing SimdKernels.DotF32 on a host-side K download. Simpler but pays the KV download cost; only worth it if the GPU scoring kernel is non-trivial to fuse.
  3. GPU-side compaction. New NVRTC kernel llm_kv_compact that takes a [K] keep-position list and copies survivors into slots [0..K) within the flat ring. Two passes (K then V) or fused.

  4. Wire up bookkeeping. CudaHybridGdnForwardPass keeps its own PagedKvCache purely for length tracking — switch it to call .Compact(keep) after the GPU compaction. Decode-side: AttnLayer already uses kvPosition + 1 for the kernel seqLen — change to _kvCache.Length + 1 so it sees the post-compaction count.

  5. Compose with feat: Bf16 KV cache for hybrid GDN models (#27) #56 (bf16 KV). The kernel needs a bf16-store variant too (i.e. llm_kv_compact_bf16), or generic byte-stride copy.

Acceptance

  • SHARPI_SNAPKV_BUDGET=2048 on a 16 K prompt: GPU KV cache shrinks ~8× (matching the issue's projection), decode stays well-formed.
  • No decode regression vs no-eviction (prefill-only cost).
  • Existing CudaHybridGdnForwardPass_* tests pass with eviction disabled (default).

Notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions