You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The CPU SnapKV foundation is in master once #57 merges. The CUDA hybrid GDN path (CudaHybridGdnForwardPass — used by qwen35moe and qwen35 27B-MTP) is the primary 12 GB VRAM target for SnapKV but isn't covered by v1.
Scope
Last-W per-layer query capture during prefill.CudaHybridGdnForwardPass.Prefill is a sequential token-by-token loop (Forward(tokens[i], …)), so there's no batched Q buffer to harvest. Add a GPU scratch ring [numAttnLayers × W × qDim] that captures _gpuQ per layer per token for the last W tokens. Reuse the trailing slots across the loop — only the last W need to survive into the eviction pass.
Post-prefill scoring. Run per-layer scoring against the GPU K cache (_gpuKCache[layer]). Two options:
Pure GPU: dispatch Attention-style kernel that emits softmaxed weights summed across heads/queries into a [promptLen] device float buffer.
GPU+CPU split: download captured Q's (small, ~6 MiB total for Qwen3.6-27B-MTP at W=64) and use existing SimdKernels.DotF32 on a host-side K download. Simpler but pays the KV download cost; only worth it if the GPU scoring kernel is non-trivial to fuse.
GPU-side compaction. New NVRTC kernel llm_kv_compact that takes a [K] keep-position list and copies survivors into slots [0..K) within the flat ring. Two passes (K then V) or fused.
Wire up bookkeeping.CudaHybridGdnForwardPass keeps its own PagedKvCache purely for length tracking — switch it to call .Compact(keep) after the GPU compaction. Decode-side: AttnLayer already uses kvPosition + 1 for the kernel seqLen — change to _kvCache.Length + 1 so it sees the post-compaction count.
Follow-up to #51 / PR #57.
The CPU SnapKV foundation is in master once #57 merges. The CUDA hybrid GDN path (
CudaHybridGdnForwardPass— used by qwen35moe and qwen35 27B-MTP) is the primary 12 GB VRAM target for SnapKV but isn't covered by v1.Scope
Last-W per-layer query capture during prefill.
CudaHybridGdnForwardPass.Prefillis a sequential token-by-token loop (Forward(tokens[i], …)), so there's no batched Q buffer to harvest. Add a GPU scratch ring[numAttnLayers × W × qDim]that captures_gpuQper layer per token for the lastWtokens. Reuse the trailing slots across the loop — only the lastWneed to survive into the eviction pass.Post-prefill scoring. Run per-layer scoring against the GPU K cache (
_gpuKCache[layer]). Two options:Attention-style kernel that emits softmaxed weights summed across heads/queries into a[promptLen]device float buffer.SimdKernels.DotF32on a host-side K download. Simpler but pays the KV download cost; only worth it if the GPU scoring kernel is non-trivial to fuse.GPU-side compaction. New NVRTC kernel
llm_kv_compactthat takes a[K]keep-position list and copies survivors into slots[0..K)within the flat ring. Two passes (K then V) or fused.Wire up bookkeeping.
CudaHybridGdnForwardPasskeeps its ownPagedKvCachepurely for length tracking — switch it to call.Compact(keep)after the GPU compaction. Decode-side:AttnLayeralready useskvPosition + 1for the kernelseqLen— change to_kvCache.Length + 1so it sees the post-compaction count.Compose with feat: Bf16 KV cache for hybrid GDN models (#27) #56 (bf16 KV). The kernel needs a bf16-store variant too (i.e.
llm_kv_compact_bf16), or generic byte-stride copy.Acceptance
SHARPI_SNAPKV_BUDGET=2048on a 16 K prompt: GPU KV cache shrinks ~8× (matching the issue's projection), decode stays well-formed.CudaHybridGdnForwardPass_*tests pass with eviction disabled (default).Notes
_kvCache) already supports theLogicalLengthsplit from PR feat: SnapKV prefill KV eviction (#51 v1, CPU) #57.