Symptom
When SnapKV prefill eviction is active on an MTP model (e.g. Qwen3.6-27B-MTP), the first decode iteration throws:
BatchForward2: _kvCache.Length=128 != startPos=<promptLen>. Caches must be at startPos before the batched verify call.
(128 is the SnapKV budget; <promptLen> is the original, pre-eviction prompt length.)
Root cause
SnapKV eviction is a prefill-only compaction. After it runs, PagedKvCache deliberately splits its two length notions (see PagedKvCache.cs):
_length (physical) shrinks to the kept-slot count K (= budget, e.g. 128) — the slot index of the next append.
_logicalLength stays at the original prompt length N — the absolute position the next decode token sits at, used for RoPE so cached (already-RoPE'd) keys and the incoming query share a reference frame.
The single-token Forward decode path already handles this Length != LogicalLength split correctly (covered by CudaHybridGdnSnapKv_LongPrompt_CacheShrinksToBudget_DecodeStaysWellFormed).
BatchForward2 (the MTP N=2 batched-verify primitive in both HybridGdnForwardPass and CudaHybridGdnForwardPass) opens with a precondition if (_kvCache.Length != startPos) throw .... MtpDecoder.DecodeBatched passes startPos = _nextPos (= logical position). Post-eviction _kvCache.Length == K but startPos == N, so the precondition fails on the very first batched-verify call. There is no reconciliation between eviction and batched verify.
Fix (this PR)
Cleanly gate batched-verify off when the cache has been evicted — detect _kvCache.Length != _kvCache.LogicalLength and fall back to the sequential MTP decode path (which uses Forward, already eviction-safe). SnapKV + MTP then coexist correctly, just without the N=2 verify speedup while eviction is active. Add a decode-after-eviction coherence test.
Follow-up (not this PR)
Teach BatchForward2 to operate on an evicted cache directly: change the precondition to key off LogicalLength (the RoPE position) and make the two-token append/attention use physical Length for storage slots and LogicalLength(+0/+1) for RoPE — mirroring what single-token Forward already does. That restores the verify speedup under eviction. Requires re-running the MTP byte-parity oracle (MtpDecoder_GreedyParity_LlamaCpp) since MTP FP parity is fragile.
Symptom
When SnapKV prefill eviction is active on an MTP model (e.g.
Qwen3.6-27B-MTP), the first decode iteration throws:(
128is the SnapKV budget;<promptLen>is the original, pre-eviction prompt length.)Root cause
SnapKV eviction is a prefill-only compaction. After it runs,
PagedKvCachedeliberately splits its two length notions (seePagedKvCache.cs):_length(physical) shrinks to the kept-slot countK(= budget, e.g. 128) — the slot index of the next append._logicalLengthstays at the original prompt lengthN— the absolute position the next decode token sits at, used for RoPE so cached (already-RoPE'd) keys and the incoming query share a reference frame.The single-token
Forwarddecode path already handles thisLength != LogicalLengthsplit correctly (covered byCudaHybridGdnSnapKv_LongPrompt_CacheShrinksToBudget_DecodeStaysWellFormed).BatchForward2(the MTP N=2 batched-verify primitive in bothHybridGdnForwardPassandCudaHybridGdnForwardPass) opens with a preconditionif (_kvCache.Length != startPos) throw ....MtpDecoder.DecodeBatchedpassesstartPos = _nextPos(= logical position). Post-eviction_kvCache.Length == KbutstartPos == N, so the precondition fails on the very first batched-verify call. There is no reconciliation between eviction and batched verify.Fix (this PR)
Cleanly gate batched-verify off when the cache has been evicted — detect
_kvCache.Length != _kvCache.LogicalLengthand fall back to the sequential MTP decode path (which usesForward, already eviction-safe). SnapKV + MTP then coexist correctly, just without the N=2 verify speedup while eviction is active. Add a decode-after-eviction coherence test.Follow-up (not this PR)
Teach
BatchForward2to operate on an evicted cache directly: change the precondition to key offLogicalLength(the RoPE position) and make the two-token append/attention use physicalLengthfor storage slots andLogicalLength(+0/+1)for RoPE — mirroring what single-tokenForwardalready does. That restores the verify speedup under eviction. Requires re-running the MTP byte-parity oracle (MtpDecoder_GreedyParity_LlamaCpp) since MTP FP parity is fragile.