Skip to content

MTP batched-verify (BatchForward2) crashes when SnapKV eviction is active: _kvCache.Length != startPos #130

Description

@pekkah

Symptom

When SnapKV prefill eviction is active on an MTP model (e.g. Qwen3.6-27B-MTP), the first decode iteration throws:

BatchForward2: _kvCache.Length=128 != startPos=<promptLen>. Caches must be at startPos before the batched verify call.

(128 is the SnapKV budget; <promptLen> is the original, pre-eviction prompt length.)

Root cause

SnapKV eviction is a prefill-only compaction. After it runs, PagedKvCache deliberately splits its two length notions (see PagedKvCache.cs):

  • _length (physical) shrinks to the kept-slot count K (= budget, e.g. 128) — the slot index of the next append.
  • _logicalLength stays at the original prompt length N — the absolute position the next decode token sits at, used for RoPE so cached (already-RoPE'd) keys and the incoming query share a reference frame.

The single-token Forward decode path already handles this Length != LogicalLength split correctly (covered by CudaHybridGdnSnapKv_LongPrompt_CacheShrinksToBudget_DecodeStaysWellFormed).

BatchForward2 (the MTP N=2 batched-verify primitive in both HybridGdnForwardPass and CudaHybridGdnForwardPass) opens with a precondition if (_kvCache.Length != startPos) throw .... MtpDecoder.DecodeBatched passes startPos = _nextPos (= logical position). Post-eviction _kvCache.Length == K but startPos == N, so the precondition fails on the very first batched-verify call. There is no reconciliation between eviction and batched verify.

Fix (this PR)

Cleanly gate batched-verify off when the cache has been evicted — detect _kvCache.Length != _kvCache.LogicalLength and fall back to the sequential MTP decode path (which uses Forward, already eviction-safe). SnapKV + MTP then coexist correctly, just without the N=2 verify speedup while eviction is active. Add a decode-after-eviction coherence test.

Follow-up (not this PR)

Teach BatchForward2 to operate on an evicted cache directly: change the precondition to key off LogicalLength (the RoPE position) and make the two-token append/attention use physical Length for storage slots and LogicalLength(+0/+1) for RoPE — mirroring what single-token Forward already does. That restores the verify speedup under eviction. Requires re-running the MTP byte-parity oracle (MtpDecoder_GreedyParity_LlamaCpp) since MTP FP parity is fragile.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions