perf: #207 single-user speculative decoding — dense batched verify + MTP k-token verify on GDN-hybrid (#30)#208
Conversation
…CUDA path CudaForwardPass.BatchVerify(tokens, startPos): one packed k-token pass over the owned cache at contiguous positions [P, P+k) reusing BatchForwardMulti's trunk (every row bound to the same cache; ragged append-then-attend is exactly packed causal attention), so every trunk matmul routes through BatchDecodeMatMul and the #194 WS kernels (or opt-in #201 decode MMQ) amortize weight HBM k x. SupportsBatchVerify gates on the dense batching-capable config + uncompacted cache; rollback stays TruncateTo(P + accepted). SpeculativeDecoder: replace the `is ForwardPass` CPU type-check with the IForwardPass.SupportsBatchVerify capability check (promoted to the interface with a default-throw BatchVerify; CPU ForwardPass opts in, TQ/gemma4 excluded). Kill-switch SHARPI_SPEC_BATCH_VERIFY=0 -> sequential fallback. Adds draft/verify/commit phase timing for bench reporting. CLI: --draft-model now supported with full CUDA offload of a dense model (draft loads on its OWN CudaBackend - graph state is per backend); both spec runners generalized to IForwardPass and prefill via Prefill instead of the per-token Forward loop. download-model.ps1 gains qwen3-0.6b (draft for the Qwen3-8B bench pair). Tests: pass-level BatchVerify-vs-sequential parity at k=4/6 (argmax + top-5 + maxAbs, SnapKV pinned off), rollback TruncateTo+commit oracle, e2e 48-token greedy parity (CUDA 8B target + CPU 0.6B draft, exact match), and scripted-mock routing/kill-switch/rejection coverage. CudaBatchForwardMultiTests + ContinuousBatchingTests stay green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… spec VRAM sizing SpeculativeDecoder.Step now packs the CERTAIN next token (argmax of saved target logits) with k-1 draft proposals into ONE batched target pass (the llama.cpp formulation): the batch yields both the verification logits and the next step's saved logits, so the separate 13.2 ms correction-commit Forward disappears - the target runs exactly one batched pass per step. Saved draft logits are no longer decoder state (Initialize keeps the parameter for call-site compat). CudaForwardPass: SupportsBatchVerify gets its own gate - a CONFIGURED SnapKV budget no longer disables verify, only an actual prefill-time eviction does (_kvEvictedCount > 0, the #130 GDN pattern); ThrowIfBatchingUnsupported gains a decodeOnly mode for BatchForwardMulti (decode never evicts; per-seq caches are only obtainable via CreateCache, which still rejects budgets). CLI: cap the CUDA draft's KV ring at min(target.MaxSeqLen, 4096) unless -c is explicit - sizing it from post-target free VRAM oversubscribed the 12 GB card (34K-ctx/7GB ring: decode 75->13 t/s; even a target-matched 12K ring paged the draft weights every step: 34 t/s). Generation is bounded by BOTH windows. Qwen3-8B Q4_K_M + Qwen3-0.6B Q8_0 draft, 4070 Ti, 200-token greedy decode (baseline 74.8 t/s): k=3 96.1, k=4 99.3 (1.33x), k=6 86.1, k=8 77.2; SHARPI_BATCH_DECODE_MMQ=1: k=5 90.4, k=6 96.2, k=8 92.1. Verify cost at k=4 is ~19.5 ms/step (matches the #201 N=4 packed-step reference). Spec output text identical to non-spec baseline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…coding
PromptLookupDraft: model-free draft source (llama.cpp lookup-decoding analog)
that proposes continuation tokens by matching the tail n-gram (max..min, most
recent occurrence wins, min=2 to suppress junk 1-gram fires) of prompt +
generated history. Zero forward-pass cost; no match -> no proposals -> the
step degrades to a plain single-token decode, so the floor is ~baseline.
SpeculativeDecoder gains a lookup mode (second ctor + prompt-token Initialize
overload): the draft phase consults the matcher instead of running k-1 draft
forwards, and there is no draft cache to truncate/sync. Verify, accept, and
rollback are unchanged - emitted tokens are the target greedy chain regardless
of proposal quality.
CLI: --draft-lookup (mutually exclusive with --draft-model), same CPU / full-
CUDA-offload targets as model-draft speculation.
Qwen3-8B Q4_K_M, 4070 Ti, greedy (baseline 74.9 t/s): echo-heavy prompt
("repeat this text three times") k=4/8/16 -> 170.5 / 182.2 / 148.7 t/s
(2.4x at k=8, draft 0 ms); adversarial no-copy prompt k=4/8 -> 76.9 / 71.8
t/s (8-9% acceptance, floor holds within ~4% of baseline).
Tests: PromptLookupDraft matcher units (longest-n-gram preference, recency,
caps, reset), scripted lookup-mode decoder trace (proposal accept + floor
path, no target Forwards), CUDA e2e 48-token greedy parity with lookup
drafting on Qwen3-8B.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Multi-GB nsys-rep/sqlite dumps from kernel profiling sessions; far over GitHub's 100 MB file limit and regenerable from the bench scripts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rid path (#30) Realizes issue #25's >=1.3x criterion: Qwen3.6-27B-MTP Q4_K_M CUDA-hybrid decode 6.2 -> 10.4 t/s (1.68x) at default settings (k=2 verify batch, 90% draft acceptance). Sweep: draftN=1/2/3 -> 9.7/8.5/9.2 t/s with a 3-slot ring, 10.4 at the shipped 1-slot default. 35B-A3B-MTP (MoE) stays at parity (23.9 vs 24.4 warm) - per-token MoE routing can't share verify weight reads; grouped-by-expert verify batching is the follow-up there. Mechanics (the dense #207 BatchVerify seam, generalized to GDN): - MtpDecoder.DecodeBatched is now the folded k-token loop: the certain token plus a chained MTP draft sequence (new MtpLastHidden self-chaining) run through ONE IForwardPass.BatchVerify per step; on rejection the correction rides into the NEXT step's batch - no per-step commit forward (the #207 lesson). MTP KV entries for accepted drafts are rewritten with trunk hiddens (new HiddenAt) to keep the legacy trunk-quality contract. Per-step SupportsBatchVerify re-check preserves the #130 SnapKV gate; SHARPI_DISABLE_BATCH_VERIFY still forces the sequential N=1 path. Draft/Verify/Commit phase ms exposed for WDDM-paging diagnosis. - CudaHybridGdnForwardPass.BatchVerify reuses the #111/#114-B batched-prefill trunk (GEMM-batched projections, batched contiguous-position attention) with a per-position delta-net recurrence so a DEVICE-side GDN snapshot ring captures every token boundary; [k x vocab] all-position lm_head; pair-batched MatVec2In on the CPU mmap FFN layers (odd-k tail runs as a duplicated-input pair so per-token bits are k-parity-independent). - The ring (slots = SHARPI_MTP_BATCH_MAX-1, default 2 -> 1 slot x 149 MiB on 27B) is reserved BEFORE TryUploadDenseFfnLayers fills VRAM and skipped under SHARPI_DISABLE_MTP=1; each extra slot displaces ~2 GPU FFN layers (~0.35 t/s), which is why k=2 ships as the default. - Verify-path batched matmuls are latched onto the temp-free matvec re-stream: the #162 MMQ/dequant-GEMM prefill machinery allocates per-call fp16 temps (71 MB per Q6_K FFN layer, ~600 MB for the lm_head) that only amortize at prefill N and land in WDDM-paged VRAM at decode-k (measured 6.0 -> 9.2 t/s from this latch alone). - Fixes a latent bug: BatchForward2's GPU-GDN reject path snapshotted the STALE host _gdnStateCache (live state is the device tensors), so a rejected draft never actually rewound GPU GDN state. The device ring now serves slot 0 for BatchForward2 too; the new CUDA rollback oracle test fails without the fix. - HybridGdnForwardPass (CPU) gets the same BatchVerify with a lazily grown host ring; the MtpDecoder_GreedyParity_LlamaCpp byte oracle passes over the new path (also exercised at k=4 during bring-up). - Engine/CLI: the SpecDraftNMax>1 rejection is lifted; --spec-draft-n-max / SHARPI_MTP_DRAFT_N select the chain depth, clamped per step to the ring capacity (MaxBatchVerifyTokens) with an actionable CLI message. Next unlock (measured headroom): the GPU verify cost scales linearly with k on the matvec re-stream while the CPU FFN only pair-amortizes - a 4-input dense MatVec (est. ~12+ t/s at k=4) or #194-style weight-stationary batched decode kernels for the hybrid trunk. Tests: MtpDecoderBatchVerifyTests (scripted folded-loop contracts), CudaMtpBatchVerifyTests (per-position parity vs sequential, device-ring rollback oracle, chained-acceptance e2e); HybridGdnForwardPassTests, CudaHybridGdnForwardPassTests, CudaHybridGdnSnapKvTests, CudaSpecBatchVerifyTests, SpeculativeDecoderTests, GdnStateCache suites all green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces high-performance speculative decoding enhancements, including a model-free prompt-lookup (n-gram) drafting mechanism (PromptLookupDraft) and a generalized, folded k-token batched verification path (BatchVerify) across CPU and CUDA forward passes. It also adds a device-side GDN snapshot ring to CudaHybridGdnForwardPass and a host-side ring to HybridGdnForwardPass to support state rollback during multi-token MTP drafting. Feedback on the changes suggests using the more idiomatic and optimized List.CopyTo method instead of a manual loop for copying tokens in PromptLookupDraft.cs.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| var proposal = new int[count]; | ||
| for (int t = 0; t < count; t++) proposal[t] = h[start + t]; |
…dense SnapKV verify gate, matvec small-N threshold, CLI spec window guards - MtpDecoder: an accepted STOP draft now clamps acceptance before rollback so trunk KV / GDN state / MTP KV / hidden history all end exactly at _nextPos (previously stranded at P+1+a past the stop); one-time notice when draftN exceeds the snapshot-ring capacity (the silent-clamp server case); greedy selection delegates to Sampler.Greedy (3 copies -> 1). - ForwardPass (CPU dense): SupportsBatchVerify gates on Length==LogicalLength - a SnapKV-compacted cache routed BatchVerify's TruncateTo(startPos) past the compacted slots (the #130 gate the CUDA/GDN passes already had). - CudaHybridGdnForwardPass: the _matVecBatchedOnly latch (leak window between _faulted=false and latch-clear) is replaced by a MatMulComputeBatchMinN=8 threshold inside GpuMatMulBatched - the dequant-GEMM/MMQ temps only amortize at prefill N; host _batchSnapshotBuf now allocated only for the SHARPI_CPU_GDN=1 trunk (~150 MB host saved per MTP load); _bvCap reset before reallocs. - HybridGdnForwardPass: null-after-free in the scratch/ring growers (OOM-path double-free); SHARPI_MTP_BATCH_MAX semantics live in one shared GdnStateCache.ResolveMtpBatchMax. - CLI: prompts that exceed the (possibly 4096-capped) draft KV ring now fail fast with an actionable error before prefill writes out of range; a --spec-draft-n-max beyond ring capacity warns and clamps instead of disabling MTP outright; maxNew can no longer go negative silently. - PromptLookupDraft: List.CopyTo for the proposal copy (PR #208 review). - IForwardPass: SupportsBatchVerify doc no longer routes readers to the retired BatchForward2 dispatch. - README: 27B-MTP rows updated to the #30/#207 results (10.4 t/s, 1.68x). Tests: 45 scripted + 26 model-gated green incl. MtpDecoder_GreedyParity_LlamaCpp. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Single-user speculative decoding, end to end (#207): batched k-token verify on the dense CUDA path with a draft model or prompt-lookup drafting (goals 1–3), and the same
BatchVerifyseam realized on the GDN-hybrid path for MTP self-speculation (goal 4, issue #30) — which finally lands issue #25's ≥1.3× MTP decode criterion.Results (RTX 4070 Ti)
Dense path (Qwen3-8B Q4_K_M target, 200-token decode; baseline 74.8 t/s):
--draft-lookup, echo-heavy, k=8GDN-hybrid MTP path (Qwen3.6-27B-MTP Q4_K_M, 80-token decode; baseline 6.2 t/s):
--spec-draft-n-max 3(k=4)Qwen3.6-35B-A3B-MTP (MoE): 23.9 vs 24.4 warm — parity by design (per-token expert routing can't share verify weight reads; grouped-by-expert verify is a follow-up).
What's in here
IForwardPass.BatchVerify(tokens, startPos)seam +SupportsBatchVerifycapability (re-checked per step — preserves the MTP batched-verify (BatchForward2) crashes when SnapKV eviction is active: _kvCache.Length != startPos #130 SnapKV eviction gate), nowMaxBatchVerifyTokens,HiddenAt,MtpLastHiddenfor the GDN side.CudaForwardPass.BatchVerify(dense): packed k-token pass =BatchForwardMultirows on the owned cache at contiguous positions; exact greedy parity in practice (WS kernels keep per-token reduction chains).SpeculativeDecoderfolded step: the certain token rides in the verify batch; the batch yields both verification logits and the next step's saved logits — zero separate commit forwards (the separate commit was eating the entire win: 72→99 t/s). Draft KV capped tomin(target.MaxSeqLen, 4096)to avoid WDDM paging.--draft-lookup): zero draft-forward cost, ~baseline floor on no-copy prompts.BatchVerify(both passes): perf(engine,cuda): batch the GDN-hybrid prefill trunk (GEMM-N attn/GDN projections) — trunk is now ~62% of prefill (#110 follow-up) #111/perf(engine,cpu,cuda): remaining GDN-hybrid prefill headroom after #111/#112 (N-input MoE dots + per-position recurrence/SDPA batching) #114-B batched trunk + per-position delta-net recurrence with a per-token-boundary GDN snapshot ring (device-side on the GPU trunk, reserved before the dense-FFN VRAM fill;SHARPI_MTP_BATCH_MAX); pair-batchedMatVec2InCPU FFN; [k×vocab] all-position lm_head; MTP hidden-history threading.MtpDecoderfolded k-token loop with chained NEXTN drafting, MTP-KV trunk-quality refresh, per-phase Draft/Verify/Commit ms;--spec-draft-n-maxrestriction lifted in engine + CLI.BatchForward2's GPU-GDN reject path snapshotted the stale host_gdnStateCachewhile the live state sits in the device tensors — a rejected MTP draft never actually rewound GPU GDN state. The device ring now serves that path; the new CUDA rollback-oracle test fails without the fix.Correctness
MtpDecoder_GreedyParity_LlamaCppbyte oracle passes over the new batched path (untouched, unweakened).MtpDecoderBatchVerifyTests(scripted folded-loop contracts),CudaMtpBatchVerifyTests(per-position parity vs sequential, device-ring rollback oracle, chained-acceptance e2e),CudaSpecBatchVerifyTests, extendedSpeculativeDecoderTests.HybridGdnForwardPassTests,CudaHybridGdnForwardPassTests,CudaHybridGdnSnapKvTests(MTP batched-verify (BatchForward2) crashes when SnapKV eviction is active: _kvCache.Length != startPos #130 both directions),ContinuousBatchingTests,GdnStateCachesuites.Closes #207. Closes #30. Closes #25.
Follow-ups (issues to be filed): 4-input dense CPU
MatVec/ weight-stationary hybrid-trunk decode kernels (k>2 headroom, est. ~12+ t/s), grouped-by-expert MoE verify batching (35B win), SnapKV × batched-verify coexistence (#130).🤖 Generated with Claude Code