perf: #207 single-user speculative decoding — dense batched verify + MTP k-token verify on GDN-hybrid (#30) by pekkah · Pull Request #208 · pekkah/SharpInference

pekkah · 2026-06-11T08:14:06Z

Single-user speculative decoding, end to end (#207): batched k-token verify on the dense CUDA path with a draft model or prompt-lookup drafting (goals 1–3), and the same BatchVerify seam realized on the GDN-hybrid path for MTP self-speculation (goal 4, issue #30) — which finally lands issue #25's ≥1.3× MTP decode criterion.

Results (RTX 4070 Ti)

Dense path (Qwen3-8B Q4_K_M target, 200-token decode; baseline 74.8 t/s):

cell	decode t/s	vs baseline
Qwen3-0.6B draft, k=4 (default WS)	99.3	1.33×
prompt-lookup `--draft-lookup`, echo-heavy, k=8	182.2	2.43×
prompt-lookup, adversarial (no-copy)	71.8–76.9	~1.0× floor

GDN-hybrid MTP path (Qwen3.6-27B-MTP Q4_K_M, 80-token decode; baseline 6.2 t/s):

cell	decode t/s	vs baseline	acceptance
defaults (k=2 verify batch)	10.4	1.68×	90%
`--spec-draft-n-max 3` (k=4)	9.2	1.48×	84%

Qwen3.6-35B-A3B-MTP (MoE): 23.9 vs 24.4 warm — parity by design (per-token expert routing can't share verify weight reads; grouped-by-expert verify is a follow-up).

What's in here

IForwardPass.BatchVerify(tokens, startPos) seam + SupportsBatchVerify capability (re-checked per step — preserves the MTP batched-verify (BatchForward2) crashes when SnapKV eviction is active: _kvCache.Length != startPos #130 SnapKV eviction gate), now MaxBatchVerifyTokens, HiddenAt, MtpLastHidden for the GDN side.
CudaForwardPass.BatchVerify (dense): packed k-token pass = BatchForwardMulti rows on the owned cache at contiguous positions; exact greedy parity in practice (WS kernels keep per-token reduction chains).
SpeculativeDecoder folded step: the certain token rides in the verify batch; the batch yields both verification logits and the next step's saved logits — zero separate commit forwards (the separate commit was eating the entire win: 72→99 t/s). Draft KV capped to min(target.MaxSeqLen, 4096) to avoid WDDM paging.
Prompt-lookup (n-gram) drafting (--draft-lookup): zero draft-forward cost, ~baseline floor on no-copy prompts.
GDN-hybrid BatchVerify (both passes): perf(engine,cuda): batch the GDN-hybrid prefill trunk (GEMM-N attn/GDN projections) — trunk is now ~62% of prefill (#110 follow-up) #111/perf(engine,cpu,cuda): remaining GDN-hybrid prefill headroom after #111/#112 (N-input MoE dots + per-position recurrence/SDPA batching) #114-B batched trunk + per-position delta-net recurrence with a per-token-boundary GDN snapshot ring (device-side on the GPU trunk, reserved before the dense-FFN VRAM fill; SHARPI_MTP_BATCH_MAX); pair-batched MatVec2In CPU FFN; [k×vocab] all-position lm_head; MTP hidden-history threading.
MtpDecoder folded k-token loop with chained NEXTN drafting, MTP-KV trunk-quality refresh, per-phase Draft/Verify/Commit ms; --spec-draft-n-max restriction lifted in engine + CLI.
Verify-path matmuls latched onto the temp-free matvec re-stream: the perf(cuda): close remaining Qwen3-8B Q4_K DECODE gap to llama.cpp — non-matvec cost (prefill handled by #167; kernel-efficiency in #149/#152) #162 MMQ/dequant-GEMM prefill machinery allocates per-call fp16 temps (71 MB/Q6_K FFN layer, ~600 MB lm_head) that land in WDDM-paged VRAM at decode-sized k — this latch alone was 6.0 → 9.2 t/s.
Latent bug fix: BatchForward2's GPU-GDN reject path snapshotted the stale host _gdnStateCache while the live state sits in the device tensors — a rejected MTP draft never actually rewound GPU GDN state. The device ring now serves that path; the new CUDA rollback-oracle test fails without the fix.

Correctness

MtpDecoder_GreedyParity_LlamaCpp byte oracle passes over the new batched path (untouched, unweakened).
Dense spec-decode greedy output exactly equals the non-spec baseline (model-draft and lookup modes).
New: MtpDecoderBatchVerifyTests (scripted folded-loop contracts), CudaMtpBatchVerifyTests (per-position parity vs sequential, device-ring rollback oracle, chained-acceptance e2e), CudaSpecBatchVerifyTests, extended SpeculativeDecoderTests.
Green: HybridGdnForwardPassTests, CudaHybridGdnForwardPassTests, CudaHybridGdnSnapKvTests (MTP batched-verify (BatchForward2) crashes when SnapKV eviction is active: _kvCache.Length != startPos #130 both directions), ContinuousBatchingTests, GdnStateCache suites.

Closes #207. Closes #30. Closes #25.

Follow-ups (issues to be filed): 4-input dense CPU MatVec / weight-stationary hybrid-trunk decode kernels (k>2 headroom, est. ~12+ t/s), grouped-by-expert MoE verify batching (35B win), SnapKV × batched-verify coexistence (#130).

🤖 Generated with Claude Code

…CUDA path CudaForwardPass.BatchVerify(tokens, startPos): one packed k-token pass over the owned cache at contiguous positions [P, P+k) reusing BatchForwardMulti's trunk (every row bound to the same cache; ragged append-then-attend is exactly packed causal attention), so every trunk matmul routes through BatchDecodeMatMul and the #194 WS kernels (or opt-in #201 decode MMQ) amortize weight HBM k x. SupportsBatchVerify gates on the dense batching-capable config + uncompacted cache; rollback stays TruncateTo(P + accepted). SpeculativeDecoder: replace the `is ForwardPass` CPU type-check with the IForwardPass.SupportsBatchVerify capability check (promoted to the interface with a default-throw BatchVerify; CPU ForwardPass opts in, TQ/gemma4 excluded). Kill-switch SHARPI_SPEC_BATCH_VERIFY=0 -> sequential fallback. Adds draft/verify/commit phase timing for bench reporting. CLI: --draft-model now supported with full CUDA offload of a dense model (draft loads on its OWN CudaBackend - graph state is per backend); both spec runners generalized to IForwardPass and prefill via Prefill instead of the per-token Forward loop. download-model.ps1 gains qwen3-0.6b (draft for the Qwen3-8B bench pair). Tests: pass-level BatchVerify-vs-sequential parity at k=4/6 (argmax + top-5 + maxAbs, SnapKV pinned off), rollback TruncateTo+commit oracle, e2e 48-token greedy parity (CUDA 8B target + CPU 0.6B draft, exact match), and scripted-mock routing/kill-switch/rejection coverage. CudaBatchForwardMultiTests + ContinuousBatchingTests stay green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… spec VRAM sizing SpeculativeDecoder.Step now packs the CERTAIN next token (argmax of saved target logits) with k-1 draft proposals into ONE batched target pass (the llama.cpp formulation): the batch yields both the verification logits and the next step's saved logits, so the separate 13.2 ms correction-commit Forward disappears - the target runs exactly one batched pass per step. Saved draft logits are no longer decoder state (Initialize keeps the parameter for call-site compat). CudaForwardPass: SupportsBatchVerify gets its own gate - a CONFIGURED SnapKV budget no longer disables verify, only an actual prefill-time eviction does (_kvEvictedCount > 0, the #130 GDN pattern); ThrowIfBatchingUnsupported gains a decodeOnly mode for BatchForwardMulti (decode never evicts; per-seq caches are only obtainable via CreateCache, which still rejects budgets). CLI: cap the CUDA draft's KV ring at min(target.MaxSeqLen, 4096) unless -c is explicit - sizing it from post-target free VRAM oversubscribed the 12 GB card (34K-ctx/7GB ring: decode 75->13 t/s; even a target-matched 12K ring paged the draft weights every step: 34 t/s). Generation is bounded by BOTH windows. Qwen3-8B Q4_K_M + Qwen3-0.6B Q8_0 draft, 4070 Ti, 200-token greedy decode (baseline 74.8 t/s): k=3 96.1, k=4 99.3 (1.33x), k=6 86.1, k=8 77.2; SHARPI_BATCH_DECODE_MMQ=1: k=5 90.4, k=6 96.2, k=8 92.1. Verify cost at k=4 is ~19.5 ms/step (matches the #201 N=4 packed-step reference). Spec output text identical to non-spec baseline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…coding PromptLookupDraft: model-free draft source (llama.cpp lookup-decoding analog) that proposes continuation tokens by matching the tail n-gram (max..min, most recent occurrence wins, min=2 to suppress junk 1-gram fires) of prompt + generated history. Zero forward-pass cost; no match -> no proposals -> the step degrades to a plain single-token decode, so the floor is ~baseline. SpeculativeDecoder gains a lookup mode (second ctor + prompt-token Initialize overload): the draft phase consults the matcher instead of running k-1 draft forwards, and there is no draft cache to truncate/sync. Verify, accept, and rollback are unchanged - emitted tokens are the target greedy chain regardless of proposal quality. CLI: --draft-lookup (mutually exclusive with --draft-model), same CPU / full- CUDA-offload targets as model-draft speculation. Qwen3-8B Q4_K_M, 4070 Ti, greedy (baseline 74.9 t/s): echo-heavy prompt ("repeat this text three times") k=4/8/16 -> 170.5 / 182.2 / 148.7 t/s (2.4x at k=8, draft 0 ms); adversarial no-copy prompt k=4/8 -> 76.9 / 71.8 t/s (8-9% acceptance, floor holds within ~4% of baseline). Tests: PromptLookupDraft matcher units (longest-n-gram preference, recency, caps, reset), scripted lookup-mode decoder trace (proposal accept + floor path, no target Forwards), CUDA e2e 48-token greedy parity with lookup drafting on Qwen3-8B. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Multi-GB nsys-rep/sqlite dumps from kernel profiling sessions; far over GitHub's 100 MB file limit and regenerable from the bench scripts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…rid path (#30) Realizes issue #25's >=1.3x criterion: Qwen3.6-27B-MTP Q4_K_M CUDA-hybrid decode 6.2 -> 10.4 t/s (1.68x) at default settings (k=2 verify batch, 90% draft acceptance). Sweep: draftN=1/2/3 -> 9.7/8.5/9.2 t/s with a 3-slot ring, 10.4 at the shipped 1-slot default. 35B-A3B-MTP (MoE) stays at parity (23.9 vs 24.4 warm) - per-token MoE routing can't share verify weight reads; grouped-by-expert verify batching is the follow-up there. Mechanics (the dense #207 BatchVerify seam, generalized to GDN): - MtpDecoder.DecodeBatched is now the folded k-token loop: the certain token plus a chained MTP draft sequence (new MtpLastHidden self-chaining) run through ONE IForwardPass.BatchVerify per step; on rejection the correction rides into the NEXT step's batch - no per-step commit forward (the #207 lesson). MTP KV entries for accepted drafts are rewritten with trunk hiddens (new HiddenAt) to keep the legacy trunk-quality contract. Per-step SupportsBatchVerify re-check preserves the #130 SnapKV gate; SHARPI_DISABLE_BATCH_VERIFY still forces the sequential N=1 path. Draft/Verify/Commit phase ms exposed for WDDM-paging diagnosis. - CudaHybridGdnForwardPass.BatchVerify reuses the #111/#114-B batched-prefill trunk (GEMM-batched projections, batched contiguous-position attention) with a per-position delta-net recurrence so a DEVICE-side GDN snapshot ring captures every token boundary; [k x vocab] all-position lm_head; pair-batched MatVec2In on the CPU mmap FFN layers (odd-k tail runs as a duplicated-input pair so per-token bits are k-parity-independent). - The ring (slots = SHARPI_MTP_BATCH_MAX-1, default 2 -> 1 slot x 149 MiB on 27B) is reserved BEFORE TryUploadDenseFfnLayers fills VRAM and skipped under SHARPI_DISABLE_MTP=1; each extra slot displaces ~2 GPU FFN layers (~0.35 t/s), which is why k=2 ships as the default. - Verify-path batched matmuls are latched onto the temp-free matvec re-stream: the #162 MMQ/dequant-GEMM prefill machinery allocates per-call fp16 temps (71 MB per Q6_K FFN layer, ~600 MB for the lm_head) that only amortize at prefill N and land in WDDM-paged VRAM at decode-k (measured 6.0 -> 9.2 t/s from this latch alone). - Fixes a latent bug: BatchForward2's GPU-GDN reject path snapshotted the STALE host _gdnStateCache (live state is the device tensors), so a rejected draft never actually rewound GPU GDN state. The device ring now serves slot 0 for BatchForward2 too; the new CUDA rollback oracle test fails without the fix. - HybridGdnForwardPass (CPU) gets the same BatchVerify with a lazily grown host ring; the MtpDecoder_GreedyParity_LlamaCpp byte oracle passes over the new path (also exercised at k=4 during bring-up). - Engine/CLI: the SpecDraftNMax>1 rejection is lifted; --spec-draft-n-max / SHARPI_MTP_DRAFT_N select the chain depth, clamped per step to the ring capacity (MaxBatchVerifyTokens) with an actionable CLI message. Next unlock (measured headroom): the GPU verify cost scales linearly with k on the matvec re-stream while the CPU FFN only pair-amortizes - a 4-input dense MatVec (est. ~12+ t/s at k=4) or #194-style weight-stationary batched decode kernels for the hybrid trunk. Tests: MtpDecoderBatchVerifyTests (scripted folded-loop contracts), CudaMtpBatchVerifyTests (per-position parity vs sequential, device-ring rollback oracle, chained-acceptance e2e); HybridGdnForwardPassTests, CudaHybridGdnForwardPassTests, CudaHybridGdnSnapKvTests, CudaSpecBatchVerifyTests, SpeculativeDecoderTests, GdnStateCache suites all green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces high-performance speculative decoding enhancements, including a model-free prompt-lookup (n-gram) drafting mechanism (PromptLookupDraft) and a generalized, folded k-token batched verification path (BatchVerify) across CPU and CUDA forward passes. It also adds a device-side GDN snapshot ring to CudaHybridGdnForwardPass and a host-side ring to HybridGdnForwardPass to support state rollback during multi-token MTP drafting. Feedback on the changes suggests using the more idiomatic and optimized List.CopyTo method instead of a manual loop for copying tokens in PromptLookupDraft.cs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-11T08:17:58Z

+                var proposal = new int[count];
+                for (int t = 0; t < count; t++) proposal[t] = h[start + t];


Instead of using a manual loop to copy tokens from the history list to the proposal array, you can use the more idiomatic and optimized List.CopyTo method.

var proposal = new int[count]; h.CopyTo(start, proposal, 0, count);

Applied in 112b6af (List.CopyTo). Thanks.

…dense SnapKV verify gate, matvec small-N threshold, CLI spec window guards - MtpDecoder: an accepted STOP draft now clamps acceptance before rollback so trunk KV / GDN state / MTP KV / hidden history all end exactly at _nextPos (previously stranded at P+1+a past the stop); one-time notice when draftN exceeds the snapshot-ring capacity (the silent-clamp server case); greedy selection delegates to Sampler.Greedy (3 copies -> 1). - ForwardPass (CPU dense): SupportsBatchVerify gates on Length==LogicalLength - a SnapKV-compacted cache routed BatchVerify's TruncateTo(startPos) past the compacted slots (the #130 gate the CUDA/GDN passes already had). - CudaHybridGdnForwardPass: the _matVecBatchedOnly latch (leak window between _faulted=false and latch-clear) is replaced by a MatMulComputeBatchMinN=8 threshold inside GpuMatMulBatched - the dequant-GEMM/MMQ temps only amortize at prefill N; host _batchSnapshotBuf now allocated only for the SHARPI_CPU_GDN=1 trunk (~150 MB host saved per MTP load); _bvCap reset before reallocs. - HybridGdnForwardPass: null-after-free in the scratch/ring growers (OOM-path double-free); SHARPI_MTP_BATCH_MAX semantics live in one shared GdnStateCache.ResolveMtpBatchMax. - CLI: prompts that exceed the (possibly 4096-capped) draft KV ring now fail fast with an actionable error before prefill writes out of range; a --spec-draft-n-max beyond ring capacity warns and clamps instead of disabling MTP outright; maxNew can no longer go negative silently. - PromptLookupDraft: List.CopyTo for the proposal copy (PR #208 review). - IForwardPass: SupportsBatchVerify doc no longer routes readers to the retired BatchForward2 dispatch. - README: 27B-MTP rows updated to the #30/#207 results (10.4 t/s, 1.68x). Tests: 45 scripted + 26 model-gated green incl. MtpDecoder_GreedyParity_LlamaCpp. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pekkah and others added 5 commits June 10, 2026 22:32

chore: gitignore Nsight profiling artifacts (prof/)

2e6f423

Multi-GB nsys-rep/sqlite dumps from kernel profiling sessions; far over GitHub's 100 MB file limit and regenerable from the bench scripts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 11, 2026

View reviewed changes

pekkah merged commit 8d0d2f8 into master Jun 11, 2026
1 check passed

pekkah deleted the perf/207-cuda-spec-batch-verify branch June 11, 2026 11:40

This was referenced Jun 14, 2026

GPU draft-MTP speculative decoding for Gemma 4 12B (decode 54 → ~70 t/s) #178

Closed

perf(mtp): 4-input CPU MatVec4In for k>2 batched verify (#209) #287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: #207 single-user speculative decoding — dense batched verify + MTP k-token verify on GDN-hybrid (#30)#208

perf: #207 single-user speculative decoding — dense batched verify + MTP k-token verify on GDN-hybrid (#30)#208
pekkah merged 6 commits into
masterfrom
perf/207-cuda-spec-batch-verify

pekkah commented Jun 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Uh oh!

pekkah Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		var proposal = new int[count];
		for (int t = 0; t < count; t++) proposal[t] = h[start + t];

Conversation

pekkah commented Jun 11, 2026

Results (RTX 4070 Ti)

What's in here

Correctness

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

pekkah Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant