Skip to content

perf: #207 single-user speculative decoding — dense batched verify + MTP k-token verify on GDN-hybrid (#30)#208

Merged
pekkah merged 6 commits into
masterfrom
perf/207-cuda-spec-batch-verify
Jun 11, 2026
Merged

perf: #207 single-user speculative decoding — dense batched verify + MTP k-token verify on GDN-hybrid (#30)#208
pekkah merged 6 commits into
masterfrom
perf/207-cuda-spec-batch-verify

Conversation

@pekkah

@pekkah pekkah commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Single-user speculative decoding, end to end (#207): batched k-token verify on the dense CUDA path with a draft model or prompt-lookup drafting (goals 1–3), and the same BatchVerify seam realized on the GDN-hybrid path for MTP self-speculation (goal 4, issue #30) — which finally lands issue #25's ≥1.3× MTP decode criterion.

Results (RTX 4070 Ti)

Dense path (Qwen3-8B Q4_K_M target, 200-token decode; baseline 74.8 t/s):

cell decode t/s vs baseline
Qwen3-0.6B draft, k=4 (default WS) 99.3 1.33×
prompt-lookup --draft-lookup, echo-heavy, k=8 182.2 2.43×
prompt-lookup, adversarial (no-copy) 71.8–76.9 ~1.0× floor

GDN-hybrid MTP path (Qwen3.6-27B-MTP Q4_K_M, 80-token decode; baseline 6.2 t/s):

cell decode t/s vs baseline acceptance
defaults (k=2 verify batch) 10.4 1.68× 90%
--spec-draft-n-max 3 (k=4) 9.2 1.48× 84%

Qwen3.6-35B-A3B-MTP (MoE): 23.9 vs 24.4 warm — parity by design (per-token expert routing can't share verify weight reads; grouped-by-expert verify is a follow-up).

What's in here

Correctness

  • MtpDecoder_GreedyParity_LlamaCpp byte oracle passes over the new batched path (untouched, unweakened).
  • Dense spec-decode greedy output exactly equals the non-spec baseline (model-draft and lookup modes).
  • New: MtpDecoderBatchVerifyTests (scripted folded-loop contracts), CudaMtpBatchVerifyTests (per-position parity vs sequential, device-ring rollback oracle, chained-acceptance e2e), CudaSpecBatchVerifyTests, extended SpeculativeDecoderTests.
  • Green: HybridGdnForwardPassTests, CudaHybridGdnForwardPassTests, CudaHybridGdnSnapKvTests (MTP batched-verify (BatchForward2) crashes when SnapKV eviction is active: _kvCache.Length != startPos #130 both directions), ContinuousBatchingTests, GdnStateCache suites.

Closes #207. Closes #30. Closes #25.

Follow-ups (issues to be filed): 4-input dense CPU MatVec / weight-stationary hybrid-trunk decode kernels (k>2 headroom, est. ~12+ t/s), grouped-by-expert MoE verify batching (35B win), SnapKV × batched-verify coexistence (#130).

🤖 Generated with Claude Code

pekkah and others added 5 commits June 10, 2026 22:32
…CUDA path

CudaForwardPass.BatchVerify(tokens, startPos): one packed k-token pass over
the owned cache at contiguous positions [P, P+k) reusing BatchForwardMulti's
trunk (every row bound to the same cache; ragged append-then-attend is exactly
packed causal attention), so every trunk matmul routes through BatchDecodeMatMul
and the #194 WS kernels (or opt-in #201 decode MMQ) amortize weight HBM k x.
SupportsBatchVerify gates on the dense batching-capable config + uncompacted
cache; rollback stays TruncateTo(P + accepted).

SpeculativeDecoder: replace the `is ForwardPass` CPU type-check with the
IForwardPass.SupportsBatchVerify capability check (promoted to the interface
with a default-throw BatchVerify; CPU ForwardPass opts in, TQ/gemma4 excluded).
Kill-switch SHARPI_SPEC_BATCH_VERIFY=0 -> sequential fallback. Adds
draft/verify/commit phase timing for bench reporting.

CLI: --draft-model now supported with full CUDA offload of a dense model
(draft loads on its OWN CudaBackend - graph state is per backend); both spec
runners generalized to IForwardPass and prefill via Prefill instead of the
per-token Forward loop. download-model.ps1 gains qwen3-0.6b (draft for the
Qwen3-8B bench pair).

Tests: pass-level BatchVerify-vs-sequential parity at k=4/6 (argmax + top-5 +
maxAbs, SnapKV pinned off), rollback TruncateTo+commit oracle, e2e 48-token
greedy parity (CUDA 8B target + CPU 0.6B draft, exact match), and scripted-mock
routing/kill-switch/rejection coverage. CudaBatchForwardMultiTests +
ContinuousBatchingTests stay green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… spec VRAM sizing

SpeculativeDecoder.Step now packs the CERTAIN next token (argmax of saved
target logits) with k-1 draft proposals into ONE batched target pass (the
llama.cpp formulation): the batch yields both the verification logits and the
next step's saved logits, so the separate 13.2 ms correction-commit Forward
disappears - the target runs exactly one batched pass per step. Saved draft
logits are no longer decoder state (Initialize keeps the parameter for
call-site compat).

CudaForwardPass: SupportsBatchVerify gets its own gate - a CONFIGURED SnapKV
budget no longer disables verify, only an actual prefill-time eviction does
(_kvEvictedCount > 0, the #130 GDN pattern); ThrowIfBatchingUnsupported gains
a decodeOnly mode for BatchForwardMulti (decode never evicts; per-seq caches
are only obtainable via CreateCache, which still rejects budgets).

CLI: cap the CUDA draft's KV ring at min(target.MaxSeqLen, 4096) unless -c is
explicit - sizing it from post-target free VRAM oversubscribed the 12 GB card
(34K-ctx/7GB ring: decode 75->13 t/s; even a target-matched 12K ring paged the
draft weights every step: 34 t/s). Generation is bounded by BOTH windows.

Qwen3-8B Q4_K_M + Qwen3-0.6B Q8_0 draft, 4070 Ti, 200-token greedy decode
(baseline 74.8 t/s): k=3 96.1, k=4 99.3 (1.33x), k=6 86.1, k=8 77.2;
SHARPI_BATCH_DECODE_MMQ=1: k=5 90.4, k=6 96.2, k=8 92.1. Verify cost at k=4
is ~19.5 ms/step (matches the #201 N=4 packed-step reference). Spec output
text identical to non-spec baseline.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…coding

PromptLookupDraft: model-free draft source (llama.cpp lookup-decoding analog)
that proposes continuation tokens by matching the tail n-gram (max..min, most
recent occurrence wins, min=2 to suppress junk 1-gram fires) of prompt +
generated history. Zero forward-pass cost; no match -> no proposals -> the
step degrades to a plain single-token decode, so the floor is ~baseline.

SpeculativeDecoder gains a lookup mode (second ctor + prompt-token Initialize
overload): the draft phase consults the matcher instead of running k-1 draft
forwards, and there is no draft cache to truncate/sync. Verify, accept, and
rollback are unchanged - emitted tokens are the target greedy chain regardless
of proposal quality.

CLI: --draft-lookup (mutually exclusive with --draft-model), same CPU / full-
CUDA-offload targets as model-draft speculation.

Qwen3-8B Q4_K_M, 4070 Ti, greedy (baseline 74.9 t/s): echo-heavy prompt
("repeat this text three times") k=4/8/16 -> 170.5 / 182.2 / 148.7 t/s
(2.4x at k=8, draft 0 ms); adversarial no-copy prompt k=4/8 -> 76.9 / 71.8
t/s (8-9% acceptance, floor holds within ~4% of baseline).

Tests: PromptLookupDraft matcher units (longest-n-gram preference, recency,
caps, reset), scripted lookup-mode decoder trace (proposal accept + floor
path, no target Forwards), CUDA e2e 48-token greedy parity with lookup
drafting on Qwen3-8B.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Multi-GB nsys-rep/sqlite dumps from kernel profiling sessions; far over
GitHub's 100 MB file limit and regenerable from the bench scripts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rid path (#30)

Realizes issue #25's >=1.3x criterion: Qwen3.6-27B-MTP Q4_K_M CUDA-hybrid
decode 6.2 -> 10.4 t/s (1.68x) at default settings (k=2 verify batch, 90%
draft acceptance). Sweep: draftN=1/2/3 -> 9.7/8.5/9.2 t/s with a 3-slot ring,
10.4 at the shipped 1-slot default. 35B-A3B-MTP (MoE) stays at parity (23.9
vs 24.4 warm) - per-token MoE routing can't share verify weight reads;
grouped-by-expert verify batching is the follow-up there.

Mechanics (the dense #207 BatchVerify seam, generalized to GDN):
- MtpDecoder.DecodeBatched is now the folded k-token loop: the certain token
  plus a chained MTP draft sequence (new MtpLastHidden self-chaining) run
  through ONE IForwardPass.BatchVerify per step; on rejection the correction
  rides into the NEXT step's batch - no per-step commit forward (the #207
  lesson). MTP KV entries for accepted drafts are rewritten with trunk
  hiddens (new HiddenAt) to keep the legacy trunk-quality contract.
  Per-step SupportsBatchVerify re-check preserves the #130 SnapKV gate;
  SHARPI_DISABLE_BATCH_VERIFY still forces the sequential N=1 path.
  Draft/Verify/Commit phase ms exposed for WDDM-paging diagnosis.
- CudaHybridGdnForwardPass.BatchVerify reuses the #111/#114-B batched-prefill
  trunk (GEMM-batched projections, batched contiguous-position attention)
  with a per-position delta-net recurrence so a DEVICE-side GDN snapshot
  ring captures every token boundary; [k x vocab] all-position lm_head;
  pair-batched MatVec2In on the CPU mmap FFN layers (odd-k tail runs as a
  duplicated-input pair so per-token bits are k-parity-independent).
- The ring (slots = SHARPI_MTP_BATCH_MAX-1, default 2 -> 1 slot x 149 MiB on
  27B) is reserved BEFORE TryUploadDenseFfnLayers fills VRAM and skipped
  under SHARPI_DISABLE_MTP=1; each extra slot displaces ~2 GPU FFN layers
  (~0.35 t/s), which is why k=2 ships as the default.
- Verify-path batched matmuls are latched onto the temp-free matvec
  re-stream: the #162 MMQ/dequant-GEMM prefill machinery allocates
  per-call fp16 temps (71 MB per Q6_K FFN layer, ~600 MB for the lm_head)
  that only amortize at prefill N and land in WDDM-paged VRAM at decode-k
  (measured 6.0 -> 9.2 t/s from this latch alone).
- Fixes a latent bug: BatchForward2's GPU-GDN reject path snapshotted the
  STALE host _gdnStateCache (live state is the device tensors), so a
  rejected draft never actually rewound GPU GDN state. The device ring now
  serves slot 0 for BatchForward2 too; the new CUDA rollback oracle test
  fails without the fix.
- HybridGdnForwardPass (CPU) gets the same BatchVerify with a lazily grown
  host ring; the MtpDecoder_GreedyParity_LlamaCpp byte oracle passes over
  the new path (also exercised at k=4 during bring-up).
- Engine/CLI: the SpecDraftNMax>1 rejection is lifted; --spec-draft-n-max /
  SHARPI_MTP_DRAFT_N select the chain depth, clamped per step to the ring
  capacity (MaxBatchVerifyTokens) with an actionable CLI message.

Next unlock (measured headroom): the GPU verify cost scales linearly with k
on the matvec re-stream while the CPU FFN only pair-amortizes - a 4-input
dense MatVec (est. ~12+ t/s at k=4) or #194-style weight-stationary batched
decode kernels for the hybrid trunk.

Tests: MtpDecoderBatchVerifyTests (scripted folded-loop contracts),
CudaMtpBatchVerifyTests (per-position parity vs sequential, device-ring
rollback oracle, chained-acceptance e2e); HybridGdnForwardPassTests,
CudaHybridGdnForwardPassTests, CudaHybridGdnSnapKvTests,
CudaSpecBatchVerifyTests, SpeculativeDecoderTests, GdnStateCache suites
all green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces high-performance speculative decoding enhancements, including a model-free prompt-lookup (n-gram) drafting mechanism (PromptLookupDraft) and a generalized, folded k-token batched verification path (BatchVerify) across CPU and CUDA forward passes. It also adds a device-side GDN snapshot ring to CudaHybridGdnForwardPass and a host-side ring to HybridGdnForwardPass to support state rollback during multi-token MTP drafting. Feedback on the changes suggests using the more idiomatic and optimized List.CopyTo method instead of a manual loop for copying tokens in PromptLookupDraft.cs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +76 to +77
var proposal = new int[count];
for (int t = 0; t < count; t++) proposal[t] = h[start + t];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of using a manual loop to copy tokens from the history list to the proposal array, you can use the more idiomatic and optimized List.CopyTo method.

                var proposal = new int[count];
                h.CopyTo(start, proposal, 0, count);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in 112b6af (List.CopyTo). Thanks.

…dense SnapKV verify gate, matvec small-N threshold, CLI spec window guards

- MtpDecoder: an accepted STOP draft now clamps acceptance before rollback so
  trunk KV / GDN state / MTP KV / hidden history all end exactly at _nextPos
  (previously stranded at P+1+a past the stop); one-time notice when draftN
  exceeds the snapshot-ring capacity (the silent-clamp server case); greedy
  selection delegates to Sampler.Greedy (3 copies -> 1).
- ForwardPass (CPU dense): SupportsBatchVerify gates on Length==LogicalLength -
  a SnapKV-compacted cache routed BatchVerify's TruncateTo(startPos) past the
  compacted slots (the #130 gate the CUDA/GDN passes already had).
- CudaHybridGdnForwardPass: the _matVecBatchedOnly latch (leak window between
  _faulted=false and latch-clear) is replaced by a MatMulComputeBatchMinN=8
  threshold inside GpuMatMulBatched - the dequant-GEMM/MMQ temps only amortize
  at prefill N; host _batchSnapshotBuf now allocated only for the
  SHARPI_CPU_GDN=1 trunk (~150 MB host saved per MTP load); _bvCap reset
  before reallocs.
- HybridGdnForwardPass: null-after-free in the scratch/ring growers (OOM-path
  double-free); SHARPI_MTP_BATCH_MAX semantics live in one shared
  GdnStateCache.ResolveMtpBatchMax.
- CLI: prompts that exceed the (possibly 4096-capped) draft KV ring now fail
  fast with an actionable error before prefill writes out of range; a
  --spec-draft-n-max beyond ring capacity warns and clamps instead of
  disabling MTP outright; maxNew can no longer go negative silently.
- PromptLookupDraft: List.CopyTo for the proposal copy (PR #208 review).
- IForwardPass: SupportsBatchVerify doc no longer routes readers to the
  retired BatchForward2 dispatch.
- README: 27B-MTP rows updated to the #30/#207 results (10.4 t/s, 1.68x).

Tests: 45 scripted + 26 model-gated green incl. MtpDecoder_GreedyParity_LlamaCpp.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant