Skip to content

MoE batched verify: support BatchForward2 for IsMoE models (follow-up to #30, #44) #45

Description

@pekkah

Background

HybridGdnForwardPass.SupportsBatchVerify and CudaHybridGdnForwardPass.SupportsBatchVerify both gate on _hasMtp && !_hp.IsMoE. Reason: BatchForward2 calls MatVec2In for the dense FFN gate+up+down path, and MatVec2In has no MoE expert routing — it's a dense-only fused 2-input dot.

Result: MoE MTP models (currently Qwen3.6-35B-A3B-MTP after #44) fall back to sequential N=1 verify. With 100 % draft acceptance, sequential MTP is ~0.93× the MTP-off baseline on CPU and ~0.99× on CUDA-hybrid (per-step MTP-forward overhead with no speculative win).

Scope

Make BatchForward2 work when _hp.IsMoE. Options:

  1. Add MoE expert routing to BatchForward2. Each token has its own top-K expert selection (router fires per token, so t1 and t2 may select different experts). The batched routed-expert sweep would need to handle this. Two paths:

    • Sequential per-token MoE FFN inside the per-layer loop (no FFN amortization, but the rest of the layer batches — attention, GDN, norms). Should still net positive because attention/GDN dominate at this model size.
    • True 2-input routed-expert batch: for each (k, r) pair where k=expert-rank, batch the gate/up/down across both tokens. Same expert may serve both (cheap) or different (no batching). Complex.
  2. Run MoE FFN sequentially inside BatchForward2 but batch everything else. Simplest first pass. Saves the t1+t2 attention and GDN MatMuls' weight reads while accepting that MoE FFN runs twice.

The simplest path (option 2) should already cover most of the win since the trunk's MoE FFN is bandwidth-bound on the routed experts (CPU mmap reads at ~2.3 GB/s) and shared expert (GPU) is overlapped with the CPU routed loop. The non-FFN part of the layer (attention, GDN, norms) does benefit from 2-input weight reads.

Acceptance criteria

  • SupportsBatchVerify returns true for MoE MTP models (CPU and CUDA-hybrid).
  • MtpDecoder dispatches to BatchForward2 for 35B-A3B-MTP.
  • Decode t/s improves over the sequential 0.93×/0.99× numbers on the README quicksort prompt with 100 % accept.
  • No regression on dense 27B-MTP (existing batched-verify path stays correct).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions