Background
HybridGdnForwardPass.SupportsBatchVerify and CudaHybridGdnForwardPass.SupportsBatchVerify both gate on _hasMtp && !_hp.IsMoE. Reason: BatchForward2 calls MatVec2In for the dense FFN gate+up+down path, and MatVec2In has no MoE expert routing — it's a dense-only fused 2-input dot.
Result: MoE MTP models (currently Qwen3.6-35B-A3B-MTP after #44) fall back to sequential N=1 verify. With 100 % draft acceptance, sequential MTP is ~0.93× the MTP-off baseline on CPU and ~0.99× on CUDA-hybrid (per-step MTP-forward overhead with no speculative win).
Scope
Make BatchForward2 work when _hp.IsMoE. Options:
-
Add MoE expert routing to BatchForward2. Each token has its own top-K expert selection (router fires per token, so t1 and t2 may select different experts). The batched routed-expert sweep would need to handle this. Two paths:
- Sequential per-token MoE FFN inside the per-layer loop (no FFN amortization, but the rest of the layer batches — attention, GDN, norms). Should still net positive because attention/GDN dominate at this model size.
- True 2-input routed-expert batch: for each (k, r) pair where k=expert-rank, batch the gate/up/down across both tokens. Same expert may serve both (cheap) or different (no batching). Complex.
-
Run MoE FFN sequentially inside BatchForward2 but batch everything else. Simplest first pass. Saves the t1+t2 attention and GDN MatMuls' weight reads while accepting that MoE FFN runs twice.
The simplest path (option 2) should already cover most of the win since the trunk's MoE FFN is bandwidth-bound on the routed experts (CPU mmap reads at ~2.3 GB/s) and shared expert (GPU) is overlapped with the CPU routed loop. The non-FFN part of the layer (attention, GDN, norms) does benefit from 2-input weight reads.
Acceptance criteria
Related
Background
HybridGdnForwardPass.SupportsBatchVerifyandCudaHybridGdnForwardPass.SupportsBatchVerifyboth gate on_hasMtp && !_hp.IsMoE. Reason:BatchForward2callsMatVec2Infor the dense FFN gate+up+down path, andMatVec2Inhas no MoE expert routing — it's a dense-only fused 2-input dot.Result: MoE MTP models (currently Qwen3.6-35B-A3B-MTP after #44) fall back to sequential N=1 verify. With 100 % draft acceptance, sequential MTP is ~0.93× the MTP-off baseline on CPU and ~0.99× on CUDA-hybrid (per-step MTP-forward overhead with no speculative win).
Scope
Make
BatchForward2work when_hp.IsMoE. Options:Add MoE expert routing to
BatchForward2. Each token has its own top-K expert selection (router fires per token, so t1 and t2 may select different experts). The batched routed-expert sweep would need to handle this. Two paths:Run MoE FFN sequentially inside
BatchForward2but batch everything else. Simplest first pass. Saves the t1+t2 attention and GDN MatMuls' weight reads while accepting that MoE FFN runs twice.The simplest path (option 2) should already cover most of the win since the trunk's MoE FFN is bandwidth-bound on the routed experts (CPU mmap reads at ~2.3 GB/s) and shared expert (GPU) is overlapped with the CPU routed loop. The non-FFN part of the layer (attention, GDN, norms) does benefit from 2-input weight reads.
Acceptance criteria
SupportsBatchVerifyreturns true for MoE MTP models (CPU and CUDA-hybrid).MtpDecoderdispatches toBatchForward2for 35B-A3B-MTP.Related
BatchForward2inHybridGdnForwardPass.csandCudaHybridGdnForwardPass.cs