MoE batched verify: support BatchForward2 for IsMoE models (follow-up to #30, #44)

## Background

`HybridGdnForwardPass.SupportsBatchVerify` and `CudaHybridGdnForwardPass.SupportsBatchVerify` both gate on `_hasMtp && !_hp.IsMoE`. Reason: `BatchForward2` calls `MatVec2In` for the dense FFN gate+up+down path, and `MatVec2In` has no MoE expert routing — it's a dense-only fused 2-input dot.

Result: MoE MTP models (currently Qwen3.6-35B-A3B-MTP after #44) fall back to sequential N=1 verify. With 100 % draft acceptance, sequential MTP is ~0.93× the MTP-off baseline on CPU and ~0.99× on CUDA-hybrid (per-step MTP-forward overhead with no speculative win).

## Scope

Make `BatchForward2` work when `_hp.IsMoE`. Options:

1. **Add MoE expert routing to `BatchForward2`.** Each token has its own top-K expert selection (router fires per token, so t1 and t2 may select different experts). The batched routed-expert sweep would need to handle this. Two paths:
   - Sequential per-token MoE FFN inside the per-layer loop (no FFN amortization, but the rest of the layer batches — attention, GDN, norms). Should still net positive because attention/GDN dominate at this model size.
   - True 2-input routed-expert batch: for each (k, r) pair where k=expert-rank, batch the gate/up/down across both tokens. Same expert may serve both (cheap) or different (no batching). Complex.

2. **Run MoE FFN sequentially inside `BatchForward2` but batch everything else.** Simplest first pass. Saves the t1+t2 attention and GDN MatMuls' weight reads while accepting that MoE FFN runs twice.

The simplest path (option 2) should already cover most of the win since the trunk's MoE FFN is bandwidth-bound on the routed experts (CPU mmap reads at ~2.3 GB/s) and shared expert (GPU) is overlapped with the CPU routed loop. The non-FFN part of the layer (attention, GDN, norms) does benefit from 2-input weight reads.

## Acceptance criteria

- [ ] `SupportsBatchVerify` returns true for MoE MTP models (CPU and CUDA-hybrid).
- [ ] `MtpDecoder` dispatches to `BatchForward2` for 35B-A3B-MTP.
- [ ] Decode t/s improves over the sequential 0.93×/0.99× numbers on the README quicksort prompt with 100 % accept.
- [ ] No regression on dense 27B-MTP (existing batched-verify path stays correct).

## Related

- #30 (batched verify, dense-only, shipped)
- #44 (MoE MTP loader, shipped — left this as a follow-up)
- `BatchForward2` in `HybridGdnForwardPass.cs` and `CudaHybridGdnForwardPass.cs`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE batched verify: support BatchForward2 for IsMoE models (follow-up to #30, #44) #45

Background

Scope

Acceptance criteria

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MoE batched verify: support BatchForward2 for IsMoE models (follow-up to #30, #44) #45

Description

Background

Scope

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions