Follow-up to #210 (shipped in #270 / c835073), which added grouped-by-expert MoE verify batching to the CUDA-hybrid pass (CudaHybridGdnForwardPass.BatchVerifyCpuMoe). The pure-CPU sibling was intentionally left on its per-token loop.
Scope
HybridGdnForwardPass.BatchVerify (src/SharpInference.Engine/HybridGdnForwardPass.cs) still loops MoeFfnCore per draft token, so on a routed-MoE MTP model the routed experts re-read their mmap'd weights k times — the exact problem #210 fixed for the GPU path. This pass runs only for pure-CPU execution of a GDN-hybrid model (RunCommand.cs:391 — hp.IsHybridSsm && effNGpuLayers == 0, i.e. -g 0 / no GPU), e.g. serving Qwen3.6-35B-A3B-MTP on a GPU-less box.
Why it wasn't done in #210
Unlike the CUDA pass, HybridGdnForwardPass has no BatchedRoutedExperts core to reuse — this is a real port (~200 lines): the CSR bucket machinery + Phase A/B/C gate/up/down dots + scratch buffers, against the CPU class's TensorRef/DispatchDot (not the CUDA class's CpuWeightRef). It only benefits a niche path that can't be GPU-validated and historically showed no measured MTP speedup.
Work items
- Port the group-by-expert routed FFN (mirror
BatchedRoutedExperts + the BatchVerifyCpuMoe orchestration) into HybridGdnForwardPass, reused by both BatchVerify's MoE branch and ideally prefill.
- Bit-parity test: batched-vs-per-token verify byte-identical (GPU-free —
HybridGdnForwardPass is CPU-only, so this can run in CI unlike the CUDA oracle which needs a 4070 Ti + the GGUF on disk).
- Bench pure-CPU
-g 0 MoE-MTP decode (per-token vs batched).
Acceptance
Follow-up to #210 (shipped in #270 /
c835073), which added grouped-by-expert MoE verify batching to the CUDA-hybrid pass (CudaHybridGdnForwardPass.BatchVerifyCpuMoe). The pure-CPU sibling was intentionally left on its per-token loop.Scope
HybridGdnForwardPass.BatchVerify(src/SharpInference.Engine/HybridGdnForwardPass.cs) still loopsMoeFfnCoreper draft token, so on a routed-MoE MTP model the routed experts re-read their mmap'd weights k times — the exact problem #210 fixed for the GPU path. This pass runs only for pure-CPU execution of a GDN-hybrid model (RunCommand.cs:391—hp.IsHybridSsm && effNGpuLayers == 0, i.e.-g 0/ no GPU), e.g. serving Qwen3.6-35B-A3B-MTP on a GPU-less box.Why it wasn't done in #210
Unlike the CUDA pass,
HybridGdnForwardPasshas noBatchedRoutedExpertscore to reuse — this is a real port (~200 lines): the CSR bucket machinery + Phase A/B/C gate/up/down dots + scratch buffers, against the CPU class'sTensorRef/DispatchDot(not the CUDA class'sCpuWeightRef). It only benefits a niche path that can't be GPU-validated and historically showed no measured MTP speedup.Work items
BatchedRoutedExperts+ theBatchVerifyCpuMoeorchestration) intoHybridGdnForwardPass, reused by bothBatchVerify's MoE branch and ideally prefill.HybridGdnForwardPassis CPU-only, so this can run in CI unlike the CUDA oracle which needs a 4070 Ti + the GGUF on disk).-g 0MoE-MTP decode (per-token vs batched).Acceptance
MoeFfnCoreloop (greedy parity preserved).