Skip to content

perf(moe): grouped-by-expert MoE verify batching for the pure-CPU GDN path (HybridGdnForwardPass) #271

Description

@pekkah

Follow-up to #210 (shipped in #270 / c835073), which added grouped-by-expert MoE verify batching to the CUDA-hybrid pass (CudaHybridGdnForwardPass.BatchVerifyCpuMoe). The pure-CPU sibling was intentionally left on its per-token loop.

Scope

HybridGdnForwardPass.BatchVerify (src/SharpInference.Engine/HybridGdnForwardPass.cs) still loops MoeFfnCore per draft token, so on a routed-MoE MTP model the routed experts re-read their mmap'd weights k times — the exact problem #210 fixed for the GPU path. This pass runs only for pure-CPU execution of a GDN-hybrid model (RunCommand.cs:391hp.IsHybridSsm && effNGpuLayers == 0, i.e. -g 0 / no GPU), e.g. serving Qwen3.6-35B-A3B-MTP on a GPU-less box.

Why it wasn't done in #210

Unlike the CUDA pass, HybridGdnForwardPass has no BatchedRoutedExperts core to reuse — this is a real port (~200 lines): the CSR bucket machinery + Phase A/B/C gate/up/down dots + scratch buffers, against the CPU class's TensorRef/DispatchDot (not the CUDA class's CpuWeightRef). It only benefits a niche path that can't be GPU-validated and historically showed no measured MTP speedup.

Work items

  1. Port the group-by-expert routed FFN (mirror BatchedRoutedExperts + the BatchVerifyCpuMoe orchestration) into HybridGdnForwardPass, reused by both BatchVerify's MoE branch and ideally prefill.
  2. Bit-parity test: batched-vs-per-token verify byte-identical (GPU-free — HybridGdnForwardPass is CPU-only, so this can run in CI unlike the CUDA oracle which needs a 4070 Ti + the GGUF on disk).
  3. Bench pure-CPU -g 0 MoE-MTP decode (per-token vs batched).

Acceptance

  • Batched CPU verify byte-identical to the per-token MoeFfnCore loop (greedy parity preserved).
  • Measurable pure-CPU decode improvement on a routed-MoE MTP model, or a documented finding that the CPU path is bound elsewhere.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions