perf(moe): grouped-by-expert MoE verify batching for the pure-CPU GDN path (HybridGdnForwardPass)

Follow-up to #210 (shipped in #270 / `c835073`), which added grouped-by-expert MoE verify batching to the **CUDA-hybrid** pass (`CudaHybridGdnForwardPass.BatchVerifyCpuMoe`). The **pure-CPU** sibling was intentionally left on its per-token loop.

## Scope

`HybridGdnForwardPass.BatchVerify` (`src/SharpInference.Engine/HybridGdnForwardPass.cs`) still loops `MoeFfnCore` per draft token, so on a routed-MoE MTP model the routed experts re-read their mmap'd weights k times — the exact problem #210 fixed for the GPU path. This pass runs only for **pure-CPU** execution of a GDN-hybrid model (`RunCommand.cs:391` — `hp.IsHybridSsm && effNGpuLayers == 0`, i.e. `-g 0` / no GPU), e.g. serving Qwen3.6-35B-A3B-MTP on a GPU-less box.

## Why it wasn't done in #210

Unlike the CUDA pass, `HybridGdnForwardPass` has **no** `BatchedRoutedExperts` core to reuse — this is a real port (~200 lines): the CSR bucket machinery + Phase A/B/C gate/up/down dots + scratch buffers, against the CPU class's `TensorRef`/`DispatchDot` (not the CUDA class's `CpuWeightRef`). It only benefits a niche path that can't be GPU-validated and historically showed no measured MTP speedup.

## Work items

1. Port the group-by-expert routed FFN (mirror `BatchedRoutedExperts` + the `BatchVerifyCpuMoe` orchestration) into `HybridGdnForwardPass`, reused by both `BatchVerify`'s MoE branch and ideally prefill.
2. Bit-parity test: batched-vs-per-token verify byte-identical (GPU-free — `HybridGdnForwardPass` is CPU-only, so this can run in CI unlike the CUDA oracle which needs a 4070 Ti + the GGUF on disk).
3. Bench pure-CPU `-g 0` MoE-MTP decode (per-token vs batched).

## Acceptance

- [ ] Batched CPU verify byte-identical to the per-token `MoeFfnCore` loop (greedy parity preserved).
- [ ] Measurable pure-CPU decode improvement on a routed-MoE MTP model, or a documented finding that the CPU path is bound elsewhere.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(moe): grouped-by-expert MoE verify batching for the pure-CPU GDN path (HybridGdnForwardPass) #271

Scope

Why it wasn't done in #210

Work items

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(moe): grouped-by-expert MoE verify batching for the pure-CPU GDN path (HybridGdnForwardPass) #271

Description

Scope

Why it wasn't done in #210

Work items

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions