You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This session (under #25) landed support for Qwen3.6-27B-MTP on both CPU and CUDA hybrid backends, with the following measured baselines on RTX 4070 Ti 12 GB:
Backend
Decode t/s
Notes
CPU
3.1
All FFN + GDN on CPU
CUDA hybrid (no FFN-on-GPU)
4.0
Pre-exact-size, all FFN on CPU
CUDA hybrid (21/64 FFN-on-GPU, exact-size)
6.3
Current best on 12 GB
These numbers exist only in the design doc (`docs/qwen35moe-plan.md` Phase 11) and won't be tracked across future changes without bench rows.
An all-CUDA bench row — the 27B at Q4_K_M is 17 GB and won't fit a 12 GB card in pure CUDA mode at any supported quant. Skip the row or document the OOM.
pwsh scripts/bench-all.ps1 runs to completion without errors, prints the new rows, and the cuda-hybrid number is within ±10 % of the 6.3 t/s baseline above (allowing for run-to-run variance and the bench's prompt/length).
Background
This session (under #25) landed support for Qwen3.6-27B-MTP on both CPU and CUDA hybrid backends, with the following measured baselines on RTX 4070 Ti 12 GB:
These numbers exist only in the design doc (`docs/qwen35moe-plan.md` Phase 11) and won't be tracked across future changes without bench rows.
Scope
```powershell
$results += .\scripts\bench-textgen.ps1 -Model $qwen36_27b_mtp -Tag "qwen36-27b-mtp-cpu" -NTokens $NTokens -Prompt $Prompt -TimeoutSec 600
$results += .\scripts\bench-textgen.ps1 -Model $qwen36_27b_mtp -Tag "qwen36-27b-mtp-cuda-hybrid" -NTokens $NTokens -Prompt $Prompt -TimeoutSec 600 -ExtraArgs @("-g","-1","--backend","cuda")
```
Out of scope (separate issues)
Verification
pwsh scripts/bench-all.ps1runs to completion without errors, prints the new rows, and the cuda-hybrid number is within ±10 % of the 6.3 t/s baseline above (allowing for run-to-run variance and the bench's prompt/length).