Add bench rows for qwen36-27b-mtp (cpu, cuda-hybrid) and establish perf baseline

## Background

This session (under #25) landed support for Qwen3.6-27B-MTP on both CPU and CUDA hybrid backends, with the following measured baselines on RTX 4070 Ti 12 GB:

| Backend | Decode t/s | Notes |
|---|---|---|
| CPU | 3.1 | All FFN + GDN on CPU |
| CUDA hybrid (no FFN-on-GPU) | 4.0 | Pre-exact-size, all FFN on CPU |
| **CUDA hybrid (21/64 FFN-on-GPU, exact-size)** | **6.3** | Current best on 12 GB |

These numbers exist only in the design doc (\`docs/qwen35moe-plan.md\` Phase 11) and won't be tracked across future changes without bench rows.

## Scope

1. Add \`qwen36-27b-mtp\` rows to \`scripts/bench-all.ps1\` mirroring the existing \`qwen36\` pattern (\`bench-all.ps1:34-35\`):
   \`\`\`powershell
   $results += .\scripts\bench-textgen.ps1 -Model $qwen36_27b_mtp -Tag "qwen36-27b-mtp-cpu"          -NTokens $NTokens -Prompt $Prompt -TimeoutSec 600
   $results += .\scripts\bench-textgen.ps1 -Model $qwen36_27b_mtp -Tag "qwen36-27b-mtp-cuda-hybrid" -NTokens $NTokens -Prompt $Prompt -TimeoutSec 600 -ExtraArgs @("-g","-1","--backend","cuda")
   \`\`\`
2. Define \`$qwen36_27b_mtp\` path at the top of \`bench-all.ps1\` (likely \`models/Qwen3.6-27B-MTP-Q4_K_M.gguf\` per the download script entry added under #25).
3. Update README perf table once #25 lands MTP self-speculation so we have an apples-to-apples \"+MTP / -MTP\" comparison row.

## Out of scope (separate issues)

- An all-CUDA bench row — the 27B at Q4_K_M is 17 GB and won't fit a 12 GB card in pure CUDA mode at any supported quant. Skip the row or document the OOM.
- MTP perf rows — file under #25 when the MTP forward path lands.

## Verification

- `pwsh scripts/bench-all.ps1` runs to completion without errors, prints the new rows, and the cuda-hybrid number is within ±10 % of the 6.3 t/s baseline above (allowing for run-to-run variance and the bench's prompt/length).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bench rows for qwen36-27b-mtp (cpu, cuda-hybrid) and establish perf baseline #28

Background

Scope

Out of scope (separate issues)

Verification

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Backend	Decode t/s	Notes
CPU	3.1	All FFN + GDN on CPU
CUDA hybrid (no FFN-on-GPU)	4.0	Pre-exact-size, all FFN on CPU
CUDA hybrid (21/64 FFN-on-GPU, exact-size)	6.3	Current best on 12 GB

Add bench rows for qwen36-27b-mtp (cpu, cuda-hybrid) and establish perf baseline #28

Description

Background

Scope

Out of scope (separate issues)

Verification

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions