Skip to content

Fix garbled chat output; guard MoE+hybrid; copy-paste-safe README#5

Merged
pekkah merged 3 commits into
masterfrom
fix/hybrid-moe-guards-and-jinja-macros
Apr 28, 2026
Merged

Fix garbled chat output; guard MoE+hybrid; copy-paste-safe README#5
pekkah merged 3 commits into
masterfrom
fix/hybrid-moe-guards-and-jinja-macros

Conversation

@pekkah

@pekkah pekkah commented Apr 28, 2026

Copy link
Copy Markdown
Owner

Summary

  • Jinja chat template: handle {% macro %} / {% block %} / {% raw %}. Without this, any template that opens with a macro definition (the Unsloth Qwen3-Coder template) collapsed to "\n\n", the model received no prompt, and produced garbled output.
  • HybridForwardPass barriers: switch inter-layer copies from vkCmdCopyBuffer (transfer stage) to RecordComputeCopy (compute stage), and add the missing RecordBarrier() at the end of GpuLayer. With these, SmolLM2 hybrid produces coherent output at every -g N.
  • HybridForwardPass embedding: force CPU embedding+output as a workaround for a separate shader bug (issue GPU embedding lookup writes garbage when invoked from HybridForwardPass #3).
  • CLI guards: refuse explicit -g N on MoE models with an error linking to issue Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2; -g -1 falls back to CPU with a warning.
  • README: every command is single-line so it survives copy-paste, plus a "What works today" matrix per locally-tested model.

Verified

Model Path Decode
SmolLM2 1.7B CPU ~51 t/s
SmolLM2 1.7B -g -1 (all GPU) ~159 t/s
SmolLM2 1.7B -g 8 (hybrid) ~61 t/s
Qwen3-Coder 30B-A3B CPU + --tq ~21 t/s
Qwen3-Coder 30B-A3B -g 1 refused with helpful error
Qwen3-Coder 30B-A3B -g -1 warns, runs on CPU at ~21 t/s
SmolLM2 + --tq any refused (head dim 64 != 128/256)

Tracking issues opened

Test plan

  • dotnet test tests/SharpInference.Tests.Core — 26/26 pass
  • SmolLM2 CPU (def add_two_numbers(a, b): return a + b)
  • SmolLM2 -g -1 and -g 4/8/16/23
  • Qwen3-Coder CPU + --tq
  • Qwen3-Coder -g 1 rejection message
  • Qwen3-Coder -g -1 auto-fallback to CPU
  • Image generation pipelines not re-tested (separate path; unchanged)
  • API server not re-tested (unchanged)

🤖 Generated with Claude Code

pekkah and others added 3 commits April 28, 2026 15:21
- JinjaChatTemplate: handle {% macro %} / {% block %} / {% raw %}.
  The parser was treating macro as an unknown opening tag and bailing
  out of the top-level node list at {% endmacro %}, dropping the rest
  of any template that started with a macro definition (the Unsloth
  Qwen3-Coder template). Result: the rendered prompt was just "\n\n"
  and the model produced garbage. Now consumes and discards the macro
  body up to the matching end tag; macro calls fall back to "" via
  CallFunction's null-result path (only matters inside the
  tools-rendering branch, which we never enter when tools=null).

- HybridForwardPass: replace transfer-stage vkCmdCopyBuffer with
  RecordComputeCopy. Every existing barrier was a Compute->Compute
  memory barrier, so transfer-stage writes were never synced with
  the surrounding compute work. Keeping every copy on the compute
  stage makes the existing barriers correct.

- HybridForwardPass: add the missing barrier between the residual
  AddInPlace at the end of GpuLayer and the next layer's residual
  copy. Without it, downstream layers read stale _gpuHidden and
  output quality degrades non-linearly with the number of GPU layers
  (worked at -g 1, broke at -g 4+).

- HybridForwardPass: force CPU embedding+output (single row of dequant
  per token; negligible cost). Worked around an unresolved bug where
  the EmbedLookupQ4K shader writes garbage when invoked from this
  forward pass even though it works correctly from GpuForwardPass.
  Tracking issue #3.

- RunCommand: refuse explicit -g N (N>0) on MoE models with an error
  pointing to issue #2. -g -1 (auto) emits a warning and falls back
  to CPU. Non-MoE hybrid path is unchanged. Avoids users seeing
  silent NaN / "11111..." output and not knowing why.

- README: rewrote every multi-line `\`-continued bash block as
  single-line commands so they survive copy-paste. Added a "What
  works today" matrix per locally-present model with verified
  decode rates (SmolLM2 51 / 159 / 61 t/s on CPU / all-GPU / hybrid;
  Qwen3-Coder 21 t/s on CPU with --tq). Documented the broken
  Llama-70B local file (partial download), the missing
  RealESRGAN upscaler, and the three known-issue links.

Issues opened during this debugging session:
  - #2 MoE on hybrid GPU+CPU produces NaN / degenerate output
  - #3 GPU embedding lookup writes garbage in HybridForwardPass
  - #4 CLI: --backend cuda silently ignored; clarify backend selection

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s + qwen3 default system

JinjaChatTemplate previously emitted stray newlines from the Unsloth
Qwen3-Coder template:

  \n<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n\n
  ^ leading                                                ^^ trailing

The leading \n came from the line after `{# Copyright #}`; the
trailing extra came from the blank line between `{% endmacro %}`
and `{%- if messages[0]... %}`. HuggingFace transformers' chat-
template renderer enables trim_blocks=True and lstrip_blocks=True by
default, which strip those. Implement the same: ApplyStripping now
handles implicit trim/lstrip for {% %} and {# #} in addition to
the explicit {%- / -%} controls.

Also: when a Jinja template is used (the normal path for any modern
GGUF), inject the default "You are a helpful assistant." system
message for qwen3/qwen3moe to match the hardcoded fallback path.
Without it, Qwen3 models tend to emit <|endoftext|> immediately at
temp 0 for short / under-specified prompts.

Verified:
  -p "Write a one-line Python function..." → coherent code (was
   already coherent before the fix; not a regression).
  -p "Write a short WGSL shader..." with temp 0:
     * before this commit: rendered prompt had stray \n's; model
       emitted ``` then <|im_end|>.
     * after: prompt is clean, system message present; model
       generates plausible (if hallucinated) WGSL with a more
       directive variant of the prompt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3-Coder appears to be tuned to operate without a system message.
With "You are a helpful assistant." injected, the model gives high-
confidence (<|endoftext|>) at greedy temp 0 for short prompts (logit
~29 vs ~14 with no system). Narrow the default to qwen3 (dense) only
to match the model's expected input distribution.

Note: this does not fully fix all Qwen3-Coder degenerate-output
cases — the model still emits <|endoftext|> with high confidence
for some prompts (e.g. "Write a short wgsl shader for pbr
rendering. Output only the code."). That requires llama.cpp
reference comparison to determine whether it's expected model
behavior or a forward-pass divergence; tracking separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant