Fix garbled chat output; guard MoE+hybrid; copy-paste-safe README by pekkah · Pull Request #5 · pekkah/SharpInference

pekkah · 2026-04-28T12:21:23Z

Summary

Jinja chat template: handle {% macro %} / {% block %} / {% raw %}. Without this, any template that opens with a macro definition (the Unsloth Qwen3-Coder template) collapsed to "\n\n", the model received no prompt, and produced garbled output.
HybridForwardPass barriers: switch inter-layer copies from vkCmdCopyBuffer (transfer stage) to RecordComputeCopy (compute stage), and add the missing RecordBarrier() at the end of GpuLayer. With these, SmolLM2 hybrid produces coherent output at every -g N.
HybridForwardPass embedding: force CPU embedding+output as a workaround for a separate shader bug (issue GPU embedding lookup writes garbage when invoked from HybridForwardPass #3).
CLI guards: refuse explicit -g N on MoE models with an error linking to issue Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2; -g -1 falls back to CPU with a warning.
README: every command is single-line so it survives copy-paste, plus a "What works today" matrix per locally-tested model.

Verified

Model	Path	Decode
SmolLM2 1.7B	CPU	~51 t/s
SmolLM2 1.7B	`-g -1` (all GPU)	~159 t/s
SmolLM2 1.7B	`-g 8` (hybrid)	~61 t/s
Qwen3-Coder 30B-A3B	CPU + `--tq`	~21 t/s
Qwen3-Coder 30B-A3B	`-g 1`	refused with helpful error
Qwen3-Coder 30B-A3B	`-g -1`	warns, runs on CPU at ~21 t/s
SmolLM2 + `--tq`	any	refused (head dim 64 != 128/256)

Tracking issues opened

Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2 MoE on hybrid GPU+CPU produces NaN / degenerate output
GPU embedding lookup writes garbage when invoked from HybridForwardPass #3 GPU embedding lookup writes garbage in HybridForwardPass (worked around here)
CLI: clarify backend selection (no --backend flag; CUDA not wired to LLM path) #4 CLI: --backend cuda silently ignored; clarify backend selection

Test plan

dotnet test tests/SharpInference.Tests.Core — 26/26 pass
SmolLM2 CPU (def add_two_numbers(a, b): return a + b)
SmolLM2 -g -1 and -g 4/8/16/23
Qwen3-Coder CPU + --tq
Qwen3-Coder -g 1 rejection message
Qwen3-Coder -g -1 auto-fallback to CPU
Image generation pipelines not re-tested (separate path; unchanged)
API server not re-tested (unchanged)

🤖 Generated with Claude Code

- JinjaChatTemplate: handle {% macro %} / {% block %} / {% raw %}. The parser was treating macro as an unknown opening tag and bailing out of the top-level node list at {% endmacro %}, dropping the rest of any template that started with a macro definition (the Unsloth Qwen3-Coder template). Result: the rendered prompt was just "\n\n" and the model produced garbage. Now consumes and discards the macro body up to the matching end tag; macro calls fall back to "" via CallFunction's null-result path (only matters inside the tools-rendering branch, which we never enter when tools=null). - HybridForwardPass: replace transfer-stage vkCmdCopyBuffer with RecordComputeCopy. Every existing barrier was a Compute->Compute memory barrier, so transfer-stage writes were never synced with the surrounding compute work. Keeping every copy on the compute stage makes the existing barriers correct. - HybridForwardPass: add the missing barrier between the residual AddInPlace at the end of GpuLayer and the next layer's residual copy. Without it, downstream layers read stale _gpuHidden and output quality degrades non-linearly with the number of GPU layers (worked at -g 1, broke at -g 4+). - HybridForwardPass: force CPU embedding+output (single row of dequant per token; negligible cost). Worked around an unresolved bug where the EmbedLookupQ4K shader writes garbage when invoked from this forward pass even though it works correctly from GpuForwardPass. Tracking issue #3. - RunCommand: refuse explicit -g N (N>0) on MoE models with an error pointing to issue #2. -g -1 (auto) emits a warning and falls back to CPU. Non-MoE hybrid path is unchanged. Avoids users seeing silent NaN / "11111..." output and not knowing why. - README: rewrote every multi-line `\`-continued bash block as single-line commands so they survive copy-paste. Added a "What works today" matrix per locally-present model with verified decode rates (SmolLM2 51 / 159 / 61 t/s on CPU / all-GPU / hybrid; Qwen3-Coder 21 t/s on CPU with --tq). Documented the broken Llama-70B local file (partial download), the missing RealESRGAN upscaler, and the three known-issue links. Issues opened during this debugging session: - #2 MoE on hybrid GPU+CPU produces NaN / degenerate output - #3 GPU embedding lookup writes garbage in HybridForwardPass - #4 CLI: --backend cuda silently ignored; clarify backend selection Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s + qwen3 default system JinjaChatTemplate previously emitted stray newlines from the Unsloth Qwen3-Coder template: \n<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n\n ^ leading ^^ trailing The leading \n came from the line after `{# Copyright #}`; the trailing extra came from the blank line between `{% endmacro %}` and `{%- if messages[0]... %}`. HuggingFace transformers' chat- template renderer enables trim_blocks=True and lstrip_blocks=True by default, which strip those. Implement the same: ApplyStripping now handles implicit trim/lstrip for {% %} and {# #} in addition to the explicit {%- / -%} controls. Also: when a Jinja template is used (the normal path for any modern GGUF), inject the default "You are a helpful assistant." system message for qwen3/qwen3moe to match the hardcoded fallback path. Without it, Qwen3 models tend to emit <|endoftext|> immediately at temp 0 for short / under-specified prompts. Verified: -p "Write a one-line Python function..." → coherent code (was already coherent before the fix; not a regression). -p "Write a short WGSL shader..." with temp 0: * before this commit: rendered prompt had stray \n's; model emitted ``` then <|im_end|>. * after: prompt is clean, system message present; model generates plausible (if hallucinated) WGSL with a more directive variant of the prompt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Qwen3-Coder appears to be tuned to operate without a system message. With "You are a helpful assistant." injected, the model gives high- confidence (<|endoftext|>) at greedy temp 0 for short prompts (logit ~29 vs ~14 with no system). Narrow the default to qwen3 (dense) only to match the model's expected input distribution. Note: this does not fully fix all Qwen3-Coder degenerate-output cases — the model still emits <|endoftext|> with high confidence for some prompts (e.g. "Write a short wgsl shader for pbr rendering. Output only the code."). That requires llama.cpp reference comparison to determine whether it's expected model behavior or a forward-pass divergence; tracking separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pekkah and others added 3 commits April 28, 2026 15:21

pekkah merged commit 0bd07d4 into master Apr 28, 2026
1 check passed

pekkah deleted the fix/hybrid-moe-guards-and-jinja-macros branch April 28, 2026 19:10

pekkah mentioned this pull request Apr 29, 2026

Fix compute->host visibility in HybridForwardPass MoE path (#2) #12

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix garbled chat output; guard MoE+hybrid; copy-paste-safe README#5

Fix garbled chat output; guard MoE+hybrid; copy-paste-safe README#5
pekkah merged 3 commits into
masterfrom
fix/hybrid-moe-guards-and-jinja-macros

pekkah commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Apr 28, 2026

Summary

Verified

Tracking issues opened

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant