Fix garbled chat output; guard MoE+hybrid; copy-paste-safe README#5
Merged
Conversation
- JinjaChatTemplate: handle {% macro %} / {% block %} / {% raw %}.
The parser was treating macro as an unknown opening tag and bailing
out of the top-level node list at {% endmacro %}, dropping the rest
of any template that started with a macro definition (the Unsloth
Qwen3-Coder template). Result: the rendered prompt was just "\n\n"
and the model produced garbage. Now consumes and discards the macro
body up to the matching end tag; macro calls fall back to "" via
CallFunction's null-result path (only matters inside the
tools-rendering branch, which we never enter when tools=null).
- HybridForwardPass: replace transfer-stage vkCmdCopyBuffer with
RecordComputeCopy. Every existing barrier was a Compute->Compute
memory barrier, so transfer-stage writes were never synced with
the surrounding compute work. Keeping every copy on the compute
stage makes the existing barriers correct.
- HybridForwardPass: add the missing barrier between the residual
AddInPlace at the end of GpuLayer and the next layer's residual
copy. Without it, downstream layers read stale _gpuHidden and
output quality degrades non-linearly with the number of GPU layers
(worked at -g 1, broke at -g 4+).
- HybridForwardPass: force CPU embedding+output (single row of dequant
per token; negligible cost). Worked around an unresolved bug where
the EmbedLookupQ4K shader writes garbage when invoked from this
forward pass even though it works correctly from GpuForwardPass.
Tracking issue #3.
- RunCommand: refuse explicit -g N (N>0) on MoE models with an error
pointing to issue #2. -g -1 (auto) emits a warning and falls back
to CPU. Non-MoE hybrid path is unchanged. Avoids users seeing
silent NaN / "11111..." output and not knowing why.
- README: rewrote every multi-line `\`-continued bash block as
single-line commands so they survive copy-paste. Added a "What
works today" matrix per locally-present model with verified
decode rates (SmolLM2 51 / 159 / 61 t/s on CPU / all-GPU / hybrid;
Qwen3-Coder 21 t/s on CPU with --tq). Documented the broken
Llama-70B local file (partial download), the missing
RealESRGAN upscaler, and the three known-issue links.
Issues opened during this debugging session:
- #2 MoE on hybrid GPU+CPU produces NaN / degenerate output
- #3 GPU embedding lookup writes garbage in HybridForwardPass
- #4 CLI: --backend cuda silently ignored; clarify backend selection
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s + qwen3 default system
JinjaChatTemplate previously emitted stray newlines from the Unsloth
Qwen3-Coder template:
\n<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n\n
^ leading ^^ trailing
The leading \n came from the line after `{# Copyright #}`; the
trailing extra came from the blank line between `{% endmacro %}`
and `{%- if messages[0]... %}`. HuggingFace transformers' chat-
template renderer enables trim_blocks=True and lstrip_blocks=True by
default, which strip those. Implement the same: ApplyStripping now
handles implicit trim/lstrip for {% %} and {# #} in addition to
the explicit {%- / -%} controls.
Also: when a Jinja template is used (the normal path for any modern
GGUF), inject the default "You are a helpful assistant." system
message for qwen3/qwen3moe to match the hardcoded fallback path.
Without it, Qwen3 models tend to emit <|endoftext|> immediately at
temp 0 for short / under-specified prompts.
Verified:
-p "Write a one-line Python function..." → coherent code (was
already coherent before the fix; not a regression).
-p "Write a short WGSL shader..." with temp 0:
* before this commit: rendered prompt had stray \n's; model
emitted ``` then <|im_end|>.
* after: prompt is clean, system message present; model
generates plausible (if hallucinated) WGSL with a more
directive variant of the prompt.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen3-Coder appears to be tuned to operate without a system message. With "You are a helpful assistant." injected, the model gives high- confidence (<|endoftext|>) at greedy temp 0 for short prompts (logit ~29 vs ~14 with no system). Narrow the default to qwen3 (dense) only to match the model's expected input distribution. Note: this does not fully fix all Qwen3-Coder degenerate-output cases — the model still emits <|endoftext|> with high confidence for some prompts (e.g. "Write a short wgsl shader for pbr rendering. Output only the code."). That requires llama.cpp reference comparison to determine whether it's expected model behavior or a forward-pass divergence; tracking separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
This was referenced Jun 14, 2026
perf(mtp): batched-verify logits-buffer reuse + PromptLookup incremental index (#209 items 5,6)
#291
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
{% macro %}/{% block %}/{% raw %}. Without this, any template that opens with a macro definition (the Unsloth Qwen3-Coder template) collapsed to"\n\n", the model received no prompt, and produced garbled output.vkCmdCopyBuffer(transfer stage) toRecordComputeCopy(compute stage), and add the missingRecordBarrier()at the end ofGpuLayer. With these, SmolLM2 hybrid produces coherent output at every-g N.-g Non MoE models with an error linking to issue Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2;-g -1falls back to CPU with a warning.Verified
-g -1(all GPU)-g 8(hybrid)--tq-g 1-g -1--tqTracking issues opened
--backend cudasilently ignored; clarify backend selectionTest plan
dotnet test tests/SharpInference.Tests.Core— 26/26 passdef add_two_numbers(a, b): return a + b)-g -1and-g 4/8/16/23--tq-g 1rejection message-g -1auto-fallback to CPU🤖 Generated with Claude Code