Skip to content

Fix compute->host visibility in HybridForwardPass MoE path (#2)#12

Merged
pekkah merged 2 commits into
masterfrom
fix/issue-2-moe-hybrid-host-barrier
Apr 29, 2026
Merged

Fix compute->host visibility in HybridForwardPass MoE path (#2)#12
pekkah merged 2 commits into
masterfrom
fix/issue-2-moe-hybrid-host-barrier

Conversation

@pekkah

@pekkah pekkah commented Apr 29, 2026

Copy link
Copy Markdown
Owner

Summary

Partial fix for issue #2 (Hybrid GPU+CPU MoE path producing NaN/garbled output). The
documented repro (-g 1 --tq on Qwen3-Coder 30B-A3B) now decodes coherent text at
~14.7 t/s instead of 0 NaN tokens.

Root cause (Bug 1, fixed)

HybridForwardPass.GpuMoeFfn writes the post-RmsNorm hidden state to a host-coherent
BAR buffer (_gpuPinnedNorm) via RecordComputeCopy, then submits and waits on a
fence so the CPU expert-fallback path can read it via MapPinned. On RTX 4070 Ti,
fence completion alone did not make the compute-shader writes visible to host reads
MapPinned returned stale data, the CPU fallback consumed bogus normPtr values,
the resulting _cpuFallbackBuf was ~1000x out of range, and the residual carried the
corruption through every later layer as the garbled tokens reported in the issue.

Adding an explicit compute -> host pipeline barrier (SHADER_WRITE -> HOST_READ on
COMPUTE_SHADER -> HOST stages) immediately before EndRecordAndSubmit fixes it. The
new helper is VulkanBackend.RecordComputeToHostBarrier().

Residual issue (Bug 2, still open under #2)

Beyond ~-g 9 GPU layers on Qwen3-Coder 30B-A3B, the prefetcher-cached GPU expert
path still produces wrong output even with the host barrier. Disabling the prefetcher
entirely (forcing all experts through CPU fallback) restores correctness across the
full -g 1..-1 range, which points at the GPU-expert MatMul reading prefetched
weights — most likely descriptor-set reuse across multiple recorded dispatches in
ComputePipeline._reusableDs. That fix is a deeper rework and intentionally out of
scope here.

The CLI guard from PR #5 therefore stays in place by default. Set
SHARPI_ALLOW_BROKEN_MOE_HYBRID=1 to bypass the guard for further work on Bug 2.

Test plan

  • Original issue repro: SHARPI_ALLOW_BROKEN_MOE_HYBRID=1 ... -g 1 --tq -p \"Hello\"
    produces coherent decode (was: 0 NaN tokens)
  • Sweep -g 1..9 with prefetcher-on: all produce coherent output
  • Sweep -g 1..-1 with prefetcher disabled (debug only): all produce coherent output
  • Default behavior unchanged: guard still refuses MoE on hybrid path with the
    issue Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2 message
  • Full test suite green (excluding 2 pre-existing Vulkan failures unrelated to MoE)

🤖 Generated with Claude Code

pekkah and others added 2 commits April 29, 2026 14:00
`GpuMoeFfn` writes the post-RmsNorm hidden state to a host-coherent BAR
buffer (`_gpuPinnedNorm`) so the CPU expert-fallback path can read it via
`MapPinned` after the mid-FFN submit. Fence completion alone did not make
the compute-shader writes visible to host reads on RTX 4070 Ti — MapPinned
returned stale data and the CPU fallback produced wildly out-of-range
output (~1062 magnitudes), which propagated through residuals as the
garbled MoE tokens reported in issue #2.

Add `VulkanBackend.RecordComputeToHostBarrier` (a SHADER_WRITE -> HOST_READ
pipeline barrier on COMPUTE_SHADER -> HOST stages) and call it in
`GpuMoeFfn` immediately before `EndRecordAndSubmit`. With this barrier the
issue's exact repro (`-g 1 --tq` on Qwen3-Coder 30B-A3B) now decodes
coherent text at ~14.7 t/s instead of 0 NaN tokens.

Add `SHARPI_ALLOW_BROKEN_MOE_HYBRID=1` CLI escape hatch so the
not-yet-fixed prefetcher path can still be exercised for further
investigation; the existing guard remains in place because the prefetcher
GPU-expert path still corrupts output beyond ~-g 9 (likely descriptor-set
reuse in `ComputePipeline._reusableDs` across multiple recorded
dispatches; out of scope for this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pekkah pekkah merged commit ea9ff3f into master Apr 29, 2026
1 check passed
@pekkah pekkah deleted the fix/issue-2-moe-hybrid-host-barrier branch April 29, 2026 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant