Fix compute->host visibility in HybridForwardPass MoE path (#2)#12
Merged
Conversation
`GpuMoeFfn` writes the post-RmsNorm hidden state to a host-coherent BAR buffer (`_gpuPinnedNorm`) so the CPU expert-fallback path can read it via `MapPinned` after the mid-FFN submit. Fence completion alone did not make the compute-shader writes visible to host reads on RTX 4070 Ti — MapPinned returned stale data and the CPU fallback produced wildly out-of-range output (~1062 magnitudes), which propagated through residuals as the garbled MoE tokens reported in issue #2. Add `VulkanBackend.RecordComputeToHostBarrier` (a SHADER_WRITE -> HOST_READ pipeline barrier on COMPUTE_SHADER -> HOST stages) and call it in `GpuMoeFfn` immediately before `EndRecordAndSubmit`. With this barrier the issue's exact repro (`-g 1 --tq` on Qwen3-Coder 30B-A3B) now decodes coherent text at ~14.7 t/s instead of 0 NaN tokens. Add `SHARPI_ALLOW_BROKEN_MOE_HYBRID=1` CLI escape hatch so the not-yet-fixed prefetcher path can still be exercised for further investigation; the existing guard remains in place because the prefetcher GPU-expert path still corrupts output beyond ~-g 9 (likely descriptor-set reuse in `ComputePipeline._reusableDs` across multiple recorded dispatches; out of scope for this commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Partial fix for issue #2 (Hybrid GPU+CPU MoE path producing NaN/garbled output). The
documented repro (
-g 1 --tqon Qwen3-Coder 30B-A3B) now decodes coherent text at~14.7 t/s instead of 0 NaN tokens.
Root cause (Bug 1, fixed)
HybridForwardPass.GpuMoeFfnwrites the post-RmsNorm hidden state to a host-coherentBAR buffer (
_gpuPinnedNorm) viaRecordComputeCopy, then submits and waits on afence so the CPU expert-fallback path can read it via
MapPinned. On RTX 4070 Ti,fence completion alone did not make the compute-shader writes visible to host reads
—
MapPinnedreturned stale data, the CPU fallback consumed bogusnormPtrvalues,the resulting
_cpuFallbackBufwas ~1000x out of range, and the residual carried thecorruption through every later layer as the garbled tokens reported in the issue.
Adding an explicit
compute -> hostpipeline barrier (SHADER_WRITE -> HOST_READonCOMPUTE_SHADER -> HOSTstages) immediately beforeEndRecordAndSubmitfixes it. Thenew helper is
VulkanBackend.RecordComputeToHostBarrier().Residual issue (Bug 2, still open under #2)
Beyond
~-g 9GPU layers on Qwen3-Coder 30B-A3B, the prefetcher-cached GPU expertpath still produces wrong output even with the host barrier. Disabling the prefetcher
entirely (forcing all experts through CPU fallback) restores correctness across the
full
-g 1..-1range, which points at the GPU-expert MatMul reading prefetchedweights — most likely descriptor-set reuse across multiple recorded dispatches in
ComputePipeline._reusableDs. That fix is a deeper rework and intentionally out ofscope here.
The CLI guard from PR #5 therefore stays in place by default. Set
SHARPI_ALLOW_BROKEN_MOE_HYBRID=1to bypass the guard for further work on Bug 2.Test plan
SHARPI_ALLOW_BROKEN_MOE_HYBRID=1 ... -g 1 --tq -p \"Hello\"produces coherent decode (was: 0 NaN tokens)
-g 1..9with prefetcher-on: all produce coherent output-g 1..-1with prefetcher disabled (debug only): all produce coherent outputissue Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2 message
🤖 Generated with Claude Code