Support SSD streaming for Q4_K routed experts on ROCm#451
Open
kmc6042 wants to merge 1 commit into
Open
Conversation
The ROCm streaming MoE paths were gated to the IQ2_XXS/Q2_K expert quant pair, so Q4_K expert GGUFs failed prefill with "missing compact selected experts" and could not run under --ssd-streaming at all. Route Q4_K through the quant-agnostic machinery instead of the IQ2-only selected/split kernels: - Prefill: allow the full-layer streaming path for Q4_K. It stages a whole layer's expert table contiguously and runs the standard matmul, so use it for any multi-token prefill since Q4_K has no batched selected-gather kernel. - Decode: route Q4_K through the shared-overlap selected-load path and force the selected-expert loader to build a full contiguous compact buffer, since the split decode kernels only exist for the IQ2_XXS/Q2_K pair. Also speed up Q4_K streaming by warming the routed-expert cache from the popularity hotlist: - Implement the previously stubbed ROCm seed_experts() as a real bulk sequential preload into the resident cache, which is far cheaper than the scattered first-touch random reads it replaces. Read failures release the resident cache so partially-filled entries are never served as hits. - Allow the hotlist/prefill cache seed for Q4_K layers, and warm the cache at the start of decode-style prefill so short prompts benefit too. On an AMD Ryzen AI MAX+ 395 (Strix Halo, gfx1151) with the 153 GiB Q4_K DeepSeek-V4-Flash GGUF and 123 GiB RAM, this takes the model from failing to start to producing correct output, and the preload cuts decode cache misses roughly in half. Escape hatches: DS4_ROCM_DISABLE_Q4_SELECTED_SHARED_OVERLAP=1 plus the existing --ssd-streaming-cold / DS4_METAL_DISABLE_STREAMING_EXPERT_HOTLIST. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Enables SSD streaming execution for ROCm (Strix Halo) routed-expert GGUFs quantized as Q4_K by routing Q4_K through quant-agnostic streaming paths and adding an expert-cache warm-up pass to reduce first-touch decode misses.
Changes:
- Adds a ROCm selected-expert loader mode that forces contiguous compact buffers (avoids IQ2-only split decode paths).
- Extends ROCm streaming prefill/decode routing to support Q4_K, including full-layer prefill enablement for multi-token prompts.
- Implements popularity-based expert cache seeding on ROCm and triggers warm-up for short decode-style prefill.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| rocm/ds4_rocm_runtime.cuh | Adds g_stream_selected_force_contiguous and gates the async split pending path for selected-expert loads. |
| rocm/ds4_rocm_current_api_compat.cuh | Exposes a setter for the contiguous mode and implements ROCm seed_experts() warm-up via bulk sequential reads. |
| ds4.c | Updates ROCm streaming routing for Q4_K (prefill + decode), enables cache warm-up at decode-style prefill start, and broadens seeding applicability. |
| ds4_gpu.h | Adds a public GPU API declaration for the new contiguous-mode toggle. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
73
to
+77
| void ds4_gpu_set_quality(bool quality); | ||
| void ds4_gpu_set_ssd_streaming(bool enabled); | ||
| void ds4_gpu_set_streaming_expert_cache_budget(uint32_t experts); | ||
| void ds4_gpu_set_streaming_expert_cache_expert_bytes(uint64_t bytes); | ||
| void ds4_gpu_stream_set_selected_force_contiguous(int enabled); |
Comment on lines
+13644
to
+13660
| static bool metal_graph_use_rocm_q4_selected_shared_overlap( | ||
| const ds4_gpu_graph *g, | ||
| const ds4_layer_weights *layer) { | ||
| return g && | ||
| g->ssd_streaming && | ||
| !g->quality && | ||
| layer && | ||
| layer->ffn_gate_exps && | ||
| layer->ffn_up_exps && | ||
| layer->ffn_down_exps && | ||
| layer->ffn_gate_exps->type == DS4_TENSOR_Q4_K && | ||
| layer->ffn_up_exps->type == DS4_TENSOR_Q4_K && | ||
| layer->ffn_down_exps->type == DS4_TENSOR_Q4_K && | ||
| DS4_N_EXPERT_USED == 6 && | ||
| DS4_N_EXPERT >= 128 && | ||
| getenv("DS4_ROCM_DISABLE_Q4_SELECTED_SHARED_OVERLAP") == NULL; | ||
| } |
Comment on lines
+420
to
+442
| jobs[job_count++] = {entry.gate, gate_offset + gate_rel, gate_expert_bytes, | ||
| NULL, NULL, 0, 0, 0}; | ||
| jobs[job_count++] = {entry.up, up_offset + gate_rel, gate_expert_bytes, | ||
| NULL, NULL, 0, 0, 0}; | ||
| jobs[job_count++] = {entry.down, down_offset + down_rel, down_expert_bytes, | ||
| NULL, NULL, 0, 0, 0}; | ||
| } | ||
|
|
||
| if (job_count != 0) { | ||
| const int flushed = | ||
| cuda_stream_read_jobs_parallel(jobs, job_count) && | ||
| cuda_stream_selected_upload_read_jobs(jobs, job_count); | ||
| cuda_stream_read_jobs_free(jobs, job_count); | ||
| if (!flushed) { | ||
| cuda_stream_resident_cache_release(); | ||
| free(jobs); | ||
| return 1; | ||
| } | ||
| } | ||
| if (cuda_stream_cache_stats_on()) { | ||
| g_stream_cache_stats.seed_calls++; | ||
| g_stream_cache_stats.seed_unique += n_experts; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enables running Q4_K routed-expert GGUFs under
--ssd-streamingon the ROCm(Strix Halo) backend, and warms the routed-expert cache so streaming decode is
meaningfully faster.
Before this change the ROCm streaming MoE paths were gated to the
IQ2_XXS/Q2_Kexpert quant pair, so a Q4_K expert GGUF failed prefill withmissing compact selected experts ... full expert table is not mappedand couldnot run under
--ssd-streamingat all.What changed
Route Q4_K through the quant-agnostic streaming machinery instead of the
IQ2-only selected/split kernels:
layer's expert table into a contiguous buffer and runs the standard matmul, so
it is used for any multi-token prefill (Q4_K has no batched selected-gather
kernel).
the selected-expert loader to build a full contiguous compact buffer, since the
split decode kernels exist only for the
IQ2_XXS/Q2_Kpair.Cache warm-up to speed up streaming:
seed_experts()as a real bulksequential preload of the popularity hotlist into the resident cache — far
cheaper than the scattered first-touch random reads it replaces. Read failures
release the resident cache so partially-filled entries are never served as hits.
start of decode-style prefill so short prompts benefit too.
Testing
On an AMD Ryzen AI MAX+ 395 (Strix Halo,
gfx1151), 123 GiB RAM, with the153 GiB
DeepSeek-V4-Flash-Q4KExperts-...GGUF on NVMe:missing compact selected experts.path and the long-prompt (>64 token) full-layer prefill path.
(e.g. 3432 → 1720 misses), measured via
DS4_ROCM_STREAM_CACHE_STATS=1.q4k-dotunit test passes.Escape hatches
DS4_ROCM_DISABLE_Q4_SELECTED_SHARED_OVERLAP=1disables the Q4_K decode path.--ssd-streaming-cold/DS4_METAL_DISABLE_STREAMING_EXPERT_HOTLISTskip the preload.
🤖 Generated with Claude Code