ROCm: discrete GPU memory management#461
Open
cattivik66 wants to merge 1 commit into
Open
Conversation
Add host-mapped expert cache fallback for VRAM-constrained discrete GPUs (e.g. AMD W7800 48GB). When the streaming expert cache is exhausted, copy the expert weights to pinned host memory and map them into the GPU address space via cudaHostRegister/cudaHostGetDevicePointer. Inference proceeds (slower) instead of hard-failing. Add automatic discrete/integrated GPU detection via cudaDeviceGetAttribute so the registered-weights path activates without manual configuration. DS4_ROCM_REGISTERED_WEIGHTS env var overrides detection for edge cases. Add three env var tunables: - DS4_ROCM_STREAM_FREE_RESERVE_GIB: override VRAM free-reserve floor (default 16 GiB). - DS4_CUDA_Q8_F16_CACHE_MB: limit the optional q8->f16 cache to free VRAM for routed-expert residency (e.g. DS4_CUDA_Q8_F16_CACHE_MB=0). - DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK: set to 0 to disable host-mapped expert overflow (default: enabled). Host-mapped fallback can be disabled with DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK=0. All defaults preserve upstream behavior.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
With these changes, the ds4flash.gguf (DeepSeek-V4 Flash) model runs on
discrete GPUs with 48 GB VRAM (tested: AMD Radeon Pro W7800 x2). Without
this PR the upstream code hard-fails: the streaming expert cache exhausts
VRAM during model load and
cuda_stream_resident_alloc()returns -1,preventing the model from starting at all.
Sorry I was not able to test on other systems, but I expect the code to be fully
compatible with the current supported systems as they should be untouched.
The PR adds:
instead of aborting.
path activates without manual configuration.
Compilation
Built on ROCm 7.2 with hipcc:
Linker flags:
-O3 -ffast-math -g -fno-finite-math-only -pthread -D__HIP_PLATFORM_AMD__ --offload-arch=gfx1100 -lhipblas -lhipblasltProblem
Expert cache OOM --
cuda_stream_resident_alloc()returns -1 whencuda_stream_resident_make_room()fails the 16 GiB free-reserve floor.On 48 GB cards the working set of routed experts from the Flash model
exceeds the remaining VRAM after model weights + KV cache, so the
startup crashes with "streaming expert cache cannot keep ... MiB".
Raw host-pointer reads -- when
cuda_model_range_ptr_from_fd()can't fita range in the arena it returns
cuda_model_ptr(), which for discrete GPUsis a plain host virtual address. On discrete GPUs this causes slow uncached
PCIe reads on every kernel access. In distributed mode, the raw pointer
also fails the
cuda_model_range_is_cached()check, crashing the worker.Validation
Test setup
make rocm ROCM_ARCH=gfx1100Single GPU (auto-detection, no DS4_ROCM_REGISTERED_WEIGHTS needed)
DS4_ROCM_STREAM_FREE_RESERVE_GIB=4 \ DS4_CUDA_Q8_F16_CACHE_MB=0 \ ./ds4 --model ds4flash.gguf --backend rocm --ssd-streaming \ -p "Hello, who are you?" -n 30 --nothinkDual GPU (distributed, layers split 0:21 + 22:output)
The auto-detection correctly identifies the W7800 as discrete, enabling the
registered-weights fallback path automatically. No
DS4_ROCM_REGISTERED_WEIGHTSenv var is needed.
Terminal 1 (GPU 1, worker, layers 22:output):
Terminal 2 (GPU 0, coordinator, layers 0:21):
ROCR_VISIBLE_DEVICES=0 DS4_LOCK_FILE=/tmp/ds4-c.lock \ DS4_CUDA_Q8_F16_CACHE_MB=0 DS4_ROCM_STREAM_FREE_RESERVE_GIB=1 \ ./ds4 --role coordinator --layers 0:21 --ssd-streaming \ --listen 127.0.0.1 9000 \ --backend rocm -m ds4flash.gguf \ -p "Hello, who are you?" -n 50 --nothinkCoordinator output:
Both GPUs participate in every token via the distributed layer pipeline.
Each GPU uses ~42 GiB VRAM during inference.
Dual GPU distributed (full residency, no SSD)
Outcome summary
Changes
ds4_rocm.hAdded
cudaDevAttrIntegrated->hipDeviceAttributeIntegratedmapping for theHIP/CUDA compatibility layer, enabling
cudaDeviceGetAttributeto query whetherthe GPU is integrated or discrete.
rocm/ds4_rocm_runtime.cuh1. Host-mapped expert fallback (
cuda_stream_resident_alloc)When a routed expert cannot be placed in VRAM:
malloc'd host buffer.cudaHostRegister(cudaHostRegisterMapped).cudaHostGetDevicePointer.host_mapped=1so the GPU kernel reads through theregistered mapping (zero-copy over PCIe).
The fallback is enabled by default and activates only when the existing
cudaMallocpath would have returned -1. SetDS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK=0to restore the previous hard-fail behaviour.
Two new fields (
host_mapped,host_base) are added tocuda_stream_resident_expert;cuda_stream_resident_cache_release()was updated to callcudaHostUnregister+freefor host-mapped experts while preserving the original
cudaFreepath for VRAM experts.2. Automatic discrete GPU detection (
cuda_device_is_discrete)Queries
cudaDeviceGetAttribute(cudaDevAttrIntegrated)at first use and cachesthe result. On discrete GPUs (separate VRAM), the model arena fallthrough
returns NULL instead of a raw host pointer, forcing the
cudaHostRegistermapping path. On integrated GPUs (APUs, unified memory), the raw host pointer
path is preserved for zero-copy access.
DS4_ROCM_REGISTERED_WEIGHTSenv var overrides auto-detection:"1"= force discrete (always use registered path)"0"= force integrated (always use raw host pointer)When the attribute query fails (older ROCm versions), defaults to discrete
(the safer choice).
3.
DS4_ROCM_STREAM_FREE_RESERVE_GIB(VRAM free-reserve floor)Override the hard-coded 16 GiB free-reserve in
cuda_stream_resident_free_reserve_bytes().Fractional values accepted (e.g.
4.5). On 48 GB cards a reserve of ~4 GiB issufficient; the upstream 16 GiB leaves too little room for the Flash model's
routed-expert working set.
Default (unset): 16 GiB -- upstream behaviour unchanged.
4.
DS4_CUDA_Q8_F16_CACHE_MB(q8->f16 cache budget)The optional q8->f16 transpose cache defaults to unlimited (
UINT64_MAX).On tight discrete GPUs this cache can consume VRAM needed for expert residency.
Set to
0to disable it entirely.Default (unset): unlimited -- upstream behaviour unchanged.
Environment variables
DS4_ROCM_STREAM_FREE_RESERVE_GIB16DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK0to disable host-mapped expert overflowDS4_CUDA_Q8_F16_CACHE_MB0=disable)DS4_ROCM_REGISTERED_WEIGHTS1=discrete,0=integrated)Safety
same raw-pointer path as before.
DS4_ROCM_HOST_MAPPED_EXPERT_FALLBACK=0escape hatch restoresthe previous hard-fail path.
or CPU backends.