-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Apple Feedback ID: FB22091885
This issue has been filed with Apple and is cross-referenced here for the MLX community. A fix may come from either side.
Kernel panic in IOGPUMemory.cpp:550 triggered by large Metal GPU memory allocation during MLX inference on M4 Max.
PANIC STRING:
"completeMemory() prepare count underflow" @IOGPUMemory.cpp:550
SYSTEM:
- Hardware: Apple M4 Max (36GB unified memory)
- macOS: 26.3 (25D125)
- Kernel: Darwin 25.3.0 xnu-12377.81.4~5/RELEASE_ARM64_T6041
REPRODUCIBLE: Yes — confirmed twice with identical call stacks.
REPRODUCTION STEPS:
- Install MLX and mlx-lm via pip on Python 3.14 ARM64
- Load a large quantized LLM (Qwen3.5-27B Q5_K_M) via mlx_vlm.load()
- Construct a prompt consisting of 147 concatenated model outputs totalling approximately 173,000 tokens
- Call mlx_vlm.generate() with this prompt — prefill phase begins processing the full context
- Kernel panics during prefill, consistently at IOGPUMemory.cpp:550
ROOT COMPONENT:
com.apple.iokit.IOGPUFamily (129.3.2)
NOTES:
- Panic does not occur with smaller prompts (under ~10,000 tokens)
- Memory capacity is not the issue — system has 36GB and model occupies ~26GB, leaving sufficient headroom
- Issue appears to be a GPU memory accounting state corruption triggered
by a single contiguous Metal allocation for a very large attention computation, not an out-of-memory condition - Two panic logs attached with identical backtraces confirming deterministic reproducibility
Suggested mitigation for MLX:
Add a prefill token count guard in mlx_lm before the Metal allocator is called. If the prompt exceeds a safe threshold (empirically somewhere
below 173K tokens on M4 Max with 36GB), either raise a clear exception with guidance to chunk the prompt, or automatically split the prefill
into safe-sized segments. This would prevent the IOGPUFamily kernel panic without requiring a macOS fix.