Metal: keep selected-address SSD prefill opt-in by default#454
Metal: keep selected-address SSD prefill opt-in by default#454andreaborio wants to merge 1 commit into
Conversation
|
If I am interpreting your speed spot check correctly the At the end of the day, for a sufficiently large batch it is reasonable to think that most experts will be needed. Maybe for small prompts it pays off... Once the correctness is established, we should measure the effectiveness of this optimization. Ciao, |
|
Yes, same read here! And yes, the speed number is just a small sanity check, not a real benchmark. If/when the path is made correct, it probably needs a proper sweep across prompt sizes before turning it back on. Ciao |
Summary
Related to #439. This PR narrows one bad default path, but I am not claiming it fully closes #439 across all Apple Metal setups.
Current understanding
On my M5 Pro, the selected-address batch prefill path and the canonical path produce different logits for the
short_code_completionrepro. With the PR default, Metal no longer auto-enables selected-address prefill and the CLI repro selects the expected lowercasec. Explicitly opting selected-address back in still selects uppercaseC.A follow-up report in #439 suggests the behavior can also depend on math mode and hardware. I re-ran a wider matrix with the issue command (
DS4_METAL_PREFILL_CHUNK=2048,--dump-logprobs, top-k 10) on both this branch and cleanupstream/mainat80ebbc3. On this M5 Pro,DS4_METAL_MATH_SAFE=1is not a universal fix: it flips the argmax differently depending on whether selected-address prefill is enabled.So the conservative change here is only: do not auto-enable the selected-address path on Metal while its numerics are not understood. The path remains available for explicit profiling/debugging.
Validation
Machine/backend: Apple M5 Pro, 64 GiB RAM, Metal.
Model: DeepSeek-V4-Flash IQ2XXS/Q2_K imatrix GGUF.
Build/test checks run:
make clean && make && make ds4_test ds4_agent_test q4k-dot-test./ds4_test --server./ds4_test --metal-kernels./ds4_agent_testDS4_TEST_MODEL=... ./ds4_test --metal-ssd-streaming-cache-pressureDS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --logprob-vectorsDS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --long-contextDS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --think-tool-recoverymake cpugit diff --checkAdditional CLI matrix for #439 repro (
DS4_METAL_PREFILL_CHUNK=2048,--logprobs-top-k 10):CDS4_METAL_DISABLE_STREAMING_PREFILL_BATCH_SELECTED_ADDR=1cDS4_METAL_MATH_SAFE=1cCcDS4_METAL_ENABLE_STREAMING_PREFILL_BATCH_SELECTED_ADDR=1CDS4_METAL_MATH_SAFE=1CcRepeated critical cases in a different order reproduced the same logits/token choices.
I am intentionally not making a performance claim in this PR. The latency and prefill-only spot checks are workload-sensitive; correctness should come first here.
Separate failure
DS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --tool-call-qualityfails on both this branch and cleanupstream/mainat80ebbc3withMetal model range ... is not covered by mapped model viewsin the exact path. I filed that separately as #455.