Skip to content

Metal: keep selected-address SSD prefill opt-in by default#454

Draft
andreaborio wants to merge 1 commit into
antirez:mainfrom
andreaborio:fix-439-prefill-selected-addr
Draft

Metal: keep selected-address SSD prefill opt-in by default#454
andreaborio wants to merge 1 commit into
antirez:mainfrom
andreaborio:fix-439-prefill-selected-addr

Conversation

@andreaborio

@andreaborio andreaborio commented Jun 25, 2026

Copy link
Copy Markdown

Summary

  • Keep Metal selected-address SSD-streaming batch prefill opt-in instead of auto-enabling it for Flash/Pro.
  • Leave the ROCm auto path unchanged.
  • Make the Metal SSD-streaming cache-pressure regression exercise the default selected-address behavior instead of inheriting caller env overrides.

Related to #439. This PR narrows one bad default path, but I am not claiming it fully closes #439 across all Apple Metal setups.

Current understanding

On my M5 Pro, the selected-address batch prefill path and the canonical path produce different logits for the short_code_completion repro. With the PR default, Metal no longer auto-enables selected-address prefill and the CLI repro selects the expected lowercase c. Explicitly opting selected-address back in still selects uppercase C.

A follow-up report in #439 suggests the behavior can also depend on math mode and hardware. I re-ran a wider matrix with the issue command (DS4_METAL_PREFILL_CHUNK=2048, --dump-logprobs, top-k 10) on both this branch and clean upstream/main at 80ebbc3. On this M5 Pro, DS4_METAL_MATH_SAFE=1 is not a universal fix: it flips the argmax differently depending on whether selected-address prefill is enabled.

So the conservative change here is only: do not auto-enable the selected-address path on Metal while its numerics are not understood. The path remains available for explicit profiling/debugging.

Validation

Machine/backend: Apple M5 Pro, 64 GiB RAM, Metal.
Model: DeepSeek-V4-Flash IQ2XXS/Q2_K imatrix GGUF.

Build/test checks run:

  • make clean && make && make ds4_test ds4_agent_test q4k-dot-test
  • ./ds4_test --server
  • ./ds4_test --metal-kernels
  • ./ds4_agent_test
  • DS4_TEST_MODEL=... ./ds4_test --metal-ssd-streaming-cache-pressure
  • DS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --logprob-vectors
  • DS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --long-context
  • DS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --think-tool-recovery
  • make cpu
  • git diff --check

Additional CLI matrix for #439 repro (DS4_METAL_PREFILL_CHUNK=2048, --logprobs-top-k 10):

Tree Flags Selected logit(c) logit(C) c-C
upstream/main default C 34.5451698 34.7262421 -0.1810723
upstream/main DS4_METAL_DISABLE_STREAMING_PREFILL_BATCH_SELECTED_ADDR=1 c 35.8293610 34.2557144 1.5736466
upstream/main DS4_METAL_MATH_SAFE=1 c 35.5860977 35.0353928 0.5507049
upstream/main math-safe + selected-address disabled C 34.0360031 34.5594215 -0.5234184
this PR default c 35.8293610 34.2557144 1.5736466
this PR DS4_METAL_ENABLE_STREAMING_PREFILL_BATCH_SELECTED_ADDR=1 C 34.5451698 34.7262421 -0.1810723
this PR DS4_METAL_MATH_SAFE=1 C 34.0360031 34.5594215 -0.5234184
this PR math-safe + selected-address enabled c 35.5860977 35.0353928 0.5507049

Repeated critical cases in a different order reproduced the same logits/token choices.

I am intentionally not making a performance claim in this PR. The latency and prefill-only spot checks are workload-sensitive; correctness should come first here.

Separate failure

DS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --tool-call-quality fails on both this branch and clean upstream/main at 80ebbc3 with Metal model range ... is not covered by mapped model views in the exact path. I filed that separately as #455.

@DanteCpp

Copy link
Copy Markdown

If I am interpreting your speed spot check correctly the DS4_METAL_ENABLE_STREAMING_PREFILL_BATCH_SELECTED_ADDR=1 optimization is actually a regration, on top of compromising the graph.

At the end of the day, for a sufficiently large batch it is reasonable to think that most experts will be needed. Maybe for small prompts it pays off...

Once the correctness is established, we should measure the effectiveness of this optimization.

Ciao,
Dante

@andreaborio

Copy link
Copy Markdown
Author

Yes, same read here!
For now I think the safest thing is just to stop enabling it by default: it stays available for testing, but it should not be on automatically while it can change the selected token.

And yes, the speed number is just a small sanity check, not a real benchmark. If/when the path is made correct, it probably needs a proper sweep across prompt sizes before turning it back on.

Ciao
Andrea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metal SSD-streaming cache-pressure vector test fails at short_code_completion

2 participants