Metal: keep selected-address SSD prefill opt-in by default by andreaborio · Pull Request #454 · antirez/ds4

andreaborio · 2026-06-25T08:27:45Z

Summary

Keep Metal selected-address SSD-streaming batch prefill opt-in instead of auto-enabling it for Flash/Pro.
Leave the ROCm auto path unchanged.
Make the Metal SSD-streaming cache-pressure regression exercise the default selected-address behavior instead of inheriting caller env overrides.

Related to #439. This PR narrows one bad default path, but I am not claiming it fully closes #439 across all Apple Metal setups.

Current understanding

On my M5 Pro, the selected-address batch prefill path and the canonical path produce different logits for the short_code_completion repro. With the PR default, Metal no longer auto-enables selected-address prefill and the CLI repro selects the expected lowercase c. Explicitly opting selected-address back in still selects uppercase C.

A follow-up report in #439 suggests the behavior can also depend on math mode and hardware. I re-ran a wider matrix with the issue command (DS4_METAL_PREFILL_CHUNK=2048, --dump-logprobs, top-k 10) on both this branch and clean upstream/main at 80ebbc3. On this M5 Pro, DS4_METAL_MATH_SAFE=1 is not a universal fix: it flips the argmax differently depending on whether selected-address prefill is enabled.

So the conservative change here is only: do not auto-enable the selected-address path on Metal while its numerics are not understood. The path remains available for explicit profiling/debugging.

Validation

Machine/backend: Apple M5 Pro, 64 GiB RAM, Metal.
Model: DeepSeek-V4-Flash IQ2XXS/Q2_K imatrix GGUF.

Build/test checks run:

make clean && make && make ds4_test ds4_agent_test q4k-dot-test
./ds4_test --server
./ds4_test --metal-kernels
./ds4_agent_test
DS4_TEST_MODEL=... ./ds4_test --metal-ssd-streaming-cache-pressure
DS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --logprob-vectors
DS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --long-context
DS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --think-tool-recovery
make cpu
git diff --check

Additional CLI matrix for #439 repro (DS4_METAL_PREFILL_CHUNK=2048, --logprobs-top-k 10):

Tree	Flags	Selected	logit(c)	logit(C)	c-C
upstream/main	default	`C`	34.5451698	34.7262421	-0.1810723
upstream/main	`DS4_METAL_DISABLE_STREAMING_PREFILL_BATCH_SELECTED_ADDR=1`	`c`	35.8293610	34.2557144	1.5736466
upstream/main	`DS4_METAL_MATH_SAFE=1`	`c`	35.5860977	35.0353928	0.5507049
upstream/main	math-safe + selected-address disabled	`C`	34.0360031	34.5594215	-0.5234184
this PR	default	`c`	35.8293610	34.2557144	1.5736466
this PR	`DS4_METAL_ENABLE_STREAMING_PREFILL_BATCH_SELECTED_ADDR=1`	`C`	34.5451698	34.7262421	-0.1810723
this PR	`DS4_METAL_MATH_SAFE=1`	`C`	34.0360031	34.5594215	-0.5234184
this PR	math-safe + selected-address enabled	`c`	35.5860977	35.0353928	0.5507049

Repeated critical cases in a different order reproduced the same logits/token choices.

I am intentionally not making a performance claim in this PR. The latency and prefill-only spot checks are workload-sensitive; correctness should come first here.

Separate failure

DS4_TEST_MODEL=... DS4_TEST_SSD_STREAMING=1 DS4_TEST_SSD_STREAMING_CACHE_GB=16 ./ds4_test --tool-call-quality fails on both this branch and clean upstream/main at 80ebbc3 with Metal model range ... is not covered by mapped model views in the exact path. I filed that separately as #455.

DanteCpp · 2026-06-25T09:54:17Z

If I am interpreting your speed spot check correctly the DS4_METAL_ENABLE_STREAMING_PREFILL_BATCH_SELECTED_ADDR=1 optimization is actually a regration, on top of compromising the graph.

At the end of the day, for a sufficiently large batch it is reasonable to think that most experts will be needed. Maybe for small prompts it pays off...

Once the correctness is established, we should measure the effectiveness of this optimization.

Ciao,
Dante

andreaborio · 2026-06-25T09:59:29Z

Yes, same read here!
For now I think the safest thing is just to stop enabling it by default: it stays available for testing, but it should not be on automatically while it can change the selected token.

And yes, the speed number is just a small sanity check, not a real benchmark. If/when the path is made correct, it probably needs a proper sweep across prompt sizes before turning it back on.

Ciao
Andrea

Disable Metal selected-addr prefill auto path

356b450

andreaborio mentioned this pull request Jun 25, 2026

Metal SSD streaming: tool-call-quality exact path misses routed expert views #455

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal: keep selected-address SSD prefill opt-in by default#454

Metal: keep selected-address SSD prefill opt-in by default#454
andreaborio wants to merge 1 commit into
antirez:mainfrom
andreaborio:fix-439-prefill-selected-addr

andreaborio commented Jun 25, 2026 •

edited

Loading

Uh oh!

DanteCpp commented Jun 25, 2026

Uh oh!

andreaborio commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andreaborio commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Current understanding

Validation

Separate failure

Uh oh!

DanteCpp commented Jun 25, 2026

Uh oh!

andreaborio commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andreaborio commented Jun 25, 2026 •

edited

Loading