Skip to content

win: fix cuda build#3204

Open
dhiltgen wants to merge 5 commits intoml-explore:mainfrom
dhiltgen:fix_win_build
Open

win: fix cuda build#3204
dhiltgen wants to merge 5 commits intoml-explore:mainfrom
dhiltgen:fix_win_build

Conversation

@dhiltgen
Copy link
Contributor

@dhiltgen dhiltgen commented Mar 4, 2026

Adjust recent CUDA changes to build on windows.

Proposed changes

HEAD on main currently fails to build on Windows with CUDA enabled. This gets the build working.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

Copy link
Collaborator

@zcbenz zcbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

When testing with a Go client, eval can be called on different OS
threads on windows, which causes crashes from the CUDA device not being
properly initialized.  This manifests as sporadic crashing during JIT.
@dhiltgen
Copy link
Contributor Author

dhiltgen commented Mar 7, 2026

While testing more deeply I noticed sporadic JIT crashes on Windows CUDA and traced it down to missing device initialization, which I've fixed in a discrete commit. Let me know if you want that split out to a different PR.

@zcbenz
Copy link
Collaborator

zcbenz commented Mar 8, 2026

The new change looks good to me, thanks for fixing it! Still need to wait for the CI environment to get fixed before I can merge.

cuDNN creates a new graph each time which is expensive with WDDM.
@dhiltgen
Copy link
Contributor Author

dhiltgen commented Mar 8, 2026

I fixed another performance issue. I can move it to another PR as well if needed. It looks like cuDNN should be disabled by default on Windows due to how it interacts with WDDM. Using mlx-community/Qwen3-0.6B-4bit to test with an RTX 6000 Ada on Win 11:

  ┌───────────┬─────────────────┬───────────────────┐
  │   Mode    │ Prefill (p2048) │ Generation (g128) │
  ├───────────┼─────────────────┼───────────────────┤
  │ cuDNN ON  │ 2,022 tok/s     │ 10.4 tok/s        │
  ├───────────┼─────────────────┼───────────────────┤
  │ cuDNN OFF │ 4,822 tok/s     │ 480 tok/s         │
  └───────────┴─────────────────┴───────────────────┘

@zcbenz
Copy link
Collaborator

zcbenz commented Mar 8, 2026

I'm good disabling cuDNN SDPA for Windows, but can you check with NVIDIA whether this is something that can be fixed? Our fallback kernel is not good for prefill and long context decoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants