Skip to content

[BUG] JACCL RDMA consecutive send/recv point-to-point with different shape produces wrong data or hangs. #3149

@dev-tb5-tester

Description

@dev-tb5-tester

Describe the bug
Consecutive send/recv point-to-point calls between two nodes with different shapes fail on the JACCL backend. Either I get wrong data sent/received or hang my system. It seems that adding an all_sum barrier before the send/recv fixes it. Ring backend always works fine with the same complex sequences of communications.

To Reproduce
I put in a repo the detailed setup instructions on my system, using conda and the mlx config setup. I added all the hostfiles generated by the setup and two test scripts (test_fail.py / test_pass.py):
https://github.com/dev-tb5-tester/rdma_jaccl

Expected behavior
Complex series of send/recv on jaccl backend are not currently working as expected. Prepending a barrier before each couple of point-to-point communication should not be required.

Desktop (please complete the following information):

  • OS Version: MacOS 26.3
  • Environment: M4 Max + M4 Pro, Thunderbolt 5, RDMA enabled.
  • MLX version: 0.30.6

Additional context
Please review reproduction details and results on README

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions