Describe the bug
Consecutive send/recv point-to-point calls between two nodes with different shapes fail on the JACCL backend. Either I get wrong data sent/received or hang my system. It seems that adding an all_sum barrier before the send/recv fixes it. Ring backend always works fine with the same complex sequences of communications.
To Reproduce
I put in a repo the detailed setup instructions on my system, using conda and the mlx config setup. I added all the hostfiles generated by the setup and two test scripts (test_fail.py / test_pass.py):
https://github.com/dev-tb5-tester/rdma_jaccl
Expected behavior
Complex series of send/recv on jaccl backend are not currently working as expected. Prepending a barrier before each couple of point-to-point communication should not be required.
Desktop (please complete the following information):
- OS Version: MacOS 26.3
- Environment: M4 Max + M4 Pro, Thunderbolt 5, RDMA enabled.
- MLX version: 0.30.6
Additional context
Please review reproduction details and results on README