Skip to content

Add IPC domain rank detection#824

Merged
Binyang2014 merged 4 commits into
mainfrom
binyli/ipc-domain-core
Jun 30, 2026
Merged

Add IPC domain rank detection#824
Binyang2014 merged 4 commits into
mainfrom
binyli/ipc-domain-core

Conversation

@Binyang2014

Copy link
Copy Markdown
Contributor

Expose the number of ranks in each GPU IPC domain through Bootstrap and Python bindings, using NVML fabric information on CUDA when available and falling back to host-local ranks otherwise.

Expose the number of ranks in each GPU IPC domain through Bootstrap and Python bindings, using NVML fabric information on CUDA when available and falling back to host-local ranks otherwise.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Binyang2014 Binyang2014 marked this pull request as ready for review June 28, 2026 23:34
@Binyang2014 Binyang2014 requested a review from Copilot June 29, 2026 18:01
@Binyang2014 Binyang2014 requested a review from a team June 29, 2026 18:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new “IPC domain” concept to Bootstrap, exposing the number of ranks that share a GPU IPC domain (using NVML GPU fabric info on CUDA when available, otherwise falling back to host-local grouping), and wires it through to the Python bindings.

Changes:

  • Add Bootstrap::getNranksPerIpcDomain() API with TcpBootstrap implementation that groups ranks by an IPC-domain hash.
  • Implement CUDA IPC-domain hashing via NVML GPU fabric info with fallback to host hash.
  • Expose the new API in Python bindings and add a basic mp_unit test.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/mp_unit/mp_unit_tests.hpp Declares the new IPC-domain bootstrap test helper.
test/mp_unit/bootstrap_tests.cc Adds a simple assertion-based test for the new IPC-domain API and updates the MPI test bootstrap to override it.
src/core/utils_internal.cc Adds NVML-based fabric hashing (CUDA) and exposes getIpcDomainHash() with fallback behavior.
src/core/include/utils_internal.hpp Declares getIpcDomainHash() for internal callers.
src/core/bootstrap/bootstrap.cc Adds base default and TcpBootstrap implementation of getNranksPerIpcDomain().
python/mscclpp/_core/comm.py Exposes IPC-domain rank count on the Python Comm wrapper.
python/csrc/core_py.cpp Binds get_n_ranks_per_ipc_domain into the Python extension module.
include/mscclpp/core.hpp Extends the public C++ Bootstrap API with getNranksPerIpcDomain().
CMakeLists.txt Links NVML on CUDA builds to support the new NVML-based query path.

Comment thread src/core/utils_internal.cc
Comment thread src/core/utils_internal.cc
Comment thread CMakeLists.txt
Comment thread python/mscclpp/_core/comm.py Outdated
@Binyang2014

Copy link
Copy Markdown
Contributor Author

/azp run mscclpp-ut

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Comment thread src/core/utils_internal.cc
Comment thread src/core/bootstrap/bootstrap.cc

@caiomcbr caiomcbr left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Binyang2014 Binyang2014 merged commit 197d62c into main Jun 30, 2026
16 checks passed
@Binyang2014 Binyang2014 deleted the binyli/ipc-domain-core branch June 30, 2026 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants