Skip to content

[nccl-ep] Add ncclEpUpdateHandle to rebind topk_idx without reallocating#2085

Open
kwen2501 wants to merge 3 commits intoNVIDIA:masterfrom
kwen2501:add-update-handle
Open

[nccl-ep] Add ncclEpUpdateHandle to rebind topk_idx without reallocating#2085
kwen2501 wants to merge 3 commits intoNVIDIA:masterfrom
kwen2501:add-update-handle

Conversation

@kwen2501
Copy link
Copy Markdown
Collaborator

@kwen2501 kwen2501 commented Apr 2, 2026

Summary

  • Add ncclEpUpdateHandle API that rebinds topk_idx on an existing handle without reallocating GPU buffers. This avoids unnecessary cudaMalloc/cudaFree traffic (both device-synchronizing) in MoE hot loops where only the routing changes between iterations.
  • Refactor ncclEpCreateHandle to delegate the computation phase to ncclEpUpdateHandle, eliminating code duplication.
  • Fix dense_prob_buffer sizing to use max_tokens_per_rank instead of num_tokens so the buffer is reusable across different token counts.

RFC: #2084

API

ncclResult_t ncclEpUpdateHandle(
    ncclEpHandle_t handle,
    const ncclNDTensor_t* topk_idx,
    ncclNDTensor_t* const* local_tensors,
    unsigned int num_local_tensors,
    cudaStream_t stream
);

Usage

ncclEpCreateHandle(&handle, group, topk_idx_0, ...);

// Each iteration: rebind routing without realloc
ncclEpUpdateHandle(handle, topk_idx_1, ...);
ncclEpDispatch(handle, ...);
ncclEpCombine(handle, ...);

ncclEpHandleDestroy(handle);

Changes

  • nccl_ep.h: Add ncclEpUpdateHandle declaration
  • nccl_ep.cc: Add ncclEpUpdateHandle implementation (resets buffers + re-runs convert/allgather/preprocess). Refactor ncclEpCreateHandle to be allocation-only, delegating computation to ncclEpUpdateHandle.
  • nccl_wrapper.py: Add ctypes binding for ncclEpUpdateHandle

Co-authored-by: Claude noreply@anthropic.com

Made with Cursor

kwen2501 and others added 3 commits April 2, 2026 02:15
…ndle`

The `ncclAllGather` and `call_metadata_preprocessing` are on the same CUDA
stream, so intra-stream ordering already guarantees the allgather
completes before the preprocessing kernel launches. Additionally,
it seems nowhere else would use the allgather result
`global_routing_map`.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Ke Wen <kwen@nvidia.com>
ncclEpCreateHandle allocates ~8 GPU buffers whose sizes depend only on
group-level constants, yet the current API forces a full destroy+create
cycle whenever topk_idx changes between iterations. This causes
unnecessary cudaMalloc/cudaFree traffic (both device-synchronizing).

Add ncclEpUpdateHandle that resets and re-runs only the topk_idx-dependent
computation (convert_topk_to_routing_map, ncclAllGather, and
call_metadata_preprocessing) on an existing handle. Refactor
ncclEpCreateHandle to delegate to ncclEpUpdateHandle for the computation
phase, eliminating code duplication.

Also fix dense_prob_buffer sizing to use max_tokens_per_rank instead of
num_tokens so the buffer is large enough for reuse across different
token counts.

Co-authored-by: Claude <noreply@anthropic.com>
RFC: NVIDIA#2084
Signed-off-by: Ke Wen <kwen@nvidia.com>
Made-with: Cursor
@kwen2501 kwen2501 force-pushed the add-update-handle branch from 71a21ec to df56902 Compare April 2, 2026 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant