[nccl-ep] Add ncclEpUpdateHandle to rebind topk_idx without reallocating by kwen2501 · Pull Request #2085 · NVIDIA/nccl

kwen2501 · 2026-04-02T01:43:58Z

Summary

Add ncclEpUpdateHandle API that rebinds topk_idx on an existing handle without reallocating GPU buffers. This avoids unnecessary cudaMalloc/cudaFree traffic (both device-synchronizing) in MoE hot loops where only the routing changes between iterations.
Refactor ncclEpCreateHandle to delegate the computation phase to ncclEpUpdateHandle, eliminating code duplication.
Fix dense_prob_buffer sizing to use max_tokens_per_rank instead of num_tokens so the buffer is reusable across different token counts.

RFC: #2084

API

ncclResult_t ncclEpUpdateHandle(
    ncclEpHandle_t handle,
    const ncclNDTensor_t* topk_idx,
    ncclNDTensor_t* const* local_tensors,
    unsigned int num_local_tensors,
    cudaStream_t stream
);

Usage

ncclEpCreateHandle(&handle, group, topk_idx_0, ...);

// Each iteration: rebind routing without realloc
ncclEpUpdateHandle(handle, topk_idx_1, ...);
ncclEpDispatch(handle, ...);
ncclEpCombine(handle, ...);

ncclEpHandleDestroy(handle);

Changes

nccl_ep.h: Add ncclEpUpdateHandle declaration
nccl_ep.cc: Add ncclEpUpdateHandle implementation (resets buffers + re-runs convert/allgather/preprocess). Refactor ncclEpCreateHandle to be allocation-only, delegating computation to ncclEpUpdateHandle.
nccl_wrapper.py: Add ctypes binding for ncclEpUpdateHandle

Co-authored-by: Claude noreply@anthropic.com

Made with Cursor

…ndle` The `ncclAllGather` and `call_metadata_preprocessing` are on the same CUDA stream, so intra-stream ordering already guarantees the allgather completes before the preprocessing kernel launches. Additionally, it seems nowhere else would use the allgather result `global_routing_map`. Co-authored-by: Claude <noreply@anthropic.com>

Signed-off-by: Ke Wen <kwen@nvidia.com>

ncclEpCreateHandle allocates ~8 GPU buffers whose sizes depend only on group-level constants, yet the current API forces a full destroy+create cycle whenever topk_idx changes between iterations. This causes unnecessary cudaMalloc/cudaFree traffic (both device-synchronizing). Add ncclEpUpdateHandle that resets and re-runs only the topk_idx-dependent computation (convert_topk_to_routing_map, ncclAllGather, and call_metadata_preprocessing) on an existing handle. Refactor ncclEpCreateHandle to delegate to ncclEpUpdateHandle for the computation phase, eliminating code duplication. Also fix dense_prob_buffer sizing to use max_tokens_per_rank instead of num_tokens so the buffer is large enough for reuse across different token counts. Co-authored-by: Claude <noreply@anthropic.com> RFC: NVIDIA#2084 Signed-off-by: Ke Wen <kwen@nvidia.com> Made-with: Cursor

kwen2501 and others added 3 commits April 2, 2026 02:15

[nccl-ep] Remove dead code of ht struct

2e602a7

Signed-off-by: Ke Wen <kwen@nvidia.com>

kwen2501 force-pushed the add-update-handle branch from 71a21ec to df56902 Compare April 2, 2026 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nccl-ep] Add ncclEpUpdateHandle to rebind topk_idx without reallocating#2085

[nccl-ep] Add ncclEpUpdateHandle to rebind topk_idx without reallocating#2085
kwen2501 wants to merge 3 commits intoNVIDIA:masterfrom
kwen2501:add-update-handle

kwen2501 commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kwen2501 commented Apr 2, 2026

Summary

API

Usage

Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant