-
Notifications
You must be signed in to change notification settings - Fork 4.1k
support gpu direct rdma #3144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
support gpu direct rdma #3144
Conversation
1) recv all data on gpu first 2) the gpu block is alloced from a gpu block pool 3) brpc header, meta and body will be copied from gpu to cpu to process. 4) To decrease the d2h counts, we will prefetch 512B to memory Co-authored-by: sunce4t <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds GPU Direct RDMA (GDR) support to the BRPC framework, enabling efficient data transfer between GPU memory and RDMA-capable network devices. The implementation includes GPU memory pool management, CUDA stream pooling for asynchronous memory operations, and protocol-level changes to handle GPU-resident data.
- Added GPU memory block pool allocator with configurable block sizes
- Implemented CUDA stream pooling for optimized device-to-host and device-to-device transfers
- Integrated GDR support into IOBuf, RDMA endpoint, and RPC protocol parsing
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| src/butil/iobuf.h | Added GPU-specific copy and cut methods with conditional compilation guards |
| src/butil/iobuf.cpp | Implemented cutn_from_gpu and copy_from_gpu methods for GPU memory operations |
| src/butil/gpu/gpu_block_pool.h | New header defining GPU memory pool allocator, stream pool, and region management |
| src/butil/gpu/gpu_block_pool.cpp | Implementation of GPU memory allocation, CUDA stream management, and memory copy operations |
| src/brpc/rdma/rdma_helper.h | Added function declaration for GPU index retrieval |
| src/brpc/rdma/rdma_helper.cpp | Added GPU index configuration and initialization logic for GDR block pool |
| src/brpc/rdma/rdma_endpoint.h | Added remote receive window tracking and GDR-specific receive posting method |
| src/brpc/rdma/rdma_endpoint.cpp | Modified flow control, completion handling, and receive buffer posting for GDR support |
| src/brpc/policy/baidu_rpc_protocol.cpp | Enhanced message parsing to detect and handle GPU memory with optimized prefetch |
| bazel/config/BUILD.bazel | Added build configuration for GDR feature flag |
| BUILD.bazel | Added CUDA dependencies and GPU block pool source to build |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/butil/iobuf.cpp
Outdated
| clock_gettime(CLOCK_MONOTONIC, &end); | ||
| double time_us = (end.tv_sec - start.tv_sec) * 1e6 + (end.tv_nsec - start.tv_nsec) / 1e3; |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables start, end, and time_us are defined and computed but the result is only used in a commented-out log statement (lines 766-767). Consider removing this timing code if it's not being used, or uncomment the logging if it's needed for debugging.
src/butil/iobuf.cpp
Outdated
| double time_us = (end.tv_sec - start.tv_sec) * 1e6 + (end.tv_nsec - start.tv_nsec) / 1e3; | ||
| size_t copied_bytes = n - m; | ||
|
|
||
| // LOG(INFO) << "GDRCopy: " << copied_bytes << " bytes, " | ||
| // << time_us << " us" << ", to_gpu " << to_gpu; |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variables time_us and copied_bytes are computed but only used in commented-out log statements (lines 1421-1422). Consider removing this unused code or uncommenting the logging if it's needed for debugging.
| double time_us = (end.tv_sec - start.tv_sec) * 1e6 + (end.tv_nsec - start.tv_nsec) / 1e3; | |
| size_t copied_bytes = n - m; | |
| // LOG(INFO) << "GDRCopy: " << copied_bytes << " bytes, " | |
| // << time_us << " us" << ", to_gpu " << to_gpu; |
src/butil/iobuf.cpp
Outdated
| struct timespec start, end; | ||
| clock_gettime(CLOCK_MONOTONIC, &start); |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timing variables start and end are declared and used but the computed time_us value is only for commented-out logging. If timing is not actively needed, consider removing this instrumentation code.
src/butil/iobuf.cpp
Outdated
| struct timespec start, end; | ||
| clock_gettime(CLOCK_MONOTONIC, &start); |
Copilot
AI
Nov 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timing variables start and end are declared and used but the computed time_us value is only for commented-out logging. If timing is not actively needed, consider removing this instrumentation code.
Fix code style Co-authored-by: Copilot <[email protected]>
|
|
||
| char header_buf[12]; | ||
| const size_t n = source->copy_to(header_buf, sizeof(header_buf)); | ||
| size_t n = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to put the baidu_rpc_protocol.cpp file containing GPU functionality into a separate file called baidu_rpc_with_gpu_protocol.cpp, or to define a new protocol. Currently, the macro definitions generate too many if-else statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put GPU functionality into a separate file or define a new protocol is two heavy, we reorg the code to decrease the macro definitions. Is it ok? @yanglimingcn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to separate new protocol file for GDR feature, Besides, I have a idea we can also define a new protocol meta that includes both request meta and response meta carrying the GDR attribute value., the corresponding meta should also contain the GPU address information that both sides need to register with IB. Doing the GDR operations inside processRequest and ProcessRpcResponse method step will probably be more reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the server or client wants to know the GPU memory region of the other side, we can send this information with a separate rpc transport. But i do not see the the benefits of this approach. Can you explain more?
|
I believe the long-term solution is to have users register memory with RDMA, allowing users to customize this memory based on their data organization methods. |
In this pr, brpc will recv data into fragmented gpu blocks and user must call IOBuf::copy_from_gpu to copy these gpu blocks into a continous hbm in order to use it. The d2d copy is time-consuming and not necessary. In feature, we can let the user assign the gpu destination directly with interface like "rdma_memory_pool_user_specified_memory" and recv the data into it with the opcode like IBV_WR_RDMA_READ/IBV_WR_RDMA_WRITE. Then we can skip the d2d copy. Futhermore, we can parse the brpc protocol with gpu kernel and initiate RDMA communication with GPU. Then the control path and the data path both are on gpu like nccl gin. |
What problem does this PR solve?
Issue Number: resolve #3102
Problem Summary:
What is changed and the side effects?
Changed:
Side effects:
Performance effects:
Breaking backward compatibility:
Check List: