support gpu direct rdma #3144

randomkang · 2025-11-09T02:51:10Z

What problem does this PR solve?

Issue Number: resolve #3102

Problem Summary:

What is changed and the side effects?

Changed:

recv all data on gpu first
the gpu block is alloced from a gpu block pool
brpc header, meta and body will be copied from gpu to cpu to process.
To decrease the d2h counts, we will prefetch 512B to memory

Side effects:

Performance effects:
Breaking backward compatibility:

Check List:

Please make sure your changes are compilable.
When providing us with a new feature, it is best to add related tests.
Please follow Contributor Covenant Code of Conduct.

1) recv all data on gpu first 2) the gpu block is alloced from a gpu block pool 3) brpc header, meta and body will be copied from gpu to cpu to process. 4) To decrease the d2h counts, we will prefetch 512B to memory Co-authored-by: sunce4t <[email protected]>

Copilot

Pull Request Overview

This PR adds GPU Direct RDMA (GDR) support to the BRPC framework, enabling efficient data transfer between GPU memory and RDMA-capable network devices. The implementation includes GPU memory pool management, CUDA stream pooling for asynchronous memory operations, and protocol-level changes to handle GPU-resident data.

Added GPU memory block pool allocator with configurable block sizes
Implemented CUDA stream pooling for optimized device-to-host and device-to-device transfers
Integrated GDR support into IOBuf, RDMA endpoint, and RPC protocol parsing

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
src/butil/iobuf.h	Added GPU-specific copy and cut methods with conditional compilation guards
src/butil/iobuf.cpp	Implemented `cutn_from_gpu` and `copy_from_gpu` methods for GPU memory operations
src/butil/gpu/gpu_block_pool.h	New header defining GPU memory pool allocator, stream pool, and region management
src/butil/gpu/gpu_block_pool.cpp	Implementation of GPU memory allocation, CUDA stream management, and memory copy operations
src/brpc/rdma/rdma_helper.h	Added function declaration for GPU index retrieval
src/brpc/rdma/rdma_helper.cpp	Added GPU index configuration and initialization logic for GDR block pool
src/brpc/rdma/rdma_endpoint.h	Added remote receive window tracking and GDR-specific receive posting method
src/brpc/rdma/rdma_endpoint.cpp	Modified flow control, completion handling, and receive buffer posting for GDR support
src/brpc/policy/baidu_rpc_protocol.cpp	Enhanced message parsing to detect and handle GPU memory with optimized prefetch
bazel/config/BUILD.bazel	Added build configuration for GDR feature flag
BUILD.bazel	Added CUDA dependencies and GPU block pool source to build

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/butil/iobuf.cpp

src/brpc/rdma/rdma_endpoint.cpp

src/brpc/rdma/rdma_helper.cpp

src/butil/gpu/gpu_block_pool.h

Copilot · 2025-11-09T08:37:24Z

src/butil/iobuf.cpp

+    clock_gettime(CLOCK_MONOTONIC, &end);
+    double time_us = (end.tv_sec - start.tv_sec) * 1e6 + (end.tv_nsec - start.tv_nsec) / 1e3;


The variables start, end, and time_us are defined and computed but the result is only used in a commented-out log statement (lines 766-767). Consider removing this timing code if it's not being used, or uncomment the logging if it's needed for debugging.

Copilot · 2025-11-09T08:37:25Z

src/butil/iobuf.cpp

+    double time_us = (end.tv_sec - start.tv_sec) * 1e6 + (end.tv_nsec - start.tv_nsec) / 1e3;
+    size_t copied_bytes = n - m;
+
+    // LOG(INFO) << "GDRCopy: " << copied_bytes << " bytes, "
+    //           << time_us << " us" << ", to_gpu " << to_gpu;


The variables time_us and copied_bytes are computed but only used in commented-out log statements (lines 1421-1422). Consider removing this unused code or uncommenting the logging if it's needed for debugging.

Suggested change

double time_us = (end.tv_sec - start.tv_sec) * 1e6 + (end.tv_nsec - start.tv_nsec) / 1e3;

size_t copied_bytes = n - m;

// LOG(INFO) << "GDRCopy: " << copied_bytes << " bytes, "

// << time_us << " us" << ", to_gpu " << to_gpu;

src/butil/gpu/gpu_block_pool.cpp

Copilot · 2025-11-09T08:37:25Z

src/butil/iobuf.cpp

+    struct timespec start, end;
+    clock_gettime(CLOCK_MONOTONIC, &start);


The timing variables start and end are declared and used but the computed time_us value is only for commented-out logging. If timing is not actively needed, consider removing this instrumentation code.

Copilot · 2025-11-09T08:37:26Z

src/butil/iobuf.cpp

+    struct timespec start, end;
+    clock_gettime(CLOCK_MONOTONIC, &start);


The timing variables start and end are declared and used but the computed time_us value is only for commented-out logging. If timing is not actively needed, consider removing this instrumentation code.

Fix code style Co-authored-by: Copilot <[email protected]>

yanglimingcn · 2025-11-10T01:25:08Z

src/brpc/policy/baidu_rpc_protocol.cpp

+
    char header_buf[12];
-    const size_t n = source->copy_to(header_buf, sizeof(header_buf));
+    size_t n = 0;


I think it would be better to put the baidu_rpc_protocol.cpp file containing GPU functionality into a separate file called baidu_rpc_with_gpu_protocol.cpp, or to define a new protocol. Currently, the macro definitions generate too many if-else statements.

Maybe put GPU functionality into a separate file or define a new protocol is two heavy, we reorg the code to decrease the macro definitions. Is it ok? @yanglimingcn

I think it's better to separate new protocol file for GDR feature， Besides, I have a idea we can also define a new protocol meta that includes both request meta and response meta carrying the GDR attribute value., the corresponding meta should also contain the GPU address information that both sides need to register with IB. Doing the GDR operations inside processRequest and ProcessRpcResponse method step will probably be more reasonable.

If the server or client wants to know the GPU memory region of the other side, we can send this information with a separate rpc transport. But i do not see the the benefits of this approach. Can you explain more?

src/brpc/rdma/rdma_endpoint.cpp

yanglimingcn · 2025-12-02T07:31:08Z

I believe the long-term solution is to have users register memory with RDMA, allowing users to customize this memory based on their data organization methods.
https://zhuanlan.zhihu.com/p/376989325 This link contains some relevant details.

randomkang · 2025-12-14T12:53:32Z

I believe the long-term solution is to have users register memory with RDMA, allowing users to customize this memory based on their data organization methods. https://zhuanlan.zhihu.com/p/376989325 This link contains some relevant details.

In this pr, brpc will recv data into fragmented gpu blocks and user must call IOBuf::copy_from_gpu to copy these gpu blocks into a continous hbm in order to use it. The d2d copy is time-consuming and not necessary.

In feature, we can let the user assign the gpu destination directly with interface like "rdma_memory_pool_user_specified_memory" and recv the data into it with the opcode like IBV_WR_RDMA_READ/IBV_WR_RDMA_WRITE. Then we can skip the d2d copy.

Futhermore, we can parse the brpc protocol with gpu kernel and initiate RDMA communication with GPU. Then the control path and the data path both are on gpu like nccl gin.

support gpu direct rdma

cc8cfae

1) recv all data on gpu first 2) the gpu block is alloced from a gpu block pool 3) brpc header, meta and body will be copied from gpu to cpu to process. 4) To decrease the d2h counts, we will prefetch 512B to memory Co-authored-by: sunce4t <[email protected]>

wwbmmm requested review from Copilot and yanglimingcn November 9, 2025 08:34

Copilot AI reviewed Nov 9, 2025

View reviewed changes

Apply suggestions from code review

4811b06

Fix code style Co-authored-by: Copilot <[email protected]>

yanglimingcn reviewed Nov 10, 2025

View reviewed changes

src/brpc/rdma/rdma_endpoint.cpp Outdated Show resolved Hide resolved

randomkang and others added 5 commits November 13, 2025 23:45

clean unused code

31a30f7

revert the fix of rdma window

88fc739

fix align

cec0bf8

reorg code

b856888

Change GPU memory detection logics in baidu_rpc_protocol

1fb9370

sunce4t force-pushed the gdr branch from 6eedb3e to 1fb9370 Compare November 26, 2025 02:17

		clock_gettime(CLOCK_MONOTONIC, &end);
		double time_us = (end.tv_sec - start.tv_sec) * 1e6 + (end.tv_nsec - start.tv_nsec) / 1e3;

		struct timespec start, end;
		clock_gettime(CLOCK_MONOTONIC, &start);

support gpu direct rdma #3144

Are you sure you want to change the base?

support gpu direct rdma #3144

Uh oh!

Conversation

randomkang commented Nov 9, 2025 • edited by chenBright Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and the side effects?

Check List:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

Uh oh!

yanglimingcn Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

randomkang Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zchuango Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

randomkang Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yanglimingcn commented Dec 2, 2025

Uh oh!

randomkang commented Dec 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

randomkang commented Nov 9, 2025 •

edited by chenBright

Loading

randomkang Nov 14, 2025 •

edited

Loading