Skip to content

Conversation

@randomkang
Copy link

@randomkang randomkang commented Nov 9, 2025

What problem does this PR solve?

Issue Number: resolve #3102

Problem Summary:

What is changed and the side effects?

Changed:

  1. recv all data on gpu first
  2. the gpu block is alloced from a gpu block pool
  3. brpc header, meta and body will be copied from gpu to cpu to process.
  4. To decrease the d2h counts, we will prefetch 512B to memory

Side effects:

  • Performance effects:

  • Breaking backward compatibility:


Check List:

1) recv all data on gpu first
2) the gpu block is alloced from a gpu block pool
3) brpc header, meta and body will be copied from gpu to cpu to process.
4) To decrease the d2h counts, we will prefetch 512B to memory

Co-authored-by: sunce4t <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds GPU Direct RDMA (GDR) support to the BRPC framework, enabling efficient data transfer between GPU memory and RDMA-capable network devices. The implementation includes GPU memory pool management, CUDA stream pooling for asynchronous memory operations, and protocol-level changes to handle GPU-resident data.

  • Added GPU memory block pool allocator with configurable block sizes
  • Implemented CUDA stream pooling for optimized device-to-host and device-to-device transfers
  • Integrated GDR support into IOBuf, RDMA endpoint, and RPC protocol parsing

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
src/butil/iobuf.h Added GPU-specific copy and cut methods with conditional compilation guards
src/butil/iobuf.cpp Implemented cutn_from_gpu and copy_from_gpu methods for GPU memory operations
src/butil/gpu/gpu_block_pool.h New header defining GPU memory pool allocator, stream pool, and region management
src/butil/gpu/gpu_block_pool.cpp Implementation of GPU memory allocation, CUDA stream management, and memory copy operations
src/brpc/rdma/rdma_helper.h Added function declaration for GPU index retrieval
src/brpc/rdma/rdma_helper.cpp Added GPU index configuration and initialization logic for GDR block pool
src/brpc/rdma/rdma_endpoint.h Added remote receive window tracking and GDR-specific receive posting method
src/brpc/rdma/rdma_endpoint.cpp Modified flow control, completion handling, and receive buffer posting for GDR support
src/brpc/policy/baidu_rpc_protocol.cpp Enhanced message parsing to detect and handle GPU memory with optimized prefetch
bazel/config/BUILD.bazel Added build configuration for GDR feature flag
BUILD.bazel Added CUDA dependencies and GPU block pool source to build

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 763 to 764
clock_gettime(CLOCK_MONOTONIC, &end);
double time_us = (end.tv_sec - start.tv_sec) * 1e6 + (end.tv_nsec - start.tv_nsec) / 1e3;
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variables start, end, and time_us are defined and computed but the result is only used in a commented-out log statement (lines 766-767). Consider removing this timing code if it's not being used, or uncomment the logging if it's needed for debugging.

Copilot uses AI. Check for mistakes.
Comment on lines 1418 to 1422
double time_us = (end.tv_sec - start.tv_sec) * 1e6 + (end.tv_nsec - start.tv_nsec) / 1e3;
size_t copied_bytes = n - m;

// LOG(INFO) << "GDRCopy: " << copied_bytes << " bytes, "
// << time_us << " us" << ", to_gpu " << to_gpu;
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variables time_us and copied_bytes are computed but only used in commented-out log statements (lines 1421-1422). Consider removing this unused code or uncommenting the logging if it's needed for debugging.

Suggested change
double time_us = (end.tv_sec - start.tv_sec) * 1e6 + (end.tv_nsec - start.tv_nsec) / 1e3;
size_t copied_bytes = n - m;
// LOG(INFO) << "GDRCopy: " << copied_bytes << " bytes, "
// << time_us << " us" << ", to_gpu " << to_gpu;

Copilot uses AI. Check for mistakes.
Comment on lines 1394 to 1395
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timing variables start and end are declared and used but the computed time_us value is only for commented-out logging. If timing is not actively needed, consider removing this instrumentation code.

Copilot uses AI. Check for mistakes.
Comment on lines 744 to 745
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
Copy link

Copilot AI Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timing variables start and end are declared and used but the computed time_us value is only for commented-out logging. If timing is not actively needed, consider removing this instrumentation code.

Copilot uses AI. Check for mistakes.
Fix code style

Co-authored-by: Copilot <[email protected]>

char header_buf[12];
const size_t n = source->copy_to(header_buf, sizeof(header_buf));
size_t n = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to put the baidu_rpc_protocol.cpp file containing GPU functionality into a separate file called baidu_rpc_with_gpu_protocol.cpp, or to define a new protocol. Currently, the macro definitions generate too many if-else statements.

Copy link
Author

@randomkang randomkang Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put GPU functionality into a separate file or define a new protocol is two heavy, we reorg the code to decrease the macro definitions. Is it ok? @yanglimingcn

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to separate new protocol file for GDR feature, Besides, I have a idea we can also define a new protocol meta that includes both request meta and response meta carrying the GDR attribute value., the corresponding meta should also contain the GPU address information that both sides need to register with IB. Doing the GDR operations inside processRequest and ProcessRpcResponse method step will probably be more reasonable.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the server or client wants to know the GPU memory region of the other side, we can send this information with a separate rpc transport. But i do not see the the benefits of this approach. Can you explain more?

@yanglimingcn
Copy link
Contributor

I believe the long-term solution is to have users register memory with RDMA, allowing users to customize this memory based on their data organization methods.
https://zhuanlan.zhihu.com/p/376989325 This link contains some relevant details.

@randomkang
Copy link
Author

I believe the long-term solution is to have users register memory with RDMA, allowing users to customize this memory based on their data organization methods. https://zhuanlan.zhihu.com/p/376989325 This link contains some relevant details.

In this pr, brpc will recv data into fragmented gpu blocks and user must call IOBuf::copy_from_gpu to copy these gpu blocks into a continous hbm in order to use it. The d2d copy is time-consuming and not necessary.

In feature, we can let the user assign the gpu destination directly with interface like "rdma_memory_pool_user_specified_memory" and recv the data into it with the opcode like IBV_WR_RDMA_READ/IBV_WR_RDMA_WRITE. Then we can skip the d2d copy.

Futhermore, we can parse the brpc protocol with gpu kernel and initiate RDMA communication with GPU. Then the control path and the data path both are on gpu like nccl gin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

brpc支持GPU Direct RDMA

3 participants