Skip to content

Improve C++ error handling, resource cleanup, and API return checks#1089

Open
rgsl888prabhu wants to merge 4 commits intoNVIDIA:mainfrom
rgsl888prabhu:fix/cpp-error-handling-and-resource-cleanup
Open

Improve C++ error handling, resource cleanup, and API return checks#1089
rgsl888prabhu wants to merge 4 commits intoNVIDIA:mainfrom
rgsl888prabhu:fix/cpp-error-handling-and-resource-cleanup

Conversation

@rgsl888prabhu
Copy link
Copy Markdown
Collaborator

Summary

Improve error handling and resource management across C++/CUDA code: validate fread returns and dimensions in presolve binary reader, wrap unchecked CUDA API calls with RAFT_CUDA_TRY, use RAII for FILE handles in MPS parser, add PID-based shared memory names in gRPC server, and check CUDA device query returns in version_info.

Testing

No new tests added. Existing unit and integration tests cover the modified code paths.

Documentation

No documentation changes needed.

- Add fread return value checks and dimension validation in presolve binary reader
- Wrap CUDA kernel launches and graph API calls with RAFT_CUDA_TRY in feasibility_jump
- Wrap cudaMemcpy calls with RAFT_CUDA_TRY in optimization_problem tests
- Add cudaGetDevice/cudaGetDeviceProperties return checks in version_info
- Use PID-based shared memory segment names in gRPC server
- Use RAII (unique_ptr) for FILE handle in MPS parser to prevent fd leak on error

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@rgsl888prabhu rgsl888prabhu requested a review from a team as a code owner April 9, 2026 16:36
@rgsl888prabhu rgsl888prabhu requested review from chris-maes and rg20 April 9, 2026 16:36
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Replaced manual FILE* usage with RAII; added strict binary-read validation for LP deserialization; made shared-memory names PID-scoped and adjusted unlink/creation calls; wrapped CUDA API/kernel/graph/copy calls with RAFT_CUDA_TRY; added guarded CUDA device/version queries and fixed UUID formatting.

Changes

Cohort / File(s) Summary
File parsing (RAII)
cpp/libmps_parser/src/mps_parser.cpp
Replaced manual FILE* lifecycle with std::unique_ptr<FILE, FcloseDeleter>; fopen result wrapped and file ops use fp.get(); explicit fclose removed.
LP deserialization & validation
cpp/src/dual_simplex/presolve.hpp
Switched to RAII-managed binary FILE*; added check_read for fread results; validate num_rows, num_cols, obj_scale, is_integral; enforce serialized-size cap; validate A.col_start monotonicity/bounds and A.i row indices; throw on failures.
Shared-memory naming & APIs
cpp/src/grpc/server/grpc_server_types.hpp, cpp/src/grpc/server/grpc_server_main.cpp, cpp/src/grpc/server/grpc_worker_infra.cpp
Introduced make_shm_name(const char*) to append PID and changed SHM_* to std::string; updated create_shared_memory and shm_unlink calls to use .c_str().
CUDA error handling (kernels/graphs/copies)
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu, cpp/tests/linear_programming/unit_tests/optimization_problem_test.cu
Wrapped CUDA API calls (stream capture/end, kernel launches, graph instantiate/destroy/launch) and test cudaMemcpy calls with RAFT_CUDA_TRY(...); added try/catch around graph instantiation to destroy captured graph on failure.
CUDA device queries & formatting
cpp/src/utilities/version_info.cpp
Guarded cudaGetDevice, cudaGetDeviceProperties, and cudaRuntimeGetVersion with conditional logging and has_gpu flag; suppress GPU logs on failure; cast uuid.bytes[i] to unsigned char for %02x; copyright year updated.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main theme of the PR: improving error handling, resource cleanup, and API return checks across C++ and CUDA code.
Description check ✅ Passed The description is directly related to the changeset, summarizing key improvements across multiple files (presolve, MPS parser, gRPC server, CUDA calls, version_info).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu (1)

756-805: ⚠️ Potential issue | 🟠 Major

CUDA error handling is incomplete in run_step_device.

Lines 709, 722, 730, 741, and 747 contain unchecked CUDA calls that violate the requirement for error checking on every kernel launch. Specifically:

  • Line 709: cudaStreamBeginCapture
  • Lines 722 & 730: cudaLaunchCooperativeKernel (binary/non-binary branches)
  • Lines 741 & 747: cudaLaunchCooperativeKernel (within FJ_DEBUG_LOAD_BALANCING block)
Suggested patch
-    if (use_graph) { cudaStreamBeginCapture(climber_stream, cudaStreamCaptureModeThreadLocal); }
+    if (use_graph) {
+      RAFT_CUDA_TRY(cudaStreamBeginCapture(climber_stream, cudaStreamCaptureModeThreadLocal));
+    }
@@
-            cudaLaunchCooperativeKernel(
+            RAFT_CUDA_TRY(cudaLaunchCooperativeKernel(
               (void*)compute_mtm_moves_kernel<i_t, f_t, MTMMoveType::FJ_MTM_VIOLATED, true>,
               grid_resetmoves_bin,
               blocks_resetmoves_bin,
               reset_moves_args,
               0,
-              climber_stream);
+              climber_stream));
@@
-            cudaLaunchCooperativeKernel(
+            RAFT_CUDA_TRY(cudaLaunchCooperativeKernel(
               (void*)compute_mtm_moves_kernel<i_t, f_t, MTMMoveType::FJ_MTM_VIOLATED, false>,
               grid_resetmoves,
               blocks_resetmoves,
               reset_moves_args,
               0,
-              climber_stream);
+              climber_stream));
@@
-          cudaLaunchCooperativeKernel((void*)compute_mtm_moves_kernel<i_t, f_t>,
+          RAFT_CUDA_TRY(cudaLaunchCooperativeKernel((void*)compute_mtm_moves_kernel<i_t, f_t>,
                                       grid_resetmoves_bin,
                                       blocks_resetmoves_bin,
                                       reset_moves_args,
                                       0,
-                                      climber_stream);
-          cudaLaunchCooperativeKernel((void*)load_balancing_sanity_checks<i_t, f_t>,
+                                      climber_stream));
+          RAFT_CUDA_TRY(cudaLaunchCooperativeKernel((void*)load_balancing_sanity_checks<i_t, f_t>,
                                        512,
                                        128,
                                        kernel_args,
                                        0,
-                                       climber_stream);
+                                       climber_stream));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 756
- 805, The run_step_device function has unchecked CUDA calls—specifically
cudaStreamBeginCapture and several cudaLaunchCooperativeKernel invocations in
the binary/non-binary branches and inside the FJ_DEBUG_LOAD_BALANCING block—so
add proper error checking around each of these calls (use RAFT_CUDA_TRY or
equivalent error-check macro) and ensure any failure is handled consistently
(log and abort/return) just like the surrounding kernel launches (e.g., wraps
around cudaStreamBeginCapture, the cooperative launches that invoke kernels such
as handle_local_minimum_kernel<i_t,f_t> and other cooperative kernel calls in
the binary/non-binary paths and the FJ_DEBUG_LOAD_BALANCING block) so every CUDA
API and kernel launch in run_step_device is checked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/dual_simplex/presolve.hpp`:
- Around line 76-84: The current code silently proceeds when FILE* fid =
fopen(path.c_str(), "r") returns null and manually calls fclose(fid) inside the
check_read lambda, which can leak descriptors on exceptions; change this to fail
fast by checking fid immediately after fopen and throwing a clear
std::runtime_error if fopen fails for path, and replace manual fclose usage with
an RAII wrapper (e.g., std::unique_ptr<FILE, decltype(&fclose)> or a small
FileHandle class) so that the file is always closed on scope exit; update the
check_read lambda (and any code paths that currently call fclose) to rely on the
RAII handle instead of calling fclose directly.

In `@cpp/src/utilities/version_info.cpp`:
- Around line 169-177: The call to cudaRuntimeGetVersion(&version) in function
that queries CUDA info is missing error checking, causing an
uninitialized/incorrect version to be logged if it fails; update the code near
cudaRuntimeGetVersion to check its return value (similar to how cudaGetDevice
and cudaGetDeviceProperties are checked), log a warning via CUOPT_LOG_WARN with
context (e.g., "Failed to query CUDA runtime version") and return early if the
call is not cudaSuccess so version isn't used, referencing
cudaRuntimeGetVersion, cudaGetDevice, cudaGetDeviceProperties, device_id, and
device_prop to locate the change.

---

Outside diff comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 756-805: The run_step_device function has unchecked CUDA
calls—specifically cudaStreamBeginCapture and several
cudaLaunchCooperativeKernel invocations in the binary/non-binary branches and
inside the FJ_DEBUG_LOAD_BALANCING block—so add proper error checking around
each of these calls (use RAFT_CUDA_TRY or equivalent error-check macro) and
ensure any failure is handled consistently (log and abort/return) just like the
surrounding kernel launches (e.g., wraps around cudaStreamBeginCapture, the
cooperative launches that invoke kernels such as
handle_local_minimum_kernel<i_t,f_t> and other cooperative kernel calls in the
binary/non-binary paths and the FJ_DEBUG_LOAD_BALANCING block) so every CUDA API
and kernel launch in run_step_device is checked.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5d18dbbf-bd69-4d1c-8ed7-7457c556083e

📥 Commits

Reviewing files that changed from the base of the PR and between 378add0 and 374e40d.

📒 Files selected for processing (8)
  • cpp/libmps_parser/src/mps_parser.cpp
  • cpp/src/dual_simplex/presolve.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu
  • cpp/src/utilities/version_info.cpp
  • cpp/tests/linear_programming/unit_tests/optimization_problem_test.cu

@rgsl888prabhu rgsl888prabhu self-assigned this Apr 9, 2026
@rgsl888prabhu rgsl888prabhu added non-breaking Introduces a non-breaking change improvement Improves an existing functionality labels Apr 9, 2026
…g CUDA checks

- Use unique_ptr for FILE handle in presolve binary reader (RAII)
- Throw on fopen failure instead of silent skip
- Add CSC matrix validation (col_start monotonicity, row index bounds)
- Check cudaRuntimeGetVersion return value in version_info
- Wrap remaining cudaStreamBeginCapture and cudaLaunchCooperativeKernel calls
  with RAFT_CUDA_TRY in feasibility_jump

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/dual_simplex/presolve.hpp`:
- Around line 93-97: The deserialization currently coerces any bit pattern into
obj_scale and objective_is_integral; change the read-and-assign logic around
obj_scale and objective_is_integral (the block that calls check_read for
obj_constant, obj_scale, is_integral) to validate exact on-disk encodings:
accept obj_scale only if it equals 1.0f or -1.0f (reject otherwise) and accept
is_integral only if it equals 0 or 1 (treat 1 as true, 0 as false; reject other
values); on invalid values return a read/format error (or throw) instead of
silently coercing, keeping use of check_read for the fread checks and preserving
obj_constant handling.
- Around line 86-129: The file-reader currently only limits element counts
(max_dim, nnz) but still allows huge memory allocations via many large arrays;
before calling any resize() or fread() for objective, rhs, lower, upper,
A.col_start, A.i, or A.x, compute the total byte footprint required (e.g.
total_bytes = num_cols*sizeof(f_t) /*objective*/ + num_rows*sizeof(f_t) /*rhs*/
+ num_cols*sizeof(f_t) /*lower*/ + num_cols*sizeof(f_t) /*upper*/ +
(num_cols+1)*sizeof(i_t) /*col_start*/ + nnz*sizeof(i_t) /*A.i*/ +
nnz*sizeof(f_t) /*A.x*/) and compare it against a safe cap (e.g. a MAX_BYTES
constant like 1<<30 or derived from max_dim and available memory); if
total_bytes exceeds the cap, throw std::runtime_error and do not perform any
resize/fread. Ensure this check happens immediately after computing nnz (after
reading A.col_start and before any resize/fread on
objective/rhs/lower/upper/A.i/A.x) and keep existing per-field
monotonicity/index checks (A.col_start, A.i bounds) intact.

In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 811-815: The cudaGraph_t 'graph' is not destroyed if
RAFT_CUDA_TRY(cudaGraphInstantiate(&graph_instance, graph)) throws, leaking the
captured graph; wrap the instantiate call in a try/catch (or equivalent RAFT
error-handling block) so that on any exception you call cudaGraphDestroy(graph)
before rethrowing, only set graph_created = true after successful instantiation,
and ensure the same fix is applied to the identical pattern in
ping_pong_graph.cu; reference the symbols cudaStreamEndCapture,
cudaGraphInstantiate, cudaGraphDestroy, graph_instance, graph, climber_stream,
RAFT_CUDA_TRY, and graph_created when making the change.

In `@cpp/src/utilities/version_info.cpp`:
- Around line 169-177: The function currently returns early when cudaGetDevice
or cudaGetDeviceProperties fail, which prevents the subsequent general cuOpt
version/build and CPU logs from running; remove those early returns and instead
set a local flag (e.g., bool has_gpu = false) or guard variables (using
device_id/device_prop) so GPU-specific preparation and the GPU log emission are
skipped but the function continues to emit the general version/build and CPU
logs; specifically, modify the cudaGetDevice and cudaGetDeviceProperties checks
(the blocks calling CUOPT_LOG_WARN) to clear/leave has_gpu false and continue,
and wrap only the GPU info preparation and the GPU logging call(s) in an if
(has_gpu) block so non-GPU logging always executes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3d3ff6de-c56a-4aa2-af69-97f0a661d5e1

📥 Commits

Reviewing files that changed from the base of the PR and between 374e40d and 0c01c61.

📒 Files selected for processing (3)
  • cpp/src/dual_simplex/presolve.hpp
  • cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu
  • cpp/src/utilities/version_info.cpp

… GPU flag

- Add 2 GiB total serialized byte cap before allocations in presolve reader
- Validate obj_scale (must be 1.0 or -1.0) and is_integral (must be 0 or 1)
- Wrap cudaGraphInstantiate in try/catch to ensure graph cleanup on failure
- Use has_gpu flag instead of early return so version/CPU info always logs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu (2)

724-824: ⚠️ Potential issue | 🟠 Major

Add CUDA error checking to all kernel launches in load_balancing_score_update().

Lines 603, 615, 623–624, and 628 in load_balancing_score_update() issue raw cudaLaunchCooperativeKernel and <<<>>> kernel launches without RAFT_CUDA_TRY or RAFT_CHECK_CUDA. Since this function is invoked at line 721 when use_load_balancing is true, launch failures in this path will not be caught and reported, creating an inconsistency with the hardened non-load-balanced path. Wrap all four kernel launches with appropriate error checking.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 724
- 824, The kernel launches inside load_balancing_score_update are missing
RAFT_CUDA_TRY/RAFT_CHECK_CUDA wrappers; wrap each raw
cudaLaunchCooperativeKernel and cudaLaunchKernel call in that function with
RAFT_CUDA_TRY (or use RAFT_CHECK_CUDA where appropriate after stream ops) so
failures are reported; specifically update the launches that invoke
compute_mtm_moves_kernel<i_t,f_t,...>, update_lift_moves_kernel<i_t,f_t>,
update_breakthrough_moves_kernel<i_t,f_t> and the other raw launches in the
load_balancing_score_update flow (e.g., select_variable_kernel,
handle_local_minimum_kernel, update_assignment_kernel,
update_changed_constraints_kernel) to use RAFT_CUDA_TRY/RAFT_CHECK_CUDA
consistent with the non-load-balanced path.

709-820: ⚠️ Potential issue | 🟠 Major

Wrap stream capture region in exception-safe guard to ensure cudaStreamEndCapture always executes.

Any RAFT_CUDA_TRY failure inside the loop (lines 715–803) exits without calling cudaStreamEndCapture() at line 811. Per CUDA documentation, a stream remains invalid after failed capture until cudaStreamEndCapture() completes, leaving climber_stream unusable for the rest of the solve. Wrap the entire capture block (lines 710–810) in a try/catch that unconditionally ends capture, destroys any returned graph, and rethrows the exception.

The existing try/catch around cudaGraphInstantiate (lines 812–818) only handles failures during instantiation, not during the capture itself.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 709
- 820, The cuda stream capture started by cudaStreamBeginCapture(climber_stream)
must be wrapped in an exception-safe block so
cudaStreamEndCapture(climber_stream, &graph) is always called even if any
RAFT_CUDA_TRY inside the capture (e.g., launches in the loop using
compute_mtm_moves_kernel, update_lift_moves_kernel, select_variable_kernel,
etc.) throws; modify the code around the begin capture to use a
try/catch/finally pattern that on any exception calls cudaStreamEndCapture,
destroys any non-null graph via cudaGraphDestroy(graph), then rethrows; preserve
the existing cudaGraphInstantiate/graph_instance logic but ensure graph is
cleaned up and graph_created set only after successful instantiate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 724-824: The kernel launches inside load_balancing_score_update
are missing RAFT_CUDA_TRY/RAFT_CHECK_CUDA wrappers; wrap each raw
cudaLaunchCooperativeKernel and cudaLaunchKernel call in that function with
RAFT_CUDA_TRY (or use RAFT_CHECK_CUDA where appropriate after stream ops) so
failures are reported; specifically update the launches that invoke
compute_mtm_moves_kernel<i_t,f_t,...>, update_lift_moves_kernel<i_t,f_t>,
update_breakthrough_moves_kernel<i_t,f_t> and the other raw launches in the
load_balancing_score_update flow (e.g., select_variable_kernel,
handle_local_minimum_kernel, update_assignment_kernel,
update_changed_constraints_kernel) to use RAFT_CUDA_TRY/RAFT_CHECK_CUDA
consistent with the non-load-balanced path.
- Around line 709-820: The cuda stream capture started by
cudaStreamBeginCapture(climber_stream) must be wrapped in an exception-safe
block so cudaStreamEndCapture(climber_stream, &graph) is always called even if
any RAFT_CUDA_TRY inside the capture (e.g., launches in the loop using
compute_mtm_moves_kernel, update_lift_moves_kernel, select_variable_kernel,
etc.) throws; modify the code around the begin capture to use a
try/catch/finally pattern that on any exception calls cudaStreamEndCapture,
destroys any non-null graph via cudaGraphDestroy(graph), then rethrows; preserve
the existing cudaGraphInstantiate/graph_instance logic but ensure graph is
cleaned up and graph_created set only after successful instantiate.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9646f09b-95a1-4e2c-a65b-2047d379eafd

📥 Commits

Reviewing files that changed from the base of the PR and between 0c01c61 and 741a825.

📒 Files selected for processing (3)
  • cpp/src/dual_simplex/presolve.hpp
  • cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu
  • cpp/src/utilities/version_info.cpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant