Improve C++ error handling, resource cleanup, and API return checks#1089
Improve C++ error handling, resource cleanup, and API return checks#1089rgsl888prabhu wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
- Add fread return value checks and dimension validation in presolve binary reader - Wrap CUDA kernel launches and graph API calls with RAFT_CUDA_TRY in feasibility_jump - Wrap cudaMemcpy calls with RAFT_CUDA_TRY in optimization_problem tests - Add cudaGetDevice/cudaGetDeviceProperties return checks in version_info - Use PID-based shared memory segment names in gRPC server - Use RAII (unique_ptr) for FILE handle in MPS parser to prevent fd leak on error Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughReplaced manual FILE* usage with RAII; added strict binary-read validation for LP deserialization; made shared-memory names PID-scoped and adjusted unlink/creation calls; wrapped CUDA API/kernel/graph/copy calls with Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu (1)
756-805:⚠️ Potential issue | 🟠 MajorCUDA error handling is incomplete in
run_step_device.Lines 709, 722, 730, 741, and 747 contain unchecked CUDA calls that violate the requirement for error checking on every kernel launch. Specifically:
- Line 709:
cudaStreamBeginCapture- Lines 722 & 730:
cudaLaunchCooperativeKernel(binary/non-binary branches)- Lines 741 & 747:
cudaLaunchCooperativeKernel(withinFJ_DEBUG_LOAD_BALANCINGblock)Suggested patch
- if (use_graph) { cudaStreamBeginCapture(climber_stream, cudaStreamCaptureModeThreadLocal); } + if (use_graph) { + RAFT_CUDA_TRY(cudaStreamBeginCapture(climber_stream, cudaStreamCaptureModeThreadLocal)); + } @@ - cudaLaunchCooperativeKernel( + RAFT_CUDA_TRY(cudaLaunchCooperativeKernel( (void*)compute_mtm_moves_kernel<i_t, f_t, MTMMoveType::FJ_MTM_VIOLATED, true>, grid_resetmoves_bin, blocks_resetmoves_bin, reset_moves_args, 0, - climber_stream); + climber_stream)); @@ - cudaLaunchCooperativeKernel( + RAFT_CUDA_TRY(cudaLaunchCooperativeKernel( (void*)compute_mtm_moves_kernel<i_t, f_t, MTMMoveType::FJ_MTM_VIOLATED, false>, grid_resetmoves, blocks_resetmoves, reset_moves_args, 0, - climber_stream); + climber_stream)); @@ - cudaLaunchCooperativeKernel((void*)compute_mtm_moves_kernel<i_t, f_t>, + RAFT_CUDA_TRY(cudaLaunchCooperativeKernel((void*)compute_mtm_moves_kernel<i_t, f_t>, grid_resetmoves_bin, blocks_resetmoves_bin, reset_moves_args, 0, - climber_stream); - cudaLaunchCooperativeKernel((void*)load_balancing_sanity_checks<i_t, f_t>, + climber_stream)); + RAFT_CUDA_TRY(cudaLaunchCooperativeKernel((void*)load_balancing_sanity_checks<i_t, f_t>, 512, 128, kernel_args, 0, - climber_stream); + climber_stream));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 756 - 805, The run_step_device function has unchecked CUDA calls—specifically cudaStreamBeginCapture and several cudaLaunchCooperativeKernel invocations in the binary/non-binary branches and inside the FJ_DEBUG_LOAD_BALANCING block—so add proper error checking around each of these calls (use RAFT_CUDA_TRY or equivalent error-check macro) and ensure any failure is handled consistently (log and abort/return) just like the surrounding kernel launches (e.g., wraps around cudaStreamBeginCapture, the cooperative launches that invoke kernels such as handle_local_minimum_kernel<i_t,f_t> and other cooperative kernel calls in the binary/non-binary paths and the FJ_DEBUG_LOAD_BALANCING block) so every CUDA API and kernel launch in run_step_device is checked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/dual_simplex/presolve.hpp`:
- Around line 76-84: The current code silently proceeds when FILE* fid =
fopen(path.c_str(), "r") returns null and manually calls fclose(fid) inside the
check_read lambda, which can leak descriptors on exceptions; change this to fail
fast by checking fid immediately after fopen and throwing a clear
std::runtime_error if fopen fails for path, and replace manual fclose usage with
an RAII wrapper (e.g., std::unique_ptr<FILE, decltype(&fclose)> or a small
FileHandle class) so that the file is always closed on scope exit; update the
check_read lambda (and any code paths that currently call fclose) to rely on the
RAII handle instead of calling fclose directly.
In `@cpp/src/utilities/version_info.cpp`:
- Around line 169-177: The call to cudaRuntimeGetVersion(&version) in function
that queries CUDA info is missing error checking, causing an
uninitialized/incorrect version to be logged if it fails; update the code near
cudaRuntimeGetVersion to check its return value (similar to how cudaGetDevice
and cudaGetDeviceProperties are checked), log a warning via CUOPT_LOG_WARN with
context (e.g., "Failed to query CUDA runtime version") and return early if the
call is not cudaSuccess so version isn't used, referencing
cudaRuntimeGetVersion, cudaGetDevice, cudaGetDeviceProperties, device_id, and
device_prop to locate the change.
---
Outside diff comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 756-805: The run_step_device function has unchecked CUDA
calls—specifically cudaStreamBeginCapture and several
cudaLaunchCooperativeKernel invocations in the binary/non-binary branches and
inside the FJ_DEBUG_LOAD_BALANCING block—so add proper error checking around
each of these calls (use RAFT_CUDA_TRY or equivalent error-check macro) and
ensure any failure is handled consistently (log and abort/return) just like the
surrounding kernel launches (e.g., wraps around cudaStreamBeginCapture, the
cooperative launches that invoke kernels such as
handle_local_minimum_kernel<i_t,f_t> and other cooperative kernel calls in the
binary/non-binary paths and the FJ_DEBUG_LOAD_BALANCING block) so every CUDA API
and kernel launch in run_step_device is checked.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5d18dbbf-bd69-4d1c-8ed7-7457c556083e
📒 Files selected for processing (8)
cpp/libmps_parser/src/mps_parser.cppcpp/src/dual_simplex/presolve.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cucpp/src/utilities/version_info.cppcpp/tests/linear_programming/unit_tests/optimization_problem_test.cu
…g CUDA checks - Use unique_ptr for FILE handle in presolve binary reader (RAII) - Throw on fopen failure instead of silent skip - Add CSC matrix validation (col_start monotonicity, row index bounds) - Check cudaRuntimeGetVersion return value in version_info - Wrap remaining cudaStreamBeginCapture and cudaLaunchCooperativeKernel calls with RAFT_CUDA_TRY in feasibility_jump Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/dual_simplex/presolve.hpp`:
- Around line 93-97: The deserialization currently coerces any bit pattern into
obj_scale and objective_is_integral; change the read-and-assign logic around
obj_scale and objective_is_integral (the block that calls check_read for
obj_constant, obj_scale, is_integral) to validate exact on-disk encodings:
accept obj_scale only if it equals 1.0f or -1.0f (reject otherwise) and accept
is_integral only if it equals 0 or 1 (treat 1 as true, 0 as false; reject other
values); on invalid values return a read/format error (or throw) instead of
silently coercing, keeping use of check_read for the fread checks and preserving
obj_constant handling.
- Around line 86-129: The file-reader currently only limits element counts
(max_dim, nnz) but still allows huge memory allocations via many large arrays;
before calling any resize() or fread() for objective, rhs, lower, upper,
A.col_start, A.i, or A.x, compute the total byte footprint required (e.g.
total_bytes = num_cols*sizeof(f_t) /*objective*/ + num_rows*sizeof(f_t) /*rhs*/
+ num_cols*sizeof(f_t) /*lower*/ + num_cols*sizeof(f_t) /*upper*/ +
(num_cols+1)*sizeof(i_t) /*col_start*/ + nnz*sizeof(i_t) /*A.i*/ +
nnz*sizeof(f_t) /*A.x*/) and compare it against a safe cap (e.g. a MAX_BYTES
constant like 1<<30 or derived from max_dim and available memory); if
total_bytes exceeds the cap, throw std::runtime_error and do not perform any
resize/fread. Ensure this check happens immediately after computing nnz (after
reading A.col_start and before any resize/fread on
objective/rhs/lower/upper/A.i/A.x) and keep existing per-field
monotonicity/index checks (A.col_start, A.i bounds) intact.
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 811-815: The cudaGraph_t 'graph' is not destroyed if
RAFT_CUDA_TRY(cudaGraphInstantiate(&graph_instance, graph)) throws, leaking the
captured graph; wrap the instantiate call in a try/catch (or equivalent RAFT
error-handling block) so that on any exception you call cudaGraphDestroy(graph)
before rethrowing, only set graph_created = true after successful instantiation,
and ensure the same fix is applied to the identical pattern in
ping_pong_graph.cu; reference the symbols cudaStreamEndCapture,
cudaGraphInstantiate, cudaGraphDestroy, graph_instance, graph, climber_stream,
RAFT_CUDA_TRY, and graph_created when making the change.
In `@cpp/src/utilities/version_info.cpp`:
- Around line 169-177: The function currently returns early when cudaGetDevice
or cudaGetDeviceProperties fail, which prevents the subsequent general cuOpt
version/build and CPU logs from running; remove those early returns and instead
set a local flag (e.g., bool has_gpu = false) or guard variables (using
device_id/device_prop) so GPU-specific preparation and the GPU log emission are
skipped but the function continues to emit the general version/build and CPU
logs; specifically, modify the cudaGetDevice and cudaGetDeviceProperties checks
(the blocks calling CUOPT_LOG_WARN) to clear/leave has_gpu false and continue,
and wrap only the GPU info preparation and the GPU logging call(s) in an if
(has_gpu) block so non-GPU logging always executes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3d3ff6de-c56a-4aa2-af69-97f0a661d5e1
📒 Files selected for processing (3)
cpp/src/dual_simplex/presolve.hppcpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cucpp/src/utilities/version_info.cpp
… GPU flag - Add 2 GiB total serialized byte cap before allocations in presolve reader - Validate obj_scale (must be 1.0 or -1.0) and is_integral (must be 0 or 1) - Wrap cudaGraphInstantiate in try/catch to ensure graph cleanup on failure - Use has_gpu flag instead of early return so version/CPU info always logs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu (2)
724-824:⚠️ Potential issue | 🟠 MajorAdd CUDA error checking to all kernel launches in
load_balancing_score_update().Lines 603, 615, 623–624, and 628 in
load_balancing_score_update()issue rawcudaLaunchCooperativeKerneland<<<>>>kernel launches withoutRAFT_CUDA_TRYorRAFT_CHECK_CUDA. Since this function is invoked at line 721 whenuse_load_balancingis true, launch failures in this path will not be caught and reported, creating an inconsistency with the hardened non-load-balanced path. Wrap all four kernel launches with appropriate error checking.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 724 - 824, The kernel launches inside load_balancing_score_update are missing RAFT_CUDA_TRY/RAFT_CHECK_CUDA wrappers; wrap each raw cudaLaunchCooperativeKernel and cudaLaunchKernel call in that function with RAFT_CUDA_TRY (or use RAFT_CHECK_CUDA where appropriate after stream ops) so failures are reported; specifically update the launches that invoke compute_mtm_moves_kernel<i_t,f_t,...>, update_lift_moves_kernel<i_t,f_t>, update_breakthrough_moves_kernel<i_t,f_t> and the other raw launches in the load_balancing_score_update flow (e.g., select_variable_kernel, handle_local_minimum_kernel, update_assignment_kernel, update_changed_constraints_kernel) to use RAFT_CUDA_TRY/RAFT_CHECK_CUDA consistent with the non-load-balanced path.
709-820:⚠️ Potential issue | 🟠 MajorWrap stream capture region in exception-safe guard to ensure cudaStreamEndCapture always executes.
Any
RAFT_CUDA_TRYfailure inside the loop (lines 715–803) exits without callingcudaStreamEndCapture()at line 811. Per CUDA documentation, a stream remains invalid after failed capture untilcudaStreamEndCapture()completes, leavingclimber_streamunusable for the rest of the solve. Wrap the entire capture block (lines 710–810) in a try/catch that unconditionally ends capture, destroys any returned graph, and rethrows the exception.The existing try/catch around
cudaGraphInstantiate(lines 812–818) only handles failures during instantiation, not during the capture itself.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu` around lines 709 - 820, The cuda stream capture started by cudaStreamBeginCapture(climber_stream) must be wrapped in an exception-safe block so cudaStreamEndCapture(climber_stream, &graph) is always called even if any RAFT_CUDA_TRY inside the capture (e.g., launches in the loop using compute_mtm_moves_kernel, update_lift_moves_kernel, select_variable_kernel, etc.) throws; modify the code around the begin capture to use a try/catch/finally pattern that on any exception calls cudaStreamEndCapture, destroys any non-null graph via cudaGraphDestroy(graph), then rethrows; preserve the existing cudaGraphInstantiate/graph_instance logic but ensure graph is cleaned up and graph_created set only after successful instantiate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@cpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cu`:
- Around line 724-824: The kernel launches inside load_balancing_score_update
are missing RAFT_CUDA_TRY/RAFT_CHECK_CUDA wrappers; wrap each raw
cudaLaunchCooperativeKernel and cudaLaunchKernel call in that function with
RAFT_CUDA_TRY (or use RAFT_CHECK_CUDA where appropriate after stream ops) so
failures are reported; specifically update the launches that invoke
compute_mtm_moves_kernel<i_t,f_t,...>, update_lift_moves_kernel<i_t,f_t>,
update_breakthrough_moves_kernel<i_t,f_t> and the other raw launches in the
load_balancing_score_update flow (e.g., select_variable_kernel,
handle_local_minimum_kernel, update_assignment_kernel,
update_changed_constraints_kernel) to use RAFT_CUDA_TRY/RAFT_CHECK_CUDA
consistent with the non-load-balanced path.
- Around line 709-820: The cuda stream capture started by
cudaStreamBeginCapture(climber_stream) must be wrapped in an exception-safe
block so cudaStreamEndCapture(climber_stream, &graph) is always called even if
any RAFT_CUDA_TRY inside the capture (e.g., launches in the loop using
compute_mtm_moves_kernel, update_lift_moves_kernel, select_variable_kernel,
etc.) throws; modify the code around the begin capture to use a
try/catch/finally pattern that on any exception calls cudaStreamEndCapture,
destroys any non-null graph via cudaGraphDestroy(graph), then rethrows; preserve
the existing cudaGraphInstantiate/graph_instance logic but ensure graph is
cleaned up and graph_created set only after successful instantiate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9646f09b-95a1-4e2c-a65b-2047d379eafd
📒 Files selected for processing (3)
cpp/src/dual_simplex/presolve.hppcpp/src/mip_heuristics/feasibility_jump/feasibility_jump.cucpp/src/utilities/version_info.cpp
Summary
Improve error handling and resource management across C++/CUDA code: validate fread returns and dimensions in presolve binary reader, wrap unchecked CUDA API calls with RAFT_CUDA_TRY, use RAII for FILE handles in MPS parser, add PID-based shared memory names in gRPC server, and check CUDA device query returns in version_info.
Testing
No new tests added. Existing unit and integration tests cover the modified code paths.
Documentation
No documentation changes needed.