Remote execution implementation by tmckayus · Pull Request #939 · NVIDIA/cuopt

tmckayus · 2026-03-06T14:56:24Z

This change replaces the solve_lp_remote and solve_mip_remote stubs with real routines that use an embedded grpc client to communicate with a remote cuopt server.

There are two documents included, GRPC_ARCHITECTURE.md and SERVER_ARCHITECTURE.md that act as developer guides.

The server is built by build.sh and is called cuopt_grpc_server.

Remote execution is still enabled via env vars CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT

To try this feature, checkout the branch/PR and do the following:

$ build.sh
$ cuopt_grpc_server // defaults are fine, run with -h for help
$ CUOPT_REMOTE_HOST=localhost CUOPT_REMOTE_PORT=8765 cuopt_cli myproblem.mps

All cuopt APIs will pick up remote execution if the env vars are set. So cuopt_cli, the C API, the Python API, all will solve problems on the server if the env vars are set and the server is running. Just use cuopt tools and APIs as you normally would for a local solve.

coderabbitai · 2026-03-06T15:06:43Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a full gRPC remote execution subsystem: protobuf/gRPC schemas and codegen, a gRPC client and server (multi-process workers, shared-memory and pipe IPC, chunked upload/download), mappers for problems/settings/solutions, extensive tests (unit + integration with TLS/mTLS), build/CI/packaging changes, and removal of prior remote stubs.

Changes

Cohort / File(s)	Summary
Build & Packaging `build.sh`, `cpp/CMakeLists.txt`, `python/libcuopt/CMakeLists.txt`, `ci/utils/install_protobuf_grpc.sh`, `ci/build_wheel_libcuopt.sh`, `conda/recipes/libcuopt/recipe.yaml`, `dependencies.yaml`, `conda/environments/*`	Expose new `cuopt_grpc_server` target and gRPC/protobuf codegen in CMake; add CI script to build/install protobuf/gRPC/abseil; add packaging/conda changes (openssl, c-ares, libuuid) and include `cuopt_grpc_server` in outputs.
Protobuf Definitions `cpp/src/grpc/cuopt_remote.proto`, `cpp/src/grpc/cuopt_remote_service.proto`	Add protobuf schemas for problems, settings, results, chunked-transfer metadata and the `CuOptRemoteService` gRPC service (submit, chunked upload/download, status, results, logs, incumbents).
Client API & Remote Entrypoints `cpp/src/grpc/client/grpc_client.hpp`, `cpp/src/grpc/client/grpc_client.cpp`, `cpp/src/grpc/client/solve_remote.cpp`	New public `grpc_client_t` (sync/async, unary & chunked uploads/downloads, log/incumbent streaming, TLS support) and solve_remote wrappers used by local APIs; test injection helpers added.
Mapping Utilities `cpp/src/grpc/grpc_problem_mapper.{hpp,cpp}`, `cpp/src/grpc/grpc_service_mapper.{hpp,cpp}`, `cpp/src/grpc/grpc_settings_mapper.{hpp,cpp}`, `cpp/src/grpc/grpc_solution_mapper.{hpp,cpp}`	Bidirectional mappers between CPU types and protobufs (problems, settings, solutions); chunked-header builders, array-chunk builders, status/termination conversions and explicit template instantiations.
Server Core Types & IO `cpp/src/grpc/server/grpc_server_types.hpp`, `cpp/src/grpc/server/grpc_pipe_io.cpp`, `cpp/src/grpc/server/grpc_pipe_serialization.hpp`, `cpp/src/grpc/server/grpc_field_element_size.hpp`, `cpp/src/grpc/server/grpc_incumbent_proto.hpp`	Shared-memory job/result queue layouts and control block, pipe I/O helpers, pipe serialization protocol for chunking, element-size helper, and incumbent proto helpers (guarded by CUOPT_ENABLE_GRPC).
Server Entry & Worker Infra `cpp/src/grpc/server/grpc_server_main.cpp`, `cpp/src/grpc/server/grpc_worker_infra.cpp`, `cpp/src/grpc/server/grpc_worker.cpp`	Server main (CLI, TLS/mTLS, worker spawning), per-worker pipe creation and lifecycle, worker_process implementation performing LP/MIP solves, incidental incumbent forwarding and result publication.
Server Threads & Job Management `cpp/src/grpc/server/grpc_server_threads.cpp`, `cpp/src/grpc/server/grpc_job_management.cpp`	Background threads (worker monitor, result/incumbent retrieval, session reaper); async submission, chunked-job handling, job tracking, cancellation, log management and session cleanup.
gRPC Service Implementation `cpp/src/grpc/server/grpc_service_impl.cpp`, `cpp/src/grpc/server/grpc_server_main.cpp`	Full `CuOptRemoteService` implementation and factory: SubmitJob, chunked upload/download RPCs, GetResult/CheckStatus/Cancel/Delete/Wait, StreamLogs, GetIncumbents, session state machines and validation.
Pipe Serialization & Tests `cpp/src/grpc/server/grpc_pipe_serialization.hpp`, `cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp`	Wire-format helpers and unit tests validating chunked request/result serialization and multi-chunk assembly over pipes.
Tests & Test Utilities `cpp/tests/linear_programming/grpc/*`, `cpp/tests/linear_programming/grpc/grpc_client_test.cpp`, `cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`, `cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`, `cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp`, `cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp`, `python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`	Large additions: client unit tests with mock stubs; integration tests that spawn `cuopt_grpc_server` (including TLS/mTLS); test log capture and helper utilities; test fixtures updated to use live server; C API and Python tests adapted.
Removed Old Remote Stubs `cpp/src/pdlp/solve_remote.cu`, `cpp/src/pdlp/CMakeLists.txt`	Delete prior placeholder remote-stub implementations and remove `solve_remote.cu` from LP source lists; remote execution now uses the gRPC client/server.
Documentation & Quickstart `GRPC_INTERFACE.md`, `GRPC_QUICK_START.md`, `GRPC_SERVER_ARCHITECTURE.md`	Add documentation covering gRPC interface, quick-start (including TLS/mTLS), server architecture, chunked protocol, and build/test instructions.
Tests CMake wiring `cpp/tests/linear_programming/CMakeLists.txt`, `cpp/tests/linear_programming/grpc/CMakeLists.txt`	Hook gRPC test subdirectory under CUOPT_ENABLE_GRPC and add grpc-related test executables; integration test depends on `cuopt_grpc_server`.
Python / Packaging CMake `python/libcuopt/CMakeLists.txt`	Find gRPC package and set install RPATH for `cuopt_grpc_server` for Python packaging.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

coderabbitai

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py (1)
152-163: ⚠️ Potential issue | 🟠 Major

Strengthen this warmstart test to assert numerical correctness, not just non-None outputs.

After adding CUOPT_PRESOLVE at Line 152, the test still only checks that a primal solution exists (Line 162). Please also validate warmstart numerical correctness (e.g., objective consistency/improvement within tolerance).
Proposed test-strengthening diff
     sol1 = linear_programming.Solve(dm, settings)
     ws = sol1.get_pdlp_warm_start_data()
+    obj1 = sol1.get_primal_objective()

     if ws is not None:
         settings.set_pdlp_warm_start_data(ws)
         settings.set_parameter(CUOPT_ITERATION_LIMIT, 200)
         sol2 = linear_programming.Solve(dm, settings)
         assert sol2.get_primal_solution() is not None
+        obj2 = sol2.get_primal_objective()
+        assert obj2 is not None
+        assert abs(obj2 - obj1) / max(1.0, abs(obj1)) < 1e-6
As per coding guidelines, **/*test*.{cpp,cu,py} must “Write tests validating numerical correctness of optimization results (not just 'runs without error').”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 152 - 163, The test currently only asserts
sol2.get_primal_solution() is not None; strengthen it by checking numerical
correctness: retrieve both primal solutions (via sol1.get_primal_solution() and
sol2.get_primal_solution()), compute their objective values (either via
solX.get_primal_objective() if available or by forming the dot product of the
primal vector with the problem cost from dm), then assert the objective values
are consistent within a small tolerance (e.g., abs(obj1 - obj2) <= 1e-6) or that
sol2's objective is no worse (<=) than sol1's given the increased
CUOPT_ITERATION_LIMIT; keep tolerance and exact assertion choice explicit in the
test.

🟠 Major comments (18)

cpp/src/grpc/server/grpc_worker.cpp-56-74 (1)
56-74: ⚠️ Potential issue | 🟠 Major

Potential race condition in result queue access.

store_simple_result iterates through result_queue looking for an empty slot without synchronization. If multiple workers call this concurrently, they could both find the same slot with !result_queue[i].ready and overwrite each other's results.

Consider using an atomic compare-exchange on ready similar to the job claiming pattern (lines 86-87), or protect access with a mutex.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 56 - 74,
store_simple_result currently scans result_queue and sets a slot when
result_queue[i].ready is false without synchronization, causing a race where
multiple callers can claim the same slot; update this function to claim a slot
atomically (e.g., change result_queue[i].ready to an std::atomic<bool> and
perform an atomic compare-exchange to flip false->true before writing the slot)
or surround the scan-and-write with a mutex; ensure you only write the other
fields (job_id, status, data_size, worker_index, error_message, retrieved) after
successfully claiming the slot to avoid concurrent overwrites.
cpp/src/grpc/server/grpc_server_types.hpp-55-64 (1)
55-64: 🛠️ Refactor suggestion | 🟠 Major

Avoid using directives in headers.

The using declarations (lines 55-61) and using namespace directive (line 63) pollute the global namespace for any file that includes this header. This can cause unexpected name collisions.
Suggested fix: Use qualified names instead

Remove these lines and use fully qualified names (e.g., grpc::Server, grpc::Status) in the implementation files that include this header. Alternatively, move these declarations inside the #ifdef CUOPT_ENABLE_GRPC guard's associated .cpp files only.
-using grpc::Server;
-using grpc::ServerBuilder;
-using grpc::ServerContext;
-using grpc::ServerReaderWriter;
-using grpc::ServerWriter;
-using grpc::Status;
-using grpc::StatusCode;
-
-using namespace cuopt::linear_programming;
-// Note: NOT using "using namespace cuopt::remote" to avoid JobStatus enum conflict
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 55 - 64, Remove the
header-level using declarations for grpc types (grpc::Server,
grpc::ServerBuilder, grpc::ServerContext, grpc::ServerReaderWriter,
grpc::ServerWriter, grpc::Status, grpc::StatusCode) and the using namespace
cuopt::linear_programming; to avoid leaking names (and the JobStatus enum
conflict). Replace usages in this header with fully-qualified names (e.g.,
grpc::Server, cuopt::linear_programming::Whatever) or move the using
declarations into the corresponding .cpp files (or inside the CUOPT_ENABLE_GRPC
guard in .cpp) so only implementation files import those namespaces; ensure any
references in grpc_server_types.hpp are updated to the qualified symbols.
cpp/src/grpc/server/grpc_worker_infra.cpp-194-205 (1)
194-205: ⚠️ Potential issue | 🟠 Major

Handle full result_queue when marking worker failures.

If all result slots are occupied, this path drops the terminal failure record. A waiting client can then stall even though the job already failed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 194 - 205, The
current loop that writes a terminal failure into result_queue stops if every
slot is already marked ready, which can drop the failure; modify the failure
handling logic so that if no slot with !ready is found it will (1) search for a
slot with ready && retrieved (i.e., already consumed) and reuse it, using the
same writes (copy_cstr to job_id/error_message, set status, data_size,
worker_index, retrieved=false, ready=true), and (2) if none exist, as a last
resort overwrite the oldest/first slot (e.g., index 0) to guarantee the client
gets a terminal response; update the block around result_queue/MAX_RESULTS and
reuse the existing copy_cstr calls so the behavior is deterministic.
cpp/src/grpc/server/grpc_server_main.cpp-50-60 (1)
50-60: ⚠️ Potential issue | 🟠 Major

Harden numeric CLI parsing and range validation.

Line 53/55/57/59 use stoi/stoll without exception handling or range checks. Invalid input or negative values (e.g., workers/port) can crash startup or create invalid runtime config.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 60, The CLI
numeric parsing loop that assigns config.port, config.num_workers,
config.max_message_mb and config.max_message_b uses std::stoi/std::stoll without
error handling or range checks; wrap each conversion in a try/catch for
std::invalid_argument and std::out_of_range, validate values (e.g., port > 0 &&
port <= 65535, num_workers > 0, max_message_mb >= 0, max_message_b >= 4096) and
on invalid input log an error via the existing logger (or std::cerr) and
return/exit with non‑zero status or fall back to safe defaults; update the
parsing branches that set config.port, config.num_workers, config.max_message_mb
and the --max-message-bytes branch that sets config.max_message_b to use these
guarded conversions and checks.
cpp/src/grpc/server/grpc_worker_infra.cpp-120-124 (1)
120-124: ⚠️ Potential issue | 🟠 Major

Close child-end FDs on fork failure to prevent descriptor leaks.

On fork failure, only server-side pipe ends are closed. Parent-owned child ends (worker_read_fd, worker_write_fd, worker_incumbent_write_fd) remain open.
🔧 Minimal fix
   if (pid < 0) {
     std::cerr << "[Server] Failed to fork " << (is_replacement ? "replacement worker " : "worker ")
               << worker_id << "\n";
     close_worker_pipes_server(worker_id);
+    close_worker_pipes_child_ends(worker_id);
     return -1;
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 120 - 124, On fork
failure in grpc_worker_infra.cpp (the pid < 0 branch), close the parent-owned
child-end file descriptors to avoid leaks: ensure worker_read_fd,
worker_write_fd, and worker_incumbent_write_fd are closed (if valid) in addition
to calling close_worker_pipes_server(worker_id); update the pid < 0 error path
in the code that creates workers so it explicitly closes those descriptors
before returning -1.
cpp/src/grpc/client/solve_remote.cu-55-57 (1)
55-57: ⚠️ Potential issue | 🟠 Major

Validate env overrides before applying chunk/message sizes.

CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES are accepted as-is; zero/negative/too-small values can break chunk sizing and message-limit behavior.
✅ Minimal validation example
 static void apply_env_overrides(grpc_client_config_t& config)
 {
-  config.chunk_size_bytes  = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
-  config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
+  auto max_bytes  = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+
+  config.chunk_size_bytes = std::max<int64_t>(4096, chunk_size);
+  config.max_message_bytes =
+    (max_bytes <= 0) ? config.max_message_bytes : std::max<int64_t>(4096, max_bytes);
   config.enable_debug_log  = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The code currently
applies CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES directly which can allow
zero/negative/too-small values to break chunking; after calling parse_env_int64
for these into config.chunk_size_bytes and config.max_message_bytes, validate
each parsed value against sensible thresholds (e.g., >0 and >= a defined
MIN_CHUNK_SIZE / MIN_MESSAGE_BYTES constant) and only assign the override if it
passes validation; if invalid, keep the existing default and emit a warning/log
mentioning the env var name and rejected value (use the same symbols
config.chunk_size_bytes, config.max_message_bytes, parse_env_int64 and add
MIN_CHUNK_SIZE / MIN_MESSAGE_BYTES constants or inline thresholds).
cpp/src/grpc/client/solve_remote.cu-153-166 (1)
153-166: ⚠️ Potential issue | 🟠 Major

Don’t forward incumbent objective as the callback bound.

Line 164 sets bound_copy equal to incumbent objective. That can make gap-sensitive callback logic interpret the incumbent as already tightly bounded, which is semantically wrong when a true best bound is unavailable.

Based on learnings: objective direction is encoded by objective-sense metadata, and callback bounds should use direction-appropriate “no bound” values when the true bound is not available.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 153 - 166, The incumbent
callback is incorrectly using the incumbent objective as the callback bound
(bound_copy), which can mislead gap-sensitive logic; instead, set bound_copy to
a direction-appropriate "no bound" sentinel (e.g., +infinity for minimization,
-infinity for maximization) rather than objective. In the
config.incumbent_callback lambda (referencing mip_callbacks and
get_solution_callback_t::get_solution), query the objective sense from the
callback or problem metadata (e.g., a get_objective_sense() or equivalent) and
assign bound_copy = std::numeric_limits<double>::infinity() for minimization or
-std::numeric_limits<double>::infinity() for maximization before calling
get_solution; do not forward the incumbent objective as the bound.
cpp/src/grpc/server/grpc_server_threads.cpp-253-263 (1)
253-263: ⚠️ Potential issue | 🟠 Major

Synchronize worker_pipes access across polling and worker-restart paths.

This thread reads worker_pipes (lines 254–255) while the monitor thread replaces dead workers via spawn_single_worker() → close_worker_pipes_server(), which modifies the same entries without a shared lock. This is a data race: the vector can reallocate during concurrent modifications, and file descriptors can be closed while the incumbent thread polls them, causing undefined behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 253 - 263, The
polling code reads shared worker_pipes without synchronization, causing a data
race with spawn_single_worker() / close_worker_pipes_server() which mutate that
vector; fix by introducing and using a shared mutex (e.g., worker_pipes_mutex)
that both the polling thread and the worker-restart code lock when
accessing/modifying worker_pipes, and change the polling path (the loop that
builds pfds) to take the mutex, copy the needed file descriptors into a local
vector<int> (or local pfds) while holding the lock, then release the lock before
calling poll() so the poll is not done while holding the mutex; also ensure
close_worker_pipes_server() and spawn_single_worker() acquire the same mutex
when closing/removing entries to avoid closing fds being polled.
cpp/src/grpc/grpc_settings_mapper.cu-144-146 (1)
144-146: ⚠️ Potential issue | 🟠 Major

Guard iteration_limit narrowing conversion.

Line 145 casts protobuf int64 directly into i_t. Oversized values can overflow/truncate and corrupt solver limits.
Proposed fix
-  if (pb_settings.iteration_limit() >= 0) {
-    settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit());
-  }
+  if (pb_settings.iteration_limit() >= 0) {
+    const auto iter_limit = pb_settings.iteration_limit();
+    if (iter_limit > static_cast<int64_t>(std::numeric_limits<i_t>::max())) {
+      settings.iteration_limit = std::numeric_limits<i_t>::max();
+    } else {
+      settings.iteration_limit = static_cast<i_t>(iter_limit);
+    }
+  }
As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The code
currently casts pb_settings.iteration_limit() directly to i_t (setting
settings.iteration_limit) which can overflow; update the assignment in the block
that checks pb_settings.iteration_limit() to first compare the int64 value
against the representable range of i_t (use std::numeric_limits<i_t>::max() and
min()/0 as appropriate), and then either clamp to that range or return/log an
error before assigning to settings.iteration_limit; reference
pb_settings.iteration_limit(), settings.iteration_limit, and the i_t type when
making this change so the narrowing conversion is guarded.
cpp/src/grpc/server/grpc_job_management.cpp-109-113 (1)
109-113: ⚠️ Potential issue | 🟠 Major

Cap incumbent payload size before data.resize(size).

Line 111 trusts wire-provided size and can allocate arbitrarily large memory.
Proposed fix
 bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data)
 {
   uint64_t size;
   if (!read_from_pipe(fd, &size, sizeof(size))) return false;
+  constexpr uint64_t kMaxIncumbentBytes = 64ULL * 1024 * 1024;
+  if (size > kMaxIncumbentBytes) {
+    std::cerr << "[Server] Incumbent payload too large: " << size << "\n";
+    return false;
+  }
   data.resize(size);
   if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false;
   return true;
 }
As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code
reads a wire-provided uint64_t size and calls data.resize(size) without
validation; before calling data.resize (in the block that uses read_from_pipe,
size and data), validate and cap size against a defined maximum payload (e.g.
kMaxPayloadSize) and reject sizes that are zero or exceed the cap by returning
false; use the same symbols from the snippet (uint64_t size, read_from_pipe(fd,
&size,...), data.resize(size)) so the check occurs immediately after reading
size and before any allocation or further read_from_pipe calls.
cpp/src/grpc/server/grpc_service_impl.cpp-699-716 (1)
699-716: ⚠️ Potential issue | 🟠 Major

StreamLogs can block indefinitely for terminal jobs without a log file.

The wait loop only exits on file existence or NOT_FOUND; a completed/failed/cancelled job with no log file never terminates the RPC.
Proposed fix
       if (waited_ms >= 2000) {
         std::string msg;
         JobStatus s = check_job_status(job_id, msg);
         if (s == JobStatus::NOT_FOUND) {
           if (config.verbose) {
             std::cout << "[gRPC] StreamLogs job not found: " << job_id << std::endl;
           }
           return Status(grpc::StatusCode::NOT_FOUND, "Job not found: " + job_id);
         }
+        if (s == JobStatus::COMPLETED || s == JobStatus::FAILED || s == JobStatus::CANCELLED) {
+          cuopt::remote::LogMessage done;
+          done.set_line("");
+          done.set_byte_offset(from_byte);
+          done.set_job_complete(true);
+          writer->Write(done);
+          return Status::OK;
+        }
         waited_ms = 0;
       }
As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 699 - 716, The
StreamLogs loop can hang for terminal jobs that never produce a log file; modify
the loop (in grpc_service_impl::StreamLogs) to call check_job_status(job_id,
msg) when the file is still missing and, if the returned JobStatus is a terminal
state (e.g., JobStatus::COMPLETED, JobStatus::FAILED, JobStatus::CANCELLED or
other terminal enums your code defines), break out and return a gRPC error (for
example Status(grpc::StatusCode::NOT_FOUND, "Log file not available for job: " +
job_id)) or an appropriate Status; keep the existing verbose logging path
(config.verbose) and context->IsCancelled() checks, and retain the waited_ms
retry/backoff behavior but ensure terminal-state handling resets/short-circuits
the wait to avoid indefinite blocking.
cpp/src/grpc/server/grpc_service_impl.cpp-189-193 (1)
189-193: ⚠️ Potential issue | 🟠 Major

SendArrayChunk allows unbounded duplicate/overlapping payload growth.

meta.received_bytes and state.total_bytes are incremented unconditionally; repeated overlapping chunks can inflate memory usage until OOM.
Proposed guard
-    meta.received_bytes += static_cast<int64_t>(raw.size());
+    if (meta.received_bytes + static_cast<int64_t>(raw.size()) > array_bytes) {
+      return Status(StatusCode::INVALID_ARGUMENT, "Received bytes exceed declared array size");
+    }
+    meta.received_bytes += static_cast<int64_t>(raw.size());
     state.total_bytes += static_cast<int64_t>(raw.size());
As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 189 - 193,
SendArrayChunk currently unconditionally increments meta.received_bytes and
state.total_bytes and pushes state.chunks, allowing duplicate/overlapping chunks
to inflate memory; modify SendArrayChunk to first validate the incoming chunk
against already-received ranges (e.g., track per-array received byte ranges in
state or use an interval set) and skip or trim duplicate/overlapping payloads
before modifying meta.received_bytes, state.total_bytes, state.chunks, and
state.total_chunks; also enforce a global per-array max size check and reject or
cap chunks that would push state.total_bytes beyond that limit to prevent
unbounded growth.
cpp/src/grpc/grpc_settings_mapper.cu-154-154 (1)
154-154: ⚠️ Potential issue | 🟠 Major

Validate protobuf presolver value before casting to presolver_t.

Line 154 and Line 221 trust unvalidated wire values; invalid integers become unsupported internal enum states.

As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."

Also applies to: 221-221
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cu` at line 154, The code casts the
protobuf integer pb_settings.presolver() directly into presolver_t (see
assignment to settings.presolver and the similar use at the other spot), which
can produce invalid enum states for out-of-range wire values; before
static_casting to presolver_t, validate the integer is within the defined
presolver_t range (e.g., compare against the enum's min/max or use a switch/case
for known values), handle unknown values by logging an error/warning and
selecting a safe default (or returning an error), and then assign the validated
value to settings.presolver to avoid unsupported internal enum states.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-673-699 (1)
673-699: ⚠️ Potential issue | 🟠 Major

SolveInfeasibleLP can pass without actually validating infeasibility behavior.

The assertion is gated by if (result.success && result.solution), so transport or solve regressions can still produce a passing test.
Proposed fix
   auto result = client->solve_lp(problem, settings);
-  if (result.success && result.solution) {
-    EXPECT_NE(result.solution->get_termination_status(), pdlp_termination_status_t::Optimal);
-  }
+  ASSERT_TRUE(result.success) << result.error_message;
+  ASSERT_NE(result.solution, nullptr);
+  EXPECT_NE(result.solution->get_termination_status(), pdlp_termination_status_t::Optimal);
As per coding guidelines "**/*test*.{cpp,cu,py}: Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 673
- 699, The test SolveInfeasibleLP currently skips assertions when the solver
failed or returned no solution; replace the guarded if-check with hard
assertions so the test fails on transport/solve regressions: assert that
result.success is true (ASSERT_TRUE(result.success)), assert that
result.solution is non-null (ASSERT_NE(result.solution, nullptr) or
ASSERT_TRUE(result.solution)), and then assert the termination status equals
pdlp_termination_status_t::Infeasible
(ASSERT_EQ(result.solution->get_termination_status(),
pdlp_termination_status_t::Infeasible)) so the test verifies infeasibility
behavior unconditionally.
cpp/src/grpc/grpc_problem_mapper.cu-175-182 (1)
175-182: ⚠️ Potential issue | 🟠 Major

Validate lower/upper bound array sizes before applying both.

Line 175 only checks constraint_lower_bounds_size() > 0; a malformed payload with mismatched upper-size still gets applied.
Proposed guard
-  if (pb_problem.constraint_lower_bounds_size() > 0) {
+  if (pb_problem.constraint_lower_bounds_size() > 0 ||
+      pb_problem.constraint_upper_bounds_size() > 0) {
+    if (pb_problem.constraint_lower_bounds_size() != pb_problem.constraint_upper_bounds_size()) {
+      return;  // or propagate an explicit mapping error
+    }
     std::vector<f_t> con_lb(pb_problem.constraint_lower_bounds().begin(),
                             pb_problem.constraint_lower_bounds().end());
As per coding guidelines "Ensure variables and constraints are accessed from the correct problem context (original vs presolve vs folded vs postsolve); verify index mapping consistency across problem transformations."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 175 - 182, The code applies
constraint lower and upper bounds when only constraint_lower_bounds_size() > 0
is checked, which allows mismatched arrays; update the guard to validate both
pb_problem.constraint_lower_bounds_size() and
pb_problem.constraint_upper_bounds_size() are > 0 and equal before calling
cpu_problem.set_constraint_lower_bounds(...) and
cpu_problem.set_constraint_upper_bounds(...); if sizes differ or one is zero,
skip applying them (or log/return an error) to avoid out-of-bounds/misaligned
mappings and ensure index consistency across transformations.
cpp/src/grpc/grpc_problem_mapper.cu-454-461 (1)
454-461: ⚠️ Potential issue | 🟠 Major

Require complete CSR/Q triplets before setting matrix data.

Line 454 and Line 526 gate only on values array presence; applying matrices without matching indices/offsets can create malformed models.
Proposed guard pattern
-  if (!a_values.empty()) {
+  if (!a_values.empty() || !a_indices.empty() || !a_offsets.empty()) {
+    if (a_values.empty() || a_indices.empty() || a_offsets.empty()) { return; }
     cpu_problem.set_csr_constraint_matrix(...);
   }
-  if (!q_values.empty()) {
+  if (!q_values.empty() || !q_indices.empty() || !q_offsets.empty()) {
+    if (q_values.empty() || q_indices.empty() || q_offsets.empty()) { return; }
     cpu_problem.set_quadratic_objective_matrix(...);
   }
As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."

Also applies to: 526-533
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 454 - 461, The current
guard only checks a_values before calling cpu_problem.set_csr_constraint_matrix,
which can yield malformed CSR data if a_indices or a_offsets are missing or
their sizes don't match; update the condition to require a_values, a_indices,
and a_offsets to be non-empty and verify their sizes are consistent (e.g.,
a_indices.size() matches a_values.size() and a_offsets.size() matches expected
row count + 1) before calling cpu_problem.set_csr_constraint_matrix, and apply
the same strengthened guard/validation to the other matrix-setting block
referenced around the second call (the similar call later in the file).
cpp/src/grpc/server/grpc_service_impl.cpp-823-824 (1)
823-824: ⚠️ Potential issue | 🟠 Major

Fix incumbent pagination cursor calculation.

next_index is always set to available, which skips data when max_count is smaller than remaining incumbents.
Proposed fix
-    response->set_next_index(available);
+    response->set_next_index(from_index + count);
As per coding guidelines "Validate algorithm correctness in optimization logic: simplex pivots, branch-and-bound decisions, routing heuristics, and constraint/objective handling must produce correct results."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 823 - 824, The
pagination logic incorrectly sets response->set_next_index(available) which
skips items when max_count < remaining; change it to compute the next cursor as
the start index plus the number of items actually returned (e.g., start +
returned_count) or equivalently start + std::min(max_count, available - start),
then call response->set_next_index(...) with that value and set done when that
next index >= available; update the code around
response->set_next_index(available) and the subsequent bool done = ...
expression accordingly.
cpp/src/grpc/client/grpc_client.cu-525-531 (1)
525-531: ⚠️ Potential issue | 🟠 Major

Use compute_chunk_size() to clamp chunk payload to negotiated message-size limits.

chunk_data_budget is derived only from config_.chunk_size_bytes without enforcing server/client limits. If user config exceeds server capabilities, chunk RPC calls will fail with message-size errors. Apply the existing compute_chunk_size() helper at both occurrences (lines 525-531 and 694-698):
🔧 Proposed fix
-  int64_t chunk_data_budget = config_.chunk_size_bytes;
-  if (chunk_data_budget <= 0) { chunk_data_budget = 1LL * 1024 * 1024; }
+  int64_t chunk_data_budget =
+    compute_chunk_size(server_max_message_bytes_, config_.max_message_bytes, config_.chunk_size_bytes);

   const int64_t proto_overhead = 64;
   if (chunk_data_budget > proto_overhead) { chunk_data_budget -= proto_overhead; }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 525 - 531, The code sets
chunk_data_budget directly from config_.chunk_size_bytes then subtracts
proto_overhead before calling build_array_chunk_requests, which can exceed
negotiated gRPC limits; replace the direct assignment with a call to
compute_chunk_size(config_.chunk_size_bytes) (or the appropriate
compute_chunk_size overload) to clamp to negotiated message-size limits, then
subtract proto_overhead and pass that clamped chunk_data_budget into
build_array_chunk_requests; apply the exact same change to the other place in
this file where chunk_data_budget is computed later (the second occurrence) so
both chunk payload computations use compute_chunk_size().

🟡 Minor comments (10)

python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-152-152 (1)

152-152: ⚠️ Potential issue | 🟡 Minor

Clarify the presolve mode and add numerical validation to the warmstart test.

At line 152, the literal 0 for CUOPT_PRESOLVE obscures intent—there is no PresolverMode enum exported to Python, so add an inline comment explaining that 0 disables presolve (required for warmstart). More importantly, the test at line 162 only checks that the solution is non-None; it should validate numerical correctness of the warmstart result by comparing it against a baseline solve or checking solution bounds.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` at
line 152, Replace the magic literal by documenting intent: add an inline comment
next to settings.set_parameter(CUOPT_PRESOLVE, 0) stating that 0 disables
presolve (required for warmstart) since no PresolverMode enum is exposed to
Python. Then extend the warmstart test (the test that currently only asserts
solution is not None) to perform numerical validation: run a baseline solve
without warmstart (or use known optimal objective/variable bounds) and assert
the warmstarted solution’s objective and key variable values are within a tight
tolerance (e.g., via pytest.approx) of the baseline; reference the same settings
and solution objects used in the test to obtain values for comparison.

cpp/src/grpc/server/grpc_incumbent_proto.hpp-30-32 (1)

30-32: ⚠️ Potential issue | 🟡 Minor

Unchecked serialization return value.

SerializeToArray can fail and returns a boolean, but the return value is ignored. This could lead to returning an invalid/incomplete buffer silently.

🛡️ Proposed fix

   std::vector<uint8_t> buffer(msg.ByteSizeLong());
-  msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()));
+  if (!msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size()))) {
+    return {};  // Return empty vector on failure
+  }
   return buffer;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 30 - 32, The code
ignores the boolean result of msg.SerializeToArray which can fail; update the
serialization block to check the return value of
msg.SerializeToArray(buffer.data(), static_cast<int>(buffer.size())) and handle
failure (e.g., throw a std::runtime_error or return an empty/optional buffer)
rather than returning a potentially incomplete vector; keep the existing use of
msg.ByteSizeLong() and the buffer variable but ensure you validate
SerializeToArray's result and produce a clear error/alternative return before
returning buffer.

cpp/src/grpc/server/grpc_worker.cpp-248-251 (1)

248-251: ⚠️ Potential issue | 🟡 Minor

Missing CUDA error checking on cudaMemcpy.

Per coding guidelines, CUDA errors should be caught and mapped to meaningful error codes. These cudaMemcpy calls could fail (e.g., device out of memory) but errors are not checked.

🛡️ Proposed fix pattern

-        cudaMemcpy(host_solution.data(),
-                   device_solution.data(),
-                   device_solution.size() * sizeof(double),
-                   cudaMemcpyDeviceToHost);
+        cudaError_t err = cudaMemcpy(host_solution.data(),
+                                     device_solution.data(),
+                                     device_solution.size() * sizeof(double),
+                                     cudaMemcpyDeviceToHost);
+        if (err != cudaSuccess) {
+          throw std::runtime_error(std::string("cudaMemcpy failed: ") + cudaGetErrorString(err));
+        }

Apply the same pattern to the other cudaMemcpy calls on lines 296-307.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 248 - 251, The cudaMemcpy
calls (e.g., the call copying device_solution -> host_solution) lack CUDA error
checks; wrap each cudaMemcpy (including the other calls around lines 296-307) to
capture the returned cudaError_t, check if it is not cudaSuccess, and convert it
to a meaningful error/Status (use/create a helper like mapCudaErrorToStatus or
mapCudaErrorToGrpcStatus) and return/log that error instead of proceeding;
update the cudaMemcpy calls that reference device_solution, host_solution, and
any other device/host buffers to use this pattern so failures (OOM, invalid
value, etc.) are detected and mapped to the existing error-handling flow.

cpp/src/grpc/client/test_grpc_client.cpp-401-403 (1)

401-403: ⚠️ Potential issue | 🟡 Minor

Unhandled exception from std::stod.

If the user provides an invalid --time-limit value (e.g., non-numeric), std::stod throws std::invalid_argument or std::out_of_range, resulting in an unhandled exception with a cryptic error message.

🛡️ Proposed fix

     } else if (arg == "--time-limit" && arg_idx + 1 < argc) {
-      time_limit = std::stod(argv[++arg_idx]);
+      try {
+        time_limit = std::stod(argv[++arg_idx]);
+      } catch (const std::exception& e) {
+        std::cerr << "ERROR: Invalid time limit value\n";
+        return 1;
+      }
       arg_idx++;

The same issue exists for std::stoll on line 531.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Wrap the
std::stod call that parses --time-limit (and the std::stoll call at the other
location) in a try/catch that catches std::invalid_argument and
std::out_of_range, then log a clear error via the existing logging/exit path
and/or set a safe default; specifically, around the parsing of argv in the
branch handling "--time-limit" (variable time_limit and std::stod) catch parsing
exceptions, call the same error-handling used elsewhere (print usage or error
and exit with non-zero) and do the same for the std::stoll parse to avoid
uncaught exceptions and cryptic crashes.

cpp/src/grpc/cuopt_remote_service.proto-314-318 (1)

314-318: ⚠️ Potential issue | 🟡 Minor

Inconsistent job_id type: bytes here vs string elsewhere.

IncumbentRequest.job_id is defined as bytes (line 315), but all other request messages (e.g., StatusRequest, GetResultRequest, CancelRequest, DeleteRequest) use string for job_id. This inconsistency could cause confusion and require different handling on client/server.
Suggested fix for consistency
 message IncumbentRequest {
-  bytes job_id = 1;
+  string job_id = 1;
   int64 from_index = 2;        // Return incumbents starting from this index
   int32 max_count = 3;         // Optional limit (0 or negative => no limit)
 }
Similarly, Incumbent.job_id at line 324 should also be string for consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 318,
IncumbentRequest.job_id (and matching Incumbent.job_id) are declared as bytes
while all other request messages use string; change the field type from bytes to
string in the proto for IncumbentRequest and Incumbent so they match
StatusRequest/GetResultRequest/CancelRequest/DeleteRequest, then
recompile/regenerate the gRPC/protobuf stubs (look for the IncumbentRequest and
Incumbent message definitions) to ensure client/server code uses the consistent
string type.

cpp/src/grpc/server/grpc_server_types.hpp-263-265 (1)

263-265: ⚠️ Potential issue | 🟡 Minor

Hardcoded shared memory names may cause conflicts with concurrent server instances.

The shared memory segment names (/cuopt_job_queue, /cuopt_result_queue, /cuopt_control) are hardcoded constants. If multiple server instances run on the same host, they will conflict.

Consider incorporating the port number or a unique identifier into the segment names.
Suggested approach
// Instead of inline constants, generate names dynamically:
inline std::string get_shm_job_queue_name(int port) {
  return "/cuopt_job_queue_" + std::to_string(port);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 263 - 265, The three
hardcoded shared-memory name constants (SHM_JOB_QUEUE, SHM_RESULT_QUEUE,
SHM_CONTROL) will collide across concurrent server instances; change them from
inline const char* values to functions that generate unique names at runtime
(e.g., append the server port, PID, or UUID) and update call sites to use the
generator functions (e.g., get_shm_job_queue_name(port),
get_shm_result_queue_name(port), get_shm_control_name(port)) so each server
instance uses distinct segment names.

GRPC_ARCHITECTURE.md-298-304 (1)

298-304: ⚠️ Potential issue | 🟡 Minor

Document required remote endpoint env vars in the Environment Variables section.

This table currently lists only transfer tuning variables. Please also include CUOPT_REMOTE_HOST and CUOPT_REMOTE_PORT, since remote execution depends on them.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_ARCHITECTURE.md` around lines 298 - 304, The Environment Variables table
is missing the remote endpoint variables: add two rows for CUOPT_REMOTE_HOST and
CUOPT_REMOTE_PORT in the "Environment Variables" section, specifying sensible
defaults (e.g., blank or "localhost" for CUOPT_REMOTE_HOST and a default port
like 50051 for CUOPT_REMOTE_PORT) and clear descriptions such as "Remote
execution host" and "Remote execution port" so remote execution is documented
alongside CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES.

GRPC_ARCHITECTURE.md-16-39 (1)

16-39: ⚠️ Potential issue | 🟡 Minor

Specify languages on fenced code blocks to satisfy markdown lint.

These blocks are missing a language identifier (MD040), which is currently failing lint hygiene.
📝 Example fix pattern
-```
+```text
 cpp/src/grpc/
 ...
-```
+```
Also applies to: 105-120, 136-151, 359-371
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_ARCHITECTURE.md` around lines 16 - 39, The markdown fenced code blocks
in GRPC_ARCHITECTURE.md are missing language identifiers (MD040); update each
triple-backtick block (including the shown cpp/src/grpc/ tree block and the
other blocks at ranges 105-120, 136-151, 359-371) to include a language token
such as "text" (i.e., change ``` to ```text) so linting passes; ensure all
similar block openings (```...) in the file are consistently annotated.

cpp/src/grpc/grpc_solution_mapper.cu-559-560 (1)

559-560: ⚠️ Potential issue | 🟡 Minor

Warm-start presence check is too narrow.

Warm-start reconstruction only triggers when RESULT_WS_CURRENT_PRIMAL is non-empty; other warm-start fields can be present and still be dropped.

Suggested adjustment

-  auto ws_primal = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL);
-  if (!ws_primal.empty()) {
+  auto ws_primal = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL);
+  auto ws_dual   = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_DUAL);
+  auto ws_init_p = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_INITIAL_PRIMAL_AVG);
+  auto ws_init_d = bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_INITIAL_DUAL_AVG);
+  if (!ws_primal.empty() || !ws_dual.empty() || !ws_init_p.empty() || !ws_init_d.empty()) {

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 559 - 560, The code only
triggers warm-start reconstruction when ws_primal (from
bytes_to_typed<f_t>(arrays, cuopt::remote::RESULT_WS_CURRENT_PRIMAL)) is
non-empty, which misses cases where other warm-start fields exist; change the
presence check to detect any warm-start data key in arrays (e.g.,
cuopt::remote::RESULT_WS_CURRENT_PRIMAL, RESULT_WS_CURRENT_DUAL,
RESULT_WS_PREV_PRIMAL, RESULT_WS_PREV_DUAL or other RESULT_WS_* entries) before
skipping reconstruction. Implement a small helper or boolean that queries arrays
for any of those RESULT_WS_* keys (or calls bytes_to_typed for each and checks
non-empty) and use that combined condition instead of only ws_primal.empty() so
reconstruction runs when any warm-start field is present.

cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-53-57 (1)

53-57: ⚠️ Potential issue | 🟡 Minor

Add <fcntl.h> include for open() and flag constants.

Line 120 uses open() with O_WRONLY|O_CREAT|O_TRUNC, but <fcntl.h> is not explicitly included. While some systems may include it transitively through unistd.h, this is not guaranteed by the POSIX standard and creates a portability risk.

File handle cleanup is correct; fd is properly closed at line 125.
Proposed fix
 `#include` <signal.h>
+#include <fcntl.h>
 `#include` <sys/types.h>
 `#include` <sys/wait.h>
 `#include` <unistd.h>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 53
- 57, The test uses open() with flags O_WRONLY|O_CREAT|O_TRUNC (call site:
open(..., O_WRONLY|O_CREAT|O_TRUNC)) but does not include <fcntl.h>, which can
break portability; add `#include` <fcntl.h> to the top includes so the O_* flag
constants and open declaration are defined and the code compiles reliably across
platforms.

🧹 Nitpick comments (10)

cpp/src/grpc/server/grpc_worker.cpp (1)

149-161: Duplicate error handling blocks.

Lines 149-151 and 153-161 both handle !read_success. The first block logs to stderr, while the second handles the failure logic. These could be consolidated.

♻️ Proposed consolidation

-    if (!read_success) {
-      std::cerr << "[Worker " << worker_id << "] Failed to read job data from pipe\n";
-    }
-
     if (!read_success) {
+      std::cerr << "[Worker " << worker_id << "] Failed to read job data from pipe\n";
       store_simple_result(job_id, worker_id, 1, "Failed to read job data");
       job.worker_pid   = 0;
       // ... rest of cleanup

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 149 - 161, The duplicate
handling for !read_success should be consolidated into a single block: remove
the first standalone stderr log and merge its message into the existing failure
branch so that when read_success is false you both log the error (including
worker_id) and execute the failure logic (call store_simple_result(job_id,
worker_id, 1, "..."), reset job fields job.worker_pid, job.worker_index,
job.data_sent, job.ready, job.claimed, and continue). Update the block around
the read_success check in grpc_worker.cpp to perform logging and the cleanup
exactly once to avoid duplicate branches.

cpp/src/grpc/server/grpc_incumbent_proto.hpp (1)

44-49: Consider clearing the output vector before populating.

The assignment vector is passed by reference but not cleared before push_back calls. If a caller reuses the same vector across multiple calls, stale data will accumulate.

♻️ Proposed fix

   job_id    = incumbent_msg.job_id();
   objective = incumbent_msg.objective();
+  assignment.clear();
   assignment.reserve(incumbent_msg.assignment_size());
   for (int i = 0; i < incumbent_msg.assignment_size(); ++i) {
     assignment.push_back(incumbent_msg.assignment(i));
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 44 - 49, The
assignment vector is populated via push_back without being cleared, so reuse
will accumulate stale entries; before calling assignment.reserve(...) and the
for-loop that pushes from incumbent_msg.assignment(i), call assignment.clear()
(or use assignment.assign(...) from the repeated field) to ensure the vector is
emptied first—update the code around the symbols assignment and incumbent_msg in
grpc_incumbent_proto.hpp to clear assignment before reserving/pushing.

cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)

93-100: Verify semantic difference between result and request deserialization.

For result blobs (line 94), total_elements is used directly as byte count for resize. For request blobs (lines 151-153), total_elements * elem_size is used. This asymmetry suggests total_elements has different semantics in result vs request contexts - consider adding a clarifying comment.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 93 - 100, The
resize logic for result deserialization uses ac.total_elements() directly as
bytes while request deserialization multiplies total_elements by element size,
so confirm whether ac.total_elements() represents bytes or element count and
make semantics consistent: either change the result-path allocation (the
dest.resize call) to use total_bytes = ac.total_elements() * ac.elem_size() (or
equivalent accessor) before resizing and memcpy, or add a clear comment by the
result-path block documenting that total_elements already includes element size
in bytes; update any related uses (element_offset(), chunk_data.size(), memcpy)
to match the chosen unit to avoid off-by-factor bugs.

cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp (1)

16-20: Consider reordering includes for consistency.

Standard convention places system/library headers before local headers. The <memory> include should come before the local "grpc_client.hpp" include.

Suggested reorder

 `#include` <cuopt_remote_service.grpc.pb.h>
 `#include` <grpcpp/grpcpp.h>
-#include "grpc_client.hpp"
 
 `#include` <memory>
+
+#include "grpc_client.hpp"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp` around lines
16 - 20, Reorder the includes so system/library headers come before local
headers: move the <memory> include above the "grpc_client.hpp" include in
grpc_client_test_helper.hpp; ensure the block now lists
<cuopt_remote_service.grpc.pb.h>, <grpcpp/grpcpp.h>, <memory> and then
"grpc_client.hpp" to follow the standard include ordering convention.

cpp/tests/linear_programming/grpc/CMakeLists.txt (1)

24-33: Minor: Redundant GTest library linkage.

GTest::gmock_main already includes GTest::gmock, and GTest::gtest_main includes GTest::gtest. You can simplify to just the _main variants.

Simplified linkage

 target_link_libraries(GRPC_CLIENT_TEST
     PRIVATE
     cuopt
-    GTest::gmock
     GTest::gmock_main
-    GTest::gtest
     GTest::gtest_main
     gRPC::grpc++
     protobuf::libprotobuf
 )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/CMakeLists.txt` around lines 24 - 33,
Remove redundant GTest link entries in the target_link_libraries call for
GRPC_CLIENT_TEST: keep only the main variants (GTest::gmock_main and
GTest::gtest_main) and remove GTest::gmock and GTest::gtest to avoid duplicate
linkage; update the target_link_libraries block that references
GRPC_CLIENT_TEST, cuopt, gRPC::grpc++, and protobuf::libprotobuf accordingly so
only the necessary _main symbols remain.

cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (2)

359-360: Remove unused server_logs_ member.

The server_logs_ vector is declared but never populated or used. Server logs are read directly from the file system via get_server_logs(). This dead code should be removed to avoid confusion.

Remove unused member

  private:
   mutable std::mutex mutex_;
   std::vector<LogEntry> client_logs_;
-  std::vector<LogEntry> server_logs_;
   std::string server_log_path_;
   std::streampos server_log_start_pos_ = 0;  // Position in server log file when test started
   bool test_start_marked_              = false;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 359
- 360, Remove the dead member server_logs_ from the class: delete the
declaration of std::vector<LogEntry> server_logs_ (it is never populated or
used; server logs are accessed via get_server_logs()). After removing the
member, run a quick search for server_logs_ to ensure there are no remaining
references and update any tests or constructors that may have initialized it.

123-126: Callback captures this pointer — document lifetime requirement.

The lambda returned by client_callback() captures this by pointer. If the callback is invoked after the GrpcTestLogCapture instance is destroyed, this will cause undefined behavior. Consider adding a brief note in the documentation that the GrpcTestLogCapture instance must outlive any client using this callback.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 123
- 126, The lambda returned by GrpcTestLogCapture::client_callback() captures
this by pointer and can dangle if the GrpcTestLogCapture instance is destroyed,
so add a clear lifetime note: update the documentation/comment for the
GrpcTestLogCapture class and/or the client_callback() method to state that the
GrpcTestLogCapture instance must outlive any client that retains or invokes the
returned callback; reference the client_callback() function name in the comment
so callers know the requirement and consider mentioning alternative designs
(e.g., providing a weak_ptr-based factory) if you want to avoid the strict
lifetime requirement.

SERVER_ARCHITECTURE.md (1)

15-46: Add language specifier to fenced code block.

Per markdownlint, fenced code blocks should have a language specified. For ASCII diagrams, use text or plaintext.

Fix markdown lint warning

-```
+```text
 ┌────────────────────────────────────────────────────────────────────┐

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The fenced ASCII diagram block
in SERVER_ARCHITECTURE.md is missing a language specifier; change the opening
fence from ``` to ```text (or ```plaintext) so the block becomes ```text and
satisfies markdownlint for the diagram that starts with the
"┌────────────────────────────────────────────────────────────────────┐" line.

cpp/src/grpc/client/solve_remote.cu (1)

69-74: Mark currently-unused LP flags explicitly (or forward them).

problem_checking and use_pdlp_solver_mode are currently unused. Marking them explicitly avoids silent interface drift and warning churn.

Small cleanup option

 std::unique_ptr<lp_solution_interface_t<i_t, f_t>> solve_lp_remote(
   cpu_optimization_problem_t<i_t, f_t> const& cpu_problem,
   pdlp_solver_settings_t<i_t, f_t> const& settings,
-  bool problem_checking,
-  bool use_pdlp_solver_mode)
+  [[maybe_unused]] bool problem_checking,
+  [[maybe_unused]] bool use_pdlp_solver_mode)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 69 - 74, The function
solve_lp_remote currently accepts parameters problem_checking and
use_pdlp_solver_mode but never uses them, which can cause warnings and interface
drift; update the function (solve_lp_remote) to explicitly mark these parameters
as intentionally unused (e.g., cast to void or annotate with [[maybe_unused]])
or forward them to the underlying call that needs them, ensuring the symbols
problem_checking and use_pdlp_solver_mode appear in the implementation so the
intent is clear and compiler warnings are silenced.

cpp/tests/linear_programming/grpc/grpc_client_test.cpp (1)

1086-1340: Add at least one degenerate solve-case assertion in this suite.

The end-to-end mock solve tests are strong, but adding explicit infeasible/unbounded/empty (or singleton edge) solve assertions would better protect status/result mapping regressions.

As per coding guidelines: "Write tests validating numerical correctness of optimization results (not just 'runs without error'); test degenerate cases (infeasible, unbounded, empty, singleton problems)".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp` around lines 1086 -
1340, Add a new degenerate-case unit test (e.g., TEST_F(GrpcClientTest,
SolveLP_Infeasible)) that uses the existing GrpcClientTest fixture and
MockCuOptStub to exercise grpc_client_t::solve_lp for an infeasible LP: have
SubmitJob(...) return a job_id, have CheckStatus/WaitForCompletion indicate
COMPLETED (or skip if use_wait), and have GetResult(...) return a
cuopt::remote::LPSolution with termination_status set to PDLP_INFEASIBLE (and
resp->set_status(cuopt::remote::SUCCESS) if appropriate); then assert the
returned result indicates failure (result.success == false) and that
result.error_message contains a clear indicator of infeasibility (e.g.,
"INFEASIBLE" or "infeasible"); reuse symbols SubmitJob,
CheckStatus/WaitForCompletion, GetResult, grpc_client_t::solve_lp, and
GrpcClientTest to locate where to add the test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6984e3cb-4874-432a-8307-c8f3a64fd9ba

📥 Commits

Reviewing files that changed from the base of the PR and between 1c6f86b and 6cafe37.

📒 Files selected for processing (39)

GRPC_ARCHITECTURE.md
SERVER_ARCHITECTURE.md
build.sh
cpp/CMakeLists.txt
cpp/cuopt_grpc_server.cpp
cpp/src/grpc/client/grpc_client.cu
cpp/src/grpc/client/grpc_client.hpp
cpp/src/grpc/client/solve_remote.cu
cpp/src/grpc/client/test_grpc_client.cpp
cpp/src/grpc/cuopt_remote.proto
cpp/src/grpc/cuopt_remote_service.proto
cpp/src/grpc/grpc_problem_mapper.cu
cpp/src/grpc/grpc_problem_mapper.hpp
cpp/src/grpc/grpc_service_mapper.cu
cpp/src/grpc/grpc_service_mapper.hpp
cpp/src/grpc/grpc_settings_mapper.cu
cpp/src/grpc/grpc_settings_mapper.hpp
cpp/src/grpc/grpc_solution_mapper.cu
cpp/src/grpc/grpc_solution_mapper.hpp
cpp/src/grpc/server/grpc_field_element_size.hpp
cpp/src/grpc/server/grpc_incumbent_proto.hpp
cpp/src/grpc/server/grpc_job_management.cpp
cpp/src/grpc/server/grpc_pipe_serialization.hpp
cpp/src/grpc/server/grpc_server_main.cpp
cpp/src/grpc/server/grpc_server_threads.cpp
cpp/src/grpc/server/grpc_server_types.hpp
cpp/src/grpc/server/grpc_service_impl.cpp
cpp/src/grpc/server/grpc_worker.cpp
cpp/src/grpc/server/grpc_worker_infra.cpp
cpp/src/pdlp/CMakeLists.txt
cpp/src/pdlp/solve_remote.cu
cpp/tests/linear_programming/CMakeLists.txt
cpp/tests/linear_programming/grpc/CMakeLists.txt
cpp/tests/linear_programming/grpc/grpc_client_test.cpp
cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
python/cuopt/cuopt/routing/utils.py
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py

💤 Files with no reviewable changes (2)

cpp/src/pdlp/CMakeLists.txt
cpp/src/pdlp/solve_remote.cu

cpp/src/grpc/client/grpc_client.cpp

cpp/src/grpc/client/grpc_client.cu

cpp/src/grpc/grpc_problem_mapper.cpp

cpp/src/grpc/server/grpc_server_main.cpp

cpp/src/grpc/server/grpc_service_impl.cpp

cpp/src/grpc/server/grpc_worker_infra.cpp

coderabbitai

Actionable comments posted: 14

♻️ Duplicate comments (9)

cpp/src/grpc/server/grpc_server_main.cpp (2)

186-212: ⚠️ Potential issue | 🔴 Critical

Add teardown on TLS config failures after workers/threads are started.

These early returns occur after runtime startup and bypass shutdown/join/cleanup paths.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 186 - 212, The TLS
validation branches that currently do "return 1" (around checks using config,
ssl_opts and read_file_to_string for cert/key/root and require_client) can occur
after workers/threads have been started; replace those early returns with calls
to the centralized shutdown/teardown routine (e.g., call your existing server
stop/cleanup functions such as StopServer(), shutdownWorkers()/joinThreads(), or
a new cleanup_resources() that performs shutdown and joins) and then return the
error code; ensure every failure path that currently returns from inside the TLS
setup invokes that teardown routine before exiting so workers/threads and other
resources are properly cleaned up.

234-251: ⚠️ Potential issue | 🔴 Critical

Guard BuildAndStart() failure before waiting on the server pointer.

server->Wait() is called unconditionally; if startup fails and returns null, this dereferences a null pointer.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 234 - 251,
BuildAndStart() may return nullptr so avoid unguarded dereference of server;
after std::unique_ptr<Server> server(builder.BuildAndStart()), check if (server)
before calling server->Wait() and starting the shutdown logic that assumes a
live server. If BuildAndStart() fails, log an error (or print to cerr) and
exit/return non-zero. Also ensure the shutdown_thread's capture/Shutdown call
remains safe by only invoking server->Shutdown() when server is non-null; move
creation of shutdown_thread and the server->Wait() call inside the if (server)
block and handle the failure path explicitly.

cpp/src/grpc/server/grpc_job_management.cpp (1)

147-158: ⚠️ Potential issue | 🔴 Critical

Synchronize all job_queue access; current slot claim/update path is racy.

submit_job_async still does unsynchronized check-then-write on shared queue state, and check_job_status/cancel_job concurrently read/write the same fields. This can corrupt queue ownership and status transitions under concurrent requests.

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."

Also applies to: 195-196, 267-270
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, The
submit_job_async path performs an unsynchronized check-then-write on shared
job_queue entries (e.g., checking job_queue[i].ready/claimed then writing fields
like job_id, ready, claimed), which races with check_job_status and cancel_job;
protect all accesses to job_queue (both scans and per-entry reads/writes) by
introducing a synchronization primitive (e.g., a std::mutex or per-entry mutex
array) and acquire the lock(s) around the loop that scans MAX_JOBS and before
modifying fields (job_queue[i].ready, claimed, cancelled, etc.), and also ensure
check_job_status and cancel_job acquire the same lock(s) before reading or
updating those fields so ownership and state transitions are atomic and
race-free.

cpp/src/grpc/client/grpc_client.cu (4)

258-266: ⚠️ Potential issue | 🔴 Critical

Guard public RPC methods before dereferencing impl_->stub.

These methods can be called before connect(). Dereferencing impl_->stub without a null check can crash.

🔧 Suggested guard pattern

 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;
+  if (!impl_ || !impl_->stub) {
+    result.error_message = "Not connected to server";
+    return result;
+  }

   grpc::ClientContext context;
   auto request = build_status_request(job_id);
   cuopt::remote::StatusResponse response;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The public RPC
method grpc_client_t::check_status dereferences impl_->stub without ensuring
impl_ is initialized or stub is non-null (can be called before connect()), which
can crash; add a null-check guard at the start of check_status (and other public
RPC methods) that verifies impl_ and impl_->stub are valid, and return an
appropriate job_status_result_t error/failed state (or set result.error and
result.code) when the client is not connected instead of dereferencing; refer to
grpc_client_t::connect and impl_->stub when adding the guard so the check
mirrors connect()'s initialization contract.

994-995: ⚠️ Potential issue | 🔴 Critical

Validate poll_interval_ms before using it as a divisor.

max_polls divides by config_.poll_interval_ms without checking <= 0, which can crash.

🔧 Suggested validation

+  if (config_.poll_interval_ms <= 0) {
+    stop_log_streaming();
+    result.error_message = "poll_interval_ms must be > 0";
+    return result;
+  }
   int poll_count = 0;
   int max_polls  = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;

Also applies to: 1132-1132

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 994 - 995, The code computes
max_polls by dividing by config_.poll_interval_ms without validating it; update
the logic that sets max_polls (the line initializing int max_polls) to first
validate config_.poll_interval_ms is > 0 (and also guard against extremely
large/small values), returning an error or using a safe default/policy (e.g.,
set poll_interval_ms = 1 or return a failure status) when it's <= 0 to avoid
division-by-zero and undefined behavior; apply the same validation where
max_polls is computed elsewhere (the other occurrence around the code that uses
config_.poll_interval_ms).

703-712: ⚠️ Potential issue | 🔴 Critical

Validate chunk metadata and chunk bounds before allocation/copy.

elem_size, total_elems, and elements_in_chunk are trusted too early. This can lead to division by zero, overflowed allocation sizes, or out-of-bounds memcpy.

🔧 Suggested hardening

   for (const auto& arr_desc : header->arrays()) {
     auto field_id       = arr_desc.field_id();
     int64_t total_elems = arr_desc.total_elements();
     int64_t elem_size   = arr_desc.element_size_bytes();
     if (total_elems <= 0) continue;
+    if (elem_size <= 0) {
+      last_error_ = "Invalid chunk metadata: non-positive element size";
+      return false;
+    }
+    if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) {
+      last_error_ = "Invalid chunk metadata: byte-size overflow";
+      return false;
+    }
+    int64_t total_bytes = total_elems * elem_size;
+    if (total_bytes < 0 ||
+        static_cast<uint64_t>(total_bytes) > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) {
+      last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory";
+      return false;
+    }

     int64_t elems_per_chunk = chunk_data_budget / elem_size;
     if (elems_per_chunk <= 0) elems_per_chunk = 1;

-    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size));
+    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes));

     for (int64_t elem_offset = 0; elem_offset < total_elems; elem_offset += elems_per_chunk) {
       int64_t elems_wanted = std::min(elems_per_chunk, total_elems - elem_offset);
@@
       int64_t elems_received = chunk_resp.elements_in_chunk();
       const auto& data       = chunk_resp.data();

       if (static_cast<int64_t>(data.size()) != elems_received * elem_size) {
         last_error_ = "GetResultChunk: data size mismatch";
         return false;
       }
+      if (elems_received < 0 || elem_offset + elems_received > total_elems) {
+        last_error_ = "GetResultChunk: elements_in_chunk out of bounds";
+        return false;
+      }

       std::memcpy(array_bytes.data() + elem_offset * elem_size, data.data(), data.size());

Also applies to: 731-740

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 703 - 712, Validate and
guard all chunk metadata before allocating or copying: ensure
arr_desc.element_size_bytes() (elem_size) > 0 and arr_desc.total_elements()
(total_elems) >= 0 before using them; ensure chunk_data_budget is >0 before
computing elems_per_chunk and handle division safely; check for
size_t/multiplication overflow when computing total_elems * elem_size before
constructing array_bytes and use a safe cap; when iterating chunks validate each
elements_in_chunk is within [0, total_elems] and compute copy_size =
elements_in_chunk * elem_size with overflow checks before any memcpy; add clear
error handling/logging and skip/abort the chunk on invalid metadata rather than
performing the allocation or memcpy (refer to symbols: arr_desc, field_id,
elem_size, total_elems, chunk_data_budget, elems_per_chunk, elements_in_chunk,
array_bytes, memcpy).

247-251: ⚠️ Potential issue | 🔴 Critical

stop_log_streaming() can hang waiting on a blocked reader thread.

Setting stop_logs_ does not unblock a thread already stuck in stream read, so join() can block indefinitely.

🔧 Suggested fix

 void grpc_client_t::start_log_streaming(const std::string& job_id)
 {
   if (!config_.stream_logs || !config_.log_callback) return;

   stop_logs_.store(false);
   log_thread_ = std::make_unique<std::thread>([this, job_id]() {
     grpc::ClientContext context;
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = &context;
+    }
     auto request = build_stream_logs_request(job_id, 0);
     auto reader  = impl_->stub->StreamLogs(&context, request);

     cuopt::remote::LogMessage log_msg;
     while (reader->Read(&log_msg)) {
       if (stop_logs_.load()) {
-        context.TryCancel();
         break;
       }
       if (config_.log_callback) { config_.log_callback(log_msg.line()); }
       if (log_msg.job_complete()) { break; }
     }
     reader->Finish();
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = nullptr;
+    }
   });
 }

 void grpc_client_t::stop_log_streaming()
 {
   stop_logs_.store(true);
+  {
+    std::lock_guard<std::mutex> lk(log_context_mutex_);
+    if (active_log_context_) { active_log_context_->TryCancel(); }
+  }
   if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); }
   log_thread_.reset();
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 247 - 251,
stop_log_streaming currently only sets stop_logs_ and then joins log_thread_,
which can hang if the thread is blocked in a gRPC read; change the shutdown
sequence to (1) set stop_logs_, (2) explicitly cancel/close the gRPC stream used
by the reader (e.g. call the grpc::ClientContext::TryCancel() or cancel/Finish
the ClientReader/ReaderInterface instance used for log streaming), (3) ensure
any CompletionQueue or async reader is shutdown/drained so the blocked read
returns, and only then join log_thread_ (use a timed join as a fallback and log
an error if join times out); update stop_log_streaming to reference stop_logs_,
log_thread_ and the stream/reader/client-context member (the object that
performs the stream read) when implementing the cancel/cleanup steps.

cpp/src/grpc/grpc_problem_mapper.cu (1)

399-411: ⚠️ Potential issue | 🔴 Critical

Validate byte alignment before memcpy in chunked array decoders.

n is computed with integer division, but memcpy copies full byte count. Misaligned payload sizes can write past allocated vectors.

🔧 Suggested fix

   auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(double) != 0) return {};
     size_t n = it->second.size() / sizeof(double);
@@
   auto get_ints = [&](int32_t field_id) -> std::vector<i_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(int32_t) != 0) return {};
     size_t n = it->second.size() / sizeof(int32_t);

Also applies to: 414-426

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 411, The lambda
get_doubles uses n = it->second.size() / sizeof(double) but then memcpy's the
full byte size, which can overflow if the payload isn't aligned; validate that
it->second.size() is an exact multiple of sizeof(double) before copying, compute
expected_bytes = n * sizeof(double) and only memcpy expected_bytes (or
return/throw on misaligned size), and when converting to f_t ensure you only
read expected_bytes into the temporary vector; apply the same check/fix to the
other chunked-array decoder blocks that perform memcpy (the other double-to-f_t
conversion in this file).

cpp/src/grpc/server/grpc_service_impl.cpp (1)

71-73: ⚠️ Potential issue | 🔴 Critical

Publish ready before releasing claimed to avoid queue slot races.

Current store order can allow another submitter to reserve the same slot before readiness is published.
🔧 Suggested ordering fix
-    job_queue[job_idx].claimed.store(false);
-    job_queue[job_idx].ready.store(true);
+    job_queue[job_idx].ready.store(true, std::memory_order_release);
+    job_queue[job_idx].claimed.store(false, std::memory_order_release);
As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."
Also applies to: 265-267
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 71 - 73, The code
currently does job_queue[job_idx].claimed.store(false) before
job_queue[job_idx].ready.store(true), which can let another submitter claim the
slot before readiness is published; swap the two stores so ready.store(true) is
executed before claimed.store(false) for the job_queue entry, and apply the same
fix to the other identical location where claimed and ready are updated (the
later occurrence around the second claim/release sequence) to prevent the race.

🧹 Nitpick comments (5)

cpp/src/grpc/client/test_grpc_client.cpp (1)
401-403: Consider graceful handling for invalid numeric CLI input instead of process abort.

std::stod/std::stoll at lines 402 and 531 throw on malformed input (e.g., --time-limit abc or invalid incumbent index). While this is test code, a try-catch wrapper would provide a friendlier CLI experience instead of silently crashing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 401 - 403, Replace
direct calls to std::stod and std::stoll in the CLI parsing logic with
exception-safe parsing: wrap the std::stod(std::string) used to set time_limit
and the std::stoll used to set incumbent_index in try-catch blocks that catch
std::invalid_argument and std::out_of_range, print a clear error message
indicating the offending flag and value (e.g., "--time-limit" or
"--incumbent-index"), and exit with a non-zero status instead of letting the
process throw; update the parsing branches that reference time_limit and
incumbent_index so they validate the parsed numeric ranges if applicable and
fall back or exit cleanly on parse failure.
cpp/src/grpc/cuopt_remote_service.proto (1)
9-9: Standardize job_id field type across RPC schema.

Most request messages use string job_id (StatusRequest, GetResultRequest, DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.), but IncumbentRequest.job_id and Incumbent.job_id are defined as bytes. Similarly, cuopt_remote.proto uses bytes job_id in SubmitResponse. This type inconsistency can cause friction in client libraries expecting uniform scalar types across the API contract. Either standardize on string for consistency with the majority of the schema, or use bytes uniformly if avoiding UTF-8 validation is a requirement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` at line 9, IncumbentRequest.job_id,
Incumbent.job_id and SubmitResponse.job_id are defined as bytes while most RPC
messages (StatusRequest, GetResultRequest, DeleteRequest, CancelRequest,
WaitRequest, StreamLogsRequest, etc.) use string; change those fields to type
string to standardize the schema, update any proto message definitions where
job_id is bytes to string (IncumbentRequest, Incumbent, SubmitResponse), run
proto generation to refresh client/server stubs, and adjust any
serialization/handler code that expects bytes to handle string instead.
cpp/src/grpc/grpc_solution_mapper.cu (2)
93-101: Consider bulk operations for repeated fields.

The element-by-element add_*() calls work correctly but may be slower for large solutions. Protobuf's mutable_primal_solution()->Reserve(primal.size()) followed by Add() or using mutable_primal_solution()->Assign(...) could improve performance for large-scale problems.
💡 Example using Reserve for primal_solution
   // Solution vectors - CPU solution already has data in host memory
   const auto& primal       = solution.get_primal_solution_host();
   const auto& dual         = solution.get_dual_solution_host();
   const auto& reduced_cost = solution.get_reduced_cost_host();

+  pb_solution->mutable_primal_solution()->Reserve(primal.size());
   for (const auto& v : primal) {
     pb_solution->add_primal_solution(static_cast<double>(v));
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 93 - 101, The per-element
protobuf adds (pb_solution->add_primal_solution, add_dual_solution,
add_reduced_cost) should be replaced with bulk operations to improve performance
for large vectors: call
pb_solution->mutable_primal_solution()->Reserve(primal.size()) and then use
Add() or assign via mutable_primal_solution()->Assign(...) to copy the primal
data (cast to double as needed), and do the same for dual and reduced_cost using
mutable_dual_solution()->Reserve(dual.size())/Assign and
mutable_reduced_cost()->Reserve(reduced_cost.size())/Assign; update the casts so
the bulk copy converts elements to double before Assign/adding.
559-598: Warm-start presence detection relies on empty array check.

The warm-start data presence is inferred from ws_primal.empty(). This works correctly as long as any valid warm-start data always includes the primal solution. If there's ever a case where warm-start exists but current_primal is empty, this would miss it. Consider checking the header for a dedicated warm-start flag if available.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 559 - 598, The code
currently detects warm-start presence by testing ws_primal.empty() in the block
that constructs cpu_pdlp_warm_start_data_t (see bytes_to_typed,
cpu_pdlp_warm_start_data_t, and the if (!ws_primal.empty()) guard); change that
condition to prefer a dedicated header flag (e.g., use h.has_warm_start() or
h.ws_present() if such a field exists) and fall back to the primal-array
emptiness check if the flag is unavailable, e.g., replace if
(!ws_primal.empty()) with if (h.has_warm_start() || !ws_primal.empty()) so
warm-starts with an empty current_primal are still recognized.
cpp/src/grpc/client/grpc_client.hpp (1)
236-240: Non-movable semantics are intentional but could be revisited.

The class is non-copyable and non-movable due to std::atomic<bool> and std::thread members. While std::atomic is actually movable in C++17, the std::thread member and PIMPL pattern make this reasonable. If movability becomes needed, consider wrapping the thread in a nullable type or resetting stop_logs_ after move.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.hpp` around lines 236 - 240, The class
grpc_client_t is explicitly non-copyable and non-movable due to an
std::atomic<bool> (stop_logs_) and an internal std::thread member; if you need
movability, add a noexcept move constructor and move assignment that transfer
ownership of the PIMPL, move the std::thread (std::move on the thread member)
and ensure the source thread is left in a benign state (e.g., set to not
joinable or detached), and transfer/reset the atomic flag (use
stop_logs_.store(false) or std::exchange to set a sensible value in the
moved-from object). Update grpc_client_t::grpc_client_t(grpc_client_t&&) and
operator=(grpc_client_t&&) to move the pimpl pointer, move the thread, and
handle stop_logs_ correctly; ensure proper synchronization when transferring
thread ownership to avoid races.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/grpc/client/solve_remote.cu`:
- Around line 55-57: The env overrides for CUOPT_CHUNK_SIZE and
CUOPT_MAX_MESSAGE_BYTES must be validated before assignment: call
parse_env_int64 for each (as currently done), then only assign to
config.chunk_size_bytes and config.max_message_bytes when the parsed value is a
sane positive integer (e.g., > 0) and within expected bounds; otherwise ignore
the override (leave existing config value) and emit a warning via
process/logging (use config.enable_debug_log or existing logger). In short,
replace the direct assignments with guarded checks around parse_env_int64
results and clamp or reject values that are zero/negative or exceed your defined
upper limits before setting config.chunk_size_bytes and
config.max_message_bytes.
- Around line 164-166: The incumbent callback incorrectly uses the incumbent
objective as the "bound" (bound_copy = objective) when calling
get_callback->get_solution; change the flow so the actual best bound from the
remote solver (use get_solution_bound()) is forwarded through the incumbent
callback protocol and used as bound_copy when invoking
get_callback->get_solution; update the gRPC incumbent callback message/handler
(grpc_client.hpp protocol and the server-side code that calls the callback) to
include a bound field, populate it from get_solution_bound(), and adjust the
client-side call site in solve_remote.cu to read that bound and pass it instead
of objective to ensure callbacks receive the solver's best bound.

In `@cpp/src/grpc/cuopt_remote_service.proto`:
- Line 315: The proto uses two different types for job_id (string in
SubmitJobResponse, StatusRequest, GetResultRequest, StartChunkedDownloadRequest,
DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest vs bytes in
IncumbentRequest and Incumbent), so unify them: choose one canonical type
(prefer string for UTF-8 identifiers or bytes if binary IDs are required) and
update the job_id field in the outlier messages (IncumbentRequest and Incumbent
if switching to string, or update the eight string-typed messages if switching
to bytes) to match; after changing the proto make sure to regenerate language
bindings and add any necessary UTF-8 validation or encoding/decoding notes in
the RPC handling code.

In `@cpp/src/grpc/cuopt_remote.proto`:
- Line 6: The proto package declaration "package cuopt.remote" in
cuopt_remote.proto doesn't match Buf's package-directory rules; either change
the package line to match the file's directory-based package (so the package
mirrors the proto file location) or move cuopt_remote.proto into a directory
structure that matches "cuopt/remote"; update the package declaration (package
cuopt.remote) accordingly so the package and directory structure are consistent
and the PACKAGE_DIRECTORY_MATCH lint passes.

In `@cpp/src/grpc/grpc_settings_mapper.cu`:
- Around line 144-146: The protobuf iteration_limit can exceed the range of the
target type i_t and overflow when statically cast; before assigning
pb_settings.iteration_limit() to settings.iteration_limit (the
static_cast<i_t>), check that the value lies within numeric_limits<i_t>::min()
and numeric_limits<i_t>::max(); if it is out of range, either clamp to the
appropriate min/max or return/log an error and refuse the assignment, ensuring
you reference pb_settings.iteration_limit() and settings.iteration_limit when
implementing the guard to prevent narrowing overflow.

In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 109-113: The code currently reads a uint64_t size from the pipe
and blindly calls data.resize(size) which permits unbounded allocation; before
resizing, validate and cap the incoming size (e.g. define a
MAX_INCUMBENT_PAYLOAD or MAX_INCUMBENT_SIZE constant) and return false if size
is zero or exceeds that limit, and ensure you convert/cast size to size_t
safely; update the recv_incumbent_pipe logic around the read_from_pipe(fd,
&size, ...) and data.resize(...) calls (and any callers using fd/data) to
enforce this limit and prevent resource exhaustion.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Line 90: The loop currently treats any ParseDelimitedFromCodedStream(&ac,
&coded, &clean_eof) == false as a benign break and later returns true, which
accepts partial/corrupt payloads; change the handling so that when
ParseDelimitedFromCodedStream returns false you check clean_eof and only break
on a true clean_eof (clean EOF), otherwise return false to indicate a parse
error. Update both occurrences (the line with ParseDelimitedFromCodedStream(&ac,
&coded, &clean_eof) and the other similar occurrence) so parse failures produce
an immediate false return instead of falling through and returning true.
- Around line 151-154: The resize uses ac.total_elements() * elem_size without
validation which can overflow or trigger huge allocations; before calling
dest.resize validate that ac.total_elements() and elem_size are non-negative and
compute the product using a larger unsigned type (e.g., uint64_t) then check
product <= std::numeric_limits<size_t>::max() and <= a configured maximum
allocation threshold, and only then cast to size_t and call dest.resize; if the
checks fail, return/throw an error instead of resizing. Also apply the same
validation logic for the other occurrence around lines 158-163 (same variables:
ac, elem_size, array_field_element_size(ac.field_id()), dest).

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 50-80: The CLI parsing loop in main (grpc_server_main.cpp)
currently silently ignores unknown flags and missing values; update the for-loop
that inspects argv/argc to validate every flag: for each option that requires a
value (e.g., "-p"/"--port", "-w"/"--workers", "--max-message-mb",
"--max-message-bytes", "--tls-cert", "--tls-key", "--tls-root") check i+1 < argc
and if not, print an error to std::cerr (include the flag name), call
print_usage(argv[0]) and return a non-zero exit code; for unrecognized args (the
final else case) do the same (error + usage + non-zero exit); also wrap
stoi/stoll conversions in try/catch to handle invalid numbers and fail similarly
with an explanatory error. Ensure you reference and update the same config
fields (config.port, config.num_workers, config.max_message_mb,
config.max_message_b, config.tls_cert_path, config.tls_key_path,
config.tls_root_path) and keep behavior for boolean flags unchanged.
- Around line 157-167: The code currently calls memset on JobQueueEntry and
ResultQueueEntry which contain std::atomic members (job_queue and result_queue
initialization loop); replace the raw memset with explicit field initialization
or construction: for each JobQueueEntry/ResultQueueEntry set every member
explicitly (e.g., initialize non-atomic fields directly and initialize atomics
via .store(...) or use placement-new / a helper function that constructs the
object in shared memory) to avoid undefined behavior; update the loops that call
memset(&job_queue[i], 0, ...) and memset(&result_queue[i], 0, ...) to perform
per-field initialization instead, ensuring worker_index, ready, claimed,
cancelled, retrieved, etc. are correctly constructed and stored.

In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 300-307: signal_handler currently calls non-async-signal-safe
functions (std::cout and result_cv.notify_all()), which must be removed; change
signal_handler to only perform async-signal-safe writes: set keep_running to
false using a sig_atomic_t/atomic<bool> and, if needed, set
shm_ctrl->shutdown_requested using an atomic or sig_atomic_t field, and return
immediately (no I/O or condition_variable calls). Add a separate watcher
thread/function (e.g., monitor_shutdown or shutdown_watcher) that polls/blocks
on keep_running becoming false and then performs the logging (std::cout) and
calls result_cv.notify_all() and any other cleanup; update signal_handler,
keep_running's type, and shm_ctrl to use signal-safe atomics so the watcher can
safely observe the change.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 213-218: The setters and other methods access server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization, causing
races; add a member mutex (e.g., server_log_mutex_) and use it to guard all
reads and writes of these fields (wrap accesses in std::lock_guard<std::mutex>),
update set_server_log_path to lock the mutex when updating server_log_path_,
server_log_start_pos_, and test_start_marked_, and ensure every other method
that reads or writes these members (the other server-log readers/pollers
mentioned) also locks the same mutex before accessing them so the state is
consistently protected.

In `@GRPC_ARCHITECTURE.md`:
- Around line 9-10: The documentation sentence claiming “no custom serialization
logic is implemented” is too absolute; update the statement to acknowledge the
custom pipe blob framing helpers introduced in grpc_pipe_serialization.hpp (the
pipe blob framing helpers) and clarify scope by limiting the claim to gRPC wire
payloads (e.g., “All gRPC wire payload serialization uses protocol buffers
generated by protoc and grpc_cpp_plugin; separate pipe blob framing helpers in
grpc_pipe_serialization.hpp are used only for local pipe framing, not for gRPC
wire serialization.”).

In `@SERVER_ARCHITECTURE.md`:
- Around line 15-46: The markdown has untyped fenced code blocks (the ASCII
diagram block and other fenced regions) which trigger MD040; update each
triple-backtick fence (including the large ASCII diagram block shown and the
other blocks noted) to include an explicit language token such as text (e.g.,
change ``` to ```text) or another appropriate language (bash/cpp) so
markdownlint MD040 is satisfied across the file.

---

Duplicate comments:
In `@cpp/src/grpc/client/grpc_client.cu`:
- Around line 258-266: The public RPC method grpc_client_t::check_status
dereferences impl_->stub without ensuring impl_ is initialized or stub is
non-null (can be called before connect()), which can crash; add a null-check
guard at the start of check_status (and other public RPC methods) that verifies
impl_ and impl_->stub are valid, and return an appropriate job_status_result_t
error/failed state (or set result.error and result.code) when the client is not
connected instead of dereferencing; refer to grpc_client_t::connect and
impl_->stub when adding the guard so the check mirrors connect()'s
initialization contract.
- Around line 994-995: The code computes max_polls by dividing by
config_.poll_interval_ms without validating it; update the logic that sets
max_polls (the line initializing int max_polls) to first validate
config_.poll_interval_ms is > 0 (and also guard against extremely large/small
values), returning an error or using a safe default/policy (e.g., set
poll_interval_ms = 1 or return a failure status) when it's <= 0 to avoid
division-by-zero and undefined behavior; apply the same validation where
max_polls is computed elsewhere (the other occurrence around the code that uses
config_.poll_interval_ms).
- Around line 703-712: Validate and guard all chunk metadata before allocating
or copying: ensure arr_desc.element_size_bytes() (elem_size) > 0 and
arr_desc.total_elements() (total_elems) >= 0 before using them; ensure
chunk_data_budget is >0 before computing elems_per_chunk and handle division
safely; check for size_t/multiplication overflow when computing total_elems *
elem_size before constructing array_bytes and use a safe cap; when iterating
chunks validate each elements_in_chunk is within [0, total_elems] and compute
copy_size = elements_in_chunk * elem_size with overflow checks before any
memcpy; add clear error handling/logging and skip/abort the chunk on invalid
metadata rather than performing the allocation or memcpy (refer to symbols:
arr_desc, field_id, elem_size, total_elems, chunk_data_budget, elems_per_chunk,
elements_in_chunk, array_bytes, memcpy).
- Around line 247-251: stop_log_streaming currently only sets stop_logs_ and
then joins log_thread_, which can hang if the thread is blocked in a gRPC read;
change the shutdown sequence to (1) set stop_logs_, (2) explicitly cancel/close
the gRPC stream used by the reader (e.g. call the
grpc::ClientContext::TryCancel() or cancel/Finish the
ClientReader/ReaderInterface instance used for log streaming), (3) ensure any
CompletionQueue or async reader is shutdown/drained so the blocked read returns,
and only then join log_thread_ (use a timed join as a fallback and log an error
if join times out); update stop_log_streaming to reference stop_logs_,
log_thread_ and the stream/reader/client-context member (the object that
performs the stream read) when implementing the cancel/cleanup steps.

In `@cpp/src/grpc/grpc_problem_mapper.cu`:
- Around line 399-411: The lambda get_doubles uses n = it->second.size() /
sizeof(double) but then memcpy's the full byte size, which can overflow if the
payload isn't aligned; validate that it->second.size() is an exact multiple of
sizeof(double) before copying, compute expected_bytes = n * sizeof(double) and
only memcpy expected_bytes (or return/throw on misaligned size), and when
converting to f_t ensure you only read expected_bytes into the temporary vector;
apply the same check/fix to the other chunked-array decoder blocks that perform
memcpy (the other double-to-f_t conversion in this file).

In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 147-158: The submit_job_async path performs an unsynchronized
check-then-write on shared job_queue entries (e.g., checking
job_queue[i].ready/claimed then writing fields like job_id, ready, claimed),
which races with check_job_status and cancel_job; protect all accesses to
job_queue (both scans and per-entry reads/writes) by introducing a
synchronization primitive (e.g., a std::mutex or per-entry mutex array) and
acquire the lock(s) around the loop that scans MAX_JOBS and before modifying
fields (job_queue[i].ready, claimed, cancelled, etc.), and also ensure
check_job_status and cancel_job acquire the same lock(s) before reading or
updating those fields so ownership and state transitions are atomic and
race-free.

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 186-212: The TLS validation branches that currently do "return 1"
(around checks using config, ssl_opts and read_file_to_string for cert/key/root
and require_client) can occur after workers/threads have been started; replace
those early returns with calls to the centralized shutdown/teardown routine
(e.g., call your existing server stop/cleanup functions such as StopServer(),
shutdownWorkers()/joinThreads(), or a new cleanup_resources() that performs
shutdown and joins) and then return the error code; ensure every failure path
that currently returns from inside the TLS setup invokes that teardown routine
before exiting so workers/threads and other resources are properly cleaned up.
- Around line 234-251: BuildAndStart() may return nullptr so avoid unguarded
dereference of server; after std::unique_ptr<Server>
server(builder.BuildAndStart()), check if (server) before calling server->Wait()
and starting the shutdown logic that assumes a live server. If BuildAndStart()
fails, log an error (or print to cerr) and exit/return non-zero. Also ensure the
shutdown_thread's capture/Shutdown call remains safe by only invoking
server->Shutdown() when server is non-null; move creation of shutdown_thread and
the server->Wait() call inside the if (server) block and handle the failure path
explicitly.

In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 71-73: The code currently does
job_queue[job_idx].claimed.store(false) before
job_queue[job_idx].ready.store(true), which can let another submitter claim the
slot before readiness is published; swap the two stores so ready.store(true) is
executed before claimed.store(false) for the job_queue entry, and apply the same
fix to the other identical location where claimed and ready are updated (the
later occurrence around the second claim/release sequence) to prevent the race.

---

Nitpick comments:
In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 236-240: The class grpc_client_t is explicitly non-copyable and
non-movable due to an std::atomic<bool> (stop_logs_) and an internal std::thread
member; if you need movability, add a noexcept move constructor and move
assignment that transfer ownership of the PIMPL, move the std::thread (std::move
on the thread member) and ensure the source thread is left in a benign state
(e.g., set to not joinable or detached), and transfer/reset the atomic flag (use
stop_logs_.store(false) or std::exchange to set a sensible value in the
moved-from object). Update grpc_client_t::grpc_client_t(grpc_client_t&&) and
operator=(grpc_client_t&&) to move the pimpl pointer, move the thread, and
handle stop_logs_ correctly; ensure proper synchronization when transferring
thread ownership to avoid races.

In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Around line 401-403: Replace direct calls to std::stod and std::stoll in the
CLI parsing logic with exception-safe parsing: wrap the std::stod(std::string)
used to set time_limit and the std::stoll used to set incumbent_index in
try-catch blocks that catch std::invalid_argument and std::out_of_range, print a
clear error message indicating the offending flag and value (e.g.,
"--time-limit" or "--incumbent-index"), and exit with a non-zero status instead
of letting the process throw; update the parsing branches that reference
time_limit and incumbent_index so they validate the parsed numeric ranges if
applicable and fall back or exit cleanly on parse failure.

In `@cpp/src/grpc/cuopt_remote_service.proto`:
- Line 9: IncumbentRequest.job_id, Incumbent.job_id and SubmitResponse.job_id
are defined as bytes while most RPC messages (StatusRequest, GetResultRequest,
DeleteRequest, CancelRequest, WaitRequest, StreamLogsRequest, etc.) use string;
change those fields to type string to standardize the schema, update any proto
message definitions where job_id is bytes to string (IncumbentRequest,
Incumbent, SubmitResponse), run proto generation to refresh client/server stubs,
and adjust any serialization/handler code that expects bytes to handle string
instead.

In `@cpp/src/grpc/grpc_solution_mapper.cu`:
- Around line 93-101: The per-element protobuf adds
(pb_solution->add_primal_solution, add_dual_solution, add_reduced_cost) should
be replaced with bulk operations to improve performance for large vectors: call
pb_solution->mutable_primal_solution()->Reserve(primal.size()) and then use
Add() or assign via mutable_primal_solution()->Assign(...) to copy the primal
data (cast to double as needed), and do the same for dual and reduced_cost using
mutable_dual_solution()->Reserve(dual.size())/Assign and
mutable_reduced_cost()->Reserve(reduced_cost.size())/Assign; update the casts so
the bulk copy converts elements to double before Assign/adding.
- Around line 559-598: The code currently detects warm-start presence by testing
ws_primal.empty() in the block that constructs cpu_pdlp_warm_start_data_t (see
bytes_to_typed, cpu_pdlp_warm_start_data_t, and the if (!ws_primal.empty())
guard); change that condition to prefer a dedicated header flag (e.g., use
h.has_warm_start() or h.ws_present() if such a field exists) and fall back to
the primal-array emptiness check if the flag is unavailable, e.g., replace if
(!ws_primal.empty()) with if (h.has_warm_start() || !ws_primal.empty()) so
warm-starts with an empty current_primal are still recognized.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2a43112f-b25e-4054-b00f-1fee3a598df9

📥 Commits

Reviewing files that changed from the base of the PR and between 6cafe37 and 612de9a.

📒 Files selected for processing (38)

GRPC_ARCHITECTURE.md
SERVER_ARCHITECTURE.md
build.sh
cpp/CMakeLists.txt
cpp/cuopt_grpc_server.cpp
cpp/src/grpc/client/grpc_client.cu
cpp/src/grpc/client/grpc_client.hpp
cpp/src/grpc/client/solve_remote.cu
cpp/src/grpc/client/test_grpc_client.cpp
cpp/src/grpc/cuopt_remote.proto
cpp/src/grpc/cuopt_remote_service.proto
cpp/src/grpc/grpc_problem_mapper.cu
cpp/src/grpc/grpc_problem_mapper.hpp
cpp/src/grpc/grpc_service_mapper.cu
cpp/src/grpc/grpc_service_mapper.hpp
cpp/src/grpc/grpc_settings_mapper.cu
cpp/src/grpc/grpc_settings_mapper.hpp
cpp/src/grpc/grpc_solution_mapper.cu
cpp/src/grpc/grpc_solution_mapper.hpp
cpp/src/grpc/server/grpc_field_element_size.hpp
cpp/src/grpc/server/grpc_incumbent_proto.hpp
cpp/src/grpc/server/grpc_job_management.cpp
cpp/src/grpc/server/grpc_pipe_serialization.hpp
cpp/src/grpc/server/grpc_server_main.cpp
cpp/src/grpc/server/grpc_server_threads.cpp
cpp/src/grpc/server/grpc_server_types.hpp
cpp/src/grpc/server/grpc_service_impl.cpp
cpp/src/grpc/server/grpc_worker.cpp
cpp/src/grpc/server/grpc_worker_infra.cpp
cpp/src/pdlp/CMakeLists.txt
cpp/src/pdlp/solve_remote.cu
cpp/tests/linear_programming/CMakeLists.txt
cpp/tests/linear_programming/grpc/CMakeLists.txt
cpp/tests/linear_programming/grpc/grpc_client_test.cpp
cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py

💤 Files with no reviewable changes (2)

cpp/src/pdlp/CMakeLists.txt
cpp/src/pdlp/solve_remote.cu

🚧 Files skipped from review as they are similar to previous changes (8)

cpp/tests/linear_programming/CMakeLists.txt
cpp/tests/linear_programming/grpc/CMakeLists.txt
cpp/src/grpc/server/grpc_incumbent_proto.hpp
cpp/src/grpc/server/grpc_worker_infra.cpp
cpp/src/grpc/server/grpc_server_threads.cpp
cpp/src/grpc/server/grpc_field_element_size.hpp
cpp/src/grpc/server/grpc_worker.cpp
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py

cpp/src/grpc/client/solve_remote.cpp

cpp/src/grpc/cuopt_remote_service.proto

cpp/src/grpc/cuopt_remote.proto

cpp/src/grpc/grpc_settings_mapper.cpp

cpp/src/grpc/server/grpc_server_main.cpp

cpp/src/grpc/server/grpc_server_types.hpp

cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp

GRPC_ARCHITECTURE.md

SERVER_ARCHITECTURE.md

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (24)

cpp/src/grpc/cuopt_remote_service.proto (1)

314-325: ⚠️ Potential issue | 🟠 Major

Use one job_id scalar type across service messages.

Line 315 and Line 324 use bytes, while most service request/response messages use string for job_id. This inconsistency creates avoidable cross-language conversion edge cases.

Schema alignment sketch

 message IncumbentRequest {
-  bytes job_id = 1;
+  string job_id = 1;
   int64 from_index = 2;
   int32 max_count = 3;
 }

 message Incumbent {
   int64 index = 1;
   double objective = 2;
   repeated double assignment = 3;
-  bytes job_id = 4;
+  string job_id = 4;
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 325, The
IncumbentRequest.job_id and Incumbent.job_id fields use bytes while the rest of
the API uses string for job_id; change both fields in messages IncumbentRequest
and Incumbent from bytes to string to ensure a single scalar type for job_id
across the service, and update any serialization/usage code that expects bytes
to accept a string (or perform explicit encoding/decoding at the callsite) so
cross-language conversions remain consistent.

GRPC_ARCHITECTURE.md (1)

9-10: ⚠️ Potential issue | 🟡 Minor

Scope the serialization claim to gRPC wire payloads.

Line 9-Line 10 is too absolute; local pipe framing helpers exist. Please clarify this is specifically about gRPC wire serialization.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@GRPC_ARCHITECTURE.md` around lines 9 - 10, The statement "All serialization
uses protocol buffers generated by `protoc` and `grpc_cpp_plugin` — no custom
serialization logic is implemented." is too absolute; update it to explicitly
scope serialization to gRPC wire payloads (e.g., "All gRPC wire payload
serialization uses protocol buffers generated by `protoc` and
`grpc_cpp_plugin`") and add a brief note acknowledging that local pipe framing
helpers (custom framing/unframing code) are used outside the gRPC wire format;
keep the original wording's intent but clarify that custom framing helpers exist
for local pipes.

SERVER_ARCHITECTURE.md (1)

15-46: ⚠️ Potential issue | 🟡 Minor

Add explicit fence languages to these code blocks.

markdownlint MD040 is still triggered by untyped fences in these sections. Use ```text for ASCII diagrams and table-like flows.

Also applies to: 110-118, 122-135, 139-148, 154-168, 196-198, 207-217

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The ASCII-art code fences
(e.g., the block starting with the "Main Server Process" diagram that includes
"Shared Memory Queues", "Job Queue", "Result Queue", and worker boxes) are
untyped and trigger markdownlint MD040; update each such fence to use an
explicit text language (replace ``` with ```text) for that diagram and the other
similar untyped blocks (the blocks containing table/flow-style ASCII diagrams
and blocks starting with the same diagram-like contents referenced in the
comment). Ensure every ASCII/table-like fenced block (including the ones around
the ranges called out in the review) is changed to ```text so markdownlint stops
flagging MD040.

cpp/src/grpc/server/grpc_worker_infra.cpp (1)

165-168: ⚠️ Potential issue | 🔴 Critical

Validate PID and handle EINTR in wait_for_workers().

The reap loop at Line 165-Line 168 should skip non-positive PIDs and retry interrupted waits; otherwise waitpid(0, ...)/signal interruption can lead to incorrect child reaping behavior.

Suggested fix

 void wait_for_workers()
 {
   for (pid_t pid : worker_pids) {
+    if (pid <= 0) continue;
     int status;
-    waitpid(pid, &status, 0);
+    while (waitpid(pid, &status, 0) < 0 && errno == EINTR) {}
   }
   worker_pids.clear();
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 165 - 168, In
wait_for_workers(), the reap loop currently calls waitpid(pid, &status, 0) for
every pid in worker_pids without validating pid or handling EINTR; update the
loop to skip non-positive PIDs (pid <= 0) and wrap waitpid(pid, &status, 0) in a
retry loop that re-invokes waitpid when it returns -1 with errno == EINTR, and
only break on a successful return or an error other than EINTR; reference the
worker_pids iteration, the pid variable, the waitpid call and the status/errno
checks when making the change.

cpp/src/grpc/cuopt_remote.proto (1)

6-6: ⚠️ Potential issue | 🟠 Major

Align proto package and directory to satisfy Buf lint.

package cuopt.remote at Line 6 does not match the current file directory (cpp/src/grpc), which triggers PACKAGE_DIRECTORY_MATCH.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration in
cuopt_remote.proto ("package cuopt.remote;") does not match the file's directory
and triggers PACKAGE_DIRECTORY_MATCH; update the package line in
cuopt_remote.proto to reflect the directory structure (e.g., "package
cpp.src.grpc;" or another package path that mirrors the directory segments) so
the package-to-directory mapping aligns, then re-run buf lint to verify the
PACKAGE_DIRECTORY_MATCH error is resolved.

cpp/src/grpc/server/grpc_server_main.cpp (5)

157-167: ⚠️ Potential issue | 🔴 Critical

Replace raw memset on queue entries that include atomics.

JobQueueEntry / ResultQueueEntry contain std::atomic members; byte-wise zeroing those objects is undefined behavior.

As per coding guidelines "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 157 - 167, The code
currently zeroes JobQueueEntry and ResultQueueEntry with memset (affecting
job_queue and result_queue), which is undefined because those structs contain
std::atomic members; replace the memset calls by value-initializing or
explicitly initializing each member (e.g., construct entries using
JobQueueEntry() or assign default-constructed instances) and then set the atomic
flags via their store(...) calls (retain the existing store(false)/store(-1)
lines for ready/claimed/cancelled/worker_index and retrieved), ensuring no
byte-wise writes touch atomics; update the initialization loop to use assignment
or placement-new of JobQueueEntry/ResultQueueEntry rather than memset to avoid
UB.

50-80: ⚠️ Potential issue | 🟠 Major

Reject unknown or incomplete CLI flags explicitly.

The parser currently falls through for unrecognized flags and for value-requiring flags missing a value, so the server can continue with unintended defaults.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 80, The CLI
parser in the main argument loop accepts unknown flags and silently ignores
missing values for flags that require them; update the argument parsing logic
(the for-loop handling argv in grpc_server_main.cpp and the handling of flags
like
"--port"/"--workers"/"--max-message-mb"/"--max-message-bytes"/"--tls-cert"/"--tls-key"/"--tls-root")
so that when a flag that requires a value is encountered and i+1 >= argc you
call print_usage(argv[0]) (or log an explicit error) and return a non-zero exit
code, and when an unrecognized argument starting with '-' is seen you also call
print_usage(argv[0]) (or emit an error) and return non-zero—this ensures unknown
or incomplete CLI flags are rejected instead of silently falling back to
defaults.

89-90: ⚠️ Potential issue | 🔴 Critical

Do not wire POSIX signals to the current async-unsafe handler.

signal_handler (in cpp/src/grpc/server/grpc_server_types.hpp) uses operations like condition-variable notification from signal context, which is not async-signal-safe.

As per coding guidelines "Ensure race conditions are absent in multi-threaded server implementations; verify proper synchronization of shared state."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 89 - 90, The current
wiring of POSIX signals via signal(SIGINT, signal_handler) / signal(SIGTERM,
signal_handler) is unsafe because signal_handler performs async-unsafe
operations (e.g., condition-variable notification); instead, change to a safe
signal handling pattern: install a minimal async-signal-safe handler that only
sets a std::sig_atomic_t or std::atomic_bool shutdown flag (or writes a byte to
a dedicated self-pipe) and return, and move any condition-variable notifications
or complex shutdown logic out of signal_handler into the main thread or a
dedicated watcher that polls the flag or reads the pipe; update references in
grpc_server_main.cpp that call signal(...) and the implementation of
signal_handler in grpc_server_types.hpp to follow this pattern so all
non-async-safe actions occur off the signal context.

234-251: ⚠️ Potential issue | 🔴 Critical

Guard BuildAndStart() failure before calling server->Wait().

BuildAndStart() can return null, but Line 250 unconditionally dereferences server.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 234 - 251,
BuildAndStart() may return nullptr so avoid unguarded dereference of server;
after calling std::unique_ptr<Server> server(builder.BuildAndStart()), check if
(!server) and handle the failure (log an error via std::cerr or similar and
return/exit with non-zero) before starting the shutdown_thread or calling
server->Wait(); also ensure the shutdown_thread lambda still checks server
before calling server->Shutdown() and join the thread if started (use a flag to
indicate thread started or create the thread only when server is valid).

185-212: ⚠️ Potential issue | 🔴 Critical

TLS error branches leak workers/threads/shared memory after startup.

These early returns execute after spawn_workers() and thread creation, so teardown is skipped on TLS configuration/read failures.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 185 - 212, The TLS
error branches return early after spawn_workers()/thread/shared-memory setup and
thus leak resources; either move the TLS validation block to run before
spawn_workers(), or ensure every early-return path calls the teardown routines
(e.g., shutdown_workers(), join_worker_threads(), release_shared_memory() or
whatever the existing cleanup functions are) before returning. Locate the TLS
block (uses config, ssl_opts, read_file_to_string) and change it so that
validation happens prior to calling spawn_workers(), or add calls to the
project's worker/shutdown/cleanup functions immediately before each return in
this block to properly stop threads and free shared memory.

cpp/src/grpc/server/grpc_pipe_serialization.hpp (2)

90-90: ⚠️ Potential issue | 🔴 Critical

Return failure on malformed chunk parse instead of breaking and succeeding.

A parse error currently breaks the loop and still returns true, which accepts partial/corrupt payloads as valid.

🛠️ Minimal fix

-    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break;
+    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) {
+      if (clean_eof) { break; }
+      return false;
+    }
...
-    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break;
+    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) {
+      if (clean_eof) { break; }
+      return false;
+    }

As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

Also applies to: 148-148

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` at line 90, The loop
currently treats any false from
google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)
as a harmless break and ultimately returns true, which accepts corrupt partial
payloads; change the logic so that when ParseDelimitedFromCodedStream returns
false you check clean_eof and only break if clean_eof is true (clean EOF),
otherwise immediately return false to signal a malformed chunk error; update
this behavior for the ParseDelimitedFromCodedStream call at the shown line and
the duplicate occurrence (around the other occurrence near line 148) so
malformed parses fail instead of succeeding.

93-95: ⚠️ Potential issue | 🟠 Major

Validate total_elements size math before resize().

The current resize paths trust unvalidated total_elements (and multiplication with elem_size), which can overflow or trigger excessive allocation on malformed input.

As per coding guidelines "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

Also applies to: 151-154

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 93 - 95, The
resize call trusts ac.total_elements() (and the later elem_size * total_elements
math) and can overflow or cause huge allocations; before calling dest.resize()
(both at the dest.empty() check and the duplicate block at the later
occurrence), validate ac.total_elements() and the computed byte_count: cast
total_elements to a fixed-width unsigned type (e.g. uint64_t), check
total_elements > 0 is within a sane upper bound and that elem_size *
total_elements does not overflow size_t (e.g. byte_count = uint64_t(elem_size) *
total_elements; if (byte_count > std::numeric_limits<size_t>::max() ||
byte_count > MAX_ALLOWED_BYTES) return/error), then perform resize using the
checked size_t value; if the validation fails, return an error/handle gracefully
instead of resizing.

cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)

213-218: ⚠️ Potential issue | 🟠 Major

Synchronize server-log state reads/writes consistently.

server_log_path_, server_log_start_pos_, and test_start_marked_ are accessed without a common lock across setters/readers. This can race when log polling and test setup run concurrently.

Suggested fix pattern

   void set_server_log_path(const std::string& path)
   {
+    std::lock_guard<std::mutex> lock(mutex_);
     server_log_path_      = path;
     server_log_start_pos_ = 0;
     test_start_marked_    = false;
   }

   std::string get_server_logs(bool since_test_start = true) const
   {
-    if (server_log_path_.empty()) { return ""; }
+    std::string path;
+    std::streampos start_pos = 0;
+    bool marked = false;
+    {
+      std::lock_guard<std::mutex> lock(mutex_);
+      path = server_log_path_;
+      start_pos = server_log_start_pos_;
+      marked = test_start_marked_;
+    }
+    if (path.empty()) { return ""; }

-    std::ifstream file(server_log_path_);
+    std::ifstream file(path);
     if (!file.is_open()) { return ""; }

-    if (since_test_start && test_start_marked_ && server_log_start_pos_ > 0) {
-      file.seekg(server_log_start_pos_);
+    if (since_test_start && marked && start_pos > 0) {
+      file.seekg(start_pos);
     }

As per coding guidelines: "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state."

Also applies to: 229-244, 332-355

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213
- 218, The three server-log fields (server_log_path_, server_log_start_pos_,
test_start_marked_) are accessed without synchronization; modify
set_server_log_path to acquire a dedicated mutex (e.g., server_log_mutex_) and
set all three fields while holding the lock, and update all other
readers/writers that touch these members (the other setter/getter/polling
functions referenced in the diff ranges) to also lock the same mutex when
reading or writing those fields to prevent races.

cpp/src/grpc/server/grpc_job_management.cpp (2)

109-113: ⚠️ Potential issue | 🟠 Major

Cap incumbent payload size before resizing the buffer.

This path trusts the size prefix and directly calls data.resize(size). A malformed payload can force large allocations and exhaust memory.

Suggested guardrail

 bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data)
 {
   uint64_t size;
   if (!read_from_pipe(fd, &size, sizeof(size))) return false;
+  if (size > static_cast<uint64_t>(config.max_message_bytes)) {
+    std::cerr << "[Worker] Incumbent payload too large: " << size << "\n";
+    return false;
+  }
   data.resize(size);
   if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false;
   return true;
 }

As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code
currently trusts the size prefix and calls data.resize(size) after
read_from_pipe; introduce a validated upper bound (e.g., a constexpr size_t
kMaxJobPayloadSize) and check that size is non-negative and <=
kMaxJobPayloadSize before calling data.resize(size) and before the subsequent
read_from_pipe; if the size is out of range, return false (or handle as a
protocol error) to prevent unbounded allocation and potential memory exhaustion.
Use the existing symbols size, data.resize, and read_from_pipe to locate and
modify the logic.

147-158: ⚠️ Potential issue | 🔴 Critical

Fix racy queue-slot reservation in async submit.

Slot selection currently performs a check-then-write on shared queue state, so concurrent submissions can reserve the same slot.

Suggested approach (atomic claim)

   for (size_t i = 0; i < MAX_JOBS; ++i) {
-    if (!job_queue[i].ready && !job_queue[i].claimed) {
+    if (job_queue[i].ready.load()) { continue; }
+    bool expected = false;
+    if (!job_queue[i].claimed.compare_exchange_strong(expected, true)) { continue; }
       copy_cstr(job_queue[i].job_id, job_id);
       job_queue[i].problem_type = is_mip ? 1 : 0;
       ...
-      job_queue[i].claimed      = false;
-      job_queue[i].cancelled    = false;
-      job_queue[i].ready        = true;
+      job_queue[i].cancelled.store(false);
+      job_queue[i].ready.store(true, std::memory_order_release);
+      job_queue[i].claimed.store(false, std::memory_order_release);

As per coding guidelines: "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158, The loop
in submit that does check-then-write on job_queue slots is racy: change
reservation to an atomic claim (e.g., make job_queue[].claimed an
std::atomic<bool> or std::atomic_flag) and perform an atomic test-and-set
(compare_exchange_strong or test_and_set) to reserve the slot before writing any
fields; once the atomic claim succeeds, initialize job_queue[i].job_id,
problem_type, data_size, worker_* etc., then set job_queue[i].ready = true
(non-atomic write) only after fully initialized; ensure headers (<atomic>) and
any needed memory-ordering are used and that no other code resets claimed
without clearing ready in the reverse order.

cpp/src/grpc/client/solve_remote.cu (2)

53-57: ⚠️ Potential issue | 🟠 Major

Guard invalid env overrides before applying gRPC sizes.

CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES are applied without sanity checks. Zero/negative values can break transfer behavior at runtime.

Suggested fix

 static void apply_env_overrides(grpc_client_config_t& config)
 {
-  config.chunk_size_bytes  = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
-  config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  const auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
+  const auto max_msg    = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; }
+  if (max_msg > 0) { config.max_message_bytes = max_msg; }
   config.enable_debug_log  = (parse_env_int64("CUOPT_GRPC_DEBUG", 0) != 0);

As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 53 - 57, The env overrides
in apply_env_overrides currently accept any parsed value from parse_env_int64
for chunk_size_bytes and max_message_bytes; update apply_env_overrides to
validate the returned values before assignment: call parse_env_int64 for
"CUOPT_CHUNK_SIZE" and "CUOPT_MAX_MESSAGE_BYTES", then only assign to
config.chunk_size_bytes and config.max_message_bytes if the parsed value is > 0
and within sane bounds (e.g., enforce a minimum of 1 and a sensible upper cap to
avoid resource exhaustion); leave enable_debug_log assignment using
parse_env_int64 as-is. Ensure you reference apply_env_overrides,
config.chunk_size_bytes, config.max_message_bytes, and parse_env_int64 when
making the change.

163-166: ⚠️ Potential issue | 🟠 Major

Forward the real best bound to incumbent callbacks.

The current forwarding path sets bound_copy from the incumbent objective, which can report an artificial zero gap to user callbacks. This should carry the solver’s actual best bound from the remote pipeline.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 163 - 166, The code
currently sets bound_copy from objective, so get_callback->get_solution(...)
forwards the incumbent objective as the bound and can report a zero gap; change
the initialization of bound_copy to use the solver’s actual best-bound value
from the remote pipeline (replace "bound_copy = objective" with the variable
that holds the solver's best bound), then call
get_callback->get_solution(solution_copy.data(), &obj_copy, &bound_copy,
callback->get_user_data()); this ensures callbacks receive the real best bound
instead of the incumbent objective.

cpp/src/grpc/grpc_settings_mapper.cu (1)

144-146: ⚠️ Potential issue | 🟠 Major

Range-check iteration_limit before narrowing cast.

A large protobuf iteration_limit (int64_t) can overflow when cast to i_t (which may be int32_t), producing an invalid internal limit.

🔧 Suggested guard

   // proto3 defaults numeric fields to 0; treat negative iteration_limit as "unset"
   // so the server keeps the library default (typically max()).
   if (pb_settings.iteration_limit() >= 0) {
-    settings.iteration_limit = static_cast<i_t>(pb_settings.iteration_limit());
+    const auto limit = pb_settings.iteration_limit();
+    if (limit <= static_cast<int64_t>(std::numeric_limits<i_t>::max())) {
+      settings.iteration_limit = static_cast<i_t>(limit);
+    } else {
+      settings.iteration_limit = std::numeric_limits<i_t>::max();
+    }
   }

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cu` around lines 144 - 146, The narrowing
cast from pb_settings.iteration_limit() to settings.iteration_limit (type i_t)
can overflow; before the static_cast in the block that sets
settings.iteration_limit, range-check the int64 value against the numeric limits
of i_t and either clamp to std::numeric_limits<i_t>::max()/min() or
return/report an error if out-of-range—update the code around the check of
pb_settings.iteration_limit() to perform this validation and only then assign
via static_cast<i_t>.

cpp/src/grpc/client/grpc_client.cu (4)

258-266: ⚠️ Potential issue | 🔴 Critical

Add null stub guard and deadline to RPC methods.

Public RPC methods like check_status dereference impl_->stub without checking if a connection was established. Additionally, the ClientContext lacks a deadline, allowing indefinite blocking on network issues. The connection probe at line 191 correctly demonstrates both patterns.

🔧 Proposed fix pattern (apply to all RPC methods)

 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;
+  if (!impl_->stub) {
+    result.error_message = "Not connected to server";
+    return result;
+  }

   grpc::ClientContext context;
+  if (config_.timeout_seconds > 0) {
+    context.set_deadline(std::chrono::system_clock::now() +
+                         std::chrono::seconds(config_.timeout_seconds));
+  }
   auto request = build_status_request(job_id);

Apply the same pattern to: wait_for_completion, cancel_job, delete_job, get_incumbents, stream_logs, and all chunked upload/download methods.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The method
grpc_client_t::check_status currently dereferences impl_->stub and creates a
grpc::ClientContext without a deadline; add a null-stub guard at the start
(check impl_ and impl_->stub and return a job_status_result_t error/result
indicating "not connected") before calling build_status_request/CheckStatus, and
set a deadline on the grpc::ClientContext (use std::chrono::... now()+configured
RPC timeout) so the RPC won't block indefinitely; follow the same pattern for
wait_for_completion, cancel_job, delete_job, get_incumbents, stream_logs and all
chunked upload/download methods, using the same error/result handling style used
by the connection probe at line 191.

991-995: ⚠️ Potential issue | 🔴 Critical

Validate poll_interval_ms before division.

Line 994 computes max_polls by dividing by config_.poll_interval_ms. If this value is <= 0, the division will either crash or produce invalid polling behavior. The same issue exists at line 1132.

🔧 Proposed validation

   } else {
     CUOPT_LOG_INFO("[grpc_client] Using polling (CheckStatus) for job %s", job_id.c_str());
+    if (config_.poll_interval_ms <= 0) {
+      stop_log_streaming();
+      result.error_message = "Invalid configuration: poll_interval_ms must be > 0";
+      return result;
+    }
     int poll_count = 0;
     int max_polls  = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 991 - 995, The code computes
max_polls by dividing by config_.poll_interval_ms without validating it; add a
guard in the polling branch (where CUOPT_LOG_INFO logs "Using polling
(CheckStatus) for job" and any other similar polling block around the second
occurrence) to check config_.poll_interval_ms > 0 and handle invalid values: log
an error via CUOPT_LOG_ERROR (include job_id/context), set a safe default (e.g.,
poll_interval_ms = 1) or return/fail early, then compute max_polls using the
validated/clamped value; update both locations that reference
config_.poll_interval_ms to use the validated variable.

702-712: ⚠️ Potential issue | 🔴 Critical

Validate chunked result metadata before arithmetic operations.

element_size_bytes is used at line 708 without checking for <= 0, causing division by zero. Additionally, total_elems * elem_size at line 711 can overflow before allocation, making subsequent memcpy operations unsafe.

🔧 Proposed bounds validation

   for (const auto& arr_desc : header->arrays()) {
     auto field_id       = arr_desc.field_id();
     int64_t total_elems = arr_desc.total_elements();
     int64_t elem_size   = arr_desc.element_size_bytes();
     if (total_elems <= 0) continue;
+    if (elem_size <= 0) {
+      last_error_ = "Invalid chunk metadata: non-positive element_size_bytes";
+      return false;
+    }
+    // Check for overflow before allocation
+    if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) {
+      last_error_ = "Invalid chunk metadata: total byte size overflow";
+      return false;
+    }
+    int64_t total_bytes = total_elems * elem_size;
+    if (static_cast<uint64_t>(total_bytes) > std::numeric_limits<size_t>::max()) {
+      last_error_ = "Invalid chunk metadata: exceeds addressable memory";
+      return false;
+    }

     int64_t elems_per_chunk = chunk_data_budget / elem_size;

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 702 - 712, The loop over
header->arrays() uses arr_desc.element_size_bytes() and multiplies it by
arr_desc.total_elements() without validation; add checks that
element_size_bytes() > 0 (return/continue or log and skip if not) before
computing elems_per_chunk to avoid division by zero, compute elems_per_chunk =
max(1, chunk_data_budget / elem_size) only after that check, and validate that
total_elements() * element_size_bytes() cannot overflow size_t before allocating
array_bytes (e.g., if total_elems > SIZE_MAX / elem_size then handle as
error/skip or cap/stream allocation); reference arr_desc.element_size_bytes(),
arr_desc.total_elements(), chunk_data_budget, elems_per_chunk, and the
allocation of array_bytes to implement these guards.

221-252: ⚠️ Potential issue | 🔴 Critical

Log streaming thread can hang indefinitely on shutdown.

The stop_log_streaming() method sets the stop_logs_ flag and joins the thread, but reader->Read() at line 232 blocks indefinitely if the server stops sending messages. The TryCancel() call at line 234 only executes after Read() returns, not before. You need to call context.TryCancel() from the main thread before joining.

🔧 Proposed fix using context cancellation from stop_log_streaming

Add member variables to the class (in grpc_client.hpp):

std::mutex log_context_mutex_;
grpc::ClientContext* active_log_context_ = nullptr;

Then update the methods:

 void grpc_client_t::start_log_streaming(const std::string& job_id)
 {
   if (!config_.stream_logs || !config_.log_callback) return;

   stop_logs_.store(false);
   log_thread_ = std::make_unique<std::thread>([this, job_id]() {
     grpc::ClientContext context;
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = &context;
+    }
     auto request = build_stream_logs_request(job_id, 0);
     auto reader  = impl_->stub->StreamLogs(&context, request);

     cuopt::remote::LogMessage log_msg;
     while (reader->Read(&log_msg)) {
       if (stop_logs_.load()) {
-        context.TryCancel();
         break;
       }
       if (config_.log_callback) { config_.log_callback(log_msg.line()); }
       if (log_msg.job_complete()) { break; }
     }
     reader->Finish();
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = nullptr;
+    }
   });
 }

 void grpc_client_t::stop_log_streaming()
 {
   stop_logs_.store(true);
+  {
+    std::lock_guard<std::mutex> lk(log_context_mutex_);
+    if (active_log_context_) { active_log_context_->TryCancel(); }
+  }
   if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); }
   log_thread_.reset();
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 221 - 252, The log streaming
thread can block in reader->Read() so add a cancellable context pointer
protected by a mutex: introduce members std::mutex log_context_mutex_ and
grpc::ClientContext* active_log_context_ = nullptr; in grpc_client_t, then in
start_log_streaming() create the local grpc::ClientContext context as before but
immediately register it by locking log_context_mutex_ and setting
active_log_context_ = &context before calling impl_->stub->StreamLogs(...);
after the read/Finish() loop clear the pointer (lock and set to nullptr). In
stop_log_streaming() lock log_context_mutex_, if active_log_context_ != nullptr
call active_log_context_->TryCancel() (to wake up reader->Read()) before joining
the thread, then proceed to join/reset log_thread_ and set stop_logs_. Ensure
all accesses to active_log_context_ are mutex-protected to avoid race conditions
and that the pointer only points to the stack context while the thread is
running.

cpp/src/grpc/server/grpc_server_types.hpp (1)

300-308: ⚠️ Potential issue | 🔴 Critical

Remove non-async-signal-safe operations from signal handler.

std::cout and condition_variable::notify_all() are not async-signal-safe and can deadlock or corrupt state when called from a POSIX signal handler. The signal handler should only perform atomic writes.

🔧 Suggested approach

 inline void signal_handler(int signal)
 {
   if (signal == SIGINT || signal == SIGTERM) {
-    std::cout << "\n[gRPC Server] Received shutdown signal\n";
     keep_running = false;
     if (shm_ctrl) { shm_ctrl->shutdown_requested = true; }
-    result_cv.notify_all();
   }
 }

Move logging and result_cv.notify_all() to a dedicated shutdown-watcher thread that polls keep_running == false and then performs the cleanup.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 300 - 308, The
signal_handler currently performs non-async-signal-safe operations (std::cout
and result_cv.notify_all()), which must be removed; change signal_handler to
only perform atomic-safe actions: set keep_running = false and, if shm_ctrl
exists, set shm_ctrl->shutdown_requested = true (both already present), and
return immediately; then implement a separate shutdown-watcher thread (e.g.,
spawned at server startup) that polls or waits on an atomic/condition to detect
keep_running == false and from that thread perform logging (std::cout or logging
framework), call result_cv.notify_all(), and any other cleanup work — locate
signal_handler, keep_running, shm_ctrl, and result_cv to update behavior and add
the watcher thread to the server lifecycle.

cpp/src/grpc/grpc_problem_mapper.cu (1)

399-427: ⚠️ Potential issue | 🔴 Critical

Add alignment validation before memcpy in chunked array decoding.

The get_doubles and get_ints lambdas compute element count via integer division but then copy the full byte array. If the payload size isn't aligned to element size, memcpy will copy garbage or partial data.

🔧 Proposed alignment checks

   auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(double) != 0) return {};  // Misaligned data
     size_t n = it->second.size() / sizeof(double);
     if constexpr (std::is_same_v<f_t, double>) {
       std::vector<double> v(n);
-      std::memcpy(v.data(), it->second.data(), it->second.size());
+      std::memcpy(v.data(), it->second.data(), n * sizeof(double));
       return v;
     } else {
       // ...
     }
   };

   auto get_ints = [&](int32_t field_id) -> std::vector<i_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(int32_t) != 0) return {};  // Misaligned data
     size_t n = it->second.size() / sizeof(int32_t);
     // ...
   };

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 427, The lambdas
get_doubles and get_ints must validate byte alignment before memcpy: compute
size_t bytes = it->second.size(); if bytes == 0 return {}; if (bytes %
sizeof(double) != 0) (for get_doubles) or bytes % sizeof(int32_t) != 0 (for
get_ints) then handle the misaligned payload (log/return error/throw) instead of
blindly memcpy; use size_t n = bytes / sizeof(...) and pass bytes (or n *
sizeof(...)) to memcpy rather than it->second.size(); update references in
get_doubles/get_ints and ensure the branch where types differ still copies only
the aligned byte count and converts only n elements.

🧹 Nitpick comments (3)

cpp/src/grpc/server/grpc_field_element_size.hpp (1)
17-29: Avoid silent fallback for unknown ArrayFieldId.

At Line 28, default: return 8; can mask unsupported/new enum values and mis-size chunk parsing. Prefer an explicit invalid path so callers can reject the request.
Suggested change
 inline int64_t array_field_element_size(cuopt::remote::ArrayFieldId field_id)
 {
   switch (field_id) {
@@
-    default: return 8;
+    default: return -1;  // unknown field id; caller should fail validation
   }
 }
As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_field_element_size.hpp` around lines 17 - 29, The
function array_field_element_size(cuopt::remote::ArrayFieldId) silently falls
back to return 8 for unknown enum values; change it to explicitly handle
unsupported/new enums by logging/propagating an error or failing fast (e.g.,
throw std::invalid_argument or assert) instead of returning 8 so callers can
reject invalid requests and avoid mis-sized parsing; update
array_field_element_size to remove the silent default and return/throw a clear
failure for unexpected cuopt::remote::ArrayFieldId values and ensure callers
handle that error path.
build.sh (1)
7-7: Enable unbound-variable checks in the script preamble.

At Line 7, prefer set -eu to catch unset variable usage early.
Suggested change
-set -e
+set -eu
Based on learnings: "In this repository, prefer using 'set -u' in Bash scripts to detect unbound variables and rely on the default unbound-variable error messages rather than implementing explicit guards with custom error messages."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@build.sh` at line 7, The script preamble currently uses "set -e"; change this
to enable unbound-variable checks by replacing that occurrence with "set -eu"
(i.e., add the -u flag) so the shell will error on unset variables; audit any
places that intentionally rely on unset variables and initialize or guard them
accordingly (locate the "set -e" token in build.sh and update it).
cpp/src/grpc/server/grpc_server_threads.cpp (1)
166-182: Deserialize outside tracker_mutex to reduce contention.

deserialize_result_pipe_blob(...) can be expensive; parsing before taking the lock and only committing parsed state under lock will improve concurrency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 166 - 182,
deserialize_result_pipe_blob(...) is done while holding tracker_mutex causing
unnecessary contention; move the expensive deserialize step out of the critical
section by first copying result_data and success locally, call
deserialize_result_pipe_blob(result_data.data(), result_data.size(), hdr,
arrays) before acquiring tracker_mutex, then acquire lock(tracker_mutex), find
job_tracker entry by job_id, verify it still exists and update it: set status to
COMPLETED or FAILED (JobStatus), set error_message on failure, or assign
result_header, result_arrays and result_size_bytes on success; ensure hdr and
arrays are moved into it->second to avoid copying and that result_size_bytes is
set from result_data.size().

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d12476d8-baa0-42cb-9006-1d214ffa2716

📥 Commits

Reviewing files that changed from the base of the PR and between 612de9a and d228718.

📒 Files selected for processing (38)

GRPC_ARCHITECTURE.md
SERVER_ARCHITECTURE.md
build.sh
cpp/CMakeLists.txt
cpp/cuopt_grpc_server.cpp
cpp/src/grpc/client/grpc_client.cu
cpp/src/grpc/client/grpc_client.hpp
cpp/src/grpc/client/solve_remote.cu
cpp/src/grpc/client/test_grpc_client.cpp
cpp/src/grpc/cuopt_remote.proto
cpp/src/grpc/cuopt_remote_service.proto
cpp/src/grpc/grpc_problem_mapper.cu
cpp/src/grpc/grpc_problem_mapper.hpp
cpp/src/grpc/grpc_service_mapper.cu
cpp/src/grpc/grpc_service_mapper.hpp
cpp/src/grpc/grpc_settings_mapper.cu
cpp/src/grpc/grpc_settings_mapper.hpp
cpp/src/grpc/grpc_solution_mapper.cu
cpp/src/grpc/grpc_solution_mapper.hpp
cpp/src/grpc/server/grpc_field_element_size.hpp
cpp/src/grpc/server/grpc_incumbent_proto.hpp
cpp/src/grpc/server/grpc_job_management.cpp
cpp/src/grpc/server/grpc_pipe_serialization.hpp
cpp/src/grpc/server/grpc_server_main.cpp
cpp/src/grpc/server/grpc_server_threads.cpp
cpp/src/grpc/server/grpc_server_types.hpp
cpp/src/grpc/server/grpc_service_impl.cpp
cpp/src/grpc/server/grpc_worker.cpp
cpp/src/grpc/server/grpc_worker_infra.cpp
cpp/src/pdlp/CMakeLists.txt
cpp/src/pdlp/solve_remote.cu
cpp/tests/linear_programming/CMakeLists.txt
cpp/tests/linear_programming/grpc/CMakeLists.txt
cpp/tests/linear_programming/grpc/grpc_client_test.cpp
cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py

💤 Files with no reviewable changes (2)

cpp/src/pdlp/CMakeLists.txt
cpp/src/pdlp/solve_remote.cu

🚧 Files skipped from review as they are similar to previous changes (5)

cpp/src/grpc/server/grpc_incumbent_proto.hpp
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
cpp/src/grpc/server/grpc_service_impl.cpp
cpp/src/grpc/server/grpc_worker.cpp
cpp/src/grpc/grpc_solution_mapper.cu

build.sh

cpp/CMakeLists.txt

cpp/src/grpc/client/test_grpc_client.cpp

cpp/src/grpc/server/grpc_pipe_serialization.hpp

cpp/src/grpc/server/grpc_worker_infra.cpp

coderabbitai

Actionable comments posted: 13

♻️ Duplicate comments (25)

cpp/src/grpc/server/grpc_job_management.cpp (2)

147-158: ⚠️ Potential issue | 🔴 Critical

submit_job_async still uses racy check-then-write slot acquisition.

Lines [147]-[158] can let concurrent submitters claim/overwrite the same queue slot. Use atomic reservation (compare_exchange) and publish ordering.

Proposed fix

-  for (size_t i = 0; i < MAX_JOBS; ++i) {
-    if (!job_queue[i].ready && !job_queue[i].claimed) {
+  for (size_t i = 0; i < MAX_JOBS; ++i) {
+    if (job_queue[i].ready.load()) { continue; }
+    bool expected_claimed = false;
+    if (job_queue[i].claimed.compare_exchange_strong(expected_claimed, true)) {
       copy_cstr(job_queue[i].job_id, job_id);
       job_queue[i].problem_type = is_mip ? 1 : 0;
       job_queue[i].data_size    = request_data.size();
       job_queue[i].worker_pid   = 0;
       job_queue[i].worker_index = -1;
-      job_queue[i].data_sent    = false;
-      job_queue[i].claimed      = false;
-      job_queue[i].cancelled    = false;
-      job_queue[i].ready        = true;
+      job_queue[i].data_sent.store(false);
+      job_queue[i].cancelled.store(false);
+      job_queue[i].ready.store(true, std::memory_order_release);
+      job_queue[i].claimed.store(false, std::memory_order_release);

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 147 - 158,
submit_job_async currently uses a racy check-then-write loop over job_queue
(iterating i from 0..MAX_JOBS) checking job_queue[i].ready and
job_queue[i].claimed then writing the slot; change this to atomically reserve a
slot using an atomic CAS on a dedicated slot state (e.g., atomically
compare_exchange the claimed/empty flag for job_queue[i]) before writing any
fields, and use release semantics when publishing ready (store with
memory_order_release) and acquire semantics when consumers read
(memory_order_acquire); update submit_job_async to attempt compare_exchange on
job_queue[i].claimed (or a new atomic state) to transition from empty->reserved,
only write the job fields after successful CAS, and then set ready=true with
appropriate publish ordering so concurrent submitters cannot overwrite the same
slot.

109-113: ⚠️ Potential issue | 🟠 Major

Bound incumbent payload size before resize to prevent memory exhaustion.

Line [111] trusts a pipe-provided size and allocates directly. A malformed size prefix can force unbounded allocation.

Proposed fix

 bool recv_incumbent_pipe(int fd, std::vector<uint8_t>& data)
 {
   uint64_t size;
   if (!read_from_pipe(fd, &size, sizeof(size))) return false;
+  if (size > static_cast<uint64_t>(config.max_message_bytes)) {
+    std::cerr << "[Worker] Incumbent payload too large: " << size << "\n";
+    return false;
+  }
   data.resize(size);
   if (size > 0 && !read_from_pipe(fd, data.data(), size)) return false;
   return true;
 }

As per coding guidelines "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 109 - 113, The code
reads an untrusted uint64_t size from a pipe and calls data.resize(size)
directly; bound the incoming size before resizing to prevent memory exhaustion
by introducing a sensible constant (e.g. JOB_PAYLOAD_MAX or MAX_PAYLOAD_SIZE)
and rejecting sizes > that limit (return false) or otherwise handling them; also
validate zero/negative semantics if needed and use that check right after
read_from_pipe(fd, &size, ...) and before data.resize(size) (referencing the
local variable size, function read_from_pipe, and the vector/data.resize call).

cpp/src/grpc/cuopt_remote.proto (1)

6-6: ⚠️ Potential issue | 🟠 Major

Buf package-directory mismatch is still unresolved.

Line [6] (package cuopt.remote;) remains in a path that violates Buf PACKAGE_DIRECTORY_MATCH for this module layout, so lint/CI will continue to fail until package/path (or Buf roots) are aligned.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration in
cuopt_remote.proto ("package cuopt.remote;") doesn't match the file path,
triggering Buf's PACKAGE_DIRECTORY_MATCH rule; fix by either updating the
package to reflect the file's directory structure (e.g., change package to match
the actual path) or move cuopt_remote.proto into a directory layout matching
"cuopt/remote" (or adjust buf.yaml roots/module configuration to include the
current path). Ensure the package token "cuopt.remote" and the filesystem layout
are aligned so Buf lint passes.

cpp/src/grpc/server/grpc_service_impl.cpp (1)

71-73: ⚠️ Potential issue | 🔴 Critical

Publish ready before releasing claimed to prevent queue-slot double allocation.

At Line [71] and Line [265], releasing claimed before publishing ready re-opens a claim window for another submitter on the same slot.
Proposed fix (both locations)
-    job_queue[job_idx].claimed.store(false);
-    job_queue[job_idx].ready.store(true);
+    job_queue[job_idx].ready.store(true, std::memory_order_release);
+    job_queue[job_idx].claimed.store(false, std::memory_order_release);
As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".

Also applies to: 265-267
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 71 - 73, The code
currently clears job_queue[job_idx].claimed before publishing
job_queue[job_idx].ready, which allows another submitter to claim the same slot
between those two stores; swap the operations so you set
job_queue[job_idx].ready.store(true) before
job_queue[job_idx].claimed.store(false) (and apply the same swap at the other
occurrence around lines 265-267) to ensure the slot is marked available only
after the ready flag is visible; if using explicit memory orders, use release
semantics for the store to ready to avoid the race.

SERVER_ARCHITECTURE.md (1)

15-46: ⚠️ Potential issue | 🟡 Minor

Add fence languages to satisfy markdownlint MD040

Several fenced blocks are untyped; markdownlint MD040 will keep warning/failing until those fences specify a language (e.g., text).

Proposed patch

-```
+```text
 ┌────────────────────────────────────────────────────────────────────┐
 │                        Main Server Process                          │
 ...
 └─────────────────┘  └─────────────────┘  └─────────────────┘
-```
+```

@@
-```
+```text
 Client                     Server                      Worker
 ...
-```
+```

@@
-```
+```text
 Client                     Server                      Worker
 ...
-```
+```

@@
-```
+```text
 Client                     Server                      Worker
 ...
-```
+```

@@
-```
+```text
 Client                     Worker
 ...
-```
+```

@@
-```
+```text
 ┌─────────┐  submit   ┌───────────┐  claim   ┌────────────┐
 ...
 └───────────┘          └─────────┘
-```
+```

Also applies to: 110-118, 122-136, 139-148, 154-168, 207-217

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@SERVER_ARCHITECTURE.md` around lines 15 - 46, The markdown has multiple
untyped fenced code blocks (the large ASCII "Main Server Process" diagram and
the smaller "Client / Server / Worker" and "submit / claim" pipeline diagrams)
causing markdownlint MD040 failures; update each triple-backtick fence around
these ASCII diagrams in SERVER_ARCHITECTURE.md to specify a language (use
"text") so the fences read ```text ... ```; look for the big box diagram
containing "Main Server Process", the repeated "Client                    
Server                      Worker" blocks, and the "submit   claim" pipeline
block and add the language token to each opening fence.

cpp/src/grpc/server/grpc_server_main.cpp (3)

156-168: ⚠️ Potential issue | 🔴 Critical

Don’t memset shared-memory structs that contain std::atomic

memset() on a type with std::atomic members is undefined behavior in C++. Even if you immediately .store() afterwards, the atomic object’s lifetime/rules are already violated.

Safer initialization pattern

   for (size_t i = 0; i < MAX_JOBS; ++i) {
-    memset(&job_queue[i], 0, sizeof(JobQueueEntry));
+    new (&job_queue[i]) JobQueueEntry{};
     job_queue[i].ready.store(false);
     job_queue[i].claimed.store(false);
     job_queue[i].cancelled.store(false);
     job_queue[i].worker_index.store(-1);
   }

   for (size_t i = 0; i < MAX_RESULTS; ++i) {
-    memset(&result_queue[i], 0, sizeof(ResultQueueEntry));
+    new (&result_queue[i]) ResultQueueEntry{};
     result_queue[i].ready.store(false);
     result_queue[i].retrieved.store(false);
   }

As per coding guidelines, “Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 156 - 168, The code
currently calls memset on job_queue[i] and result_queue[i] which is undefined
for types containing std::atomic; remove the memset calls and instead properly
construct/value-initialize each entry (e.g., use placement new: new
(&job_queue[i]) JobQueueEntry(); new (&result_queue[i]) ResultQueueEntry();) and
then set the atomic fields via their .store(...) calls (worker_index.store(-1),
ready.store(false), claimed.store(false), cancelled.store(false),
retrieved.store(false)); update the loops that touch job_queue and result_queue
(and references to MAX_JOBS / MAX_RESULTS) to use this safe initialization
pattern.

173-212: ⚠️ Potential issue | 🔴 Critical

Avoid std::terminate on TLS error paths; guard BuildAndStart() failure

TLS validation errors currently return after starting workers/threads (joinable thread dtors will call std::terminate). Separately, BuildAndStart() can return null and server->Wait() will crash. Both need a single cleanup/teardown path.

Minimal direction: introduce a shutdown helper and use it on all early-exit paths

   spawn_workers();

   std::thread result_thread(result_retrieval_thread);
   std::thread incumbent_thread(incumbent_retrieval_thread);
   std::thread monitor_thread(worker_monitor_thread);
   std::thread reaper_thread(session_reaper_thread);

+  auto shutdown_runtime = [&]() {
+    keep_running                 = false;
+    shm_ctrl->shutdown_requested = true;
+    result_cv.notify_all();
+    if (result_thread.joinable()) result_thread.join();
+    if (incumbent_thread.joinable()) incumbent_thread.join();
+    if (monitor_thread.joinable()) monitor_thread.join();
+    if (reaper_thread.joinable()) reaper_thread.join();
+    wait_for_workers();
+    cleanup_shared_memory();
+  };
@@
   if (config.enable_tls) {
     if (config.tls_cert_path.empty() || config.tls_key_path.empty()) {
       std::cerr << "[Server] TLS enabled but --tls-cert/--tls-key not provided\n";
-      return 1;
+      shutdown_runtime();
+      return 1;
     }
@@
       if (ssl_opts.pem_root_certs.empty()) {
         std::cerr << "[Server] --require-client-cert requires --tls-root\n";
-        return 1;
+        shutdown_runtime();
+        return 1;
       }
@@
   std::unique_ptr<Server> server(builder.BuildAndStart());
+  if (!server) {
+    std::cerr << "[gRPC Server] Failed to start server on " << server_address << "\n";
+    shutdown_runtime();
+    return 1;
+  }
@@
   server->Wait();

As per coding guidelines, “Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state.”

Also applies to: 234-251

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 173 - 212, Spawned
worker threads and early TLS/BuildAndStart error paths can return while threads
are still joinable (causing std::terminate) and BuildAndStart() can return null
causing a crash; introduce a single shutdown helper (e.g.,
shutdown_server_and_workers()) that stops workers, signals threads to exit,
joins result_retrieval_thread, incumbent_retrieval_thread,
worker_monitor_thread, session_reaper_thread and cleans up any resources and the
grpc::Server pointer, call this helper on all early-exit/TLS error paths and
when builder.BuildAndStart() returns nullptr before returning non-zero, and
replace ad-hoc returns with calls to this helper; additionally ensure shared
solver/global state access in server handlers uses proper mutex/synchronization
(e.g., std::mutex/lock_guard around the solver state) to avoid thread-unsafe
access.

50-80: ⚠️ Potential issue | 🟠 Major

CLI parsing should reject unknown/incomplete flags and handle std::stoi failures

Right now unknown flags and missing values are silently ignored, and std::stoi/std::stoll can throw and abort the server. Also, --chunk-timeout is documented but not parsed at all.

Example patch sketch (apply pattern to all flags)

   for (int i = 1; i < argc; i++) {
     std::string arg = argv[i];
     if (arg == "-p" || arg == "--port") {
-      if (i + 1 < argc) { config.port = std::stoi(argv[++i]); }
+      if (i + 1 >= argc) {
+        std::cerr << "ERROR: " << arg << " requires a value\n";
+        print_usage(argv[0]);
+        return 1;
+      }
+      try {
+        config.port = std::stoi(argv[++i]);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid port value for " << arg << ": '" << argv[i] << "'\n";
+        return 1;
+      }
     } else if (arg == "--max-message-bytes") {
-      if (i + 1 < argc) { config.max_message_b = std::max(4096LL, std::stoll(argv[++i])); }
+      if (i + 1 >= argc) {
+        std::cerr << "ERROR: " << arg << " requires a value\n";
+        return 1;
+      }
+      try {
+        config.max_message_b = std::max<int64_t>(4096LL, std::stoll(argv[++i]));
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid value for " << arg << ": '" << argv[i] << "'\n";
+        return 1;
+      }
+    } else if (arg == "--chunk-timeout") {
+      if (i + 1 >= argc) {
+        std::cerr << "ERROR: " << arg << " requires a value\n";
+        return 1;
+      }
+      try {
+        config.chunk_timeout_seconds = std::stoi(argv[++i]);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid value for " << arg << ": '" << argv[i] << "'\n";
+        return 1;
+      }
     } else if (arg == "-h" || arg == "--help") {
       print_usage(argv[0]);
       return 0;
+    } else {
+      std::cerr << "ERROR: Unknown option '" << arg << "'\n";
+      print_usage(argv[0]);
+      return 1;
     }
   }

As per coding guidelines, “Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 50 - 80, The CLI
parsing loop in main currently silently ignores unknown flags and missing
values, and calls std::stoi/std::stoll without catching exceptions; also the
documented flag "--chunk-timeout" is not parsed. Update the argv parsing code
(the for loop handling argv[], touching config.port, config.num_workers,
config.max_message_mb, config.max_message_b, config.enable_transfer_hash,
config.enable_tls, config.tls_cert_path, config.tls_key_path,
config.tls_root_path, config.require_client, config.log_to_console,
config.verbose) to: 1) validate that flags requiring a value check (i + 1 <
argc) and on failure print_usage(argv[0]) and return a non-zero error code; 2)
wrap std::stoi/std::stoll usages in try/catch(std::exception&) to log a clear
error and exit non-zero on parse failure; 3) handle unknown flags by printing an
error + usage and exiting non-zero; and 4) add parsing for "--chunk-timeout" to
populate the appropriate config field (e.g., config.chunk_timeout) following the
same validation and exception-handling pattern. Ensure error messages include
the offending flag for easier debugging.

build.sh (1)

390-398: ⚠️ Potential issue | 🟠 Major

cuopt_grpc_server fast path still captures mixed-target builds

As written, build.sh cuopt_grpc_server cuopt will still take the server-only branch and skip the requested install target(s). That’s a surprising foot-gun for devs/CI.

Proposed patch

-    if hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll; then
+    if hasArg cuopt_grpc_server && ! hasArg libcuopt && ! buildAll \
+       && ! hasArg cuopt && ! hasArg cuopt_server && ! hasArg cuopt_mps_parser \
+       && ! hasArg cuopt_sh_client && ! hasArg docs && ! hasArg deb \
+       && ! hasArg libmps_parser; then
         # Build only the gRPC server (ninja resolves libcuopt as a dependency)
         cmake --build "${LIBCUOPT_BUILD_DIR}" --target cuopt_grpc_server ${VERBOSE_FLAG} ${JFLAG}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@build.sh` around lines 390 - 398, The fast-path condition that builds only
cuopt_grpc_server (the if branch using hasArg cuopt_grpc_server && ! hasArg
libcuopt && ! buildAll) incorrectly triggers when other targets (e.g., cuopt)
are also requested; change the guard so it only takes the server-only branch
when cuopt_grpc_server is the sole target requested (for example replace the
condition with a new predicate like hasOnlyArg cuopt_grpc_server or add explicit
negations for all other possible target args / check INSTALL_TARGET equals
cuopt_grpc_server), updating the if in build.sh that surrounds the cmake --build
"${LIBCUOPT_BUILD_DIR}" --target cuopt_grpc_server block so mixed-target
invocations fall through to the normal install/build branch.

cpp/src/grpc/client/test_grpc_client.cpp (1)

381-412: ⚠️ Potential issue | 🟠 Major

Harden numeric argument parsing and validate --time-limit / incumbent index

std::stod/std::stoll throw on bad input, and static_cast<int>(time_limit) can be undefined if the value is out of range. This tool should fail with a clean CLI error instead of terminating.

Proposed patch

   while (arg_idx < argc && argv[arg_idx][0] == '-') {
     std::string arg = argv[arg_idx];
@@
     } else if (arg == "--time-limit" && arg_idx + 1 < argc) {
-      time_limit = std::stod(argv[++arg_idx]);
+      const char* value = argv[++arg_idx];
+      try {
+        time_limit = std::stod(value);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid value for --time-limit: '" << value << "'\n";
+        return 1;
+      }
+      if (!std::isfinite(time_limit) || time_limit <= 0.0) {
+        std::cerr << "ERROR: --time-limit must be a finite positive number (got '" << value
+                  << "')\n";
+        return 1;
+      }
       arg_idx++;
@@
   grpc_client_config_t config;
   config.server_address  = server_address;
-  config.timeout_seconds = static_cast<int>(time_limit) + 120;
+  if (time_limit > static_cast<double>(std::numeric_limits<int>::max() - 120)) {
+    std::cerr << "ERROR: --time-limit too large (max "
+              << (std::numeric_limits<int>::max() - 120) << ")\n";
+    return 1;
+  }
+  config.timeout_seconds = static_cast<int>(time_limit) + 120;
@@
   } else if (mode == "incumbent") {
@@
     std::string job_id = argv[arg_idx++];
-    int64_t from_index = (arg_idx < argc) ? std::stoll(argv[arg_idx]) : 0;
+    int64_t from_index = 0;
+    if (arg_idx < argc) {
+      const char* value = argv[arg_idx];
+      try {
+        from_index = std::stoll(value);
+      } catch (const std::exception&) {
+        std::cerr << "ERROR: Invalid incumbent index: '" << value << "'\n";
+        return 1;
+      }
+      if (from_index < 0) {
+        std::cerr << "ERROR: incumbent index must be >= 0 (got '" << value << "')\n";
+        return 1;
+      }
+    }
     return mode_incumbent(client, job_id, from_index);
   }

Also applies to: 425-426, 531-532

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` around lines 381 - 412, The CLI
currently calls std::stod (for time_limit) and std::stoll (for the incumbent
index elsewhere) which throw on invalid input and can lead to undefined behavior
when casting large values; update the argument parsing in the main option loop
(handling --time-limit / variable time_limit) to validate input robustly: parse
numeric arguments using non-throwing parsing (e.g., std::from_chars for integers
and a safe check for floating parsing or wrap std::stod in try/catch), detect
conversion failure or out-of-range values, emit a clear error message via
std::cerr and return a non-zero exit code, and before doing
static_cast<int>(time_limit) or casting the stoll result ensure the parsed value
fits the target range to avoid undefined behavior; apply the same
validation/failure handling to the other occurrences parsing integers (the
std::stoll sites noted in the review).

cpp/src/grpc/client/solve_remote.cu (2)

164-166: ⚠️ Potential issue | 🟠 Major

Do not pass incumbent objective as best-bound in callbacks.

Line 164–166 sets bound_copy equal to objective, which reports a fake zero-gap state to callback consumers.

A safe interim behavior is to pass NaN (unknown bound) until the incumbent protocol includes a real bound field end-to-end.

🔧 Interim mitigation in this file

+#include <limits>
...
-          double bound_copy                 = objective;  // Use objective as bound for incumbent
+          double bound_copy = std::numeric_limits<double>::quiet_NaN();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 164 - 166, The callback
currently passes the incumbent objective as the best-bound (bound_copy =
objective) to get_callback->get_solution, which falsely reports a zero-gap;
change the interim behavior to pass an unknown bound (use NaN) instead of
objective when calling get_callback->get_solution (affecting bound_copy,
get_callback->get_solution(..., &bound_copy, callback->get_user_data()) and
related solution_copy/obj_copy variables) so callbacks receive NaN until the
protocol supplies a real bound end-to-end.

55-57: ⚠️ Potential issue | 🟠 Major

Validate env sizing overrides before applying them.

Line 55–57 still accepts zero/negative values for CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES, which can destabilize transfer behavior.

🔧 Proposed guardrails

-  config.chunk_size_bytes  = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
-  config.max_message_bytes = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  auto chunk_size = parse_env_int64("CUOPT_CHUNK_SIZE", config.chunk_size_bytes);
+  auto max_msg    = parse_env_int64("CUOPT_MAX_MESSAGE_BYTES", config.max_message_bytes);
+  if (chunk_size > 0) { config.chunk_size_bytes = chunk_size; }
+  if (max_msg > 0) { config.max_message_bytes = max_msg; }

As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/solve_remote.cu` around lines 55 - 57, The env overrides
for CUOPT_CHUNK_SIZE and CUOPT_MAX_MESSAGE_BYTES may be zero/negative; update
the code that calls parse_env_int64 to validate the returned value before
assigning to config.chunk_size_bytes and config.max_message_bytes (e.g., call
parse_env_int64 for each into a temp variable, check temp > 0 and optionally cap
to a safe maximum, then assign to config.*; if invalid, retain the existing
default and emit a debug/warn via config.enable_debug_log or logger). Ensure you
reference parse_env_int64, config.chunk_size_bytes, config.max_message_bytes,
and the CUOPT_CHUNK_SIZE / CUOPT_MAX_MESSAGE_BYTES env names when implementing
the guard.

cpp/CMakeLists.txt (1)

279-294: ⚠️ Potential issue | 🟠 Major

Resolve protoc via $<TARGET_FILE:protobuf::protoc> to support in-tree protobuf targets.

Line 279–294 still assumes protobuf::protoc has IMPORTED_LOCATION*. That fails for non-imported targets (e.g., FetchContent/add_subdirectory), causing false fatal errors.

🔧 Proposed CMake adjustment

 if(TARGET protobuf::protoc)
-  get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
-  if(NOT _PROTOBUF_PROTOC)
-    get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
-  endif()
+  set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
 else()
   find_package(protobuf CONFIG REQUIRED)
-  get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION_RELEASE)
-  if(NOT _PROTOBUF_PROTOC)
-    get_target_property(_PROTOBUF_PROTOC protobuf::protoc IMPORTED_LOCATION)
-  endif()
+  if(TARGET protobuf::protoc)
+    set(_PROTOBUF_PROTOC $<TARGET_FILE:protobuf::protoc>)
+  endif()
 endif()

In CMake, does get_target_property(... IMPORTED_LOCATION) work for non-imported targets (e.g., created via add_subdirectory/FetchContent), and is $<TARGET_FILE:...> the recommended way to reference executables in add_custom_command(COMMAND ...)?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/CMakeLists.txt` around lines 279 - 294, The current logic assumes
protobuf::protoc has IMPORTED_LOCATION properties and can fail for in-tree
targets; change the resolution to first check if the target exists and set
_PROTOBUF_PROTOC to the generator expression $<TARGET_FILE:protobuf::protoc>
when TARGET protobuf::protoc is true, otherwise fall back to
get_target_property(... IMPORTED_LOCATION_RELEASE/IMPORTED_LOCATION) or the
PROTOBUF_PROTOC_EXECUTABLE cache variable; update any add_custom_command(...)
references to use ${_PROTOBUF_PROTOC} and keep the existing fatal error check if
_PROTOBUF_PROTOC is still empty.

cpp/src/grpc/server/grpc_pipe_serialization.hpp (3)

96-100: ⚠️ Potential issue | 🔴 Critical

Validate element_offset and overflow before pointer arithmetic in result deserialization.

Line 96–100 still allows negative offsets and unchecked offset + chunk_size arithmetic, which can underflow/overflow bounds checks and lead to unsafe memcpy.

🔧 Proposed fix

-    int64_t offset         = ac.element_offset();
-    const auto& chunk_data = ac.data();
-    if (offset + static_cast<int64_t>(chunk_data.size()) <= static_cast<int64_t>(dest.size())) {
-      std::memcpy(dest.data() + offset, chunk_data.data(), chunk_data.size());
-    }
+    int64_t offset         = ac.element_offset();
+    const auto& chunk_data = ac.data();
+    if (offset < 0) { return false; }
+    size_t uoffset = static_cast<size_t>(offset);
+    if (uoffset > dest.size()) { return false; }
+    if (chunk_data.size() > dest.size() - uoffset) { return false; }
+    std::memcpy(dest.data() + uoffset, chunk_data.data(), chunk_data.size());

As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 96 - 100, The
current deserialization uses offset = ac.element_offset() and then does pointer
arithmetic without validating offset or preventing overflow/underflow before
memcpy; update the logic in grpc_pipe_serialization.hpp around the block using
offset, chunk_data and dest so you first validate element_offset() is
non-negative and within dest.size(), compute size_t chunk_size =
chunk_data.size(), and perform a safe bounds check using size_t conversions
(e.g., offset <= dest.size() && offset + chunk_size <= dest.size()) while
guarding against overflow on offset + chunk_size; only call
std::memcpy(dest.data() + offset, ...) when those checks pass and otherwise
handle the error/return early.

90-90: ⚠️ Potential issue | 🔴 Critical

Return failure on non-EOF parse errors in both deserializers.

Line 90 and Line 148 still break on any parse failure and eventually return true, which can accept malformed/truncated payloads as valid.

🔧 Proposed fix

-    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) break;
+    if (!google::protobuf::util::ParseDelimitedFromCodedStream(&ac, &coded, &clean_eof)) {
+      if (clean_eof) { break; }  // normal end-of-stream
+      return false;              // malformed/truncated message
+    }

As per coding guidelines: "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files."

Also applies to: 148-148

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` at line 90, The loops
calling google::protobuf::util::ParseDelimitedFromCodedStream (using variables
ac, coded, clean_eof) currently use "break" on parse failure which lets the
function return true for truncated/malformed input; change those breaks to
return false on a parse failure unless the failure was a clean EOF. Concretely,
at both call sites (around the ParseDelimitedFromCodedStream lines referenced)
check the clean_eof flag after a failed parse and return false for non-EOF
failures (only allow breaking/ending the loop when clean_eof is true).

151-163: ⚠️ Potential issue | 🟠 Major

Guard total_elements * elem_size and byte-offset math before resize/copy.

Line 151–163 still performs unchecked multiplication and resize from untrusted chunk metadata. A malformed payload can overflow size math or trigger excessive allocation.

🔧 Proposed fix

+    if (ac.total_elements() < 0) { return false; }
     if (dest.empty() && ac.total_elements() > 0) {
       int64_t elem_size = array_field_element_size(ac.field_id());
-      dest.resize(static_cast<size_t>(ac.total_elements() * elem_size), 0);
+      if (elem_size <= 0) { return false; }
+      uint64_t total_bytes = static_cast<uint64_t>(ac.total_elements()) *
+                             static_cast<uint64_t>(elem_size);
+      if (total_bytes > static_cast<uint64_t>(std::numeric_limits<size_t>::max())) {
+        return false;
+      }
+      dest.resize(static_cast<size_t>(total_bytes), 0);
     }

As per coding guidelines: "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 151 - 163, The
code in the block that resizes dest and copies chunk bytes must validate sizes
to avoid integer overflow and huge allocations: before calling
dest.resize(ac.total_elements() * elem_size) and before computing byte_offset,
check that ac.total_elements() and computed elem_size are > 0 and that
static_cast<size_t>(ac.total_elements()) <= SIZE_MAX /
static_cast<size_t>(elem_size); bound the requested allocation against a safe
maximum (or return an error) to prevent resource exhaustion; verify
chunk_data.size() is <= dest.size() and that byte_offset >= 0 and byte_offset <=
dest.size() and that byte_offset + chunk_data.size() does not overflow (use
size_t or checked arithmetic) before calling std::memcpy; refer to symbols dest,
ac, array_field_element_size(), total_elements(), element_offset(), data(),
elem_size, byte_offset, resize, and std::memcpy when making these checks.

cpp/src/grpc/server/grpc_server_types.hpp (1)

300-308: ⚠️ Potential issue | 🔴 Critical

Remove async-signal-unsafe operations from signal_handler().

std::cout and condition_variable::notify_all() are not async-signal-safe and can deadlock/corrupt state when invoked from a POSIX signal handler.

Suggested minimal signal-safe handler

 inline void signal_handler(int signal)
 {
   if (signal == SIGINT || signal == SIGTERM) {
-    std::cout << "\n[gRPC Server] Received shutdown signal\n";
     keep_running = false;
     if (shm_ctrl) { shm_ctrl->shutdown_requested = true; }
-    result_cv.notify_all();
   }
 }

Then have a normal thread (or the main loop) observe keep_running == false and do logging + result_cv.notify_all() outside the signal context.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 300 - 308,
signal_handler currently performs async-unsafe operations (std::cout and
result_cv.notify_all()); modify it to only perform signal-safe actions: set a
sig_atomic_t/atomic<bool> flag (keep_running) and, if needed, set
shm_ctrl->shutdown_requested (ensure that shutdown_requested is signal-safe or
updated via atomic/sig_atomic_t), then return immediately—remove std::cout and
result_cv.notify_all() from signal_handler. Update the main loop or a dedicated
watcher thread to observe keep_running (or the atomic) and then perform logging
and call result_cv.notify_all() outside the signal context (use the symbols
signal_handler, keep_running, shm_ctrl, result_cv to locate changes).

cpp/src/grpc/client/grpc_client.cu (5)

702-745: ⚠️ Potential issue | 🔴 Critical

Harden chunked-result metadata validation before division/alloc/memcpy.

element_size_bytes and total_elements are trusted. elem_size <= 0 triggers division-by-zero (Line 708), and total_elems * elem_size can overflow size_t before allocation, making the memcpy (Line 739) unsafe.

Proposed bounds checks

   for (const auto& arr_desc : header->arrays()) {
     auto field_id       = arr_desc.field_id();
     int64_t total_elems = arr_desc.total_elements();
     int64_t elem_size   = arr_desc.element_size_bytes();
     if (total_elems <= 0) continue;
+    if (elem_size <= 0) {
+      last_error_ = "Invalid chunk metadata: non-positive element_size_bytes";
+      return false;
+    }
+    if (total_elems > std::numeric_limits<int64_t>::max() / elem_size) {
+      last_error_ = "Invalid chunk metadata: total byte size overflow";
+      return false;
+    }
+    const int64_t total_bytes = total_elems * elem_size;
+    if (total_bytes < 0 ||
+        static_cast<uint64_t>(total_bytes) >
+          static_cast<uint64_t>(std::numeric_limits<size_t>::max())) {
+      last_error_ = "Invalid chunk metadata: byte size exceeds addressable memory";
+      return false;
+    }

     int64_t elems_per_chunk = chunk_data_budget / elem_size;
     if (elems_per_chunk <= 0) elems_per_chunk = 1;

-    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_elems * elem_size));
+    std::vector<uint8_t> array_bytes(static_cast<size_t>(total_bytes));

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 702 - 745, The code trusts
element_size_bytes() and total_elements() and must validate them before any
division, allocation, or memcpy: check that elem_size =
arr_desc.element_size_bytes() > 0 and total_elems = arr_desc.total_elements() >=
0, ensure elems_per_chunk = chunk_data_budget / elem_size is computed only after
the elem_size > 0 check, and verify that total_elems * elem_size does not
overflow size_t before constructing array_bytes; also validate that elems_wanted
and elems_received are non-negative and that data.size() equals elems_received *
elem_size and fits within array_bytes bounds before calling std::memcpy
(references: arr_desc.field_id(), arr_desc.total_elements(),
arr_desc.element_size_bytes(), elems_per_chunk, array_bytes,
chunk_resp.elements_in_chunk(), chunk_resp.data(), and the std::memcpy call).

258-266: ⚠️ Potential issue | 🔴 Critical

Guard all public RPC methods against impl_->stub being null.

check_status(), wait_for_completion(), cancel_job(), delete_job(), get_incumbents(), and stream_logs() all dereference impl_->stub directly; calling these before connect() will crash.

Proposed guard pattern

 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;
+  if (!impl_ || !impl_->stub) {
+    result.error_message = "Not connected to server";
+    return result;
+  }

   grpc::ClientContext context;
   auto request = build_status_request(job_id);
   cuopt::remote::StatusResponse response;
   auto status = impl_->stub->CheckStatus(&context, request, &response);

Also applies to: 293-303, 325-333, 354-361, 381-395, 419-430

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, All public RPC
methods must guard against impl_ or impl_->stub being null to avoid crashes when
called before connect(); update grpc_client_t::check_status,
wait_for_completion, cancel_job, delete_job, get_incumbents, and stream_logs to
early-return a sensible error result if impl_ == nullptr or impl_->stub ==
nullptr (e.g., set job_status_result_t to an error/failed state or populate the
method's error/status fields and avoid calling stub->*), and add a clear logged
message indicating the client is not connected; ensure you use the same guard
pattern in the other listed methods (the ones around the ranges noted:
check_status, wait_for_completion, cancel_job, delete_job, get_incumbents,
stream_logs) so no method dereferences impl_->stub without this null check.

221-252: ⚠️ Potential issue | 🔴 Critical

stop_log_streaming() can hang indefinitely (blocked reader->Read()).

Setting stop_logs_ doesn’t unblock a synchronous server-streaming Read(), so join() can deadlock if the server stops emitting logs (or the stream is otherwise idle). This is still the same failure mode previously called out.

Proposed fix (make stop cancel the active ClientContext)

 void grpc_client_t::start_log_streaming(const std::string& job_id)
 {
   if (!config_.stream_logs || !config_.log_callback) return;

   stop_logs_.store(false);
   log_thread_ = std::make_unique<std::thread>([this, job_id]() {
-    grpc::ClientContext context;
+    grpc::ClientContext context;
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = &context;
+    }
     auto request = build_stream_logs_request(job_id, 0);
     auto reader  = impl_->stub->StreamLogs(&context, request);

     cuopt::remote::LogMessage log_msg;
     while (reader->Read(&log_msg)) {
       if (stop_logs_.load()) {
-        context.TryCancel();
         break;
       }
       if (config_.log_callback) { config_.log_callback(log_msg.line()); }
       if (log_msg.job_complete()) { break; }
     }
     reader->Finish();
+    {
+      std::lock_guard<std::mutex> lk(log_context_mutex_);
+      active_log_context_ = nullptr;
+    }
   });
 }

 void grpc_client_t::stop_log_streaming()
 {
   stop_logs_.store(true);
+  {
+    std::lock_guard<std::mutex> lk(log_context_mutex_);
+    if (active_log_context_) { active_log_context_->TryCancel(); }
+  }
   if (log_thread_ && log_thread_->joinable()) { log_thread_->join(); }
   log_thread_.reset();
 }

You’ll also need matching members on grpc_client_t (likely in cpp/src/grpc/client/grpc_client.hpp):

mutable std::mutex log_context_mutex_;
grpc::ClientContext* active_log_context_ = nullptr;

As per coding guidelines "Ensure race conditions are absent in multi-GPU code and multi-threaded server implementations; verify proper synchronization of shared state".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 221 - 252, The
stop_log_streaming() can block because reader->Read() is synchronous; modify
start_log_streaming() and stop_log_streaming() to track and cancel the active
grpc::ClientContext: add a mutable std::mutex log_context_mutex_ and
grpc::ClientContext* active_log_context_ member, then in start_log_streaming()
create the ClientContext, lock log_context_mutex_, set active_log_context_ =
&context before calling impl_->stub->StreamLogs, unlock, and clear
active_log_context_ (under the mutex) after reader->Finish(); in
stop_log_streaming() lock log_context_mutex_, if active_log_context_ is non-null
call active_log_context_->TryCancel(), then set stop_logs_ and join/reset the
thread; keep the existing context.TryCancel() check inside the thread but ensure
all access to active_log_context_ is synchronized to avoid races.

258-266: ⚠️ Potential issue | 🟠 Major

Add deadlines to operational RPCs (current code can block forever on network stalls).

Only the connection probe sets a deadline; most other RPC calls create a grpc::ClientContext without set_deadline(). That bypasses config_.timeout_seconds and can hang indefinitely (including inside polling loops).

Proposed helper + per-RPC use

+static inline void set_context_deadline(grpc::ClientContext& ctx, int timeout_seconds)
+{
+  if (timeout_seconds <= 0) return;
+  ctx.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(timeout_seconds));
+}
+
 job_status_result_t grpc_client_t::check_status(const std::string& job_id)
 {
   job_status_result_t result;

   grpc::ClientContext context;
+  set_context_deadline(context, config_.timeout_seconds);
   auto request = build_status_request(job_id);
   cuopt::remote::StatusResponse response;
   auto status = impl_->stub->CheckStatus(&context, request, &response);

Apply similarly to WaitForCompletion, CancelJob, DeleteResult, GetIncumbents, GetResult, and all chunked upload/download RPCs.

Also applies to: 293-303, 325-333, 354-361, 387-395, 419-443, 468-470, 509-510, 540-541, 561-562, 588-589, 625-626, 674-675, 724-725, 1000-1004, 1172-1176

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 258 - 266, The RPCs in
grpc_client_t (e.g., grpc_client_t::check_status) create grpc::ClientContext
without deadlines and can block indefinitely; add a helper (e.g.,
make_client_context_with_deadline or set_deadline_on_context) that constructs a
grpc::ClientContext and calls set_deadline(std::chrono::system_clock::now() +
std::chrono::seconds(config_.timeout_seconds)) (or uses config_.timeout_seconds
as a std::chrono duration) and use that helper everywhere you create a context
for RPCs (check_status, WaitForCompletion, CancelJob, DeleteResult,
GetIncumbents, GetResult, all chunked upload/download calls, and the other
locations noted in the review) so each RPC honors config_.timeout_seconds and
cannot hang forever.

992-995: ⚠️ Potential issue | 🔴 Critical

Validate poll_interval_ms before dividing (division-by-zero).

Both LP and MIP polling paths compute max_polls by dividing by config_.poll_interval_ms without guarding <= 0.

Proposed validation

   } else {
     CUOPT_LOG_INFO("[grpc_client] Using polling (CheckStatus) for job %s", job_id.c_str());
+    if (config_.poll_interval_ms <= 0) {
+      stop_log_streaming();
+      result.error_message = "poll_interval_ms must be > 0";
+      return result;
+    }
     int poll_count = 0;
     int max_polls  = (config_.timeout_seconds * 1000) / config_.poll_interval_ms;

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

Also applies to: 1130-1133

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cu` around lines 992 - 995, Before computing
max_polls in the LP and MIP polling paths, validate config_.poll_interval_ms > 0
to avoid division-by-zero: if config_.poll_interval_ms <= 0, emit a clear error
log (e.g., CUOPT_LOG_ERROR) that includes the invalid value and either return an
error status or substitute a safe fallback (e.g., set poll_interval_ms = 1)
before computing max_polls = (config_.timeout_seconds * 1000) /
config_.poll_interval_ms; apply this check around the existing uses of
poll_count and max_polls so both the LP and MIP paths use the validated value.

cpp/src/grpc/grpc_problem_mapper.cu (1)

399-427: ⚠️ Potential issue | 🔴 Critical

Fix potential out-of-bounds writes in get_doubles/get_ints (misaligned payload sizes).

n is computed with floor division, but memcpy copies it->second.size() bytes. If the byte payload isn’t an exact multiple of the element size, this can overflow the destination vector.

Proposed fix (validate alignment and copy only validated byte count)

   auto get_doubles = [&](int32_t field_id) -> std::vector<f_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(double) != 0) return {};
     size_t n = it->second.size() / sizeof(double);
     if constexpr (std::is_same_v<f_t, double>) {
       std::vector<double> v(n);
-      std::memcpy(v.data(), it->second.data(), it->second.size());
+      std::memcpy(v.data(), it->second.data(), n * sizeof(double));
       return v;
     } else {
       std::vector<double> tmp(n);
-      std::memcpy(tmp.data(), it->second.data(), it->second.size());
+      std::memcpy(tmp.data(), it->second.data(), n * sizeof(double));
       return std::vector<f_t>(tmp.begin(), tmp.end());
     }
   };

   auto get_ints = [&](int32_t field_id) -> std::vector<i_t> {
     auto it = arrays.find(field_id);
     if (it == arrays.end() || it->second.empty()) return {};
+    if (it->second.size() % sizeof(int32_t) != 0) return {};
     size_t n = it->second.size() / sizeof(int32_t);
     if constexpr (std::is_same_v<i_t, int32_t>) {
       std::vector<int32_t> v(n);
-      std::memcpy(v.data(), it->second.data(), it->second.size());
+      std::memcpy(v.data(), it->second.data(), n * sizeof(int32_t));
       return v;
     } else {
       std::vector<int32_t> tmp(n);
-      std::memcpy(tmp.data(), it->second.data(), it->second.size());
+      std::memcpy(tmp.data(), it->second.data(), n * sizeof(int32_t));
       return std::vector<i_t>(tmp.begin(), tmp.end());
     }
   };

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_problem_mapper.cu` around lines 399 - 427, The lambdas
get_doubles and get_ints compute element count n via floor division but then
memcpy the full it->second.size() bytes, risking out-of-bounds when payload size
isn't a multiple of element size; fix by validating alignment (check
it->second.size() % sizeof(double) / sizeof(int32_t) == 0) and compute a safe
byte_count = n * sizeof(element_type), resize the destination vector to n, then
memcpy only byte_count (or return an error/empty vector if misaligned); update
both get_doubles (use sizeof(double) and f_t conversions) and get_ints (use
sizeof(int32_t) and i_t conversions) accordingly so copying never exceeds the
destination buffer.

cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp (1)

213-218: ⚠️ Potential issue | 🟠 Major

Synchronize server-log state reads/writes consistently (data race).

server_log_path_, server_log_start_pos_, and test_start_marked_ are accessed without a common lock in set_server_log_path(), get_server_logs(), dump_logs(), and the accessors. This can race if tests poll logs while another thread resets paths/markers.

Proposed approach (copy shared state under lock, then do I/O unlocked)

 void set_server_log_path(const std::string& path)
 {
-  server_log_path_      = path;
-  server_log_start_pos_ = 0;
-  test_start_marked_    = false;
+  std::lock_guard<std::mutex> lock(mutex_);
+  server_log_path_      = path;
+  server_log_start_pos_ = 0;
+  test_start_marked_    = false;
 }

 std::string get_server_logs(bool since_test_start = true) const
 {
-  if (server_log_path_.empty()) { return ""; }
+  std::string path;
+  std::streampos start_pos = 0;
+  bool marked              = false;
+  {
+    std::lock_guard<std::mutex> lock(mutex_);
+    path      = server_log_path_;
+    start_pos = server_log_start_pos_;
+    marked    = test_start_marked_;
+  }
+
+  if (path.empty()) { return ""; }

-  std::ifstream file(server_log_path_);
+  std::ifstream file(path);
   if (!file.is_open()) { return ""; }

-  if (since_test_start && test_start_marked_ && server_log_start_pos_ > 0) {
-    file.seekg(server_log_start_pos_);
+  if (since_test_start && marked && start_pos > 0) {
+    file.seekg(start_pos);
   }

Consider also making server_log_path() return a copy (std::string) instead of a reference if you need it to be thread-safe.

As per coding guidelines "Prevent thread-unsafe use of global and static variables; use proper mutex/synchronization in server code accessing shared solver state".

Also applies to: 229-244, 332-345, 350-356

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp` around lines 213
- 218, There is a data race on server_log_path_, server_log_start_pos_, and
test_start_marked_: protect all reads/writes with a single mutex (e.g., add
std::mutex server_log_mutex_) and update set_server_log_path(),
get_server_logs(), dump_logs(), and the accessors to acquire the lock when
accessing those members; for I/O work, copy the guarded state (make
server_log_path() return a std::string copy rather than a const ref) while
holding the lock then release and perform file reading/writing unlocked to avoid
long holds; ensure all places that currently read the fields directly use the
mutex-protected copy to eliminate races.

cpp/src/grpc/grpc_solution_mapper.cu (1)

512-535: ⚠️ Potential issue | 🔴 Critical

Harden bytes_to_typed() against misaligned byte payloads (potential OOB write).

All branches compute n with floor division and then copy raw.size() bytes. If raw.size() is not an exact multiple of the expected element size, this overflows tmp/v.

Proposed fix

   const auto& raw = it->second;
   if constexpr (std::is_same_v<T, float>) {
+    if (raw.size() % sizeof(double) != 0) return {};
     size_t n = raw.size() / sizeof(double);
     std::vector<double> tmp(n);
-    std::memcpy(tmp.data(), raw.data(), raw.size());
+    std::memcpy(tmp.data(), raw.data(), n * sizeof(double));
     return std::vector<T>(tmp.begin(), tmp.end());
   } else if constexpr (std::is_same_v<T, double>) {
+    if (raw.size() % sizeof(double) != 0) return {};
     size_t n = raw.size() / sizeof(double);
     std::vector<double> v(n);
-    std::memcpy(v.data(), raw.data(), raw.size());
+    std::memcpy(v.data(), raw.data(), n * sizeof(double));
     return v;
   } else {
+    if (raw.size() % sizeof(T) != 0) return {};
     size_t n = raw.size() / sizeof(T);
     std::vector<T> v(n);
-    std::memcpy(v.data(), raw.data(), raw.size());
+    std::memcpy(v.data(), raw.data(), n * sizeof(T));
     return v;
   }

As per coding guidelines "Check numerical stability: prevent overflow/underflow, precision loss, division by zero/near-zero, and use epsilon comparisons for floating-point equality checks".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_solution_mapper.cu` around lines 512 - 535, The
bytes_to_typed() implementation can write out-of-bounds because it computes n =
raw.size()/sizeof(...) but then memcpy's raw.size() bytes; fix by always using
sizeof(T) (not sizeof(double) in the float branch), compute size_t n =
raw.size() / sizeof(T), and memcpy only n * sizeof(T) bytes into the destination
buffers (and return the n-length vector); if raw.size() % sizeof(T) != 0,
silently ignore trailing bytes or log a warning—ensure this same change is
applied in the float branch (tmp) and the double/other branches (v) so all
copies use n * sizeof(T) and use sizeof(T) to compute n.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 53823cec-ccad-4fbb-b90f-ec216b29d2e2

📥 Commits

Reviewing files that changed from the base of the PR and between d228718 and e46028c.

📒 Files selected for processing (39)

GRPC_ARCHITECTURE.md
SERVER_ARCHITECTURE.md
build.sh
conda/recipes/libcuopt/recipe.yaml
cpp/CMakeLists.txt
cpp/cuopt_grpc_server.cpp
cpp/src/grpc/client/grpc_client.cu
cpp/src/grpc/client/grpc_client.hpp
cpp/src/grpc/client/solve_remote.cu
cpp/src/grpc/client/test_grpc_client.cpp
cpp/src/grpc/cuopt_remote.proto
cpp/src/grpc/cuopt_remote_service.proto
cpp/src/grpc/grpc_problem_mapper.cu
cpp/src/grpc/grpc_problem_mapper.hpp
cpp/src/grpc/grpc_service_mapper.cu
cpp/src/grpc/grpc_service_mapper.hpp
cpp/src/grpc/grpc_settings_mapper.cu
cpp/src/grpc/grpc_settings_mapper.hpp
cpp/src/grpc/grpc_solution_mapper.cu
cpp/src/grpc/grpc_solution_mapper.hpp
cpp/src/grpc/server/grpc_field_element_size.hpp
cpp/src/grpc/server/grpc_incumbent_proto.hpp
cpp/src/grpc/server/grpc_job_management.cpp
cpp/src/grpc/server/grpc_pipe_serialization.hpp
cpp/src/grpc/server/grpc_server_main.cpp
cpp/src/grpc/server/grpc_server_threads.cpp
cpp/src/grpc/server/grpc_server_types.hpp
cpp/src/grpc/server/grpc_service_impl.cpp
cpp/src/grpc/server/grpc_worker.cpp
cpp/src/grpc/server/grpc_worker_infra.cpp
cpp/src/pdlp/CMakeLists.txt
cpp/src/pdlp/solve_remote.cu
cpp/tests/linear_programming/CMakeLists.txt
cpp/tests/linear_programming/grpc/CMakeLists.txt
cpp/tests/linear_programming/grpc/grpc_client_test.cpp
cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py

💤 Files with no reviewable changes (2)

cpp/src/pdlp/solve_remote.cu
cpp/src/pdlp/CMakeLists.txt

✅ Files skipped from review due to trivial changes (1)

cpp/tests/linear_programming/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (4)

GRPC_ARCHITECTURE.md
cpp/src/grpc/server/grpc_worker_infra.cpp
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
cpp/src/grpc/grpc_settings_mapper.cu

cpp/src/grpc/client/grpc_client.hpp

cpp/src/grpc/grpc_problem_mapper.cpp

cpp/src/grpc/grpc_service_mapper.cpp

cpp/src/grpc/server/grpc_incumbent_proto.hpp

cpp/src/grpc/server/grpc_service_impl.cpp

cpp/src/grpc/server/grpc_worker.cpp

cpp/tests/linear_programming/grpc/grpc_integration_test.cpp

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (3)

cpp/src/grpc/client/test_grpc_client.cpp (1)
340-340: Print the client error on log-stream failure for parity with other modes.

At Line 340, a failed stream_logs returns non-zero but drops the error detail. Emit client.get_last_error() before returning to keep failure diagnosis consistent.
Proposed fix
-  return success ? 0 : 1;
+  if (!success) {
+    std::cerr << "Log stream failed: " << client.get_last_error() << "\n";
+    return 1;
+  }
+  return 0;
As per coding guidelines: "Verify error propagation from CUDA to user-facing APIs is complete; ensure CUDA errors are caught and mapped to meaningful user error codes".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/test_grpc_client.cpp` at line 340, The return path after
stream_logs currently returns non-zero on failure but drops the error detail;
update the failure branch surrounding stream_logs in test_grpc_client.cpp to
call and emit client.get_last_error() (or equivalent accessor) to the log/stream
before returning the non-zero exit code so the client error is logged for parity
with other modes and aids diagnosis (locate the code around stream_logs and the
return of success ? 0 : 1 and insert a process/log call that prints
client.get_last_error() when success is false).
cpp/src/grpc/server/grpc_pipe_serialization.hpp (1)
24-27: Consider making kPipeChunkBytes configurable.

The fixed 64 MiB chunk cap is a hard resource limit; exposing it via server config/env (with this as default) would improve deploy-time tuning.

As per coding guidelines: "Check that hard-coded GPU device IDs and resource limits are made configurable; abstract multi-backend support for different CUDA versions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 24 - 27,
kPipeChunkBytes is a hard-coded constexpr limit; make it configurable by
replacing the fixed constexpr with a runtime-configured value (e.g., a
GetPipeChunkBytes() or PipeConfig::pipe_chunk_bytes() accessor) that reads an
environment variable or server config and falls back to 64ULL * 1024 * 1024 as
the default; update references to kPipeChunkBytes to call the accessor and
validate the parsed value (positive, reasonable upper bound) and document the
new env var name (e.g., PIPE_CHUNK_BYTES) and default behavior.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp (1)
957-982: Assert the log-callback path in SolveMIPWithLogCallback.

received_logs is populated but never asserted, so this test can pass even if callback streaming regresses.
💡 Suggested assertion
   auto result = client->solve_mip(problem, settings, false);
   EXPECT_TRUE(result.success) << result.error_message;
+  EXPECT_FALSE(received_logs.empty()) << "Expected at least one streamed log line";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 957
- 982, The test registers a log callback (received_logs with log_mutex) but
never asserts anything, so add assertions after the call to client->solve_mip in
SolveMIPWithLogCallback to verify the callback was invoked and logs look
correct: acquire lock(log_mutex) and assert !received_logs.empty() (and
optionally that at least one entry contains an expected substring like
"objective" or "MIP" to ensure meaningful log content). If solve_mip is
asynchronous/streams logs, ensure you wait briefly or poll until received_logs
is non-empty before asserting to avoid flakes; reference received_logs,
log_mutex, log_callback, SolveMIPWithLogCallback, and client->solve_mip when
adding the checks.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/CMakeLists.txt`:
- Around line 297-304: The CMake snippet only checks for the non-namespaced
target grpc_cpp_plugin; update the check to first test for the namespaced target
gRPC::grpc_cpp_plugin (like how gRPC::grpc++ is handled) and use its target file
if present, otherwise fall back to checking TARGET grpc_cpp_plugin and then
find_program for the executable; adjust references to set
_GRPC_CPP_PLUGIN_EXECUTABLE from "$<TARGET_FILE:gRPC::grpc_cpp_plugin>" when
available to ensure installed gRPC packages are recognized.

In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp`:
- Around line 1387-1402: The test currently expects GetResult to be called even
though CheckStatus returns PROCESSING (not complete); update the test so
GetResult is not expected for PROCESSING jobs: remove or replace the
EXPECT_CALL(*mock_stub_, GetResult(...)).WillOnce(...) with an expectation that
it is not called (e.g., EXPECT_CALL(*mock_stub_, GetResult(_, _, _)).Times(0))
or simply delete the GetResult expectation, keeping the CheckStatus stub that
sets resp->set_job_status(cuopt::remote::PROCESSING) so the test asserts the
intended "do not call GetResult for non-complete status" behavior.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 275-283: The server_log_count function can infinite-loop when
substring is empty; add a guard at the start of server_log_count (which calls
get_server_logs()) to return 0 (or handle as appropriate) if substring.empty()
before entering the while loop so pos will never rely on substring.length() == 0
to advance.

---

Duplicate comments:
In `@cpp/CMakeLists.txt`:
- Around line 279-294: The current CMake logic relies on IMPORTED_LOCATION*
which can be empty for non-imported targets; update the resolution to use
$<TARGET_FILE:protobuf::protoc> when the target exists: in the branch that tests
if(TARGET protobuf::protoc) set _PROTOBUF_PROTOC to the generator expression
$<TARGET_FILE:protobuf::protoc> (instead of querying IMPORTED_LOCATION*),
otherwise fall back to find_package(protobuf CONFIG REQUIRED) and then use
IMPORTED_LOCATION_RELEASE/IMPORTED_LOCATION as a fallback; keep the existing
fatal check for empty _PROTOBUF_PROTOC. Ensure you reference the symbols
protobuf::protoc and _PROTOBUF_PROTOC in the change so the target-file generator
expression is used for in-tree/FetchContent builds.

In `@cpp/src/grpc/client/grpc_client.cu`:
- Around line 258-265: grpc_client_t::check_status (and similar RPC helpers)
currently dereferences impl_->stub without checking connection state which can
crash if called before connect(); add a null-check for impl_ and impl_->stub at
the start of grpc_client_t::check_status and for the same pattern in the other
RPC helper methods (the ones calling impl_->stub->...). If either is null,
return an appropriate error job_status_result_t (populate error code/message or
a failed state consistent with existing error handling) instead of calling the
stub; otherwise proceed to build the request and call CheckStatus as before.
- Around line 994-995: The code computes max_polls by dividing by
config_.poll_interval_ms without checking for <=0; add a guard before any
division (both where max_polls is computed and the other occurrence around the
1132 region) to validate config_.poll_interval_ms > 0, and if not either (a)
return or raise a clear error/invalid-argument from the surrounding function, or
(b) log an error and use a safe default (e.g., 1 ms) before computing max_polls;
ensure you reference config_.poll_interval_ms and update the computations that
set max_polls to avoid divide-by-zero crashes.
- Around line 221-251: The read loop in start_log_streaming blocks on
reader->Read() so stop_log_streaming can hang; to fix, add a mutex-protected
pointer member (log_context_mutex_ and active_log_context_) and set
active_log_context_ = &context (protected by log_context_mutex_) right after
creating grpc::ClientContext context in start_log_streaming, clear it (set
nullptr) after reader->Finish(); then update stop_log_streaming to lock
log_context_mutex_ and if active_log_context_ is non-null call
active_log_context_->TryCancel() before setting stop_logs_ and joining the
thread so the blocking StreamLogs read is cancelled safely.
- Around line 703-712: Validate elem_size and guard against overflow before
dividing/allocating: check elem_size > 0 before computing elems_per_chunk =
chunk_data_budget / elem_size (return/continue or log and skip if zero), ensure
total_elems > 0 (already present) and verify total_elems <= SIZE_MAX / elem_size
(or use safe_mul_check) before computing total_elems * elem_size and allocating
array_bytes, and if the allocation would exceed a configured budget or SIZE_MAX
also error/skip; apply the same checks when computing per-chunk sizes and before
any memcpy/copy operations. Reference symbols: elem_size, total_elems,
chunk_data_budget, elems_per_chunk, array_bytes, arr_desc.total_elements(),
arr_desc.element_size_bytes() (also update the similar block around the other
occurrence).
- Around line 262-265: The CheckStatus RPC (and other RPCs that create
grpc::ClientContext, e.g., the sites calling impl_->stub->CheckStatus and the
other stub->* calls listed) currently create a ClientContext with no deadline;
fix by setting a per-RPC deadline on the context before the call using
context.set_deadline(std::chrono::system_clock::now() + configured_timeout).
Prefer reusing or adding a small helper (e.g., set_deadline(context, timeout_ms)
or a method on impl_ that reads a configured RPC timeout) and apply it to the
Call sites such as the CheckStatus call (impl_->stub->CheckStatus) and the other
grpc::ClientContext creation sites flagged in the review so all RPCs
consistently use the same configured deadline.

In `@cpp/src/grpc/client/grpc_client.hpp`:
- Around line 53-99: The grpc_client_config_t allows invalid values for
intervals/sizes (poll_interval_ms, incumbent_poll_interval_ms, timeout_seconds,
chunk_size_bytes, max_message_bytes, chunked_array_threshold_bytes) which can
break runtime behavior; add a validation method (e.g.,
grpc_client_config_t::Validate() or a constructor/static factory) that enforces
and documents invariants (positive non-zero ints for intervals/timeout, sensible
min/max for chunk_size_bytes and max_message_bytes, and explicit handling for
chunked_array_threshold_bytes semantics (-1 for auto, 0 for force-chunked, >0 as
threshold)), make the client call Validate() early (e.g., before using in
grpc_client constructors or SubmitJob path) and return/throw a clear error on
invalid config or clamp values where safe; update comments to list accepted
ranges for each field.

In `@cpp/src/grpc/client/solve_remote.cu`:
- Around line 55-57: The env-derived overrides for config.chunk_size_bytes and
config.max_message_bytes are applied without validation (using parse_env_int64),
allowing zero/negative/oversized values; update the logic where
config.chunk_size_bytes, config.max_message_bytes (and optionally
config.enable_debug_log) are set so that after calling parse_env_int64 you
validate the result against sane bounds (e.g., >0 and <= a defined
MAX_MESSAGE_BYTES/CHUNK_SIZE constant), ignore or clamp invalid values, and emit
a debug/warning log via the existing logger; specifically modify the section
using parse_env_int64 to perform the check and fallback to existing config
values when the parsed value is out-of-range.
- Around line 164-166: The callback is being passed the incumbent objective as
the bound (bound_copy = objective) which can show a false zero gap; change the
code so bound_copy is set to the solver's actual best bound variable (the
maintained global/best bound used by the B&B engine) instead of 'objective'
before calling get_callback->get_solution(...); locate where the solver stores
the current best bound (e.g., variables named best_bound, global_bound,
current_bound, or similar in the surrounding code/branch-and-bound context) and
use that identifier (preserving obj_copy, solution_copy, and
callback->get_user_data() usage).

In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Around line 401-403: The parsing of numeric CLI arguments uses std::stod and
std::stoll without guards and converts a double to int unsafely (variable
time_limit and the static_cast<int>(time_limit)); wrap each std::stod/std::stoll
call in a try/catch (catch std::invalid_argument and std::out_of_range),
validate the parsed value for positivity and reasonable bounds, and return/exit
with a clear error on bad input; before converting time_limit to int (the
static_cast<int>(time_limit) usage) check that time_limit is finite, >= 0 and <=
std::numeric_limits<int>::max() (or clamp/handle overflow) to avoid UB; apply
the same defensive checks to the stoll usages (ensure values fit in target
integer types) and use the same error path if validation fails.

In `@cpp/src/grpc/cuopt_remote.proto`:
- Line 6: The proto package declaration "package cuopt.remote" in
cuopt_remote.proto conflicts with its directory path (cpp/src/grpc) and triggers
Buf PACKAGE_DIRECTORY_MATCH; fix by making the package name match the file path
or by moving the file to match the package. Either change the package line in
cpp/src/grpc/cuopt_remote.proto from "package cuopt.remote" to "package
cpp.src.grpc" (matching the directory segments as dot-separated identifiers) or
relocate cuopt_remote.proto into a cuopt/remote directory so the existing
"package cuopt.remote" matches the filesystem; update any import/usage
references accordingly and re-run buf lint to verify.

In `@cpp/src/grpc/grpc_problem_mapper.cu`:
- Around line 175-182: The code assumes constraint_upper_bounds exists when
pb_problem.constraint_lower_bounds_size() > 0; instead validate both arrays and
their sizes before calling cpu_problem.set_constraint_lower_bounds and
cpu_problem.set_constraint_upper_bounds: check
pb_problem.constraint_lower_bounds_size() > 0 &&
pb_problem.constraint_upper_bounds_size() > 0 and that their sizes match, and
only then build con_lb/con_ub and call the two set_* methods; if sizes mismatch
or one is missing, handle the case (skip setting bounds or raise/log an error)
to avoid creating an inconsistent problem state.
- Around line 435-446: get_string_list can advance s past s_end when memchr
returns null (no trailing NUL); fix by computing remaining = s_end - s, call
memchr over remaining, and if memchr returns null push the final string using
the remaining length and then break (do not set s = nul + 1). Update the loop in
get_string_list to use it->second.data() and it->second.size() to compute s and
s_end, use the computed remaining length for std::memchr, and on null result
construct names.emplace_back(s, remaining) and break instead of setting s = nul
+ 1.
- Around line 399-411: The lambdas (notably get_doubles) compute element count
as it->second.size() / sizeof(double) then memcpy the full byte payload, which
can overflow if the byte length is not a multiple of sizeof(double); add an
explicit alignment/length check before any typed memcpy: verify
it->second.size() % sizeof(double) == 0 (or the equivalent check for the other
lambda’s element type), and if the check fails either return an empty vector or
log/throw an error; only then allocate the destination vector with n =
it->second.size() / sizeof(double) and perform memcpy; apply the same fix to the
other similar lambda at the noted region (around 414-426) to avoid misaligned
copies.

In `@cpp/src/grpc/grpc_service_mapper.cu`:
- Around line 23-55: Chunked wire format is inconsistent for floats: update
chunk_typed_array (and the client-side build_array_chunk_requests that calls it)
to always emit floating arrays as IEEE double on the wire (so server get_doubles
can continue to interpret chunks as sizeof(double)). Concretely, when T ==
float, convert the input vector to a temporary std::vector<double> and use that
buffer for chunking (set elem_size = sizeof(double) and point raw at the double
buffer) so byte offsets/counts match get_doubles; for non-floating types keep
the existing behavior. Ensure build_array_chunk_requests callers use the
adjusted chunk_typed_array semantics and do not send raw float bytes anymore.

In `@cpp/src/grpc/grpc_solution_mapper.cu`:
- Around line 520-534: The bytes_to_typed template is copying raw.size() bytes
into buffers sized for n elements, causing overflow when raw.size() isn't an
exact multiple of the element byte-size; update each branch (the
std::is_same_v<T,float> branch that builds tmp as double, the
std::is_same_v<T,double> branch that builds v as double, and the generic branch)
to: compute n = raw.size() / sizeof(ElemType), check that raw.size() %
sizeof(ElemType) == 0 and handle (throw/log/return error) if not, and use
memcpy(dst.data(), raw.data(), n * sizeof(ElemType)) instead of memcpy(...,
raw.size()) so you only copy the exact bytes that fit the destination (refer to
function bytes_to_typed and the template T branches).

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp`:
- Around line 46-49: The assignment vector is not cleared before appending
parsed values, so repeated calls to parse_incumbent_proto will accumulate stale
entries; fix by calling assignment.clear() (or assign empty) before
assignment.reserve(...) in the code that iterates over incumbent_msg
(referencing the assignment variable and the loop that uses
incumbent_msg.assignment_size() and incumbent_msg.assignment(i)) so the vector
starts empty each parse.
- Around line 30-32: The serialization ignores SerializeToArray's boolean return
and narrows ByteSizeLong() (size_t) to int unsafely: update the code that
constructs buffer and calls msg.SerializeToArray to first capture size_t sz =
msg.ByteSizeLong(), validate sz <= static_cast<size_t>(INT_MAX) before casting,
allocate buffer with that size, call bool ok =
msg.SerializeToArray(buffer.data(), static_cast<int>(sz)), and handle a false
result (e.g., throw or return an error) instead of ignoring it; reference the
buffer variable, msg.ByteSizeLong(), and msg.SerializeToArray(...) when making
the change.

In `@cpp/src/grpc/server/grpc_job_management.cpp`:
- Around line 107-113: The recv_incumbent_pipe function trusts the incoming size
prefix and can allocate arbitrarily large memory; add a sanity check before
resizing by comparing the read size (variable size) against a defined cap (e.g.,
a new constant like kMaxIncumbentSize or MAX_INCUMBENT_SIZE) and reject/return
false if size is 0 or exceeds that cap, then only call data.resize(size) and
read_from_pipe when the size is within limits; update recv_incumbent_pipe to
enforce this limit and document the constant choice.
- Around line 147-158: The loop in submit_job_async is racy because threads
perform a check-then-write on job_queue[i] (checking ready/claimed then
initializing), so change the slot acquisition to an atomic claim: use an atomic
flag (e.g., make job_queue[].claimed an std::atomic_bool or add a per-slot
std::atomic<int> state) and perform a compare_exchange_strong to transition from
false->true to reserve the slot; only the thread that succeeds should populate
the slot fields (job_id, problem_type, data_size, worker_pid, worker_index,
data_sent, cancelled) and set job_queue[i].ready = true as the final publish
step. Ensure submit_job_async uses this atomic CAS on job_queue[i].claimed and
does not set ready until after all other fields are initialized.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 93-100: Validate and sanitize the metadata before resizing or
memcpy: check ac.total_elements() and ac.element_offset() are non-negative and
within reasonable caps, ensure chunk_data.size() + offset does not overflow and
fits within a bounded/validated size_t, and guard against pathological
total_elements to avoid huge allocations; in the functions using ac (e.g., where
dest.resize(...) and std::memcpy(...) are called), convert to size_t only after
validation, return or throw an error on invalid/too-large values instead of
resizing/ copying, and ensure the memcpy condition uses the validated,
overflow-checked bounds.
- Line 90: The deserializers currently treat ParseDelimitedFromCodedStream
failures by breaking out and eventually returning true, accepting malformed
payloads; change the error handling around
google::protobuf::util::ParseDelimitedFromCodedStream(...) (the calls that pass
ac, coded, clean_eof) so that on a parse failure you immediately fail closed and
return false (or propagate an error) instead of break/continuing; apply the same
change to both deserializer sites that call ParseDelimitedFromCodedStream to
ensure truncated/malformed messages are rejected.

In `@cpp/src/grpc/server/grpc_server_main.cpp`:
- Around line 234-250: Builder::BuildAndStart() can return nullptr so avoid
dereferencing server without checking; after calling builder.BuildAndStart()
(unique symbols: builder.BuildAndStart(), std::unique_ptr<Server> server,
server->Wait(), shutdown_thread, keep_running), check if server is null and if
so log an error/print to stderr, ensure shutdown_thread is signaled/joined or
cleaned up, and exit or return with a non-zero status; if server is valid
continue to start the shutdown_thread and call server->Wait() as currently
implemented.
- Around line 89-90: The signal handler registered via signal(SIGINT,
signal_handler) performs non-async-signal-safe operations (std::cout and
result_cv.notify_all()) — change it to a minimal, async-signal-safe handler:
have signal_handler set a std::sig_atomic_t or std::atomic<bool> flag (e.g.,
shutdown_requested) and (optionally) write a byte to a dedicated pipe to wake
the main loop; then remove any std::cout or condition_variable use from
signal_handler and move the actual logging and result_cv.notify_all() into the
main thread’s event loop where you check shutdown_requested (or the pipe) and
call result_cv.notify_all() and perform printing. Update grpc_server_types.hpp
to expose the atomic flag (or pipe FD) and ensure only async-signal-safe
functions are called from signal_handler.
- Around line 156-168: The loops in grpc_server_main.cpp must not call memset on
JobQueueEntry and ResultQueueEntry because they contain std::atomic members;
remove the memset calls and instead default-construct or placement-new each
element and then explicitly initialize fields. Replace memset(&job_queue[i], 0,
sizeof(JobQueueEntry)) with either new (&job_queue[i]) JobQueueEntry() or ensure
the array is value-initialized, then call job_queue[i].ready.store(false),
.claimed.store(false), .cancelled.store(false) and
job_queue[i].worker_index.store(-1); do the same for result_queue elements
(value-initialize or placement-new, then .ready.store(false) and
.retrieved.store(false)). Ensure JobQueueEntry and ResultQueueEntry constructors
handle non-atomic fields if you prefer constructor-based initialization.
- Around line 186-212: The early-return TLS error paths currently exit
immediately after spawn_workers() and creating threads, leaking workers,
threads, and shared memory; change these branches to perform proper cleanup
before returning: after calling spawn_workers() and after launching the threads
(referencing spawn_workers() and the container holding created std::thread
objects), call the corresponding shutdown/stop routine for workers (e.g.,
stop_workers() or signal shutdown on the worker pool), join all threads in the
thread container, and release any shared memory/IPC resources (the handles
created earlier, and any variables like ssl_opts or pem_root_certs can be left
untouched) before returning an error code. Ensure every early-return path that
occurs after worker spawn or thread creation follows this cleanup sequence so no
threads or shared resources remain leaked.
- Around line 50-80: The CLI parsing loop currently in main (the for-loop that
processes argv and assigns to config) silently ignores unknown flags and missing
values; update it to validate every flag and required value: for each branch
that expects a value (flags that call std::stoi/std::stoll or assign argv[++i]
to config fields like config.port, config.num_workers, config.max_message_mb,
config.max_message_b, config.tls_cert_path, config.tls_key_path,
config.tls_root_path) ensure you check i+1<argc and on failure call
print_usage(argv[0]) and return a non-zero exit code; add an else branch for
unrecognized args (arg starts with '-') that also prints usage and exits
non-zero; additionally wrap stoi/stoll conversions in try/catch to detect
invalid numeric inputs and treat them as errors that print usage and exit
non-zero so the server cannot start with unintended defaults.

In `@cpp/src/grpc/server/grpc_server_threads.cpp`:
- Around line 166-229: When deserialize_result_pipe_blob(...) fails you mark
it->second.status = JobStatus::FAILED but you leave the local success flag true,
causing waiter->success to be set incorrectly; update the logic so that after a
deserialization failure you set success = false (or set waiter->success based on
it->second.status != JobStatus::FAILED) before assigning waiter->success in the
waiting_threads block; modify the code paths around
deserialize_result_pipe_blob, job_tracker, and the waiter assignment
(waiter->success) so waiters observe the actual failed state and error_message
from it->second.error_message.

In `@cpp/src/grpc/server/grpc_server_types.hpp`:
- Around line 300-307: signal_handler currently calls non-async-signal-safe APIs
(std::cout and result_cv.notify_all); change it to only perform
async-signal-safe operations: remove the std::cout and result_cv.notify_all
calls and instead set an atomic/sig_atomic_t flag (ensure keep_running is an
std::atomic_bool or set a sig_atomic_t atomic_shutdown) and, if needed, set
shm_ctrl->shutdown_requested using a sig_atomic_t field; add a normal runtime
watcher thread that observes keep_running/shutdown flag and performs logging and
calls result_cv.notify_all when it detects shutdown; refer to signal_handler,
keep_running, shm_ctrl, shutdown_requested and result_cv when making these
changes.

In `@cpp/src/grpc/server/grpc_service_impl.cpp`:
- Around line 71-73: The code releases job_queue[job_idx].claimed before
publishing job_queue[job_idx].ready, creating a race where another producer can
claim the slot before consumers see ready; invert the operations so you
store(true) to job_queue[job_idx].ready before you store(false) to
job_queue[job_idx].claimed (apply same fix at the other occurrence around lines
with job_queue[...] claimed/ready, e.g., the site referenced as 265-267) to
ensure ready is visible prior to allowing re-claim.
- Around line 633-642: Hold on to waiter->mutex only while inspecting/modifying
waiter state; do not call check_job_status() while holding waiter->mutex to
avoid lock-order inversion with tracker_mutex used in cancel paths (e.g.,
cancel_job). In WaitForCompletion, restructure the loop so you acquire
unique_lock waiter's mutex, check waiter->ready and context->IsCancelled (and
adjust waiter->waiters on cancel) but before invoking check_job_status(job_id,
msg) release the lock (lock.unlock()), call check_job_status(), then re-lock
(lock.lock()) to re-evaluate waiter->ready and continue/wait; alternatively
adopt a consistent global lock order (acquire tracker_mutex before waiter->mutex
everywhere) — prefer the unlock/call/relock pattern around the call site to
quickly eliminate the deadlock risk.
- Around line 189-192: The code currently blindly adds each chunk to
meta.received_bytes, state.total_bytes, state.chunks and increments
state.total_chunks without enforcing a cumulative bound, so add a pre-check that
rejects or truncates a chunk when meta.received_bytes + raw.size() (or
state.total_bytes + raw.size()) would exceed the declared/expected field size
(use the existing expected-size field on meta or state), and only update
meta.received_bytes, state.total_bytes, state.chunks.push_back(ac) and
++state.total_chunks after the check succeeds; on failure return an error/status
indicating oversized upload to prevent unbounded memory growth.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp`:
- Around line 120-124: The fork failure branch only calls
close_worker_pipes_server(worker_id) leaving child-end FDs open and leaking;
update the pid < 0 handling to also close the child-side pipe FDs (e.g., call
the corresponding cleanup function for child ends such as
close_worker_pipes_child(worker_id) or otherwise close the child file
descriptors associated with worker_id) before returning -1, ensuring both server
and child pipe endpoints are cleaned up when fork() fails; reference symbols:
fork, pid, worker_id, is_replacement, close_worker_pipes_server.
- Around line 165-168: The loop in wait_for_workers is calling waitpid
unconditionally for every entry in worker_pids; skip any non-positive PIDs (pid
<= 0) and call waitpid only for valid PIDs, and handle EINTR by retrying the
wait in a loop until waitpid succeeds or returns a permanent error; also handle
-1 results such that if errno == ECHILD you can break/continue safely and for
other errors log/propagate as appropriate. Locate the loop that iterates over
worker_pids and update it to validate pid > 0, wrap the waitpid call in a retry
loop that checks errno == EINTR, and handle other error cases (e.g., ECHILD)
instead of proceeding blindly.

In `@cpp/src/grpc/server/grpc_worker.cpp`:
- Around line 56-74: store_simple_result currently scans result_queue for an
entry with !ready and writes multiple non-atomic fields, causing a race; change
the slot allocation to atomically claim a slot before writing (e.g., add or
reuse an atomic claim flag on result_queue[i] and use
compare_exchange_strong/compare_exchange_weak to transition from UNCLAIMED to
CLAIMED), only proceed to copy job_id/status/data_size/error_message and set
retrieved/ready after the CAS succeeds, and ensure you set the atomic ready flag
last so other threads see a fully-initialized entry; update store_simple_result
to loop until a CAS claim succeeds (or handle full-queue) and reference
result_queue, MAX_RESULTS, and the ready/claimed flags in your changes.
- Around line 44-48: Check and handle the return value of
send_incumbent_pipe(fd_, buffer): if it indicates failure (e.g., broken pipe or
write error) log an error including job_id_ and the returned errno/details and
avoid further work or retry/cleanup as appropriate; only emit the incumbent
logging (currently using std::cout with job_id_, objective, assignment.size())
when the send succeeded or gate it behind a debug/log-level flag or rate-limit
to prevent unbounded output. Locate this logic around the call sites using
send_incumbent_pipe, build_incumbent_proto, and the variables job_id_,
objective, assignment, and fd_ to add the error check and conditional logging.
- Around line 248-252: The cudaMemcpy call copying device_solution to
host_solution currently ignores CUDA return codes; change it to capture the
cudaMemcpy return value (cudaError_t err = cudaMemcpy(...)) and if err !=
cudaSuccess throw an exception (e.g., std::runtime_error) with a descriptive
message including cudaGetErrorString(err) so the existing try/catch surfaces a
clear failure to the client; apply the same pattern to all other cudaMemcpy
sites in this file (including the similar copies near lines 296-307) and
reference the same symbols (device_solution, host_solution, cudaMemcpy) when
making the fixes.
- Around line 341-363: The loop that finds an empty slot in result_queue and
then fills it (using result_slot, ResultQueueEntry, copy_cstr, MAX_RESULTS,
worker_id) has a race: another thread may grab the same slot after the ready
check but before setting ready, and if no slot is found the result is silently
dropped; fix by protecting the selection-and-write with the same synchronization
used in store_simple_result (e.g., a mutex or atomic compare-and-set on
result_queue[i].ready) so the check-and-set is atomic, and add explicit handling
for the case result_slot remains -1 (log an error via the worker logger /
std::cerr and return or propagate a failure code) so lost results are reported.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`:
- Around line 53-57: The test file uses open() and O_WRONLY/O_CREAT/O_TRUNC but
does not include <fcntl.h>, which is required on some platforms; add a `#include`
<fcntl.h> to the top of grpc_integration_test.cpp alongside the existing system
headers (near the includes for <signal.h>, <sys/types.h>, <sys/wait.h>,
<unistd.h>) so that the open() call and O_* constants used by the test compile
portably.

In `@cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp`:
- Around line 213-218: The setter set_server_log_path writes server_log_path_,
server_log_start_pos_, and test_start_marked_ without synchronization; protect
these shared members by acquiring mutex_ (e.g., std::lock_guard or
std::unique_lock on mutex_) at the start of set_server_log_path and similarly in
every reader/callback that touches server_log_path_, server_log_start_pos_, or
test_start_marked_ (the read blocks around lines ~229-244 and ~332-355). Under
the lock, perform updates or read-and-copy the values, and for readers prefer
copying the protected values into local variables while holding mutex_ then
release the lock before doing any IO/long operations to avoid holding mutex_
during blocking work.

In `@GRPC_ARCHITECTURE.md`:
- Around line 9-10: The phrase claiming "no custom serialization logic is
implemented" is too absolute; update the line to clarify the scope by stating
that protocol buffers generated by protoc and grpc_cpp_plugin are used for gRPC
wire payloads, and explicitly note the exception that
grpc_pipe_serialization.hpp provides custom pipe blob framing helpers (i.e., the
custom framing is outside gRPC wire payload serialization). Locate the sentence
in GRPC_ARCHITECTURE.md and reword it to mention "for gRPC wire payloads" and
add a parenthetical or following sentence referencing
grpc_pipe_serialization.hpp as the known custom framing helper.

---

Nitpick comments:
In `@cpp/src/grpc/client/test_grpc_client.cpp`:
- Line 340: The return path after stream_logs currently returns non-zero on
failure but drops the error detail; update the failure branch surrounding
stream_logs in test_grpc_client.cpp to call and emit client.get_last_error() (or
equivalent accessor) to the log/stream before returning the non-zero exit code
so the client error is logged for parity with other modes and aids diagnosis
(locate the code around stream_logs and the return of success ? 0 : 1 and insert
a process/log call that prints client.get_last_error() when success is false).

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp`:
- Around line 24-27: kPipeChunkBytes is a hard-coded constexpr limit; make it
configurable by replacing the fixed constexpr with a runtime-configured value
(e.g., a GetPipeChunkBytes() or PipeConfig::pipe_chunk_bytes() accessor) that
reads an environment variable or server config and falls back to 64ULL * 1024 *
1024 as the default; update references to kPipeChunkBytes to call the accessor
and validate the parsed value (positive, reasonable upper bound) and document
the new env var name (e.g., PIPE_CHUNK_BYTES) and default behavior.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp`:
- Around line 957-982: The test registers a log callback (received_logs with
log_mutex) but never asserts anything, so add assertions after the call to
client->solve_mip in SolveMIPWithLogCallback to verify the callback was invoked
and logs look correct: acquire lock(log_mutex) and assert !received_logs.empty()
(and optionally that at least one entry contains an expected substring like
"objective" or "MIP" to ensure meaningful log content). If solve_mip is
asynchronous/streams logs, ensure you wait briefly or poll until received_logs
is non-empty before asserting to avoid flakes; reference received_logs,
log_mutex, log_callback, SolveMIPWithLogCallback, and client->solve_mip when
adding the checks.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8bfed585-0c3e-4c04-9fc0-af2be140a6f2

📥 Commits

Reviewing files that changed from the base of the PR and between e46028c and 6c06a40.

📒 Files selected for processing (39)

GRPC_ARCHITECTURE.md
SERVER_ARCHITECTURE.md
build.sh
conda/recipes/libcuopt/recipe.yaml
cpp/CMakeLists.txt
cpp/cuopt_grpc_server.cpp
cpp/src/grpc/client/grpc_client.cu
cpp/src/grpc/client/grpc_client.hpp
cpp/src/grpc/client/solve_remote.cu
cpp/src/grpc/client/test_grpc_client.cpp
cpp/src/grpc/cuopt_remote.proto
cpp/src/grpc/cuopt_remote_service.proto
cpp/src/grpc/grpc_problem_mapper.cu
cpp/src/grpc/grpc_problem_mapper.hpp
cpp/src/grpc/grpc_service_mapper.cu
cpp/src/grpc/grpc_service_mapper.hpp
cpp/src/grpc/grpc_settings_mapper.cu
cpp/src/grpc/grpc_settings_mapper.hpp
cpp/src/grpc/grpc_solution_mapper.cu
cpp/src/grpc/grpc_solution_mapper.hpp
cpp/src/grpc/server/grpc_field_element_size.hpp
cpp/src/grpc/server/grpc_incumbent_proto.hpp
cpp/src/grpc/server/grpc_job_management.cpp
cpp/src/grpc/server/grpc_pipe_serialization.hpp
cpp/src/grpc/server/grpc_server_main.cpp
cpp/src/grpc/server/grpc_server_threads.cpp
cpp/src/grpc/server/grpc_server_types.hpp
cpp/src/grpc/server/grpc_service_impl.cpp
cpp/src/grpc/server/grpc_worker.cpp
cpp/src/grpc/server/grpc_worker_infra.cpp
cpp/src/pdlp/CMakeLists.txt
cpp/src/pdlp/solve_remote.cu
cpp/tests/linear_programming/CMakeLists.txt
cpp/tests/linear_programming/grpc/CMakeLists.txt
cpp/tests/linear_programming/grpc/grpc_client_test.cpp
cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py

💤 Files with no reviewable changes (2)

cpp/src/pdlp/CMakeLists.txt
cpp/src/pdlp/solve_remote.cu

🚧 Files skipped from review as they are similar to previous changes (5)

build.sh
cpp/tests/linear_programming/CMakeLists.txt
cpp/tests/linear_programming/grpc/CMakeLists.txt
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
cpp/src/grpc/grpc_settings_mapper.cu

cpp/CMakeLists.txt

cpp/tests/linear_programming/grpc/grpc_client_test.cpp

cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp

cuopt uses embedded grpc client to solve problems on a remote server

The same parameters will be removed from solve_lp in another change.

cpp/src/grpc/grpc_settings_mapper.cpp

cpp/src/grpc/grpc_solution_mapper.hpp

rg20 · 2026-03-25T13:59:50Z

@tmckayus Thanks for the great work!

I think you need to update the docs, add examples etc as well.

tmckayus · 2026-03-25T14:58:39Z

@tmckayus Thanks for the great work!

I think you need to update the docs, add examples etc as well.

Absoulutely agree! I want to get this merged to avoid conflicts, and so we can put up the code generator as a follow on. I will create a doc/example PR as well.

bdice · 2026-03-25T16:35:09Z

conda/recipes/libcuopt/recipe.yaml


-        ./build.sh -n -v ${BUILD_EXTRA_FLAGS} libmps_parser libcuopt deb --allgpuarch --cmake-args=\"-DCMAKE_INSTALL_LIBDIR=lib\"
+        # Workaround for abseil-cpp#1624: Mutex::Dtor() not exported from shared libabseil on Linux.
+        # Build gRPC/Protobuf/Abseil from source to ensure consistent ABI and avoid symbol issues.


We always want to depend conda packages in conda recipes. We must be ABI-compatible with the rest of conda-forge.

Looks like the linked issue was fixed in abseil/abseil-cpp@18ac608 and released in libabseil=20260107.* which is available on conda-forge (and is the version of libabseil that a recent build of protobuf depends on)

resolved, we've got conda build using packages now

anandhkb · 2026-03-25T17:26:29Z

@gforsyth for request on CI codeowner review

ci/utils/install_protobuf_grpc.sh

gforsyth · 2026-03-25T19:44:48Z

conda/recipes/libcuopt/recipe.yaml

+        # Workaround for abseil-cpp#1624: Mutex::Dtor() not exported from shared libabseil on Linux.
+        # Build gRPC/Protobuf/Abseil from source to ensure consistent ABI and avoid symbol issues.
+        GRPC_INSTALL_DIR="${SRC_DIR}/grpc_install"
+
+        bash ./ci/utils/install_protobuf_grpc.sh --prefix="${GRPC_INSTALL_DIR}" --build-dir="${SRC_DIR}" --skip-deps
+
+        export CMAKE_PREFIX_PATH="${GRPC_INSTALL_DIR}:${CMAKE_PREFIX_PATH:-}"


Have we tried this build using the available conda-forge packages?

libgrpc is build against libprotobuf and libabseil with the fix for the linked issue available

yes trying this in another test PR, thanks

modified, works

Co-authored-by: Gil Forsyth <gforsyth@users.noreply.github.com>

previous workaround of build from source is not necessary since a bug with libabseil has been fixed

bdice

Approving to unblock since this PR is very large. I have one question on scope.

bdice · 2026-03-30T20:53:53Z

ci/utils/install_protobuf_grpc.sh

This file is only called by build_wheel_libcuopt.sh, if I'm reading it correctly. Do we need all of the options and flexible behavior that is implemented in this script, or could we reduce its scope?

Think we can trim it and make it bare minimum. @tmckayus may be we can tackle this in a followup PR ?

tmckayus · 2026-03-31T01:23:48Z

/merge

After NVIDIA/cuopt#939, cuopt doesn't configure without the libgrpc C++ library. Authors: - Paul Taylor (https://github.com/trxcllnt) Approvers: - James Lamb (https://github.com/jameslamb) URL: #685

tmckayus requested review from a team as code owners March 6, 2026 14:56

tmckayus requested review from Iroy30, aliceb-nv, bdice and rg20 March 6, 2026 14:56

tmckayus added feature request New feature or request non-breaking Introduces a non-breaking change labels Mar 6, 2026