Skip to content

fix(cuda_helpers): clear sticky error and avoid cache poisoning in set_shmem_of_kernel#1095

Open
np96 wants to merge 2 commits intoNVIDIA:mainfrom
np96:fix/set-shmem-main
Open

fix(cuda_helpers): clear sticky error and avoid cache poisoning in set_shmem_of_kernel#1095
np96 wants to merge 2 commits intoNVIDIA:mainfrom
np96:fix/set-shmem-main

Conversation

@np96
Copy link
Copy Markdown

@np96 np96 commented Apr 12, 2026

Description

When cudaFuncSetAttribute fails (e.g. requested size exceeds device limit),
the previous implementation stored the failed size in the shmem_sizes cache
and left a sticky CUDA error in the last-error slot. Subsequent calls for
the same kernel would see the cached (invalid) size and skip the attribute
call, silently proceeding without the required shared memory. The sticky
error would later be caught by an unrelated RAFT_CHECK_CUDA, producing a
confusing cudaErrorInvalidValue crash.

Fix:

  • Only update the cache on success.
  • On failure, consume the error with cudaGetLastError() so it cannot
    surface later, then return false.
  • Add five unit tests in ROUTING_UNIT_TEST covering zero request, normal
    request, too-large returns false, cache not poisoned on failure, and
    no sticky error after failure.

Issue

#1094

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • Added tests
    • Confirmed that the issue 1094 is not reproducing after the fix
  • Documentation
    • NA

…in set_shmem_of_kernel

When cudaFuncSetAttribute fails (e.g. requested size exceeds device limit),
the previous implementation stored the failed size in the shmem_sizes cache
and left a sticky CUDA error in the last-error slot.  Subsequent calls for
the same kernel would see the cached (invalid) size and skip the attribute
call, silently proceeding without the required shared memory.  The sticky
error would later be caught by an unrelated RAFT_CHECK_CUDA, producing a
confusing cudaErrorInvalidValue crash.

Fix:
- Only update the cache on success.
- On failure, consume the error with cudaGetLastError() so it cannot
  surface later, then return false.
- Add five unit tests in ROUTING_UNIT_TEST covering zero request, normal
  request, too-large returns false, cache not poisoned on failure, and
  no sticky error after failure.

Reproducer: routing.Solve crashes with cudaErrorInvalidValue at
N_VEHICLES >= 157 on V100 (sharedMemPerBlockOptin = 98304 B).
@np96 np96 requested review from a team as code owners April 12, 2026 14:54
@np96 np96 requested review from hlinsen and rgsl888prabhu April 12, 2026 14:54
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@np96 np96 requested a review from aliceb-nv April 12, 2026 14:54
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c08e789c-e1f9-4660-9069-c03c8064784f

📥 Commits

Reviewing files that changed from the base of the PR and between 8e90a94 and fe62f1f.

📒 Files selected for processing (2)
  • cpp/src/utilities/cuda_helpers.cuh
  • cpp/tests/routing/unit_tests/set_shmem_of_kernel.cu
✅ Files skipped from review due to trivial changes (1)
  • cpp/tests/routing/unit_tests/set_shmem_of_kernel.cu
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/src/utilities/cuda_helpers.cuh

📝 Walkthrough

Walkthrough

Switched set_shmem_of_kernel synchronization to static std::shared_mutex, added a fast-path cached-read, made cache updates conditional on cudaFuncSetAttribute success (capturing its return in a local err), and consume CUDA errors on failure. Added unit tests exercising success, failure, repeat calls, and CUDA error state.

Changes

Cohort / File(s) Summary
Core implementation
cpp/src/utilities/cuda_helpers.cuh
Replaced process-wide static std::mutex with static std::shared_mutex; added shared-lock fast-path for cached aligned dynamic_request_size; call cudaFuncSetAttribute, capture return (err), update cache only on success, and call cudaGetLastError() on failure.
Tests & build
cpp/tests/routing/CMakeLists.txt, cpp/tests/routing/unit_tests/set_shmem_of_kernel.cu
Added a new CUDA GTest (set_shmem_of_kernel.cu) covering zero/normal/oversized requests, repeated-failure behavior, and CUDA error-state clearing; updated CMake to include the test source.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately and concisely describes the main fix: clearing sticky errors and avoiding cache poisoning in set_shmem_of_kernel function.
Description check ✅ Passed The PR description provides clear context about the bug, the fix applied, and the tests added, all directly related to the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tests/routing/unit_tests/set_shmem_of_kernel.cu`:
- Around line 68-69: The test currently only checks that no CUDA error was
recorded after calling set_shmem_of_kernel(kernel_sticky_error, too_large); add
an assertion that the call actually failed by expecting a non-success error
(e.g., EXPECT_NE(cudaSuccess, cudaGetLastError()) or a specific error like
EXPECT_EQ(cudaErrorInvalidConfiguration, cudaGetLastError())) immediately after
set_shmem_of_kernel to ensure the sticky-error branch is exercised; locate the
call to set_shmem_of_kernel and replace or augment the following
EXPECT_EQ(cudaSuccess, cudaGetLastError()) accordingly.
- Around line 41-42: The cudaDeviceGetAttribute calls (used to set shmem_max and
derive too_large) are unchecked and may leave shmem_max uninitialized; update
each call to capture the cudaError_t return, verify it equals cudaSuccess, and
on failure fail the test or abort with a clear error message referencing the
call (e.g., the cudaDeviceGetAttribute for
cudaDevAttrMaxSharedMemoryPerBlockOptin) so downstream assertions (and variables
like shmem_max and too_large) are never used when the query failed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 515e20d6-05ab-4e9c-b9c7-3216ae006fb9

📥 Commits

Reviewing files that changed from the base of the PR and between 24fdb60 and 8e90a94.

📒 Files selected for processing (3)
  • cpp/src/utilities/cuda_helpers.cuh
  • cpp/tests/routing/CMakeLists.txt
  • cpp/tests/routing/unit_tests/set_shmem_of_kernel.cu

@np96
Copy link
Copy Markdown
Author

np96 commented Apr 12, 2026

Also realized there's a race in unordered_map operator[] access:

https://github.com/NVIDIA/cuopt/blob/main/cpp/src/utilities/cuda_helpers.cuh#L182

I see two options to fix it: either pre-initialize at start for all accessed functions (more efficient but requires collecting all operators per solver type and maintaining initial setting) or use correct double-locking pattern. Will use the second option since not familiar with the codebase enough.

Would be happy to hear feedback and collaborate, have strong interest contributing to this project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant