fix(cuda_helpers): clear sticky error and avoid cache poisoning in set_shmem_of_kernel#1095
fix(cuda_helpers): clear sticky error and avoid cache poisoning in set_shmem_of_kernel#1095np96 wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
…in set_shmem_of_kernel When cudaFuncSetAttribute fails (e.g. requested size exceeds device limit), the previous implementation stored the failed size in the shmem_sizes cache and left a sticky CUDA error in the last-error slot. Subsequent calls for the same kernel would see the cached (invalid) size and skip the attribute call, silently proceeding without the required shared memory. The sticky error would later be caught by an unrelated RAFT_CHECK_CUDA, producing a confusing cudaErrorInvalidValue crash. Fix: - Only update the cache on success. - On failure, consume the error with cudaGetLastError() so it cannot surface later, then return false. - Add five unit tests in ROUTING_UNIT_TEST covering zero request, normal request, too-large returns false, cache not poisoned on failure, and no sticky error after failure. Reproducer: routing.Solve crashes with cudaErrorInvalidValue at N_VEHICLES >= 157 on V100 (sharedMemPerBlockOptin = 98304 B).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughSwitched Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tests/routing/unit_tests/set_shmem_of_kernel.cu`:
- Around line 68-69: The test currently only checks that no CUDA error was
recorded after calling set_shmem_of_kernel(kernel_sticky_error, too_large); add
an assertion that the call actually failed by expecting a non-success error
(e.g., EXPECT_NE(cudaSuccess, cudaGetLastError()) or a specific error like
EXPECT_EQ(cudaErrorInvalidConfiguration, cudaGetLastError())) immediately after
set_shmem_of_kernel to ensure the sticky-error branch is exercised; locate the
call to set_shmem_of_kernel and replace or augment the following
EXPECT_EQ(cudaSuccess, cudaGetLastError()) accordingly.
- Around line 41-42: The cudaDeviceGetAttribute calls (used to set shmem_max and
derive too_large) are unchecked and may leave shmem_max uninitialized; update
each call to capture the cudaError_t return, verify it equals cudaSuccess, and
on failure fail the test or abort with a clear error message referencing the
call (e.g., the cudaDeviceGetAttribute for
cudaDevAttrMaxSharedMemoryPerBlockOptin) so downstream assertions (and variables
like shmem_max and too_large) are never used when the query failed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 515e20d6-05ab-4e9c-b9c7-3216ae006fb9
📒 Files selected for processing (3)
cpp/src/utilities/cuda_helpers.cuhcpp/tests/routing/CMakeLists.txtcpp/tests/routing/unit_tests/set_shmem_of_kernel.cu
|
Also realized there's a race in unordered_map operator[] access: https://github.com/NVIDIA/cuopt/blob/main/cpp/src/utilities/cuda_helpers.cuh#L182 I see two options to fix it: either pre-initialize at start for all accessed functions (more efficient but requires collecting all operators per solver type and maintaining initial setting) or use correct double-locking pattern. Will use the second option since not familiar with the codebase enough. Would be happy to hear feedback and collaborate, have strong interest contributing to this project. |
Description
When cudaFuncSetAttribute fails (e.g. requested size exceeds device limit),
the previous implementation stored the failed size in the shmem_sizes cache
and left a sticky CUDA error in the last-error slot. Subsequent calls for
the same kernel would see the cached (invalid) size and skip the attribute
call, silently proceeding without the required shared memory. The sticky
error would later be caught by an unrelated RAFT_CHECK_CUDA, producing a
confusing cudaErrorInvalidValue crash.
Fix:
surface later, then return false.
request, too-large returns false, cache not poisoned on failure, and
no sticky error after failure.
Issue
#1094
Checklist