Retry after sleep random milliseconds when failed to create collection#226
Retry after sleep random milliseconds when failed to create collection#226lzxddz merged 1 commit intoeloqdata:mainfrom
Conversation
WalkthroughCentralizes randomized retry sleeps by adding OperationContext::sleepForRandomMilliseconds() (with thread-local RNG and logging) and replacing local RNG or deterministic sleeps with this API across Eloq, write-ops retry paths, and KV catalog retry loops; also updates an Eloq submodule pointer. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Caller as Component (Eloq / KV / WriteOps)
participant OpCtx as OperationContext
participant Thread as Thread
Caller->>OpCtx: sleepForRandomMilliseconds()
Note right of OpCtx #f0f9ff: thread-local RNG selects ms\nlog chosen duration
OpCtx->>Thread: sleepFor(duration)
Thread-->>OpCtx: resume
OpCtx-->>Caller: return
sequenceDiagram
autonumber
participant Client as Client
participant Exec as write_ops_exec
participant OpCtx as OperationContext
participant Storage as Storage
Client->>Exec: operation (e.g., insertBatch)
Exec->>Storage: attempt implicit collection create
Storage-->>Exec: WriteConflict
Exec->>OpCtx: sleepForRandomMilliseconds()
OpCtx-->>Exec: return
Exec->>Storage: retry operation
Storage-->>Exec: result
Exec-->>Client: return
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
🧪 Early access (Sonnet 4.5): enabledWe are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience. Note:
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/mongo/db/modules/eloq/src/eloq_record_store.cpp (2)
406-410: Incomplete refactor: Update to use centralized sleep API.This retry loop in
updateRecordstill uses the old pattern with direct RNG calls (uniformDist(randomEngine)) and manual logging. According to the PR objectives and AI summary, this should be updated to useopCtx->sleepForRandomMilliseconds()for consistency with the rest of the codebase.Apply this diff:
- mongo::Milliseconds duration{uniformDist(randomEngine)}; - MONGO_LOG(1) << "Fail to create table in Eloq. Sleep for " << duration.count() << "ms"; - opCtx->sleepFor(duration); + MONGO_LOG(1) << "Fail to update table in Eloq"; + opCtx->sleepForRandomMilliseconds(); MONGO_LOG(1) << "Retry count: " << i;
372-414: Restore RNG declarations forrandomEngineanduniformDist.The recent refactoring removed the
thread_localrandom‐number‐generator declarations but left calls touniformDist(randomEngine)at line 406, resulting in undefined symbols. Reintroduce the RNG declarations (e.g. at the top of src/mongo/db/modules/eloq/src/eloq_record_store.cpp) or move them into a shared header so thatrandomEngineanduniformDistare defined before use.
🧹 Nitpick comments (4)
src/mongo/db/operation_context.h (1)
161-161: Add documentation for the new method.The
sleepForRandomMilliseconds()method lacks documentation. Consider adding a comment that describes:
- The purpose (randomized backoff for retry logic)
- The range of sleep duration (1-100ms based on the implementation)
- When it should be used vs.
sleepFor()- That it may throw if the operation is interrupted
Example:
+ /** + * Sleeps for a random duration between 1 and 100 milliseconds. + * Used for randomized backoff in retry logic to reduce contention. + * Throws an exception if the operation is interrupted during sleep. + */ void sleepForRandomMilliseconds();src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp (2)
221-221: Consider exponential backoff instead of fixed random delay.The previous implementation used
retryCount * Milliseconds{1}, providing an increasing backoff. The new implementation uses a fixed random range (1-100ms) regardless of retry count. For highly contended scenarios with multiple retries, consider:
- Using exponential backoff with jitter
- Or making the sleep duration aware of the retry count
Example approach:
// Instead of fixed random sleep, consider: opCtx->sleepFor(Milliseconds{retryCount * uniformDist(randomEngine)}); // This provides both randomization and increasing backoffHowever, if testing shows the fixed random range is sufficient, the current approach is acceptable.
259-259: Same consideration for exponential backoff.Same feedback as line 221 - consider whether a retry-count-aware backoff would be beneficial here.
src/mongo/db/operation_context.cpp (1)
298-304: Consider making the sleep range configurable.The hardcoded range [1, 100] milliseconds may not be optimal for all retry scenarios. Consider:
- Making the range configurable via a server parameter
- Accepting min/max parameters in the method signature
- Using different ranges for different contention scenarios
However, for a first iteration, the fixed range is acceptable and can be tuned based on production experience.
Also, verify that
MONGO_LOGis the correct logging macro for this codebase version, as it's been deprecated in favor ofLOGV2in newer MongoDB versions.#!/bin/bash # Check what logging macros are used in the codebase rg -n "MONGO_LOG\|LOGV2" --type=cpp src/mongo/db/operation_context.cpp -C 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/mongo/db/modules/eloq/src/eloq_record_store.cpp(1 hunks)src/mongo/db/modules/eloq/tx_service(1 hunks)src/mongo/db/operation_context.cpp(2 hunks)src/mongo/db/operation_context.h(1 hunks)src/mongo/db/ops/write_ops_exec.cpp(2 hunks)src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/mongo/db/operation_context.h (1)
src/mongo/db/operation_context.cpp (2)
sleepForRandomMilliseconds(298-303)sleepForRandomMilliseconds(298-298)
🔇 Additional comments (5)
src/mongo/db/modules/eloq/tx_service (1)
1-1: Verify the tx_service submodule update
Ensure the pointer bump from af75de6e5d0d068fa9dda2a21b6cdf621ecd1bd0 to 4c571f85afe887d79f40d259efaa3601083a9ab2 in src/mongo/db/modules/eloq/tx_service includes the intended centralized randomized retry-sleep changes by inspecting the submodule’s commit history.src/mongo/db/ops/write_ops_exec.cpp (1)
428-430: LGTM! Appropriate use of randomized backoff.The addition of logging and randomized sleep on
WriteConflictduring implicit collection creation is a good practice to reduce contention during retries. The log message clearly explains the context.src/mongo/db/operation_context.cpp (2)
48-49: LGTM! Standard library inclusion.
52-56: Thread-local RNG initialization is acceptable.
Random-device usage mirrors other codebase patterns (e.g., service_executor_adaptive, session_killer), and per-thread construction aligns with the non-contention goal; no changes required.src/mongo/db/modules/eloq/src/eloq_record_store.cpp (1)
247-247: LGTM! Consistent with centralized sleep API.The change to use
opCtx->sleepForRandomMilliseconds()in the retry loop is appropriate.
59b5a6a to
625d599
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/mongo/db/operation_context.cpp (2)
53-55: Consider optimizing thread-local RNG initialization.The current pattern creates a
thread_local std::random_devicewhich is only used once for seeding. Sincerandom_devicecan be heavyweight and may have limited entropy on some platforms, consider using a temporary for seeding instead.Apply this diff to optimize the initialization:
-thread_local std::random_device r; -thread_local std::default_random_engine randomEngine{r()}; +thread_local std::default_random_engine randomEngine{std::random_device{}()}; thread_local std::uniform_int_distribution<int> uniformDist{1, 100};This creates a temporary
random_devicejust for seeding, avoiding the overhead of maintaining it as thread-local state.
298-303: LGTM: Clean implementation of randomized sleep.The implementation is straightforward and correct:
- Uses thread-local RNG for thread safety
- Logs at appropriate debug level
- Delegates to existing
sleepForinfrastructureThe fixed range [1, 100]ms is reasonable for write conflict retry backoff. If future requirements demand it, consider making the range configurable via server parameters or implementing exponential backoff for sustained contention scenarios.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
src/mongo/db/modules/eloq/src/eloq_record_store.cpp(2 hunks)src/mongo/db/modules/eloq/tx_service(1 hunks)src/mongo/db/operation_context.cpp(2 hunks)src/mongo/db/operation_context.h(1 hunks)src/mongo/db/ops/write_ops_exec.cpp(2 hunks)src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- src/mongo/db/modules/eloq/src/eloq_record_store.cpp
- src/mongo/db/modules/eloq/tx_service
- src/mongo/db/operation_context.h
- src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp
🔇 Additional comments (2)
src/mongo/db/ops/write_ops_exec.cpp (2)
423-425: LGTM: Randomized sleep on WriteConflict reduces contention.The randomized sleep before retrying collection creation is correctly placed. This backoff strategy helps reduce contention when multiple operations encounter write conflicts during implicit collection creation.
705-707: LGTM: Consistent retry strategy with insert path.The randomized sleep on WriteConflict during upsert collection creation matches the pattern used in
insertBatchAndHandleErrors. This consistency ensures uniform retry behavior across write operations.
Summary by CodeRabbit
New Features
Refactor
Chores