Skip to content

Retry after sleep random milliseconds when failed to create collection#226

Merged
lzxddz merged 1 commit intoeloqdata:mainfrom
lzxddz:random-sleep-retry
Oct 4, 2025
Merged

Retry after sleep random milliseconds when failed to create collection#226
lzxddz merged 1 commit intoeloqdata:mainfrom
lzxddz:random-sleep-retry

Conversation

@lzxddz
Copy link
Copy Markdown
Collaborator

@lzxddz lzxddz commented Sep 30, 2025

Summary by CodeRabbit

  • New Features

    • Centralized operation-context helper to pause for a randomized duration (with logging of chosen delay).
    • Randomized backoff applied to additional retry paths to reduce contention.
  • Refactor

    • Replaced scattered local wait logic with centralized randomized-sleep calls for consistent retry behavior.
  • Chores

    • Updated submodule pointer; no functional API changes.

@lzxddz lzxddz self-assigned this Sep 30, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Sep 30, 2025

Walkthrough

Centralizes randomized retry sleeps by adding OperationContext::sleepForRandomMilliseconds() (with thread-local RNG and logging) and replacing local RNG or deterministic sleeps with this API across Eloq, write-ops retry paths, and KV catalog retry loops; also updates an Eloq submodule pointer.

Changes

Cohort / File(s) Summary
OperationContext random-sleep API
src/mongo/db/operation_context.h, src/mongo/db/operation_context.cpp
Adds OperationContext::sleepForRandomMilliseconds() implemented with thread-local RNG and logging; delegates to existing sleepFor().
Eloq record-store retry
src/mongo/db/modules/eloq/src/eloq_record_store.cpp
Removes local RNG and replaces ad-hoc random-duration sleeps with opCtx->sleepForRandomMilliseconds() in delete/update retry paths; removes per-retry duration logging.
Write ops conflict backoff
src/mongo/db/ops/write_ops_exec.cpp
On WriteConflict during implicit collection creation, inserts opCtx->sleepForRandomMilliseconds() before retrying.
KV catalog entry retry
src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp
Replaces deterministic linear backoff (retryCount * Milliseconds{1}) with opCtx->sleepForRandomMilliseconds() in two retry loops.
Eloq tx_service submodule
src/mongo/db/modules/eloq/tx_service
Updates submodule pointer (no source changes in this repo).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller as Component (Eloq / KV / WriteOps)
  participant OpCtx as OperationContext
  participant Thread as Thread

  Caller->>OpCtx: sleepForRandomMilliseconds()
  Note right of OpCtx #f0f9ff: thread-local RNG selects ms\nlog chosen duration
  OpCtx->>Thread: sleepFor(duration)
  Thread-->>OpCtx: resume
  OpCtx-->>Caller: return
Loading
sequenceDiagram
  autonumber
  participant Client as Client
  participant Exec as write_ops_exec
  participant OpCtx as OperationContext
  participant Storage as Storage

  Client->>Exec: operation (e.g., insertBatch)
  Exec->>Storage: attempt implicit collection create
  Storage-->>Exec: WriteConflict
  Exec->>OpCtx: sleepForRandomMilliseconds()
  OpCtx-->>Exec: return
  Exec->>Storage: retry operation
  Storage-->>Exec: result
  Exec-->>Client: return
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • xiexiaoy
  • thweetkomputer

Poem

A jitter, a nap, a millisecond hop—
I twitch my whiskers; backoffs never stop.
OpCtx hums numbers, soft and deep,
I curl and count the randomized sleep.
Submodule set, now onward we hop. 🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title “Retry after sleep random milliseconds when failed to create collection” accurately describes one aspect of the changeset—adding a random delay before retrying collection creation failures—but it does not capture the broader intent of centralizing randomized backoff via the OperationContext::sleepForRandomMilliseconds() API across multiple modules. It is therefore only partially related to the overall changeset.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

  • Public repositories are always opted into early access features.
  • You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

Comment @coderabbitai help to get the list of available commands and usage tips.

@lzxddz lzxddz requested a review from xiexiaoy September 30, 2025 09:57
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/mongo/db/modules/eloq/src/eloq_record_store.cpp (2)

406-410: Incomplete refactor: Update to use centralized sleep API.

This retry loop in updateRecord still uses the old pattern with direct RNG calls (uniformDist(randomEngine)) and manual logging. According to the PR objectives and AI summary, this should be updated to use opCtx->sleepForRandomMilliseconds() for consistency with the rest of the codebase.

Apply this diff:

-        mongo::Milliseconds duration{uniformDist(randomEngine)};
-        MONGO_LOG(1) << "Fail to create table in Eloq. Sleep for " << duration.count() << "ms";
-        opCtx->sleepFor(duration);
+        MONGO_LOG(1) << "Fail to update table in Eloq";
+        opCtx->sleepForRandomMilliseconds();
         MONGO_LOG(1) << "Retry count: " << i;

372-414: Restore RNG declarations for randomEngine and uniformDist.

The recent refactoring removed the thread_local random‐number‐generator declarations but left calls to uniformDist(randomEngine) at line 406, resulting in undefined symbols. Reintroduce the RNG declarations (e.g. at the top of src/mongo/db/modules/eloq/src/eloq_record_store.cpp) or move them into a shared header so that randomEngine and uniformDist are defined before use.

🧹 Nitpick comments (4)
src/mongo/db/operation_context.h (1)

161-161: Add documentation for the new method.

The sleepForRandomMilliseconds() method lacks documentation. Consider adding a comment that describes:

  • The purpose (randomized backoff for retry logic)
  • The range of sleep duration (1-100ms based on the implementation)
  • When it should be used vs. sleepFor()
  • That it may throw if the operation is interrupted

Example:

+    /**
+     * Sleeps for a random duration between 1 and 100 milliseconds.
+     * Used for randomized backoff in retry logic to reduce contention.
+     * Throws an exception if the operation is interrupted during sleep.
+     */
     void sleepForRandomMilliseconds();
src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp (2)

221-221: Consider exponential backoff instead of fixed random delay.

The previous implementation used retryCount * Milliseconds{1}, providing an increasing backoff. The new implementation uses a fixed random range (1-100ms) regardless of retry count. For highly contended scenarios with multiple retries, consider:

  • Using exponential backoff with jitter
  • Or making the sleep duration aware of the retry count

Example approach:

// Instead of fixed random sleep, consider:
opCtx->sleepFor(Milliseconds{retryCount * uniformDist(randomEngine)});
// This provides both randomization and increasing backoff

However, if testing shows the fixed random range is sufficient, the current approach is acceptable.


259-259: Same consideration for exponential backoff.

Same feedback as line 221 - consider whether a retry-count-aware backoff would be beneficial here.

src/mongo/db/operation_context.cpp (1)

298-304: Consider making the sleep range configurable.

The hardcoded range [1, 100] milliseconds may not be optimal for all retry scenarios. Consider:

  1. Making the range configurable via a server parameter
  2. Accepting min/max parameters in the method signature
  3. Using different ranges for different contention scenarios

However, for a first iteration, the fixed range is acceptable and can be tuned based on production experience.

Also, verify that MONGO_LOG is the correct logging macro for this codebase version, as it's been deprecated in favor of LOGV2 in newer MongoDB versions.

#!/bin/bash
# Check what logging macros are used in the codebase

rg -n "MONGO_LOG\|LOGV2" --type=cpp src/mongo/db/operation_context.cpp -C 2
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 077a8d2 and a0d44f8.

📒 Files selected for processing (6)
  • src/mongo/db/modules/eloq/src/eloq_record_store.cpp (1 hunks)
  • src/mongo/db/modules/eloq/tx_service (1 hunks)
  • src/mongo/db/operation_context.cpp (2 hunks)
  • src/mongo/db/operation_context.h (1 hunks)
  • src/mongo/db/ops/write_ops_exec.cpp (2 hunks)
  • src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/mongo/db/operation_context.h (1)
src/mongo/db/operation_context.cpp (2)
  • sleepForRandomMilliseconds (298-303)
  • sleepForRandomMilliseconds (298-298)
🔇 Additional comments (5)
src/mongo/db/modules/eloq/tx_service (1)

1-1: Verify the tx_service submodule update
Ensure the pointer bump from af75de6e5d0d068fa9dda2a21b6cdf621ecd1bd0 to 4c571f85afe887d79f40d259efaa3601083a9ab2 in src/mongo/db/modules/eloq/tx_service includes the intended centralized randomized retry-sleep changes by inspecting the submodule’s commit history.

src/mongo/db/ops/write_ops_exec.cpp (1)

428-430: LGTM! Appropriate use of randomized backoff.

The addition of logging and randomized sleep on WriteConflict during implicit collection creation is a good practice to reduce contention during retries. The log message clearly explains the context.

src/mongo/db/operation_context.cpp (2)

48-49: LGTM! Standard library inclusion.


52-56: Thread-local RNG initialization is acceptable.
Random-device usage mirrors other codebase patterns (e.g., service_executor_adaptive, session_killer), and per-thread construction aligns with the non-contention goal; no changes required.

src/mongo/db/modules/eloq/src/eloq_record_store.cpp (1)

247-247: LGTM! Consistent with centralized sleep API.

The change to use opCtx->sleepForRandomMilliseconds() in the retry loop is appropriate.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/mongo/db/operation_context.cpp (2)

53-55: Consider optimizing thread-local RNG initialization.

The current pattern creates a thread_local std::random_device which is only used once for seeding. Since random_device can be heavyweight and may have limited entropy on some platforms, consider using a temporary for seeding instead.

Apply this diff to optimize the initialization:

-thread_local std::random_device r;
-thread_local std::default_random_engine randomEngine{r()};
+thread_local std::default_random_engine randomEngine{std::random_device{}()};
 thread_local std::uniform_int_distribution<int> uniformDist{1, 100};

This creates a temporary random_device just for seeding, avoiding the overhead of maintaining it as thread-local state.


298-303: LGTM: Clean implementation of randomized sleep.

The implementation is straightforward and correct:

  • Uses thread-local RNG for thread safety
  • Logs at appropriate debug level
  • Delegates to existing sleepFor infrastructure

The fixed range [1, 100]ms is reasonable for write conflict retry backoff. If future requirements demand it, consider making the range configurable via server parameters or implementing exponential backoff for sustained contention scenarios.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 59b5a6a and 625d599.

📒 Files selected for processing (6)
  • src/mongo/db/modules/eloq/src/eloq_record_store.cpp (2 hunks)
  • src/mongo/db/modules/eloq/tx_service (1 hunks)
  • src/mongo/db/operation_context.cpp (2 hunks)
  • src/mongo/db/operation_context.h (1 hunks)
  • src/mongo/db/ops/write_ops_exec.cpp (2 hunks)
  • src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/mongo/db/modules/eloq/src/eloq_record_store.cpp
  • src/mongo/db/modules/eloq/tx_service
  • src/mongo/db/operation_context.h
  • src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp
🔇 Additional comments (2)
src/mongo/db/ops/write_ops_exec.cpp (2)

423-425: LGTM: Randomized sleep on WriteConflict reduces contention.

The randomized sleep before retrying collection creation is correctly placed. This backoff strategy helps reduce contention when multiple operations encounter write conflicts during implicit collection creation.


705-707: LGTM: Consistent retry strategy with insert path.

The randomized sleep on WriteConflict during upsert collection creation matches the pattern used in insertBatchAndHandleErrors. This consistency ensures uniform retry behavior across write operations.

@lzxddz lzxddz merged commit 0316526 into eloqdata:main Oct 4, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants