Retry after sleep random milliseconds when failed to create collection by lzxddz · Pull Request #226 · eloqdata/eloqdoc

lzxddz · 2025-09-30T09:57:12Z

Summary by CodeRabbit

New Features
- Centralized operation-context helper to pause for a randomized duration (with logging of chosen delay).
- Randomized backoff applied to additional retry paths to reduce contention.
Refactor
- Replaced scattered local wait logic with centralized randomized-sleep calls for consistent retry behavior.
Chores
- Updated submodule pointer; no functional API changes.

coderabbitai · 2025-09-30T09:57:20Z

Walkthrough

Centralizes randomized retry sleeps by adding OperationContext::sleepForRandomMilliseconds() (with thread-local RNG and logging) and replacing local RNG or deterministic sleeps with this API across Eloq, write-ops retry paths, and KV catalog retry loops; also updates an Eloq submodule pointer.

Changes

Cohort / File(s)	Summary
OperationContext random-sleep API `src/mongo/db/operation_context.h`, `src/mongo/db/operation_context.cpp`	Adds `OperationContext::sleepForRandomMilliseconds()` implemented with thread-local RNG and logging; delegates to existing `sleepFor()`.
Eloq record-store retry `src/mongo/db/modules/eloq/src/eloq_record_store.cpp`	Removes local RNG and replaces ad-hoc random-duration sleeps with `opCtx->sleepForRandomMilliseconds()` in delete/update retry paths; removes per-retry duration logging.
Write ops conflict backoff `src/mongo/db/ops/write_ops_exec.cpp`	On `WriteConflict` during implicit collection creation, inserts `opCtx->sleepForRandomMilliseconds()` before retrying.
KV catalog entry retry `src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp`	Replaces deterministic linear backoff (`retryCount * Milliseconds{1}`) with `opCtx->sleepForRandomMilliseconds()` in two retry loops.
Eloq tx_service submodule `src/mongo/db/modules/eloq/tx_service`	Updates submodule pointer (no source changes in this repo).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller as Component (Eloq / KV / WriteOps)
  participant OpCtx as OperationContext
  participant Thread as Thread

  Caller->>OpCtx: sleepForRandomMilliseconds()
  Note right of OpCtx #f0f9ff: thread-local RNG selects ms\nlog chosen duration
  OpCtx->>Thread: sleepFor(duration)
  Thread-->>OpCtx: resume
  OpCtx-->>Caller: return

sequenceDiagram
  autonumber
  participant Client as Client
  participant Exec as write_ops_exec
  participant OpCtx as OperationContext
  participant Storage as Storage

  Client->>Exec: operation (e.g., insertBatch)
  Exec->>Storage: attempt implicit collection create
  Storage-->>Exec: WriteConflict
  Exec->>OpCtx: sleepForRandomMilliseconds()
  OpCtx-->>Exec: return
  Exec->>Storage: retry operation
  Storage-->>Exec: result
  Exec-->>Client: return

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

update submodules #198 — updates the same Eloq submodule pointer referenced here.
Remove the retry in EloqCatalogRecordStore::insertRecord() on failed to create table. #222 — touches EloqCatalogRecordStore implementation; related to centralizing randomized sleeps and other Eloq record-store edits.

Suggested reviewers

xiexiaoy
thweetkomputer

Poem

A jitter, a nap, a millisecond hop—
I twitch my whiskers; backoffs never stop.
OpCtx hums numbers, soft and deep,
I curl and count the randomized sleep.
Submodule set, now onward we hop. 🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title “Retry after sleep random milliseconds when failed to create collection” accurately describes one aspect of the changeset—adding a random delay before retrying collection creation failures—but it does not capture the broader intent of centralizing randomized backoff via the `OperationContext::sleepForRandomMilliseconds()` API across multiple modules. It is therefore only partially related to the overall changeset.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

Public repositories are always opted into early access features.
You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/mongo/db/modules/eloq/src/eloq_record_store.cpp (2)
406-410: Incomplete refactor: Update to use centralized sleep API.

This retry loop in updateRecord still uses the old pattern with direct RNG calls (uniformDist(randomEngine)) and manual logging. According to the PR objectives and AI summary, this should be updated to use opCtx->sleepForRandomMilliseconds() for consistency with the rest of the codebase.

Apply this diff:
-        mongo::Milliseconds duration{uniformDist(randomEngine)};
-        MONGO_LOG(1) << "Fail to create table in Eloq. Sleep for " << duration.count() << "ms";
-        opCtx->sleepFor(duration);
+        MONGO_LOG(1) << "Fail to update table in Eloq";
+        opCtx->sleepForRandomMilliseconds();
         MONGO_LOG(1) << "Retry count: " << i;
372-414: Restore RNG declarations for randomEngine and uniformDist.

The recent refactoring removed the thread_local random‐number‐generator declarations but left calls to uniformDist(randomEngine) at line 406, resulting in undefined symbols. Reintroduce the RNG declarations (e.g. at the top of src/mongo/db/modules/eloq/src/eloq_record_store.cpp) or move them into a shared header so that randomEngine and uniformDist are defined before use.

🧹 Nitpick comments (4)

src/mongo/db/operation_context.h (1)
161-161: Add documentation for the new method.

The sleepForRandomMilliseconds() method lacks documentation. Consider adding a comment that describes:

The purpose (randomized backoff for retry logic)

The range of sleep duration (1-100ms based on the implementation)

When it should be used vs. sleepFor()

That it may throw if the operation is interrupted

Example:
+    /**
+     * Sleeps for a random duration between 1 and 100 milliseconds.
+     * Used for randomized backoff in retry logic to reduce contention.
+     * Throws an exception if the operation is interrupted during sleep.
+     */
     void sleepForRandomMilliseconds();
src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp (2)
221-221: Consider exponential backoff instead of fixed random delay.

The previous implementation used retryCount * Milliseconds{1}, providing an increasing backoff. The new implementation uses a fixed random range (1-100ms) regardless of retry count. For highly contended scenarios with multiple retries, consider:

Using exponential backoff with jitter

Or making the sleep duration aware of the retry count

Example approach:
// Instead of fixed random sleep, consider:
opCtx->sleepFor(Milliseconds{retryCount * uniformDist(randomEngine)});
// This provides both randomization and increasing backoff
However, if testing shows the fixed random range is sufficient, the current approach is acceptable.

259-259: Same consideration for exponential backoff.

Same feedback as line 221 - consider whether a retry-count-aware backoff would be beneficial here.
src/mongo/db/operation_context.cpp (1)
298-304: Consider making the sleep range configurable.

The hardcoded range [1, 100] milliseconds may not be optimal for all retry scenarios. Consider:

Making the range configurable via a server parameter

Accepting min/max parameters in the method signature

Using different ranges for different contention scenarios

However, for a first iteration, the fixed range is acceptable and can be tuned based on production experience.

Also, verify that MONGO_LOG is the correct logging macro for this codebase version, as it's been deprecated in favor of LOGV2 in newer MongoDB versions.
#!/bin/bash
# Check what logging macros are used in the codebase

rg -n "MONGO_LOG\|LOGV2" --type=cpp src/mongo/db/operation_context.cpp -C 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 077a8d2 and a0d44f8.

📒 Files selected for processing (6)

src/mongo/db/modules/eloq/src/eloq_record_store.cpp (1 hunks)
src/mongo/db/modules/eloq/tx_service (1 hunks)
src/mongo/db/operation_context.cpp (2 hunks)
src/mongo/db/operation_context.h (1 hunks)
src/mongo/db/ops/write_ops_exec.cpp (2 hunks)
src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/mongo/db/operation_context.h (1)

src/mongo/db/operation_context.cpp (2)

sleepForRandomMilliseconds (298-303)

sleepForRandomMilliseconds (298-298)

🔇 Additional comments (5)

src/mongo/db/modules/eloq/tx_service (1)

1-1: Verify the tx_service submodule update
Ensure the pointer bump from af75de6e5d0d068fa9dda2a21b6cdf621ecd1bd0 to 4c571f85afe887d79f40d259efaa3601083a9ab2 in src/mongo/db/modules/eloq/tx_service includes the intended centralized randomized retry-sleep changes by inspecting the submodule’s commit history.

src/mongo/db/ops/write_ops_exec.cpp (1)

428-430: LGTM! Appropriate use of randomized backoff.

The addition of logging and randomized sleep on WriteConflict during implicit collection creation is a good practice to reduce contention during retries. The log message clearly explains the context.

src/mongo/db/operation_context.cpp (2)

48-49: LGTM! Standard library inclusion.

52-56: Thread-local RNG initialization is acceptable.
Random-device usage mirrors other codebase patterns (e.g., service_executor_adaptive, session_killer), and per-thread construction aligns with the non-contention goal; no changes required.

src/mongo/db/modules/eloq/src/eloq_record_store.cpp (1)

247-247: LGTM! Consistent with centralized sleep API.

The change to use opCtx->sleepForRandomMilliseconds() in the retry loop is appropriate.

src/mongo/db/ops/write_ops_exec.cpp

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/mongo/db/operation_context.cpp (2)
53-55: Consider optimizing thread-local RNG initialization.

The current pattern creates a thread_local std::random_device which is only used once for seeding. Since random_device can be heavyweight and may have limited entropy on some platforms, consider using a temporary for seeding instead.

Apply this diff to optimize the initialization:
-thread_local std::random_device r;
-thread_local std::default_random_engine randomEngine{r()};
+thread_local std::default_random_engine randomEngine{std::random_device{}()};
 thread_local std::uniform_int_distribution<int> uniformDist{1, 100};
This creates a temporary random_device just for seeding, avoiding the overhead of maintaining it as thread-local state.

298-303: LGTM: Clean implementation of randomized sleep.

The implementation is straightforward and correct:

Uses thread-local RNG for thread safety

Logs at appropriate debug level

Delegates to existing sleepFor infrastructure

The fixed range [1, 100]ms is reasonable for write conflict retry backoff. If future requirements demand it, consider making the range configurable via server parameters or implementing exponential backoff for sustained contention scenarios.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 59b5a6a and 625d599.

📒 Files selected for processing (6)

src/mongo/db/modules/eloq/src/eloq_record_store.cpp (2 hunks)
src/mongo/db/modules/eloq/tx_service (1 hunks)
src/mongo/db/operation_context.cpp (2 hunks)
src/mongo/db/operation_context.h (1 hunks)
src/mongo/db/ops/write_ops_exec.cpp (2 hunks)
src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

src/mongo/db/modules/eloq/src/eloq_record_store.cpp
src/mongo/db/modules/eloq/tx_service
src/mongo/db/operation_context.h
src/mongo/db/storage/kv/kv_collection_catalog_entry.cpp

🔇 Additional comments (2)

src/mongo/db/ops/write_ops_exec.cpp (2)

423-425: LGTM: Randomized sleep on WriteConflict reduces contention.

The randomized sleep before retrying collection creation is correctly placed. This backoff strategy helps reduce contention when multiple operations encounter write conflicts during implicit collection creation.

705-707: LGTM: Consistent retry strategy with insert path.

The randomized sleep on WriteConflict during upsert collection creation matches the pattern used in insertBatchAndHandleErrors. This consistency ensures uniform retry behavior across write operations.

#226)

lzxddz self-assigned this Sep 30, 2025

lzxddz added the trigger-ci label Sep 30, 2025

lzxddz requested a review from xiexiaoy September 30, 2025 09:57

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

src/mongo/db/ops/write_ops_exec.cpp Outdated Show resolved Hide resolved

xiexiaoy approved these changes Sep 30, 2025

View reviewed changes

Retry after sleep random milliseconds when failed to create collection

625d599

lzxddz force-pushed the random-sleep-retry branch from 59b5a6a to 625d599 Compare September 30, 2025 10:11

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

lzxddz merged commit 0316526 into eloqdata:main Oct 4, 2025
3 checks passed

xiexiaoy pushed a commit that referenced this pull request Oct 21, 2025

Retry after sleep random milliseconds when failed to create collection (

a74d148

#226)

coderabbitai bot mentioned this pull request Dec 16, 2025

Retry auto commit upserts on out-of-memory error #369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry after sleep random milliseconds when failed to create collection#226

Retry after sleep random milliseconds when failed to create collection#226
lzxddz merged 1 commit intoeloqdata:mainfrom
lzxddz:random-sleep-retry

lzxddz commented Sep 30, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 30, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lzxddz commented Sep 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lzxddz commented Sep 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 30, 2025 •

edited

Loading