Skip to content

Support multi dss shards#298

Merged
lzxddz merged 1 commit intoeloqdata:mainfrom
lzxddz:dss_multi_shard-rebase
Nov 21, 2025
Merged

Support multi dss shards#298
lzxddz merged 1 commit intoeloqdata:mainfrom
lzxddz:dss_multi_shard-rebase

Conversation

@lzxddz
Copy link
Copy Markdown
Collaborator

@lzxddz lzxddz commented Nov 9, 2025

Summary by CodeRabbit

  • Refactor

    • Simplified and unified service initialization and startup flows for bootstrap and single‑node modes, with clearer peer/leader configuration during startup.
  • Bug Fixes

    • Added guards and improved error handling to prevent operations (like cache reload) before startup completes or when node membership is missing.
  • New Features

    • Public API to query whether startup has completed.
  • Chores

    • Updated tracked internal module references.

✏️ Tip: You can customize this high-level summary in your review settings.

@lzxddz lzxddz self-assigned this Nov 9, 2025
@lzxddz lzxddz changed the title Support multi dss shard Support multi dss shards Nov 9, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 9, 2025

Walkthrough

Updates Eloq KV engine bootstrap and datastore initialization call signatures and flow, adds ServiceContext::isStartupComplete() with an early-return in ReloadCache, and updates eloq submodule pointers for store_handler and tx_service.

Changes

Cohort / File(s) Summary
Eloq KV Engine bootstrap & datastore init
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp, src/mongo/db/modules/eloq/src/eloq_kv_engine.h
Bootstrap path now fetches/validates tmpNgConfigs before init; initDataStoreService signature reduced to (nodeId, ng_configs); TxConfigsToDssClusterConfig signature changed to (node_id, ngConfigs, ng_leaders, ds_config); StartService calls simplified to a bootstrap/single-node boolean; DataStoreServiceClient now takes that boolean as first arg; bootstrap guards and error messages updated.
ServiceContext & ReloadCache
src/mongo/db/service_context.h, src/mongo/db/service_context.cpp, src/mongo/db/modules/mongo_system_handler.cpp
Added public bool ServiceContext::isStartupComplete() (reads internal flag under mutex). MongoSystemHandler::ReloadCache now returns early when serviceContext is null or startup is incomplete.
Submodule bumps (no code changes)
src/mongo/db/modules/eloq/store_handler
src/mongo/db/modules/eloq/tx_service
Tracked submodule pointers updated to new commits; no direct code edits in this diff.

Sequence Diagram(s)

sequenceDiagram
    participant Engine as EloqKVEngine
    participant Config as ConfigSource
    participant DSS as DataStoreService
    participant MSC as MongoSystemHandler

    alt bootstrap
        Engine->>Config: fetch/validate cluster configs -> tmpNgConfigs
        Config-->>Engine: tmpNgConfigs
        Engine->>DSS: initDataStoreService(nodeId, tmpNgConfigs)
    else non-bootstrap
        Engine->>DSS: initDataStoreService(nodeId, ngConfigs)
    end

    Note over DSS: build ng_leaders (bootstrap/single-node -> map to self)
    DSS->>DSS: TxConfigsToDssClusterConfig(node_id, ngConfigs, ng_leaders, ds_config)
    DSS->>DSS: StartService(bootstrap_or_singleNode_flag)

    Note over MSC: ReloadCache guard
    MSC->>MSC: if serviceContext==null || !isStartupComplete() -> return
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focus review on:
    • eloq_kv_engine.cpp bootstrap branch, tmpNgConfigs parsing and fallback behavior
    • Correct propagation and semantics of the bootstrap/single-node boolean into DataStoreServiceClient and StartService
    • TxConfigsToDssClusterConfig argument changes and ng_leaders construction
    • ServiceContext::isStartupComplete() synchronization and ReloadCache early-return

Possibly related PRs

Suggested reviewers

  • xiexiaoy
  • githubzilla
  • thweetkomputer

Poem

🐰 In configs I dig, with leaders in sight,

A single flag guides the bootstrap night.
I hop through nodes, tidy and quick,
Guards hold the gate, steady and slick.
Hooray — cleaner starts and fewer tricks! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Support multi dss shards' directly relates to the main objective of enabling multiple DataStoreService shards, as evidenced by substantial refactoring of DSS initialization, configuration parsing, and cluster setup across the codebase.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 847f99e and 3b4f9c6.

📒 Files selected for processing (5)
  • src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (4 hunks)
  • src/mongo/db/modules/eloq/store_handler (1 hunks)
  • src/mongo/db/modules/eloq/tx_service (1 hunks)
  • src/mongo/db/service_context.cpp (1 hunks)
  • src/mongo/db/service_context.h (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/mongo/db/service_context.h (1)
src/mongo/db/service_context.cpp (2)
  • isStartupComplete (394-397)
  • isStartupComplete (394-394)
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (1)
src/mongo/db/server_options.h (1)
  • bootstrap (86-269)
🔇 Additional comments (1)
src/mongo/db/modules/eloq/tx_service (1)

1-1: All integration points have been correctly updated for the submodule API changes.

Verification confirms that the parent repository's integration points are using the correct new API signatures:

  • Line 727 (eloq_kv_engine.cpp): DataStoreServiceClient constructor correctly called with new boolean parameter (opt_bootstrap || isSingleNode)
  • Line 645 (eloq_kv_engine.cpp): TxConfigsToDssClusterConfig correctly called with new signature (dss_node_id, ngConfigs, ng_leaders, ds_config)

No stale API calls were found. The submodule update is properly integrated.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (1)

409-419: Critical: Search the wrong config map for node lookup.

Line 409 iterates over ngConfigs (the single-entry bootstrap config) instead of tmpNgConfigs (the full multi-shard config fetched at line 392). This causes the node lookup to fail on any host that isn't the first in the cluster, preventing bootstrap from completing on non-primary nodes.

Apply this diff to fix the lookup:

     bool found = false;
     uint32_t dssNodeId = UINT32_MAX;
     // check whether this node is in cluster.
-    for (auto& pair : ngConfigs) {
+    for (auto& pair : tmpNgConfigs) {
         auto& ngNodes = pair.second;
         for (auto& ngNode : ngNodes) {
             if (ngNode.host_name_ == eloqGlobalOptions.localAddr.host() &&

Note: A similar issue was flagged in previous review comments and marked as addressed. This appears to be a regression or incomplete fix.

🧹 Nitpick comments (1)
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (1)

406-424: Refactor: Extract duplicate node-finding logic.

The node-finding logic in lines 406-424 duplicates the logic in lines 363-386. Consider extracting this into a helper function to improve maintainability.

For example:

// Helper function to find node in config
static std::optional<std::pair<uint32_t, uint32_t>> FindNodeInConfig(
    const std::unordered_map<uint32_t, std::vector<txservice::NodeConfig>>& configs,
    const std::string& hostName,
    uint16_t port) {
    for (auto& pair : configs) {
        auto& ngNodes = pair.second;
        for (auto& ngNode : ngNodes) {
            if (ngNode.host_name_ == hostName && ngNode.port_ == port) {
                uint32_t nodeId = ngNode.node_id_;
                uint32_t ngId = ngNode.is_candidate_ ? pair.first : 0;
                return std::make_pair(nodeId, ngId);
            }
        }
    }
    return std::nullopt;
}

Then use this helper at both line 366 and line 409.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d9e12c and da25171.

📒 Files selected for processing (6)
  • src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (6 hunks)
  • src/mongo/db/modules/eloq/src/eloq_kv_engine.h (0 hunks)
  • src/mongo/db/modules/eloq/store_handler (1 hunks)
  • src/mongo/db/modules/eloq/tx_service (1 hunks)
  • src/mongo/db/service_context.cpp (1 hunks)
  • src/mongo/db/service_context.h (1 hunks)
💤 Files with no reviewable changes (1)
  • src/mongo/db/modules/eloq/src/eloq_kv_engine.h
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/mongo/db/service_context.h
  • src/mongo/db/modules/eloq/tx_service
  • src/mongo/db/service_context.cpp
  • src/mongo/db/modules/eloq/store_handler
🧰 Additional context used
🧬 Code graph analysis (1)
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (1)
src/mongo/db/server_options.h (1)
  • bootstrap (86-269)
🔇 Additional comments (7)
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (7)

343-343: LGTM!

Adding true as the is_candidate_ parameter for the bootstrap solo config is correct, as the single node in bootstrap mode should be a candidate leader.


640-640: LGTM!

The isSingleNode computation correctly identifies single-node deployments.


657-663: LGTM!

The ng_leaders map construction is correct for bootstrap/single-node scenarios. Each node group is assigned the current node as its initial leader, which is appropriate for starting all DSS shards locally during bootstrap.


665-666: LGTM!

The updated TxConfigsToDssClusterConfig signature with per-node-group leaders (ng_leaders map) properly supports multi-shard DSS configurations.


731-739: LGTM!

The StartService calls correctly differentiate between non-shared storage (RocksDB, always create_if_missing=true) and shared storage (only create during bootstrap or single-node mode to prevent conflicts).


746-750: LGTM!

The DataStoreServiceClient construction correctly passes the bootstrap/single-node flag as the first parameter, maintaining consistency with the updated signature and broader multi-shard initialization flow.


1338-1341: LGTM!

The early-return guard prevents ReloadCache from executing before startup completes, avoiding potential race conditions or errors when authorization data structures aren't yet initialized.

@lzxddz lzxddz force-pushed the dss_multi_shard-rebase branch 3 times, most recently from 52f2abc to 2b9de01 Compare November 17, 2025 10:03
@lzxddz lzxddz requested a review from githubzilla November 17, 2025 10:52
fix single node start

update bootstrap for dss

dss client add arg bind_data_shard_with_ng

update submodule
@lzxddz lzxddz force-pushed the dss_multi_shard-rebase branch from 308b504 to 5c63e6b Compare November 21, 2025 03:36
@lzxddz lzxddz merged commit 43a07f5 into eloqdata:main Nov 21, 2025
1 of 3 checks passed
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (1)

662-671: Remove unary + before string literal in uasserted message (compile error)

This expression:

uasserted(ErrorCodes::InternalError,
          +"DataStoreService initialization failed: unable to fetch config from peer " +
              ds_peer_node);

uses unary + on a string literal, which is not allowed in C++ and will not compile.

You can fix it by dropping the unary +:

-            uasserted(ErrorCodes::InternalError,
-                      +"DataStoreService initialization failed: unable to fetch config from peer " +
-                          ds_peer_node);
+            uasserted(ErrorCodes::InternalError,
+                      "DataStoreService initialization failed: unable to fetch config from peer " +
+                          ds_peer_node);

("literal" + std::string is valid via the standard operator+ overload.)

♻️ Duplicate comments (1)
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (1)

403-444: Fix bootstrap DSS node_id lookup to use tmpNgConfigs, not ngConfigs

In bootstrap mode, ngConfigs is a single-node placeholder (node_id_ always 0), while tmpNgConfigs holds the real multi-shard topology loaded from file/IP list. The dssNodeId search and membership check currently iterate ngConfigs, so every node ends up using dssNodeId == 0 when calling:

initDataStoreService(dssNodeId, tmpNgConfigs);

On nodes whose (host, port) map to a different node_id_ in tmpNgConfigs, DSS cannot correctly identify the local node and cluster registration breaks (same underlying problem that was flagged earlier).

Change the search to walk tmpNgConfigs:

-        // check whether this node is in cluster.
-        for (auto& pair : ngConfigs) {
+        // check whether this node is in cluster.
+        for (auto& pair : tmpNgConfigs) {
             auto& ngNodes = pair.second;
             for (auto& ngNode : ngNodes) {
                 if (ngNode.host_name_ == eloqGlobalOptions.localAddr.host() &&
                     ngNode.port_ == eloqGlobalOptions.localAddr.port()) {
                     dssNodeId = ngNode.node_id_;
                     found = true;
                     break;
                 }
             }
         }

This ensures dssNodeId and the membership check reflect the real multi-shard config used by initDataStoreService.

🧹 Nitpick comments (2)
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (2)

651-683: DSS bootstrap/single-node wiring via ng_leaders, StartService, and client flag looks coherent

  • Using isSingleNode = (ngConfigs.size() == 1 && ngConfigs.begin()->second.size() == 1) and then driving both ng_leaders population and the StartService/DataStoreServiceClient boolean with (opt_bootstrap || isSingleNode) makes the intent clear: bootstrap or truly single-node deployments start all shards locally; multi-node non-bootstrap deployments rely on existing topology.
  • The new TxConfigsToDssClusterConfig(node_id, ngConfigs, ng_leaders, ds_config) call is consistent with that contract.

If you want to tidy it up, you could factor the repeated (opt_bootstrap || isSingleNode) into a single local const bool bootstrapOrSingleNode and reuse it in the three call sites, but that’s purely cosmetic.

Also applies to: 752-772


1374-1382: Startup guard in ReloadCache works, but the null-check on getGlobalServiceContext() is dead code

getGlobalServiceContext() fasserts if there is no global ServiceContext, so:

auto serviceContext = mongo::getGlobalServiceContext();
if (serviceContext == nullptr || !serviceContext->isStartupComplete()) {

will never see serviceContext == nullptr. If you truly need to handle “no global context yet”, you’d have to check hasGlobalServiceContext() before calling getGlobalServiceContext(). Otherwise, you can simplify this to just checking !serviceContext->isStartupComplete() and also reuse serviceContext for makeClient instead of calling getGlobalServiceContext() again.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da25171 and 5c63e6b.

📒 Files selected for processing (6)
  • src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (7 hunks)
  • src/mongo/db/modules/eloq/src/eloq_kv_engine.h (0 hunks)
  • src/mongo/db/modules/eloq/store_handler (1 hunks)
  • src/mongo/db/modules/eloq/tx_service (1 hunks)
  • src/mongo/db/service_context.cpp (1 hunks)
  • src/mongo/db/service_context.h (1 hunks)
💤 Files with no reviewable changes (1)
  • src/mongo/db/modules/eloq/src/eloq_kv_engine.h
✅ Files skipped from review due to trivial changes (1)
  • src/mongo/db/modules/eloq/store_handler
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/mongo/db/modules/eloq/tx_service
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-14T03:10:22.027Z
Learnt from: githubzilla
Repo: eloqdata/eloqdoc PR: 279
File: src/mongo/db/dbmain.cpp:0-0
Timestamp: 2025-11-14T03:10:22.027Z
Learning: In this repository, avoid defining --config via gflags in src/mongo/db/dbmain.cpp; MongoDB’s options_parser owns --config. Prefer gflags flags with a distinct prefix (e.g., --data_substrate_config) parsed before options_parser, with options_parser configured to allow unregistered options.

Applied to files:

  • src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp
🧬 Code graph analysis (2)
src/mongo/db/service_context.h (1)
src/mongo/db/service_context.cpp (2)
  • isStartupComplete (394-397)
  • isStartupComplete (394-394)
src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (1)
src/mongo/db/server_options.h (1)
  • bootstrap (86-269)
🔇 Additional comments (3)
src/mongo/db/service_context.h (1)

445-446: isStartupComplete accessor pairs cleanly with existing startup APIs

The new declaration is positioned appropriately next to wait/notify and matches the implementation; no API or threading concerns here.

src/mongo/db/service_context.cpp (1)

394-397: Thread-safe read of startup state looks correct

Locking _mutex and reading _startupComplete matches the existing wait/notify pattern and safely exposes the flag.

src/mongo/db/modules/eloq/src/eloq_kv_engine.cpp (1)

355-360: Solo bootstrap NodeConfig now correctly marks local node as candidate

Passing true for the solo-node NodeConfig keeps nativeNgId discovery and candidate semantics consistent during bootstrap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eloqdoc crashed at handle request before EloqKvEngine is initted on three nodes with multi-shard datastore

2 participants