Skip to content

docs(design): propose scaling roadmap#954

Merged
bootjp merged 7 commits into
mainfrom
design/scaling-roadmap
Jun 24, 2026
Merged

docs(design): propose scaling roadmap#954
bootjp merged 7 commits into
mainfrom
design/scaling-roadmap

Conversation

@bootjp

@bootjp bootjp commented Jun 12, 2026

Copy link
Copy Markdown
Owner

What

Adds docs/design/2026_06_12_proposed_scaling_roadmap.md — a doc-only scaling roadmap that surveys what exists, what is proposed, and what is missing for elastickv to scale further (more data, more throughput, more nodes).

This is a roadmap / implementation-plan doc, which docs/design/README.md explicitly permits ("Concrete implementation plans"). No code changes.

Structure

  1. Current scaling envelope — the seven things that bound a deployment today, each anchored to a file:line on main.
  2. Dimension-by-dimension analysis — (a) data volume → split M2/M3 + blob offload + shared cache, with range MERGE flagged as a missing design; (b) write throughput → multi-group + leader balance (docs(design): propose leader balance scheduler #953) + the multi-node bootstrap gap; (c) read throughput → lease reads (done), follower-reads (missing design, learner primitive present); (d) cross-group txn → per-group HLC vs centralized TSO assessment; (e) cluster size / membership; (f) operational scaling.
  3. Gap list → recommended new designs ranked by leverage, each with depends-on edges.
  4. Sequencing — dependency-ordered rollout tying in docs(design): propose hotspot split M2 — migration plane #945 / docs(design): propose hotspot split M3 — automation #951 / docs(design): propose leader balance scheduler #953.
  5. Open Questions.

Verified-on-tree claims (selection)

  • RunHLCLeaseRenewal proposes the ceiling to the default group only (kv/sharded_coordinator.go:1915) — the TSO doc's near-term "all led groups" fix is not yet implemented.
  • --raftBootstrapMembers is rejected for multi-group (main.go:746, len(groups) != 1); each group bootstraps single-member (multiraft_runtime.go:246-254) — the multi-node multi-group gap PR docs(design): propose leader balance scheduler #953 §1.1a (PR0) documents.
  • Each Pebble store allocates a private 256 MiB block cache (store/lsm_store.go:273-279); shared-cache is a TODO comment (:117-120), not a design.
  • FSM apply is serial per group (kv/fsm.go:123-124).
  • The raft learner primitive is implemented in code (engine AddLearner/PromoteLearner, raftadmin CLI, --raftJoinAsLearner) even though the doc filename still reads proposed — but follower-served reads (its consumer) have no design.
  • Snapshot disk offload and lease reads are implemented; TSO and range merge are not.

Notes

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 41 minutes and 59 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f7c92b17-56c3-4afb-ab3b-7fdd50ed8609

📥 Commits

Reviewing files that changed from the base of the PR and between 688ba8a and 3f6e910.

📒 Files selected for processing (1)
  • docs/design/2026_06_23_proposed_scaling_roadmap.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch design/scaling-roadmap

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@bootjp

bootjp commented Jun 12, 2026

Copy link
Copy Markdown
Owner Author

@claude review

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @bootjp's task in 3m 55s —— View job


Claude's Review of PR #954 — Scaling Roadmap Design Doc

Tasks:

  • Load tools
  • Read the scaling roadmap design document
  • Verify code-level claims against the codebase
  • Check consistency with existing design docs
  • Post review findings

Summary

This is a solid doc-only PR with well-structured analysis and honest gap identification. The overall framing (current envelope → dimension analysis → gap list → dependency-ordered sequencing) is clear and the design-doc conventions are followed. Most code-level claims check out, but there are two stale line-number citations, one missing companion document that the text promises exists, and one open question that could materially affect the sequencing.


Issues Found

1. Stale line-number citations in kv/sharded_coordinator.go (correctness)

The two citations into kv/sharded_coordinator.go are off by roughly 46 lines, likely because PR #952 was merged between when the document was written and when it was submitted. The claims themselves are accurate; only the anchors are wrong.

Keyviz sampler (§2(f)):

kv/sharded_coordinator.go:1795-1824

Actual location of observeMutation (the per-route sampler function): lines 1841–1846. Lines 1795–1824 actually contain rawLogs / txnLogs.

RunHLCLeaseRenewal (§1.5 and §2(d)):

kv/sharded_coordinator.go:1914-1953, group, ok := c.groups[c.defaultGroup] at :1915

Actual: the function starts at line 1960, with group, ok := c.groups[c.defaultGroup] at line 1961. Lines 1914–1953 currently contain transaction prepare/commit log-building code.

Since the intro promises "Every claim about current behaviour below is anchored to a file and line on main at the propose date," both should be corrected before merging. Fix this →

2. Referenced companion doc doesn't exist yet (blocker risk)

§2(b) and §3 Gap 1 both say:

"A companion proposal, 2026_06_12_proposed_multinode_multigroup_bootstrap.md, is being written in parallel and is the authoritative spec for this gap; this roadmap defers to it."

That file is not present in docs/design/ on any reachable branch. Readers directed to "the authoritative spec" will find nothing. Two options:

  • Option A (preferred): Land the companion proposal in the same PR (or at least in the same batch) so the "defers to it" pointer resolves.
  • Option B: Soften the language to "a companion proposal will be written" and remove "is the authoritative spec" until it exists.

The companion is also listed as the root unblocking dependency in the sequencing table (step 2), so its absence here is not merely cosmetic.

3. Open Question 1 could pull TSO forward in the sequence (design concern)

§5 OQ-1 asks:

"Is the per-group renewal fix sufficient for cross-group OCC correctness in the multi-node topology, or does the first node-spanning cross-group transaction force the dedicated TSO group earlier than step 11?"

The sequencing in §4 places the dedicated TSO group last (step 11, after cross-group transactions are "common enough"). But kv/transaction.go uses commitTS values that are issued by whichever node leads each group involved in the transaction. Without a single ordering source, two concurrent cross-group transactions led by different nodes can produce non-comparable commit timestamps, breaking OCC serializability.

This means OQ-1 may not be answerable by a "focused correctness review" deferred until just before step 2 lands — the answer could require step 11 to move to step 3. It is worth at least sketching the commitTS comparison logic before accepting the sequencing as final, or explicitly noting in §4 step 11 that the TSO decision should be revisited after multi-node groups are real.


Observations (non-blocking)

Accurate code anchors (spot-checked):
All of the following checked out on main:

  • main.go:746len(groups) != 1 guard ✓
  • multiraft_runtime.go:246-254Peers: bootstrapServers (nil in multi-group), single-member bootstrap ✓
  • store/lsm_store.go:273-279pebble.NewCache(pebbleCacheBytes)
  • store/lsm_store.go:117-120 — shared-cache TODO ✓
  • kv/fsm.go:123-124 and :297 — "Raft applies are serial" comment + Apply definition ✓
  • main.go:426WithLeaseReadObserver wiring ✓
  • main.go:104--raftJoinAsLearner flag ✓
  • main.go:1606-1620 / :1613 — per-group gRPC listener ✓
  • shard_config.go:61-99parseRaftGroups
  • internal/raftengine/engine.go:233,245AddLearner / PromoteLearner interface ✓
  • internal/raftengine/etcd/engine.go:1268,1289,2550 — etcd backend implementations ✓
  • cmd/raftadmin/main.go:204-206add_learner / promote_learner CLI ✓
  • distribution/engine.go RecordAccess — confirmed only in test files, not called from any request path ✓

Gap 5 (range merge) analysis is the strongest new content. The characterization of why merge is harder than split — unifying two independent commit-timestamp streams, the !txn|… keyspace reconciliation hazard, the Composed-1 interaction — is accurate and not documented anywhere else. Worth ensuring this ends up in the eventual merge design.

Sequencing note for #953: The doc correctly distinguishes PR #953 PR1 (observe-only, can land without multi-node) from PR2–PR3 (blocked on Gap 1). This is consistent with #953's own framing.

Gap 2 / Gap 3 are independent of Gap 1 for correctness (shared cache and follower-read design can be written before multi-node bootstrap lands), even though their value depends on Gap 1. The sequencing correctly reflects this.


Verdict

Two issues should be addressed before merging: the stale line-number anchors (issue 1, quick fix) and the missing companion document (issue 2, either land it alongside this PR or soften the language). Issue 3 is a design concern worth discussing before the sequencing is treated as settled. Everything else is accurate.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive scaling roadmap document (2026_06_12_proposed_scaling_roadmap.md) that analyzes the current scaling limits of elastickv and proposes a prioritized sequence of design proposals to address identified gaps. The reviewer's feedback focuses on strengthening the architectural and operational aspects of the roadmap. Specifically, the feedback highlights the need to address live-migration and rolling-upgrade strategies, apply decoupling patterns (like interceptors) for follower/learner reads, use layered mitigations for upgrade risks during sequencing, and explicitly note the requirement for shared start timestamps in cross-group transactions to ensure MVCC consistency.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.


---

## 5. Open Questions

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When designing a production-grade system migration (such as moving from single-node multi-group to multi-node multi-group), the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption. Additionally, if there are significant operational risks like the inability to perform rolling upgrades, the document should detail potential mitigation strategies (such as implementing a temporary 'bridge' or 'proxy' mode). Consider adding an open question or a dedicated section to address these live-migration and rolling-upgrade strategies.

References
  1. When designing a production-grade system migration, the plan must consider and detail strategies for live or zero-downtime cutovers (e.g., blue/green, dual-write proxy, logical replication) to avoid service interruption.
  2. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary 'bridge' or 'proxy' mode.

Comment on lines +301 to +309
### Gap 3 — Follower / learner reads
**Problem.** Reads are leader-only; the learner primitive exists but its
consumer (off-leader reads) has no design. Read throughput cannot scale past
one node's leader capacity. **Rough milestones:** (M1) leader-issued read-ts
pipeline + a replica apply-watermark so a follower/learner can serve a
snapshot read at a leader-vouched timestamp (CLAUDE.md HLC rule). (M2) lease
invalidation interaction + staleness bound + adapter read routing to replicas.
(M3) Jepsen stale-read bound workload. **Depends-on:** Gap 1 (multi-node
groups to have a remote replica); the learner primitive (already in-tree).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When designing the follower/learner reads architecture to decouple low-level generic packages (such as raftengine or store) from high-level domain packages (such as kv or distribution), we should use an interceptor or delegate pattern with interfaces defined in the low-level package. Additionally, ensure that any such interceptor is optional (nil-safe) so that builds or environments with the high-level feature disabled can run without modification.

References
  1. To prevent tight coupling between low-level generic packages and high-level domain packages, use an interceptor or delegate pattern with interfaces defined in the low-level package. Ensure the interceptor is optional (nil-safe) so that builds or environments with the high-level feature disabled can run without modification.


---

## 4. Sequencing (dependency-ordered rollout)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When planning the sequencing and rollout of these major architectural changes, we must mitigate rolling-upgrade risks. It is recommended to use layered mitigations (such as operational guidelines and admin RPC capability gates) and defer complex coordination protocols (like cluster-version Raft entries) to separate design slices if they have a broader scope. This keeps the initial rollout slices focused and manageable.

References
  1. When mitigating rolling-upgrade risks in design documents, use layered mitigations (such as operational guidelines and admin RPC capability gates) and defer complex coordination protocols (like cluster-version Raft entries) to separate design slices if they have a broader scope.

Comment on lines +211 to +213
(TSO doc §6 "Guarantee"). Cross-group transactions (`kv/transaction.go`,
`kv/txn_codec.go`) that span groups led by different nodes need a single
ordering source for OCC commit-ts comparability. So: per-group renewal fix

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When designing cross-group transactions at scale, we must ensure that all operations within a single transaction share the same start timestamp to guarantee atomicity and a consistent snapshot view across the involved groups. This is a core correctness requirement for our MVCC system and should be explicitly highlighted in the transaction scaling analysis.

References
  1. In an MVCC system, all operations within a single transaction must share the same start timestamp to ensure atomicity and a consistent snapshot view.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1abe3500e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +182 to +184
`--raftJoinAsLearner` exists (`main.go:104`). The learner doc itself scopes
follower-served reads as an explicit non-goal (§2 "Non-goals", §8 OQ-5):
learners forward `LinearizableRead` to the leader today. So the read-replica

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct the learner read behavior description

When an attached learner receives LinearizableRead today it does not forward the read to the leader; internal/raftengine/etcd.Engine.handleRead returns ErrNotLeader for any non-leader state, and docs/raft_learner_operations.md documents that the caller must forward to the leader. Leaving this as “learners forward” misstates the current primitive that Gap 3 is supposed to build on, so the follower-read roadmap may omit required forwarding/proxy or API work.

Useful? React with 👍 / 👎.

Issue 1: correct stale line anchors in kv/sharded_coordinator.go
  - keyviz sampler observeMutation: 1795-1824 -> 1841-1846
  - RunHLCLeaseRenewal: 1914-1953 -> 1960-1985; defaultGroup access :1915 -> :1961
Issue 2: soften companion-doc references to in-flight PR #955 form,
  matching the #945/#951/#953 branch-reference style (3 sites + map line).
Issue 3: ground OQ-1 in the actual commitTS logic (nextStartTS/
  resolveTxnCommitTS/nextTxnTSAfter all from one c.clock) and annotate
  §4 step 11 as deferred-pending-OQ-1 with an explicit trigger condition.
Inline: fix learner LinearizableRead behaviour (engine returns ErrNotLeader,
  caller forwards; engine.go:1583); note shared-startTS invariant; add OQ-7
  for the single-node->multi-node live cutover / rolling-upgrade strategy.
@bootjp

bootjp commented Jun 12, 2026

Copy link
Copy Markdown
Owner Author

@claude review

Addressed in 09031487 (doc-only, Status: Proposed unchanged).

Issue 1 — stale line anchors (fixed). Re-verified against current main:

  • keyviz sampler §2(f): observeMutation is at kv/sharded_coordinator.go:1841-1846 (was cited 1795-1824, which now holds rawLogs/txnLogs). Corrected, and added the function name for stability.
  • RunHLCLeaseRenewal §1.5: function is kv/sharded_coordinator.go:1960-1985, with group, ok := c.groups[c.defaultGroup] at :1961 (was 1914-1953 / :1915). Corrected both citations.

The other spot-checked anchors (main.go:746, multiraft_runtime.go:246-254, lsm_store.go, fsm.go, etc.) re-verified accurate; left unchanged.

Issue 2 — companion doc reference (softened, Option B). The companion is PR #955 (design/multinode-multigroup-bootstrap), open and claude-approved but not on main. Softened all three references — §2(b), §3 Gap 1, §4 step 2 plus the in-flight-PR map line — to the same "in review as PR #955 (<file>, branch <branch>)" form already used for #945/#951/#953. "Authoritative spec" is now conditioned on "once PR #955 lands."

Issue 3 — OQ-1 / TSO sequencing (took the substantive route: sketched the commitTS logic AND annotated step 11). Grounded in code:

  • A cross-group txn issues both startTS (nextStartTS, :1429) and commitTS (resolveTxnCommitTSnextTxnTSAfter, :1102/:1376) from a single *HLC (c.clock) on the coordinator's node. maxLatestCommitTS/Observe only floors on the keys the txn touches.
  • Today this is safe only because every group shares one process-wide *HLC. The per-group renewal fix (step 1) buys per-node monotonicity only — TSO doc §6 "Guarantee" is explicit that cross-node monotonicity is not covered.
  • So the moment step 2 makes the two groups of a cross-group txn coordinate on different nodes, two concurrent cross-group txns can draw commitTS from two independently-advanced clocks → non-comparable, breaking OCC. OQ-1 therefore cannot be answered affirmatively by a review deferred to "just before step 2."
  • §4 step 11 is re-labeled "deferred-pending-OQ-1-resolution, not settled-last," with the explicit trigger: the first cross-group transaction whose participating groups are led by different nodes. OQ-1 now also raises an interim option (pin cross-group ts allocation to one designated group's leader) to evaluate in the PR docs(design): propose multi-node multi-group bootstrap #955 design, short of the full TSO group.

Inline findings:

  • codex-connector (P2, learner reads, line 184) — fixed. The doc said "learners forward LinearizableRead to the leader," which is wrong: handleRead returns ErrNotLeader for any non-leader state (internal/raftengine/etcd/engine.go:1583) and the caller must forward (docs/raft_learner_operations.md). Corrected, and noted Gap 3's follower-read design must supply the forwarding/proxy/replica-read API the primitive omits.
  • gemini (line 213, shared start timestamp) — folded in. Added a note in §2(d) that the shared-snapshot invariant (one startTS per txn via nextStartTSreqs.StartTS) is already upheld; the gap is cross-node commitTS comparability, not startTS.
  • gemini (line 399, live cutover / rolling upgrade) — folded in. Added OQ-7 on the single-node→multi-node transition (live AddVoter expansion vs blue/green dual-write proxy vs fresh-cluster migration; version-skew/capability-gating), deferring the spec to PR docs(design): propose multi-node multi-group bootstrap #955 and pointing at the TSO doc §7 phased-cutover structure as the model.
  • gemini (line 309, interceptor/nil-safe pattern) and (line 356, layered mitigations / defer cluster-version Raft entries) — noted, not folded in. These are implementation-pattern suggestions for the eventual Gap 3 follower-read design and the bootstrap rollout respectively; the roadmap explicitly defers mechanism to those docs and says it "does not duplicate the detail of the docs it references." Captured here so the downstream designs pick them up.

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @bootjp's task in 1m 58s —— View job


Re-review of PR #954 — Scaling Roadmap (post 09031487)

Tasks:

  • Load tools
  • Read updated design document
  • Verify claimed line anchor fixes against codebase
  • Verify learner-read correction
  • Spot-check OQ-1/TSO annotation and OQ-7 addition
  • Post findings

Verdict: All prior issues resolved — approved to merge

Every item from the previous review has been addressed correctly. Spot-checks below.


Verified Fixes

Issue 1 — Line anchors (fixed, confirmed).

  • observeMutation at kv/sharded_coordinator.go:1841-1846: function body runs exactly lines 1841–1846 ✓
  • RunHLCLeaseRenewal at :1960, group, ok := c.groups[c.defaultGroup] at :1961: exact match ✓
  • New OQ-1 citations (nextStartTS at :1429, resolveTxnCommitTS at :1102, nextTxnTSAfter at :1376) all confirmed ✓

Issue 2 — Companion doc reference (softened, confirmed).
The three references in §2(b), §3 Gap 1, and §4 step 2 now use the same "in review as PR #955 (<file>, branch <branch>); once it lands it is the authoritative spec" form already used for #945/#951/#953. The language is no longer asserting the doc exists before it does. ✓

Issue 3 — OQ-1/TSO sequencing (substantive analysis added, confirmed).
The full code-grounded walkthrough of why the per-group fix is necessary but not sufficient in the cross-node case is accurate:

  • c.clock is the shared *HLC; nextStartTS (:1429) and resolveTxnCommitTS (:1102nextTxnTSAfter at :1376) both issue through it
  • TSO doc §6 "Guarantee" is correctly characterized: per-node monotonicity only, not cross-node
  • Step 11 is now labeled "deferred-pending-OQ-1-resolution, not settled-last" with a concrete trigger event

The interim option (pin cross-group ts allocation to the default group's leader until the full TSO group lands) is a sound suggestion and correctly deferred to the PR #955 design. ✓

Codex P2 — Learner-read description (fixed, confirmed).
The revised text correctly states handleRead returns ErrNotLeader for any non-leader state at internal/raftengine/etcd/engine.go:1583 (line 1583 is the if e.State() != raftengine.StateLeader check that precedes the return), and the docs/raft_learner_operations.md quote ("Serve linearizable reads from the learner. Today LinearizableRead on a learner returns ErrNotLeader and the caller must forward to the leader") is present verbatim in that file. ✓

Gemini shared startTS — folded in, confirmed. The clarification correctly distinguishes the already-upheld startTS invariant (one nextStartTS propagated via reqs.StartTS) from the actual gap (cross-node commitTS comparability). Accurate and clearly stated. ✓

Gemini OQ-7 (live cutover) — folded in, confirmed. The options surveyed (live AddVoter expansion, blue/green with proxy/dualwrite.go, fresh cluster + data migration) are all grounded in in-tree code; the pointer to TSO doc §7 as a structural model for the cutover design is appropriate. ✓

Gemini interceptor/nil-safe and layered mitigations — correctly deferred. These are rightly left to the downstream Gap 3 and bootstrap designs; noting them in the trigger comment is sufficient. ✓


One small observation (non-blocking)

The OQ-1 text at line 429–431 has a minor forward-reference inconsistency: it says nextStartTS allocates one startTS "for the whole txn and propagates it via reqs.StartTS to every participating group" — this is accurate for the RawKV/Transactional gRPC path, but worth a parenthetical caveat that the DynamoDB TransactWriteItems path generates its own timestamps (since the DynamoDB adapter is not behind ShardedCoordinator.nextStartTS). Not a bug in this doc since the doc scope is the coordinator path; mentioning in case it surfaces in a downstream TSO design review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 09031487bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +225 to +229
is in-scope-soon and load-bearing; the dedicated TSO group is only justified
once (i) multi-node multi-group is real and (ii) cross-group transactions
are common enough that the batch-allocator amortization pays for the extra
Raft group. Until both hold, the per-group fix plus the shared-`*HLC`
property is adequate. The recommendation in §4 sequences the per-group fix

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align TSO sequencing with first cross-node txn

This paragraph still frames the dedicated TSO as a throughput/amortization decision and says the per-group fix is adequate until cross-group transactions are common. That conflicts with the later OQ-1 analysis in this same doc: once Gap 1 permits participating groups to be led by different nodes, even the first such cross-group transaction can draw commitTS from incomparable per-node HLCs, so deferring the TSO/interim single-oracle plan based on workload frequency leaves a correctness hole in the roadmap.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bca519a604

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +293 to +295
**Problem.** No startup wiring produces a Raft group whose voters span more
than one node (`main.go:746` `len(groups) != 1` guard;
`multiraft_runtime.go:246-254` single-member bootstrap). This blocks

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Narrow the bootstrap gap to multi-group configs

This overstates the missing bootstrap support: resolveBootstrapServers only rejects --raftBootstrapMembers when len(groups) != 1, and main_bootstrap_test.go has a single group fixed members case with n1,n2, so a single Raft group can already be bootstrapped with voters on multiple nodes from flags and then passed to buildRuntimeForGroup. As written, the roadmap misclassifies existing single-group multi-node/learner-read work as blocked on Gap 1; this should be scoped to multi-group configurations that need per-group voter sets.

Useful? React with 👍 / 👎.

Comment on lines +336 to +340
**Rough milestones:** (M1) data-balance policy (move a range's replica set from
an over-full node to an under-full one) reusing the M2 migration plane for the
actual move. (M2) compose with leader balance so the two schedulers don't
fight. **Depends-on:** M2 migration plane (PR #945) for the data-movement
mechanism, **and** Gap 1 (multi-node bootstrap) for somewhere to move replicas

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Separate replica placement from range migration

For the common multi-group topology where each group is bootstrapped with the same voter set, reusing the M2 range-migration plane only changes which Raft group owns the key range; it does not move any replicas off an over-full node because the route model points at a group_id while node placement is controlled by Raft membership. A region-balance design needs an explicit target-group placement constraint or membership-change step, otherwise it can claim to fix per-node data skew while merely moving data between groups colocated on the same nodes.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the current-head roadmap review comments.

  • narrowed Gap 1 to multi-group configs with per-group multi-node voter sets, while noting single-group multi-voter bootstrap already exists
  • separated range ownership migration from replica placement in Gap 4, and made region balancing depend on both the M2 migration plane and a replica-placement / membership-change design

Verification:

  • git diff --check

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ee8c13622b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +416 to +417
8. **Region balance scheduler** (Gap 4). After step 4 (migration plane) and
step 2 (multi-node). Complement to step 3's leader balance.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add the missing replica-placement prerequisite

This sequencing still unblocks the region-balance scheduler after only the M2 migration plane and multi-node bootstrap, but §3 Gap 4 says the scheduler also needs a replica-placement / membership-change design to create or reshape groups when existing target groups share the same voter set. In that common topology, following this order would ship a scheduler that can move range ownership between groups but still cannot move any replicas off an over-full node, so the rollout list should include that prerequisite here or split it into a prior step.

Useful? React with 👍 / 👎.

Comment on lines +29 to +32
(`main.go:1606-1620`, listener at `:1613`). But a *group whose voter set
spans more than one node is not deployable from the startup flags*:
`resolveBootstrapServers` rejects `--raftBootstrapMembers` whenever
`len(groups) != 1` (`main.go:746`, `ErrBootstrapMembersRequireSingleGroup`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope bootstrap gap to multi-group configs

resolveBootstrapServers only rejects --raftBootstrapMembers when len(groups) != 1; the single-group path can already bootstrap a multi-voter group from startup flags, which this doc later acknowledges in Gap 1. As written, this current-envelope bullet says no group can span nodes at startup at all, so readers will treat existing single-group multi-voter deployments as unsupported; scope this claim to multi-group configurations.

Useful? React with 👍 / 👎.

Comment on lines +392 to +395
1. **HLC per-group ceiling renewal fix** (TSO doc §6 / M1). Smallest correct
change; closes the cross-group monotonicity gap (§1.5) *before* the
topology that exposes it exists. Land first so multi-node groups are safe
on arrival.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate multi-node safety on the TSO decision

This says the per-group HLC renewal fix makes multi-node groups safe on arrival, but §5/OQ-1 and the later step 11 text conclude that the fix only gives per-node monotonicity and that a TSO/single-oracle bridge is required before cross-group transactions led by different nodes. If an implementer follows steps 1–2 as sufficient, the existing cross-group transaction path can issue commitTS values from different HLC instances, so this step should be narrowed to the renewal gap or explicitly gate enabling that topology on the TSO/bridge decision.

Useful? React with 👍 / 👎.


Status: Proposed
Author: bootjp
Date: 2026-06-12

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Use the actual propose date

docs/design/README.md requires a new design doc's header date and filename propose date to match the author date of the first commit that added the file. This file is added by c6018d4 with AuthorDate June 23, 2026, but the header says June 12 (and the filename repeats that), so the standard git log --follow --diff-filter=A ... | tail -1 check will disagree with the doc convention from day one.

Useful? React with 👍 / 👎.

Comment on lines +448 to +452
2. **Operator-driven in-place expansion.** For clusters that can tolerate a
controlled maintenance window, upgrade all binaries first, verify capability
convergence, then add voters / learners one group at a time with live
`AddVoter` / `AddLearner` and keep the old single-voter leader serving until
the new voter set catches up. Do not permit a flag-only restart to reinterpret

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use learners for in-place replica expansion

For expanding an existing single-voter group with a fresh node, allowing direct AddVoter here is operationally unsafe: the learner runbook explains that AddLearner keeps the voter quorum unchanged during catch-up, while AddVoter immediately makes the new node count toward quorum and can stall commits until it catches up. The guardrail should prescribe AddLearner → verify catch-up → PromoteLearner for new replicas, not AddVoter / AddLearner interchangeably.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the current rollout-guardrail review items:

  • clarified that the per-group HLC ceiling fix is not sufficient to enable cross-group transactions led by different nodes; those remain gated on the TSO/single-oracle bridge.
  • tightened in-place expansion to require AddLearner -> catch-up watermark -> promotion, reserving direct AddVoter for bootstrap/offline fully caught-up peers.

Validation: git diff --check.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 360abdc0ce

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +232 to +235
load-bearing for one node leading multiple groups; before enabling the first
cross-group transaction whose participating groups are led by different
nodes, the roadmap must either pull the dedicated TSO group forward or land a
narrower single-oracle bridge that pins cross-group timestamp allocation to

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Gate cross-group txns by coordinator, not leader placement

The trigger is too narrow for the current request path: ShardedCoordinator.Dispatch allocates StartTS and resolveTxnCommitTS before each group's LeaderProxy.Commit forwards the already-stamped pb.Requests, and Internal.Forward preserves non-zero timestamps. In a multi-node deployment, two clients can therefore issue cross-group commit timestamps from different ingress/coordinator nodes even when the participating groups are all led by the same node, so gating only on groups “led by different nodes” leaves that follower-forwarded case outside the TSO/single-oracle guardrail.

Useful? React with 👍 / 👎.

Comment on lines +557 to +558
What is the supported path — operator-driven `AddVoter` expansion of an
existing single-voter group, blue/green with a dual-write proxy

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep live expansion on learner promotion

This OQ reintroduces direct live AddVoter as a supported in-place path even though §4.1 now says live expansion must add a learner, wait for catch-up, then promote, reserving direct AddVoter for bootstrap/offline fully caught-up peers. Leaving PR #955 to choose “operator-driven AddVoter expansion” makes the guardrail ambiguous and can lead implementers back to the unsafe path this revision was meant to exclude; this option should be phrased as learner-based expansion or explicitly rejected.

Useful? React with 👍 / 👎.

Comment on lines +493 to +498
Both calls go through the same `c.clock`. The apply-time OCC/ownership check
then compares these timestamps against stored `CommitTS` values (the
Composed-1 guard, `docs/design/2026_05_29_partial_composed1_cross_group_commit_guard.md`
§4.2(a)/§4.4; FSM `latest > startTS` write-conflict check). OCC
serializability depends on those `commitTS` values being **mutually
comparable** across all participating groups.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include read-only shard validation in txn guardrail

For transactions that read a shard they do not mutate, the current path uses validateReadOnlyShards, whose own comment notes a TOCTOU window between the linearizable barrier and LatestCommitTS check where a concurrent write can go undetected. A TSO/single-oracle bridge does not close that SSI gap, so the roadmap should not describe commit-ts comparability as the remaining OCC serializability requirement for cross-group txns without also gating those read-only-shard shapes or adding the dedicated read-validate FSM phase.

Useful? React with 👍 / 👎.

Comment on lines +101 to +104
- **Range split — distribute a range across groups.** Same-group split
shipped in M1 (`distribution/`). Cross-group migration (the part that
actually relocates data and reduces per-node volume) is **PR #945**
(`docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md`,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't promise M2 reduces per-node data alone

This overstates the scaling effect of the M2 migration plane: moving ownership between Raft groups only reduces per-node data when the target group has a different replica placement. Today's multi-group process hosts every group locally, and the roadmap itself later notes that M2 does not move replicas off an over-full node when groups share the same voter set, so this section should make Gap 1/region-placement a prerequisite for actual per-node volume reduction rather than attributing it to M2 alone.

Useful? React with 👍 / 👎.

Comment on lines +445 to +449
1. **Capability-gated admin operations.** `raftadmin` / coordinator admin RPCs
that enable multi-node bootstrap, leader transfers, follower reads,
migration/import, write fences, or a cross-group timestamp bridge must first
observe that every target voter advertises the matching capability. Mixed
binary clusters run in compatibility mode with these features disabled.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check capability on learners before expansion

For the live in-place expansion flow immediately below, the new replica is first added as a learner, so checking only “target voter” capabilities can miss the joiner until after it has already been admitted. To keep mixed-version clusters in compatibility mode, the guardrail should require capability convergence for every target replica/server involved in AddLearner and promotion, not just current or future voters.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the new P1 guardrail findings:

  • changed the TSO/single-oracle trigger from participant leader placement to timestamp allocation by multiple coordinator/ingress nodes, including forwarded requests that preserve non-zero timestamps.
  • added the independent read-only-shard validation guardrail: a TSO/single-oracle bridge is necessary but not sufficient while validateReadOnlyShards has a TOCTOU window, so those txn shapes stay gated until a read-validate FSM phase or equivalent lands.

Validation: git diff --check; commit hook golangci-lint --config=.golangci.yaml run --fix.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87572881ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +487 to +491
- `startTS` via `nextStartTS` (`kv/sharded_coordinator.go:1429`):
`Observe(maxLatestCommitTS(keys))` then `c.clock.NextFenced()`. The
`maxLatestCommitTS` floor is a per-key read against the store — it pins
`startTS` above the latest commit on *the keys this txn touches*, nothing
more.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Limit startTS floor claim to mutation keys

For transactions with ReadKeys that are not also mutation Elems, this overstates what the code observes: Dispatch calls nextStartTS(ctx, reqs.Elems) and maxLatestCommitTS only appends each Elem.Key before calling MaxLatestCommitTS, so read-only participant keys and adapter pre-dispatch reads are not included in this floor. Leaving the roadmap saying the floor covers the keys the txn touches/read can mislead the TSO/read-validation follow-ups into relying on a guarantee that only exists for mutation keys.

Useful? React with 👍 / 👎.


Status: Proposed
Author: bootjp
Date: 2026-06-12

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Use the add-commit date in the design header

The design-doc README requires both the filename date and header Date to match the author date of the first commit that added the file. This new file is added by commit 1e510336 on 2026-06-23, but the path/header use 2026-06-12; tooling or review processes following the documented convention will misidentify the proposal's historical anchor. Please rename the file and update this header date to 2026-06-23.

Useful? React with 👍 / 👎.

Comment on lines +264 to +269
- **keyviz** — the per-route load sampler is wired and allocation-free on the
hot path (`observeMutation`, `kv/sharded_coordinator.go:1841-1846`), with proposed extensions
for cluster fan-out (`2026_04_27_proposed_keyviz_cluster_fanout.md`),
subrange sampling (`2026_05_25_proposed_keyviz_subrange_sampling.md`),
hot-key top-K (`2026_05_28_proposed_keyviz_hot_key_topk.md`), and per-cell
conflict (implemented). It is the detection signal M3 reuses.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add the missing read-path caveat to keyviz reuse

When M3 or later balancing consumes keyviz as the hotspot signal, this summary omits an important limitation: the in-tree coordinator comment says adapter-direct Redis/DynamoDB/S3 reads that hit MVCCStore.GetAt bypass keyviz, so read-heavy hotspots on those normal adapter paths will be invisible unless read-path sampling is added or listed as a dependency. Without that caveat, the roadmap can incorrectly treat keyviz-driven automation as covering both read and write load.

Useful? React with 👍 / 👎.

Comment on lines +307 to +308
across processes. (M3) Jepsen multi-group multi-node workload. **Depends-on:**
nothing (it is the root unblocker). *A companion proposal is in review as **PR #955**

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge List the HLC renewal fix as a bootstrap dependency

This section says Gap 1 has no dependencies, but the roadmap later makes the per-group HLC ceiling renewal fix the step that must land before multi-node multi-group bootstrap because default-group-only renewal is unsafe once replicas move across nodes. If PR #955 follows this dependency list rather than the later sequence, it can enable the topology that exposes stale ceilings without the guard; please either scope “depends-on nothing” to design-doc authoring only or list the HLC renewal fix as a rollout prerequisite.

Useful? React with 👍 / 👎.

Comment on lines +431 to +432
pull the dedicated TSO group forward or land a narrower bridge that
allocates all cross-group `commitTS` values from one designated oracle. The

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require the bridge to allocate startTS too

This guardrail narrows the bridge to allocating only commitTS, but in the multi-coordinator case StartTS is also currently allocated before forwarding by Dispatch/nextStartTS. If a single-oracle bridge leaves startTS on each node's local HLC, a transaction can start on one coordinator with a snapshot timestamp below a commit just issued by another coordinator and then read stale data; the bridge needs to pin both startTS and commitTS to the oracle, or the roadmap should explicitly keep those shapes disabled.

Useful? React with 👍 / 👎.

@bootjp

bootjp commented Jun 23, 2026

Copy link
Copy Markdown
Owner Author

@codex review

Addressed the current scaling-roadmap followups:

  • renamed the proposal file/header to the add-commit date (2026_06_23)
  • scoped the bootstrap limitation to multi-group configs; single-group multi-voter bootstrap is no longer described as unsupported
  • clarified that M2 migration only reduces per-node data when replica placement differs, so Gap 1 / replica placement / region balance are prerequisites for actual volume relief
  • added the keyviz read-path caveat for adapter-direct reads
  • made Gap 1's rollout dependency on the HLC renewal fix explicit
  • required region balance to depend on a replica-placement / membership-change design
  • changed the cross-group bridge requirement to allocate both startTS and commitTS, and kept the read-only-shard validation gate explicit
  • expanded capability checks to learners/new replicas and rephrased live expansion as learner-add/promote instead of direct live AddVoter

Validation: git diff --check HEAD passes. Docs-only change.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Hooray!

Reviewed commit: 3f6e910cf0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@bootjp bootjp merged commit 90c0f46 into main Jun 24, 2026
9 checks passed
@bootjp bootjp deleted the design/scaling-roadmap branch June 24, 2026 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant