diff --git a/docs/design/2026_06_23_proposed_scaling_roadmap.md b/docs/design/2026_06_23_proposed_scaling_roadmap.md new file mode 100644 index 00000000..91afcdfc --- /dev/null +++ b/docs/design/2026_06_23_proposed_scaling_roadmap.md @@ -0,0 +1,602 @@ +# Scaling Roadmap — what exists, what is proposed, what is missing + +Status: Proposed +Author: bootjp +Date: 2026-06-23 + +This is a roadmap / implementation-plan document (the kind `docs/design/README.md` +explicitly admits under "Concrete implementation plans"). It surveys the +current scaling envelope of elastickv, maps every scaling dimension to the +design that already addresses it, and identifies the gaps where no design +exists yet. Every claim about current behaviour below is anchored to a file +and line on `main` at the propose date; every referenced design doc is one +that actually exists in `docs/design/` (or, for in-flight work, an open PR +branch named explicitly). + +Scope: this doc proposes **no implementation**. It proposes a sequence and a +short list of new design docs to write. It does not duplicate the detail of +the docs it references; read those for mechanism. + +--- + +## 1. Current scaling envelope + +What bounds a single elastickv deployment today: + +1. **Single-process multi-group only.** Multiple Raft groups run in one + process (`--raftGroups id=addr,…`, `shard_config.go:61-99`); each group + gets its own engine and its own gRPC listener at `rt.spec.address` + (`main.go:1606-1620`, listener at `:1613`). But a *multi-group topology whose + per-group voter sets span nodes is not deployable from the startup flags*: + `resolveBootstrapServers` rejects `--raftBootstrapMembers` whenever + `len(groups) != 1` (`main.go:746`, `ErrBootstrapMembersRequireSingleGroup` + at `:736`), and `buildRuntimeForGroup` passes the same `bootstrapServers` + (which is `nil` in any multi-group config) to every group with + `LocalAddress: group.address` (`multiraft_runtime.go:246-254`), so each + group in a multi-group process bootstraps as a **single-member** cluster. + The multi-group story today is "one process hosts every group, each group + has one voter" (the demo / Jepsen M5 model). True multi-node multi-group + is not wired (see §2(b), §2(e)). + +2. **Per-group memory.** Each Raft group owns a private Pebble store, and + `NewPebbleStore` → `defaultPebbleOptionsWithCache` allocates a *fresh* + block cache per store (`store/lsm_store.go:273-279`, `pebble.NewCache(pebbleCacheBytes)` + at `:274`), default 256 MiB (`defaultPebbleCacheBytes` at `:66`). N groups + on a node means N × 256 MiB of block cache alone, plus N memtables, N WALs, + N compaction budgets. The TODO at `store/lsm_store.go:117-120` records the + intended fix (a process-wide shared cache plumbed through `NewPebbleStore`) + but it is a comment, not a design. + +3. **Leader concentration.** Leadership of each group is elected + independently by etcd/raft; nothing spreads leaderships across nodes. One + node can end up leading every group while peers idle, carrying all + leader-only work (write proposals, HLC ceiling renewal, lease reads, OCC + timestamp issuance, route-catalog proposes). This is the explicit + motivation for PR #953 (leader balance scheduler). + +4. **Range granularity is fixed after a manual split.** M1 of the hotspot + shard split shipped a durable, versioned route catalog and a same-group + `SplitRange` RPC (`adapter/distribution_server.go`, `distribution/catalog.go`, + `distribution/engine.go`, `distribution/watcher.go`). There is no data + movement (cross-group migration is M2, in flight), no auto-detection (M3, + in flight), and **no merge** (no design exists). The detection counters in + `distribution/engine.go` (`RecordAccess`) are dead code — not called from + any request path (confirmed in the M3 doc §1.3). + +5. **Cross-group timestamps via per-group HLC, and not even per-group yet.** + `ShardedCoordinator.RunHLCLeaseRenewal` proposes the physical-ceiling + renewal **only to the default group** (`kv/sharded_coordinator.go:1960-1985`, + `group, ok := c.groups[c.defaultGroup]` at `:1961`). A node that leads a + non-default group but is not a member of the default group never advances + its ceiling from that group — exactly the cross-group monotonicity gap the + centralized-TSO doc §1.1 describes. The doc's "near-term fix" (M1: iterate + *all* led groups in parallel) is **not yet implemented** on `main`. Today + this gap is masked because every group is single-node (§1.1): all FSMs in + the process share one `*HLC` (CLAUDE.md "Timestamp Oracle"), so the default + group's ceiling covers the whole process. The moment groups span nodes, the + gap becomes real. + +6. **Large values traverse the Raft log.** Every byte of an S3 object travels + through the Raft log as `s3keys.BlobKey` entries (the S3 raft-blob-offload + doc §1). WAL and snapshot size scale with object bytes, and follower + catch-up re-applies every byte on the single-threaded apply loop. + +7. **FSM apply is serial per group.** `kvFSM.Apply` runs on the group's + single apply goroutine — "Raft applies are serial" (`kv/fsm.go:123-124`, + `Apply` at `:297`). Apply throughput per group is bounded by that one + goroutine; the only way to add apply parallelism today is to add groups. + +--- + +## 2. Dimension-by-dimension analysis + +Each dimension below lists what bounds it and the design (existing, in-flight, +or missing) that addresses it. + +### (a) Data volume per node + +The amount of data one node can hold is bounded by Pebble capacity and by the +memory each group's private cache/memtable pins. + +- **Range split — distribute a range across groups.** Same-group split + shipped in M1 (`distribution/`). Cross-group migration (the part that + actually relocates data and reduces per-node volume) is **PR #945** + (`docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md`, + branch `docs/hotspot-split-m2-proposal`): a resumable `SplitJob` with + `PLANNED → BACKFILL → FENCE → DELTA_COPY → CUTOVER → CLEANUP → DONE` phases + driven by a migrator on the default-group leader. M2 is the required + ownership-migration mechanism, but it reduces per-node bytes only when the + target group has different replica placement; if every group still shares the + same local voter set, migration changes ownership but cannot move data off an + over-full node. Actual per-node volume relief therefore also depends on Gap 1 + plus the replica-placement / region-balance work in Gap 4. Auto-detection / + scheduling is **PR #951** + (`docs/design/2026_06_11_proposed_hotspot_split_milestone3_automation.md`, + branch `design/hotspot-split-m3-automation`), which drives detection off the + already-wired keyviz sampler rather than the dead `RecordAccess` counters. +- **Range MERGE — missing, no design exists.** The inverse of split. Both the + parent hotspot doc (§2.2.2) and M2 (§2.2) list merge as an explicit + non-goal, and no `*_merge_*.md` exists in `docs/design/` (verified). Why it + is needed: without merge, a workload whose hot range cools, or one that + over-splits during a transient spike, accumulates permanent route fragments. + Each fragment costs a route entry, sampler cardinality, and — once a child + lives in its own group — a full group's per-node memory overhead (§1.2). A + long-running cluster's route count is monotonically non-decreasing, which is + not sustainable. Hard parts to sketch: choosing merge candidates (adjacent + routes, both cold, same or adjacent groups); fencing both sides atomically; + reconciling the two MVCC histories and the per-key internal keyspace + (`!txn|…`, list/hash/set/zset meta) without losing a committed write or an + unresolved prepare; cutover that bumps the catalog version exactly like + split; and the Composed-1 cross-group commit guard interaction (the same + apply-time hazard M2 §7.3 closes for split). Merge is strictly harder than + split because it must *unify* two independent commit-timestamp streams, not + bisect one. +- **Blob offload — proposed.** + `docs/design/2026_04_25_proposed_s3_raft_blob_offload.md` keeps large object + payloads out of the Raft log (chunkref through Raft, chunkblob via a + peer-to-peer side channel with semi-synchronous quorum replication). This + bounds WAL/snapshot growth to O(manifest), which is the data-volume lever + for the S3 surface specifically. Still proposed; the legacy `BlobKey`-on-Raft + path is what runs today. +- **Shared Pebble cache / resource pools — TODO only, promoted to a design + item here.** The `store/lsm_store.go:117-120` TODO is the single biggest + per-node memory tax as group count grows: N independent 256 MiB caches with + no shared eviction pool. This deserves a real design (see §3, Gap 2), not a + comment. It blocks high group counts on a single node, which in turn blocks + the "many small ranges" model that split (a) produces. + +### (b) Write throughput + +- **Multi-group** is the primary write-scaling lever: independent Raft + pipelines and independent serial apply loops (§1.7). Already in-tree + structurally. +- **Leader balance — PR #953** + (`docs/design/2026_06_11_proposed_leader_balance_scheduler.md`, branch + `design/leader-balance-scheduler`) spreads group leaderships across nodes so + write-proposal load is not pinned to one node. Count-based v1 (TiKV + balance-leader equivalent), embedded in the default-group leader, default + OFF behind `--leaderBalance`. +- **Multi-node multi-group bootstrap — GAP.** Leader balance is *blocked on a + topology that does not exist*: PR #953 §1.1a ("PR0") documents that no + startup wiring produces a group whose voters span more than one node — the + `len(groups) != 1` guard at `main.go:746` and the single-member bootstrap in + `multiraft_runtime.go:246-254` mean every group in a multi-group process is + single-voter, so there is nothing to transfer a leader *to*. Closing this is + a prerequisite for write-throughput scaling beyond one node, and a companion + proposal is in review as **PR #955** + (`docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md`, branch + `design/multinode-multigroup-bootstrap`): extend `--raftGroups` / a per-group + members flag to accept a multi-node voter set per group, lifting the + `len(groups)==1` guard. Once PR #955 lands it is the authoritative spec for + this gap; this roadmap treats it as the unblocking dependency for (b), (e), + and the region-balance gap in §3. + +### (c) Read throughput + +- **Lease reads — done.** Leader-served linearizable reads without a Raft + round-trip per read (`kv/lease_state.go`, + `docs/design/2026_04_20_implemented_lease_read.md`; the lease-read observer + is wired in at `main.go:426`). This scales leader read throughput but keeps + all reads on the leader. +- **Learner reads → follower/learner reads.** The learner *primitive* is + effectively implemented despite the doc filename still reading + `2026_04_26_proposed_raft_learner.md`: `AddLearner` / `PromoteLearner` are + on the engine `Admin` interface (`internal/raftengine/engine.go:233`, `:245`) + and implemented in the etcd backend (`internal/raftengine/etcd/engine.go:1268`, + `:1289`, `ConfChangeAddLearnerNode` handling at `:2550`), the raftadmin CLI + has `add_learner` / `promote_learner` (`cmd/raftadmin/main.go:204-206`), and + `--raftJoinAsLearner` exists (`main.go:104`). The learner doc itself scopes + follower-served reads as an explicit non-goal (§2 "Non-goals", §8 OQ-5): + a `LinearizableRead` against a learner is **not** forwarded by the engine — + `handleRead` returns `ErrNotLeader` for any non-leader state + (`internal/raftengine/etcd/engine.go:1583`), and the *caller* must forward to + the leader (`docs/raft_learner_operations.md`, "Serve linearizable reads from + the learner"). So the read-replica attach primitive exists, but the design + that consumes it for off-leader reads **does not** — **follower-read design + is missing**, and it must supply the forwarding/proxy or replica-read API the + current primitive deliberately omits. Requirements to + sketch (Gap 3): a leader-issued read-timestamp pipeline so a follower/learner + serves a snapshot read at a ts the leader has vouched for (CLAUDE.md HLC + rule: never use the local wall clock or a follower-issued ts for MVCC + visibility / OCC / lease decisions); a staleness bound and a way for the + reader to know it has applied past that ts; lease invalidation interaction; + and adapter routing so reads can be steered to a replica. The low-level + packages must stay generic: `internal/raftengine` / `store` should expose an + optional, nil-safe delegate/interceptor interface for "may serve at this + leader-vouched read timestamp?" and "has this replica applied past it?", + with nil preserving today's leader-only behavior. `kv` / `distribution` + implement the policy above that interface rather than being imported by the + engine/store packages. The centralized TSO doc §9 Q4 and the learner doc + §8 OQ-5 both flag this as the open follow-on. + +### (d) Cross-group transactions at scale + +- **Per-group HLC vs centralized TSO.** Today the ceiling is renewed only on + the default group (§1.5); the TSO doc + (`docs/design/2026_04_16_proposed_centralized_tso.md`) proposes first the + near-term fix (renew on all led groups, in parallel — its M1) and then a + dedicated TSO Raft group with a batch allocator. + **Assessment of whether TSO is still needed once groups multiply:** the + near-term per-group fix (TSO doc §6) is *necessary regardless* — it is the + minimum correctness fix the moment a node leads a non-default group it is + member of, and it should land before multi-node multi-group bootstrap (b) + makes that topology reachable. The *full* dedicated-TSO component (TSO doc + M6–M7) is a larger component, but the need for a shared ordering source is + not a throughput/amortization question once cross-node cross-group + transactions are possible: with the per-group fix in place, every node's + timestamps are self-monotonic, but **global** monotonicity across coordinator + nodes is still not guaranteed without a shared oracle (TSO doc §6 + "Guarantee"). Cross-group transactions (`kv/transaction.go`, + `kv/txn_codec.go`) whose timestamps can be allocated by different ingress / + coordinator nodes need a single ordering source for OCC commit-ts + comparability, regardless of where the participating groups' current leaders + live. `LeaderProxy.Commit` / `Internal.Forward` preserve non-zero timestamps, + so leader co-location does not prove single-clock allocation. (The shared-snapshot + invariant — every operation in one txn reading at the *same* `startTS` — is + already upheld: `nextStartTS` allocates one `startTS` for the whole txn and + propagates it via `reqs.StartTS` to every participating group; the gap is the + cross-*coordinator* comparability of the per-txn `commitTS`, not the per-txn + `startTS`. See OQ-1.) So: per-group renewal fix is in-scope-soon and + load-bearing for one node leading multiple groups; before enabling any + cross-group transaction mode in which more than one coordinator node can issue + `startTS` / `commitTS`, the roadmap must either pull the dedicated TSO group + forward or land a narrower single-oracle bridge that pins cross-group + timestamp allocation to one designated leader. Until such txns are enabled, + the per-group fix plus the shared-`*HLC` property remains adequate. + +### (e) Cluster size / membership + +- **Learner / conf-change — primitive present.** `AddVoter` / `AddLearner` / + `PromoteLearner` / `RemoveServer` exist on the engine and the raftadmin CLI + (see (c) / §1). Single-step V1 conf changes only; joint consensus is + rejected (`validateConfState`). +- **Multi-node bootstrap — GAP** (same gap as (b)): the membership primitives + exist for *live* expansion (`AddVoter` after bootstrap), but there is no + startup wiring to declare a multi-node voter set per group, and no + integration harness for multi-voter groups across processes (PR #953 §1.1a; + Jepsen M5 runner records "True distributed multi-group is M6+ work", + `scripts/run-jepsen-m5-local.sh`). +- **Auto group creation — explicit non-goal today, long-term.** Both the + hotspot parent doc (§2.2.1) and M2/M3 (§2.2) state automatic Raft group + creation / membership orchestration is out of scope. The scheduler only + targets groups already in `--raftGroups`. Noting it as a long-term item: + elastic scale-out (add a node → automatically create/rebalance groups onto + it) needs an auto group lifecycle, which depends on multi-node bootstrap (b), + region balance (§3 Gap 4), and merge (a) all being in place first. + +### (f) Operational scaling + +- **keyviz** — the per-route load sampler is wired and allocation-free on the + write hot path (`observeMutation`, `kv/sharded_coordinator.go:1841-1846`), + with proposed extensions for cluster fan-out + (`2026_04_27_proposed_keyviz_cluster_fanout.md`), + subrange sampling (`2026_05_25_proposed_keyviz_subrange_sampling.md`), + hot-key top-K (`2026_05_28_proposed_keyviz_hot_key_topk.md`), and per-cell + conflict (implemented). It is the detection signal M3 reuses. Current + adapter-direct Redis/DynamoDB/S3 reads that hit `MVCCStore.GetAt` bypass this + coordinator sampler, so read-heavy hotspots remain invisible until read-path + sampling or an equivalent adapter read observation path is added. +- **admin** — admin dashboard / data browser / purge-queue are implemented + (`2026_04_24_implemented_admin_dashboard.md`, + `2026_05_22_implemented_admin_data_browser.md`, + `2026_05_16_implemented_admin_purge_queue.md`). +- **metrics cardinality** — recurring constraint across proposals (S3 + admission `protocol`/`stage` labels, SQS per-queue counters bounded by a + top-N sketch, keyviz `MaxTrackedRoutes` coarsening). Any new per-route / + per-queue / per-peer metric must carry a cardinality bound; this is a + cross-cutting operational-scaling rule, not a single design. +- **workload isolation** — `2026_04_24_proposed_workload_isolation.md` (heavy + command worker pool, optional Raft-thread pinning, per-client admission, + XREAD O(N)→O(new)), S3 PUT admission control + (`2026_04_25_proposed_s3_admission_control.md`), SQS per-queue throttling + (`2026_04_26_proposed_sqs_per_queue_throttling.md`). These bound *one + workload's* impact so it cannot starve the shared runtime / Raft control + plane; they scale a deployment by making it predictable under adversarial or + unbalanced load rather than by raising raw capacity. + +--- + +## 3. Gap list → recommended new designs (ranked by leverage) + +Each gap is a design doc that should be written (`*_proposed_*.md`). Ranked by +how much further scaling it unlocks. "Depends-on" lists hard prerequisites. + +### Gap 1 — Multi-node multi-group bootstrap (highest leverage) +**Problem.** Startup wiring can already bootstrap a *single* Raft group with +multiple voters via `--raftBootstrapMembers`, but it cannot express a +*multi-group* topology where each group has its own multi-node voter set +(`main.go:746` rejects `--raftBootstrapMembers` when `len(groups) != 1`, and +the multi-group path still falls back to single-member group bootstraps). That +multi-group limitation blocks write-throughput scaling beyond one group, leader +balance across groups, follower reads for every group, and every cluster-size +dimension that needs multiple replicated ranges. **Rough milestones:** (M1) +extend `--raftGroups` / add a per-group members flag to accept per-group +multi-node voter sets; lift the guard only for the explicitly modeled +multi-group form. (M2) integration harness that stands up multi-voter groups +across processes. (M3) Jepsen multi-group multi-node workload. **Depends-on:** +none for writing the design; rollout must land the HLC per-group ceiling +renewal fix first, as sequenced in §4, before enabling the topology that +exposes stale ceilings. *A companion proposal is in review as **PR #955** +(`docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md`, branch +`design/multinode-multigroup-bootstrap`); once it lands it is the authoritative +spec for this gap and this roadmap defers to it.* + +### Gap 2 — Shared Pebble cache / resource pools +**Problem.** Each group's store allocates a private 256 MiB block cache +(`store/lsm_store.go:273-279`); N groups = N × 256 MiB plus N memtables/WALs, +with no shared eviction. This caps how many groups (hence how many ranges) one +node can hold, throttling the "many small ranges" model that split produces. +**Rough milestones:** (M1) process-wide shared `pebble.Cache` plumbed through +`NewPebbleStore` (the existing TODO), with per-node sizing. (M2) shared +memtable / compaction concurrency budget across stores. (M3) per-group +fairness so one hot group cannot evict everyone else's working set. +**Depends-on:** none functionally; pairs naturally with Gap 1 (high group +counts only matter once multi-node groups exist). + +### Gap 3 — Follower / learner reads +**Problem.** Reads are leader-only; the learner primitive exists but its +consumer (off-leader reads) has no design. Read throughput cannot scale past +one node's leader capacity. **Rough milestones:** (M1) leader-issued read-ts +pipeline + a replica apply-watermark so a follower/learner can serve a +snapshot read at a leader-vouched timestamp (CLAUDE.md HLC rule), exposed +through optional nil-safe low-level delegate interfaces so raftengine/store +do not import `kv` / `distribution`. (M2) lease invalidation interaction + +staleness bound + adapter read routing to replicas. +(M3) Jepsen stale-read bound workload. **Depends-on:** Gap 1 (multi-node +groups to have a remote replica); the learner primitive (already in-tree). + +### Gap 4 — Region (range) balance scheduler +**Problem.** Leader balance (PR #953) spreads *leaderships*; nothing spreads +*data* (which node holds which range's replicas). After enough splits, ranges +pile up unevenly across nodes. This is TiKV's separate balance-region +scheduler; PR #953 §2.2(3) explicitly excludes replica/data movement. The M2 +migration plane can move *range ownership* between Raft groups, but it does not +by itself move replicas off an over-full node when the source and target groups +share the same voter set. A real region-balance design therefore needs both a +target-group placement constraint and, when no suitable group already exists, a +Raft membership-change / replica-placement step before or during migration. +**Rough milestones:** (M1) data-balance policy that chooses a target group whose +replica set actually reduces node skew, then reuses the M2 migration plane for +the range-ownership move only when that placement predicate holds. (M2) Raft +membership-change primitive for creating or reshaping groups when no existing +group has the needed placement. (M3) compose with leader balance so the two +schedulers don't fight. **Depends-on:** M2 migration plane (PR #945) for +range-ownership movement, a replica-placement / membership-change design for +per-node data movement, and Gap 1 (multi-node bootstrap) for somewhere to place +replicas. Edge: region balance ⟂ depends on M2 + multi-node bootstrap + replica +placement. + +### Gap 5 — Range merge +**Problem.** No way to recombine ranges; route count is monotonically +non-decreasing (§2(a)). Over-split or cooled workloads leak route fragments +and (once children are in their own groups) per-node memory. **Rough +milestones:** (M1) same-group merge of two adjacent cold routes (mirror M1 +split: catalog CAS + version bump). (M2) cross-group merge (relocate one +child's data into the other's group, reusing the M2 migration plane in +reverse). (M3) auto-merge in the M3 detector (cold-route hysteresis). +**Depends-on:** M2 migration plane for cross-group merge; the Composed-1 guard +for cutover coherence. + +### Gap 6 — Connection / transport scaling (streaming transport revival) +**Problem.** Raft inter-node messages use unary gRPC per message +(`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md` §1): each +send pays a full RTT, capping per-peer throughput at ~RTT⁻¹ regardless of +bandwidth. This bites exactly when multi-node multi-group (Gap 1) puts real +inter-node Raft traffic on the wire. **Rough milestones:** revive the existing +streaming-transport proposal (client-streaming `SendStream` per peer, biased +heartbeat select, backward-compat fallback). The blob-fetch RPC in the S3 +offload doc (§3.6) already wants to reuse the same chunked-streaming +abstraction, so landing both behind one transport layer is the leverage. +**Depends-on:** nothing hard; value scales with Gap 1. + +### Gap 7 — Auto group lifecycle (longest-term) +**Problem.** Groups are static (`--raftGroups`). Elastic scale-out (add node → +auto-create/rebalance groups) needs automatic group creation + membership +orchestration, an explicit non-goal everywhere today (§2(e)). **Rough +milestones:** out of near-term scope; sketch only. **Depends-on:** Gaps 1, 4, +5 all in place (you cannot auto-create groups before you can stand up +multi-node groups, move ranges between them, and merge fragments). + +--- + +## 4. Sequencing (dependency-ordered rollout) + +The ordering is driven by unblock-edges, not by perceived value in isolation. + +1. **HLC per-group ceiling renewal fix** (TSO doc §6 / M1). Smallest correct + change; closes the cross-group monotonicity gap (§1.5) *before* the + topology that exposes it exists. Land first so each group remains safe when + replicas move across nodes, but do not enable cross-group transactions whose + timestamps can be allocated by more than one coordinator node until step 11 + (or its single-oracle bridge) lands. +2. **Multi-node multi-group bootstrap** (Gap 1 / **PR #955**, + `2026_06_12_proposed_multinode_multigroup_bootstrap.md`). The root + unblocker for (b), (c), (e), Gap 3, Gap 4. Nothing else multi-node-shaped + can land until a group can have voters on more than one node. +3. **Leader balance scheduler** (PR #953). Its PR0 is exactly Gap 1; PR1 + (observe-only) can land against today's single-voter topology, but the + transfer-issuing PR2–PR3 are blocked on step 2. So: PR #953 PR1 in + parallel with step 2; PR2–PR3 after. +4. **Hotspot split M2 migration plane** (PR #945). The data-movement + mechanism every later data-balance/merge step reuses. Independent of the + multi-node work for its own correctness (it moves ranges between groups + that already exist), so it can proceed in parallel with steps 2–3, but its + *value* compounds once groups span nodes. +5. **Hotspot split M3 automation** (PR #951). Drives detection off keyviz; + delivers same-group auto-split standalone (does not require M2), and picks + a least-loaded target once M2 lands. After step 4 for the cross-group case. +6. **Shared Pebble cache** (Gap 2). Needed once split + multi-node lets a node + hold many groups; land before pushing high group counts in production. +7. **Follower / learner reads** (Gap 3). After step 2 (remote replicas exist) + and the learner primitive (already in-tree). +8. **Region balance scheduler** (Gap 4). After step 4 (migration plane), step 2 + (multi-node), and a replica-placement / membership-change design that can + reshape groups when existing target groups share the same voter set. + Complement to step 3's leader balance. +9. **Range merge** (Gap 5). After step 4 for cross-group merge. +10. **Streaming transport** (Gap 6). Any time after step 2 makes inter-node + Raft traffic significant; pairs with the S3 blob-fetch RPC. +11. **Dedicated TSO group or single-oracle bridge** (TSO doc M6–M7 / OQ-1) — + shown late only because the roadmap must first create node-spanning groups. + This placement is not a throughput amortization tradeoff: before enabling + any cross-group transaction mode whose `startTS` / `commitTS` can be + allocated by more than one coordinator / ingress node, step 2 must either + pull the dedicated TSO group forward or land a narrower bridge that + allocates all cross-group `startTS` and `commitTS` values from one + designated oracle. The + per-group fix gives per-node monotonicity only (TSO doc §6 "Guarantee"); + commit-ts comparability across coordinators on different nodes is not + covered by it (OQ-1), even when forwarding preserves timestamps to leaders + that happen to be co-located. Treat step 11 as + "deferred-pending-OQ-1-resolution", not "settled-last". +12. **Auto group lifecycle** (Gap 7) — long-term, after 2/4/8/9. + +In-flight PRs map cleanly: **#955** is step 2 (Gap 1 bootstrap proposal), +**#953** is step 3 (and its PR0 = step 2's intent), **#945** is step 4, +**#951** is step 5. + +### 4.1 Rolling-upgrade and live-cutover guardrails + +This roadmap does not introduce a cluster-version Raft entry as part of the +first sequencing slice; that broader coordination protocol should be its own +design if needed. The near-term mitigation is layered: + +1. **Capability-gated admin operations.** `raftadmin` / coordinator admin RPCs + that enable multi-node bootstrap, leader transfers, follower reads, + migration/import, write fences, learner admission/promotion, or a cross-group + timestamp bridge must first observe that every current voter and every target + learner/voter/server involved advertises the matching capability. Mixed + binary clusters run in compatibility mode with these features disabled. +2. **Operator-driven in-place expansion.** For clusters that can tolerate a + controlled maintenance window, upgrade all binaries first, verify capability + convergence, then expand one group at a time by adding a new replica as a + learner (`AddLearner`), waiting for its match/apply watermark to catch up, + and promoting it with the learner-promotion path. Keep the old single-voter + leader serving until the learner is caught up and promoted; reserve direct + `AddVoter` for bootstrap/offline fully-caught-up peers, not live in-place + expansion. Do not permit a flag-only restart to reinterpret an existing + single-voter group as multi-voter. +3. **Blue/green or bridge/proxy cutover for zero-downtime moves.** Deployments + that cannot accept the in-place operational window should use a fresh + multi-node cluster plus a temporary bridge/proxy mode: dual-write or + write-through to both sides, shadow-read / compare, then flip reads and retire + the old cluster. This mirrors the existing Redis migration pattern and the + TSO doc's feature-flagged shadow phase, without forcing every bootstrap PR to + carry a full migration coordinator. +4. **Deferred complex protocol.** If the capability checks above are too weak for + a later feature, the fix is not to overload this roadmap; write a separate + cluster-version / rolling-upgrade design and make that feature depend on it. + +--- + +## 5. Open Questions + +1. **Per-group HLC fix vs full TSO ordering.** Is the per-group renewal fix + (step 1) sufficient for cross-group OCC correctness in the multi-node + topology, or does the first node-spanning cross-group transaction force the + dedicated TSO group earlier than step 11? + + **Where the timestamps come from (grounded in code).** A cross-group txn's + coordinator issues *both* of its timestamps from one `*HLC` — `c.clock` on + the coordinator's node: + - `startTS` via `nextStartTS` (`kv/sharded_coordinator.go:1429`): + `Observe(maxLatestCommitTS(keys))` then `c.clock.NextFenced()`. The + `maxLatestCommitTS` floor is a per-key read against the store — it pins + `startTS` above the latest commit on *the keys this txn touches*, nothing + more. + - `commitTS` via `resolveTxnCommitTS` → `nextTxnTSAfter` + (`kv/sharded_coordinator.go:1102`, `:1376`): `c.clock.NextFenced()`, + re-allocated after `Observe(startTS)` if it did not strictly exceed + `startTS`. A caller-supplied `commitTS` is fed through `c.clock.Observe` + to keep the clock monotonic. + + Both calls go through the same `c.clock`. The apply-time OCC/ownership check + then compares these timestamps against stored `CommitTS` values (the + Composed-1 guard, `docs/design/2026_05_29_partial_composed1_cross_group_commit_guard.md` + §4.2(a)/§4.4; FSM `latest > startTS` write-conflict check). OCC + serializability depends on those `commitTS` values being **mutually + comparable** across all participating groups, and read-only participant shards + need the separate validation gate described below. + + **Why the per-group fix is necessary but not sufficient.** Today this is + safe only because every group shares the *same* process-wide `*HLC` + (single-node groups, §1.1/§1.5): one clock issues every timestamp, so all + `commitTS` are trivially comparable. The per-group renewal fix (step 1) + extends correctness to *one node leading several groups* — but its guarantee + is explicitly scoped: "all timestamps issued by a single node are strictly + monotonic … monotonicity across nodes that lead *different* groups is not + fully guaranteed without a shared TSO" (TSO doc §6 "Guarantee", §1.1). The + moment step 2 (multi-node bootstrap) makes it possible for two clients to + coordinate cross-group txns on *different nodes*, two concurrent cross-group + txns can draw `commitTS` from two different `*HLC` instances whose ceilings + were advanced independently. This is true even if the participating groups' + leaders are later co-located, because `ShardedCoordinator.Dispatch` and + `resolveTxnCommitTS` stamp the requests before `LeaderProxy.Commit` / + `Internal.Forward` forwards them. The `maxLatestCommitTS`/`Observe` floor + does not close this: it orders writes only on the specific keys read, not the + global commit order two unrelated cross-group txns need to be serializable + against each other. + + A second guardrail is independent of the timestamp oracle. Cross-group txns + with read-only participant shards currently rely on `validateReadOnlyShards`; + that path has a documented TOCTOU window between the linearizable barrier and + the `LatestCommitTS` check. A dedicated TSO group or single-oracle bridge + makes commit timestamps comparable, but it does not by itself close that + read-validation gap. The rollout gate must therefore either reject + read-only-shard txn shapes until a dedicated read-validate FSM phase lands, + or land that phase before enabling those shapes in a multi-node topology. + + **Conclusion / trigger.** OQ-1 therefore is *not* a "focused correctness + review deferrable until just before step 2 lands" — the per-group fix cannot + answer it in the affirmative for the cross-node case. The trigger to pull the + dedicated TSO group forward from step 11 is concrete: **the first + cross-group transaction mode whose timestamps may be issued by more than one + coordinator node**, not merely the first case whose participant leaders sit + on different nodes. Until then the per-group fix + shared-`*HLC` property + holds. Open sub-question: whether an interim measure short of the full TSO + group — e.g. pinning every cross-group txn's timestamp allocation to a single + designated group's leader (the default group's `c.clock`), so all cross-group + `startTS` and `commitTS` still come from one clock — can bridge the timestamp + gap between step 2 and step 11 without the full batch-allocator TSO. That interim option + must be paired with the read-only-shard validation gate above; it should be + evaluated as part of the step-2 design (PR #955) rather than left to step 11. +2. **Shared cache (Gap 2) vs per-group isolation (workload isolation doc).** A + single shared block cache trades isolation for density: one hot group can + evict a latency-sensitive group's working set. How does Gap 2's per-group + fairness reconcile with the workload-isolation proposal's CPU-side + reservations? They are the memory- and CPU-axis siblings and should share a + resource-accounting vocabulary. +3. **Merge (Gap 5) and unresolved prepares.** Merge must unify two MVCC + histories *and* two `!txn|…` keyspaces. What is the fence/drain protocol + that guarantees no in-flight prepare on either side is lost across the + merge cutover? (M2's split drain is the starting point but split bisects one + history; merge unifies two.) +4. **Region balance (Gap 4) signal.** Count-based (ranges per node) like + leader balance v1, or size/load-weighted from keyviz from day one? Leader + balance chose count-first; data balance may need size from the start + because a 1 GiB range and a 1 KiB range are not interchangeable. +5. **Follower-read staleness contract (Gap 3).** What bound does the adapter + surface advertise — bounded-staleness with an explicit lag, or + read-your-writes only? This determines whether the leader-issued read-ts + pipeline needs a per-client session token. +6. **Auto group lifecycle (Gap 7) trigger.** What signal creates a new group — + node join, aggregate range count crossing a threshold, or operator action? + Premature auto-creation interacts badly with merge (create/merge thrash). +7. **Live cutover / rolling-upgrade strategy for the single-node→multi-node + transition (Gap 1).** Moving a deployment from "one process hosts every + group, each single-voter" to genuine multi-node multi-group is a topology + change, not just a flag flip: a group that bootstrapped single-member must + add remote replicas as learners, wait for catch-up, and promote them through + the learner-promotion path (the primitives exist, §2(e)); direct live + `AddVoter` remains reserved for bootstrap/offline fully caught-up peers as + in §4.1. The cluster may run mixed binary versions mid-upgrade. What is the + supported path — operator-driven learner-add/promote expansion of an + existing single-voter group, blue/green with a dual-write proxy + (`proxy/`, the existing Redis-migration pattern), or a fresh cluster + + data migration? §4.1 defines the interim guardrails — capability-gated admin + operations, in-place expansion only after all binaries advertise support, + and bridge/proxy cutover for zero-downtime deployments — but PR #955 must + choose the supported default path and spell out rollback. (Note: the TSO + doc §7 already specifies a phased dual-write/shadow-read/feature-flag + cutover for the *timestamp* migration; the bootstrap cutover should mirror + that structure.)