docs(design): propose hotspot split M2 — migration plane#945
Conversation
Parent partial doc 2026_02_18_partial_hotspot_shard_split.md tracks M1 as shipped; M2-M4 are open. This proposal scopes M2 — the migration plane that moves a split child to a different Raft group. Scope: - SplitJob durable record under !dist|job|<id> with phase/cursor/ts fields so a leader flap resumes mid-flight. - BACKFILL / FENCE / DELTA_COPY / CUTOVER / CLEANUP state machine driven by a single migrator goroutine on the default-group leader, with per-phase recovery semantics enumerated in the matrix. - proto/distribution.proto extension: target_group_id on SplitRangeRequest; GetSplitJob / ListSplitJobs / AbandonSplitJob. - proto/internal.proto: ExportRangeVersions (server-streaming) + ImportRangeVersions (idempotent via persisted import_cursor). - store/MVCCStore.ExportVersions / ImportVersions with internal-key family dispatch and routeKey() equivalence assertion (covers !txn|, !lst|, !ddb|, !redis|, !sqs|, !s3|). - ErrRouteWriteFenced at both coordinator and FSM (defense in depth on the gRPC mesh + Raft apply path); reuses PR #927 / #932 Composed-1 alignment at CUTOVER. - 8-PR landing plan with five-lens self-review per PR. Non-goals (M3+): hotspot detection, auto-split scheduler, online merge, per-tenant policy. 7 open questions called out at the end for reviewer input — notably grace-period sizing, abandon-after-cutover policy, and backfill-vs-prepare-window semantics. Per the design-doc-first workflow in CLAUDE.md, this PR is the doc only; M2-PR1..PR7 follow once the proposal is accepted.
|
Warning Review limit reached
More reviews will be available in 23 minutes and 24 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughDesign doc adding a Milestone‑2 migration plane: durable SplitJob state machine and phases, cross-group route/staging semantics, streaming MVCC Export/Import contracts and RPCs, CUTOVER/promote semantics, coordinator/FSM fencing and read-fence, migrator runtime, recovery matrix, test plan, and rollout gating. ChangesMilestone 2 Migration Design
Sequence Diagram(s)sequenceDiagram
participant Coordinator
participant SourceStore
participant TargetStore
participant Catalog
Coordinator->>SourceStore: ExportRangeVersions(job_id, start_cursor, KeyFilter)
SourceStore->>Coordinator: Stream(versions[], cursor)
Coordinator->>TargetStore: ImportRangeVersions(job_id, cursor, KeyFilter)
TargetStore->>TargetStore: persist staged MVCC pages (idempotent by job_id+cursor)
TargetStore->>Catalog: stage shadow keyspace (not bound in catalog until CUTOVER)
Coordinator->>Catalog: CUTOVER (atomic catalog rebind, commit route_version)
Catalog->>Coordinator: ack CUTOVER (route_version increment)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request proposes a design document for Milestone 2 of the Hotspot Shard Split feature, focusing on the migration plane. It outlines the data model additions (such as SplitJob), state machine transitions, RPC changes, MVCC range export/import, and coordinator/FSM integration. The review feedback highlights three key areas for improvement: defining a retention or garbage collection policy for completed SplitJob records to prevent unbounded storage growth, decoupling the low-level store package from high-level routing logic by using an interceptor or delegate pattern instead of directly referencing routeKey, and addressing rolling upgrade risks where older M1 nodes might not enforce the WriteFenced state by detailing mitigation strategies such as capability gates.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| ### 3.1 SplitJob | ||
|
|
||
| A split job is durable state for the migration. Persist under reserved keys in the default group, alongside the route catalog. | ||
|
|
||
| Key layout: | ||
|
|
||
| - `!dist|meta|next_job_id` — uint64, allocator for `job_id`. | ||
| - `!dist|job|<job_id>` — encoded `SplitJob` record. |
There was a problem hiding this comment.
The proposed design persists SplitJob records under !dist|job|<job_id> but does not specify a limit or a garbage collection/retention policy for completed (DONE, FAILED, or abandoned) jobs. Over time, as the cluster performs many hotspot splits, this collection can grow unboundedly. This poses a risk of unbounded storage growth and potential OOM issues when listing or scanning jobs (e.g., via ListSplitJobs).
To prevent unbounded memory and storage growth, please define a fixed bound or a retention/cleanup policy to automatically garbage collect or archive historical job records after a certain period or count.
References
- To prevent unbounded memory growth and potential OOM issues, apply a fixed bound to collections that can grow from external requests, such as pending configuration changes. Reject new requests when the bound is reached.
| ### 6.3 Internal-key coverage | ||
|
|
||
| §9 of the parent doc enumerates the key families. The exporter scans the raw storage range `[start, end)` filtered by **routeKey** equivalence to the moving range, so internal keys (`!txn|...`, `!lst|...`, redis hash/list/etc.) land in the same shard as their logical owner. The importer dispatches by `key_family` into the matching store helper to preserve any per-family invariant (e.g., list head pointer updates). | ||
|
|
||
| A simple safety check: every exported key must map back to the moving range under `routeKey()`. The exporter asserts this on every row before adding to the chunk; an assertion failure aborts the job and surfaces in `last_error`. This protects against a future internal-key family being added that the exporter forgot to teach. |
There was a problem hiding this comment.
The proposed design has the low-level store package (store/mvcc_store.go) directly referencing and calling routeKey() to filter keys and map them back to the moving range. This introduces tight coupling between the low-level generic storage package and high-level routing/domain logic.
Per the general rules, to prevent tight coupling between low-level generic packages and high-level domain packages, you should use an interceptor or delegate pattern with interfaces (or function values) defined in the low-level package. For example, ExportVersions can accept an optional, nil-safe key-filter function or delegate interface (e.g., func(key []byte) bool) injected from the high-level package, keeping the store package generic and decoupled.
References
- To prevent tight coupling between low-level generic packages and high-level domain packages, use an interceptor or delegate pattern with interfaces defined in the low-level package. Ensure the interceptor is optional (nil-safe) so that builds or environments with the high-level feature disabled can run without modification.
| | Lens | M2-specific risk | | ||
| |---|---| | ||
| | Data loss | CUTOVER atomicity; ImportRangeVersions idempotency; the BACKFILL/DELTA_COPY commit-ts windows have no gap. | | ||
| | Concurrency | Migrator goroutine vs leadership election; FENCE visibility timing on followers; AbandonJob race with phase advance. | | ||
| | Performance | Chunk size vs gRPC frame limit; pacing default vs prod throughput; cursor lookup cost per import. | | ||
| | Data consistency | Composed-1 alignment at CUTOVER; reader at `≤ fence_ts` on source vs reader at `≥ catalog_version+1` on target; internal-key route mapping. | | ||
| | Test coverage | Property tests over chunk-then-restart sequences; FSM unit for FENCE; Jepsen split + nemesis. | |
There was a problem hiding this comment.
The design introduces a significant operational risk during rolling upgrades. If a migration job is initiated while some nodes in the cluster are still running the older M1 version, those M1 nodes will not enforce the WriteFenced state (since the FENCE rejection logic is introduced in M2-PR5). This could allow writes to be accepted on the source group during the FENCE/DELTA_COPY phases, leading to silent write loss or inconsistency.
Per the general rules, when a design document identifies a significant operational risk like rolling upgrade incompatibility, it must detail potential mitigation strategies. Please update the design (e.g., in the self-review or as a new section) to include layered mitigations, such as using admin RPC capability gates to prevent starting any SplitRange migration jobs until all nodes have upgraded and advertised M2 capability.
References
- When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.
There was a problem hiding this comment.
Already addressed in §11.1 ("Rolling-upgrade compatibility") landed in commit 93de9b2: cluster-readiness gate at the SplitRange entry RPC keyed on the cap_migration_v2 capability bit in the existing distribution heartbeat, with M2-PR5 as the bit-advertise commit so the gate semantics are safe across rollout. Could you confirm this resolves the finding, or point at what's still uncovered?
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3aca90c833
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ├─ BACKFILL: right child created with state=MigratingTarget on target group; | ||
| │ source remains Active; writes still flow to source. | ||
| ├─ FENCE: source.right_child → WriteFenced; writes on the moving range |
There was a problem hiding this comment.
Avoid publishing overlapping migration routes
When a cross-group job enters BACKFILL, this plan creates the target right child while leaving the full source route Active, so the catalog snapshot contains overlapping ranges for the moving interval. The current engine does not ignore non-Active states when applying snapshots: routesFromCatalog converts every descriptor and validateRouteOrder rejects any prev.End > curr.Start with ErrEngineSnapshotRouteOverlap (distribution/engine.go:468-523). As written, watchers would fail to apply the first migration catalog snapshot, so the migration cannot progress through the proposed visible route states unless the design keeps the target out of the route snapshot or changes the overlap invariant.
Useful? React with 👍 / 👎.
|
|
||
| - `kv/fsm.go` already routes through `verifyComposed1` at apply. Extend `verifyOwnerFromSnapshot` to also reject when the resolved route's state is `WriteFenced` — even if a coordinator with a stale engine forwarded a write, the FSM closes the door. The error wraps `ErrRouteWriteFenced` so M4 retry logic differentiates from `ErrComposed1Violation`. | ||
|
|
||
| Reads are **not** fenced: `ShardStore.GetAt` continues to serve from source until CUTOVER. Snapshot reads at any `commit_ts ≤ fence_ts` remain consistent because the source's MVCC history is unchanged through DELTA_COPY. |
There was a problem hiding this comment.
Add a read fence for post-cutover stale routes
If a coordinator or source-group leader has not applied the CUTOVER catalog yet, leaving reads unfenced lets it keep routing the moved key to the source while new writes are already accepted on the target. The current read path resolves the group from the local engine and then serves/proxies the read without carrying an observed catalog version (kv/shard_store.go:38-53, kv/shard_store.go:1471-1477), while the Composed-1 guard cited here only checks mutations in verifyOwnerFromSnapshot (kv/fsm.go:753-773). During the CLEANUP grace window this can return the pre-cutover source value after a successful post-cutover target write, so the design needs a read-side route-version fence/redirect or another mechanism that stops stale source reads for the moved range.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md (2)
108-108: ⚖️ Poor tradeoffConfirm operational acceptance of one-way CUTOVER.
After CUTOVER completes, the migration cannot be rolled back without implementing a full reverse migration (explicitly out of scope for M2). This means if post-CUTOVER issues arise (e.g., target group instability, data corruption discovered later), the only recovery path is forward through CLEANUP or manual intervention.
Is this operational trade-off acceptable for M2, or should §12 Open Questions include a specific item about post-CUTOVER failure scenarios and operator runbooks?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` at line 108, The document currently states that CUTOVER is one-way and rollback requires a reverse migration; update the design doc to explicitly capture operational acceptance by either (A) adding a clear sentence in the CUTOVER paragraph that the team accepts the one-way behavior for M2 and lists recovery paths (CLEANUP, manual intervention, reverse migration out-of-scope), or (B) add a new item under §12 Open Questions titled "Post-CUTOVER failure scenarios and operator runbooks" that requests acceptance criteria, required runbooks, and escalation steps for issues like target-group instability or latent data corruption; reference the terms CUTOVER, PLANNED → BACKFILL, AbandonJob, and CLEANUP so reviewers can locate and verify the change.
258-261: ⚖️ Poor tradeoffVerify sufficiency of routeKey() assertion failure handling.
Lines 260-261 state that if an exported key fails the
routeKey()equivalence check, "an assertion failure aborts the job and surfaces inlast_error." Per §4 line 106, this puts the job intoFAILEDstate requiring manual operator intervention.This safety check protects against a future internal-key family being added without updating the exporter. However, silent job abortion may not provide sufficient visibility for this class of bug.
Consider:
- Is
last_errorsurfaced prominently enough (e.g., via metrics/alerts) to catch this during development/testing?- Should dev/test environments panic on this assertion to fail-fast rather than require operator log inspection?
- Should this be elevated in §12 Open Questions as a code-maintenance concern?
The defensive check is excellent; the question is whether the failure mode provides adequate feedback.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md` around lines 258 - 261, The routeKey() assertion currently aborts the exporter job and writes to last_error leaving the job in FAILED state; enhance visibility and fail-fast behavior by (1) emitting a dedicated metric/counter and structured error log whenever routeKey() mismatches occur (include routeKey, key_family, and chunk id) so monitoring/alerts can catch it, (2) adding a configurable dev-only mode on the exporter (e.g., Exporter devFailFast flag or ENV) that panics/throws immediately instead of only setting last_error to ensure fast feedback in testing, and (3) document this operational choice in the design notes (add to §12 Open Questions) so maintainers know to update alerting/ops runbooks; refer to routeKey(), exporter, last_error and the FAILED job state when making these changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md`:
- Around line 76-86: Add a language specifier to the fenced code block that
contains the migration state diagram (the block starting with "source.Active"
and the listed states BACKFILL/FENCE/DELTA_COPY/CUTOVER/CLEANUP) to satisfy
markdownlint MD040; update the opening fence from ``` to something like ```text
(or ```nohighlight) so the block is explicitly marked as plain text.
---
Nitpick comments:
In `@docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md`:
- Line 108: The document currently states that CUTOVER is one-way and rollback
requires a reverse migration; update the design doc to explicitly capture
operational acceptance by either (A) adding a clear sentence in the CUTOVER
paragraph that the team accepts the one-way behavior for M2 and lists recovery
paths (CLEANUP, manual intervention, reverse migration out-of-scope), or (B) add
a new item under §12 Open Questions titled "Post-CUTOVER failure scenarios and
operator runbooks" that requests acceptance criteria, required runbooks, and
escalation steps for issues like target-group instability or latent data
corruption; reference the terms CUTOVER, PLANNED → BACKFILL, AbandonJob, and
CLEANUP so reviewers can locate and verify the change.
- Around line 258-261: The routeKey() assertion currently aborts the exporter
job and writes to last_error leaving the job in FAILED state; enhance visibility
and fail-fast behavior by (1) emitting a dedicated metric/counter and structured
error log whenever routeKey() mismatches occur (include routeKey, key_family,
and chunk id) so monitoring/alerts can catch it, (2) adding a configurable
dev-only mode on the exporter (e.g., Exporter devFailFast flag or ENV) that
panics/throws immediately instead of only setting last_error to ensure fast
feedback in testing, and (3) document this operational choice in the design
notes (add to §12 Open Questions) so maintainers know to update alerting/ops
runbooks; refer to routeKey(), exporter, last_error and the FAILED job state
when making these changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 70d218ee-479f-4859-864a-0bdc51656929
📒 Files selected for processing (1)
docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md
… minor)
Codex P1 (line 80) — overlapping migration routes
Rewrote §3.2 state machine: catalog snapshots are kept strictly
non-overlapping throughout the job. BACKFILL/DELTA_COPY land in a
shadow keyspace (!dist|migstage|<job_id>|<raw_key>) on the target
group's MVCC store, outside the catalog. The catalog gets a NEW
WriteFenced route at FENCE-time (split out from source) and the
group-rebind happens atomically at CUTOVER. routesFromCatalog +
validateRouteOrder (distribution/engine.go:496-527) stay green.
RouteState.MigratingSource/MigratingTarget are reserved but unused
in M2.
Codex P1 (line 277) — post-CUTOVER read fence
Added §7.2: read-path now carries read_route_version (parallel to
ObservedRouteVersion on writes). New verifyReadOwner gate next to
verifyOwnerFromSnapshot returns ErrRouteMoved{new_route_version,
new_group_id} when a stale coordinator routes a read to source
after CUTOVER. Coordinator does WatcherRefresh + single retry.
CLEANUP grace window readFenceGrace=30s bounds the corner case.
The write-side Composed-1 gate is unchanged so PR #932's Jepsen
trust holds.
Gemini medium (line 51) — SplitJob unbounded growth
Added §3.1.1 + §4.2: live cap maxLiveJobs=64 enforced by
SplitRange; on terminal transition jobs move to
!dist|jobhist|<terminal_at_ms_be>|<id> with retention by
maxJobHistory=1k OR historyTTL=7d (whichever tighter); once-per-
minute sweep on default-group leader. ListSplitJobs supports
since_terminal_at_ms / phase filters and cursors.
Gemini medium (line 260) — store/routeKey tight coupling
Added store.KeyFilter delegate; ExportVersions takes the filter as
a nil-safe argument so the store package never imports kv or
knows about routing keys. routeKey() closure built in
kv/migrator_filter.go and passed in by the migrator. §6.3
rewritten with the explicit decoupling seam.
Gemini medium (line 346) — rolling-upgrade incompatibility
Added §11.1: node_capability_bitfield in the existing distribution
heartbeat; cap_migration_v2 bit gates cross-group SplitRange at
the entry RPC with ErrClusterNotReadyForMigration; in-flight
job-pause path on observed downgrade; bit is advertised only in
M2-PR5 (the commit that lands the FENCE rejection enforcement)
so the gate is safe across the rollout. Same-group split (M1)
remains unconditionally available.
CodeRabbit minor (line 86) — MD040 fenced-code-block language
All fenced blocks now carry a language tag (text/proto/go).
Replaced the bare ``` with ```text on the state-machine diagram
and consolidated to a single language across the diagram and
evolution table.
PR-plan note (§11): the rolling-upgrade gate dictates a hard
ordering between M2-PR4 (state machine) and M2-PR5 (FENCE
enforcement) — the capability bit only goes live in PR5.
|
@claude review — addressed all 6 findings from the gemini / codex / coderabbit reviews in commit 93de9b2: Codex P1 (line 80, overlapping routes) — §3.2 rewritten so catalog snapshots stay strictly non-overlapping. BACKFILL/DELTA_COPY land in a target-side shadow keyspace ( Codex P1 (line 277, post-CUTOVER read fence) — added §7.2 with Gemini medium (line 51, SplitJob unbounded growth) — added §3.1.1 / §4.2: Gemini medium (line 260, store/routeKey coupling) — Gemini medium (line 346, rolling upgrade) — added §11.1: CodeRabbit minor (line 86, MD040) — all fenced blocks now carry a language tag. |
|
Claude finished @bootjp's task in 3m 26s —— View job Review of commit 93de9b2 — addressing 6 prior findingsTodo list
Verified: all 6 prior findings are addressed
New observations on the updated designP1 — CUTOVER two-step ordering undefined (data-visibility gap)§3.2 (line 139) says:
But the CUTOVER step as described consists of two independent Raft proposals on different groups:
The design does not specify the ordering of these two. If (1) commits before (2), reads routed to the target during the rename window return empty results: the catalog says the target owns the range, but the target's MVCC only has shadow keys. The read fence in §7.2 only redirects stale source reads; it doesn't compensate for the target not yet having live data. The safe ordering is: (2) commits on the target first, then (1) executes on the default group — by the time routing switches, all data is live on the target. But:
Suggest: add a §7.5 (or a sub-bullet in §4's CUTOVER row) that specifies: "Target bulk-rename commits first ( Medium — Grace window inconsistency (§7.2.4 vs §4 vs Open Question 3)Three places cite the CLEANUP grace default and they disagree:
Pick one and make them consistent. Whichever wins should also answer Open Question 3 (or at least note the reasoning). For reference: Medium — "Wait until applied" in §7.2.2 under-specified
How is this wait implemented? Options include an applied-index condition variable, a short retry loop with a timeout, or a channel signalled by the FSM apply goroutine. The choice affects:
A one-sentence note (e.g., "implemented as a condition-variable wait on the FSM's appliedIndex, timedout by Minor —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93de9b2d2e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // ImportVersions writes the given versions idempotently. A second call | ||
| // with the same (job_id, cursor) MUST be a no-op past the recorded | ||
| // cursor — the importer records last-applied cursor under | ||
| // !dist|job|<id>|import_cursor so a network retry doesn't double-write. | ||
| ImportVersions(ctx context.Context, jobID uint64, versions []MVCCVersion, | ||
| cursor []byte) ([]byte, error) |
There was a problem hiding this comment.
Advance target HLC when importing versions
When the source range contains commit timestamps higher than the target group's current LastCommitTS/HLC, this import contract only writes versions and an idempotency cursor, but does not require the target to observe the max imported commit_ts. Since the existing FSM observes commit timestamps only after normal pb.Request applies, a migration import path can leave the target clock below the staged data; after CUTOVER the first target write can receive a lower timestamp than an imported older value, causing MVCC time travel where later reads resurrect the pre-cutover value. The import apply needs to atomically advance metaLastCommitTS and the target HLC (or otherwise fence new writes) to at least the max imported timestamp before the route becomes active.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 17b0719 — added §6.2.1 "Target HLC advancement": ImportVersions now atomically advances both metaLastCommitTS and the HLC ceiling (hlc.SetPhysicalCeiling) to max(batch.commit_ts) under the same FSM apply lock as the MVCC writes. SplitJob carries a monotone max_imported_ts witness, and CUTOVER is gated on a final SetPhysicalCeiling(max_imported_ts) proposal on the TARGET group before the catalog CAS — so post-CUTOVER Next() is provably greater than every imported ts. Reuses the existing Composed-1 monotone-ceiling primitive (PR #927) — no new fence.
| | `snapshot_ts` | uint64 | HLC commit-ts pinned at BACKFILL entry. | | ||
| | `fence_ts` | uint64 | HLC commit-ts pinned at FENCE → DELTA_COPY entry. | | ||
| | `cutover_version` | uint64 | Catalog version produced by the CUTOVER write; populated at CUTOVER, used by §7.2 read-fence and §4.2 GC. | | ||
| | `cursor` | bytes | Next-key cursor for BACKFILL / DELTA_COPY resume. | |
There was a problem hiding this comment.
Make the export cursor address MVCC versions
The job cursor is defined as a next-key cursor, but an MVCC export can contain many committed versions for the same raw key. If a hot key's history exceeds chunkBytes, resuming by key alone either has to keep that entire key history in one unbounded chunk or risks skipping/duplicating the remaining versions after a restart; the API should make the cursor opaque over the encoded MVCC position (raw key plus commit timestamp) or explicitly require chunks never split a key's full version history.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 17b0719 — added §6.1.1 "Cursor granularity": the cursor now encodes (raw_key, commit_ts) and the exporter iterates (raw_key ASC, commit_ts DESC) with exclusive-resume semantics. Chunks may contain only versions of one raw key without an all-or-nothing constraint on the history — the next chunk continues at the next-older commit_ts of the same key. The SplitJob row was updated accordingly. Opaque encoding so the store-side type stays []byte and the exporter alone owns the codec (preserves the §6.1 store/kv decoupling seam).
…x P1/P2) Codex P1 on PR #945 (line 323) — Advance target HLC when importing Source range can contain commit_ts > target group's current HLC / metaLastCommitTS. Import-only versions-and-cursor writes leave the target clock behind; after CUTOVER the first target write at Next() can collide with imported ts and resurrect pre-cutover values on a snapshot read. Fix: ImportVersions atomically advances metaLastCommitTS AND calls hlc.SetPhysicalCeiling(batchMax) under the same FSM apply lock as the MVCC writes; SplitJob carries a monotone max_imported_ts witness; CUTOVER is gated on a final SetPhysicalCeiling(max_imported_ts) proposal on the TARGET group before the catalog CAS. Reuses the existing Composed-1 / HLC monotone-ceiling primitive — no new fence. New unit + property tests pin (a) per-batch advance, (b) CUTOVER gap-closing proposal, (c) strict Next() > max_imported_ts post-CUTOVER. Codex P2 on PR #945 (line 66) — Cursor must address MVCC versions A "next-key" cursor cannot resume safely on a hot key whose version history exceeds chunkBytes. New §6.1.1 spec: opaque cursor encodes (raw_key, commit_ts), iterated in (raw_key ASC, commit_ts DESC) order with exclusive-resume semantics. Chunks may contain only versions of one raw key; the next chunk continues with the next-older version of the same key — no all-or-nothing constraint on a key's history. Idempotent ack at the importer dedups any one resume overlap. SplitJob row (§3.1) gains `max_imported_ts uint64` and the cursor field description is updated. §3.2 state-machine diagram now shows the SetPhysicalCeiling proposal as a precondition of CUTOVER. No caller-visible wire change beyond the existing M2 proto deltas.
|
@claude review — round 2:
|
|
Claude finished @bootjp's task in 2m 51s —— View job Review — Round 2Todo list
Round-2 items — all three verified
Remaining items from prior reviewP1 — CUTOVER bulk-rename ordering still implicit (unresolved)The prior review flagged that the CUTOVER step on two different Raft groups must be sequenced: target bulk-rename first, then catalog CAS — otherwise there is a window where the catalog says the target owns the range but the target's MVCC only has shadow keys, making reads to the target return empty results. The current doc has moved closer but the ordering is still not stated explicitly in the §3.2 CUTOVER description. §6.2.1 (line 367) says It does not list The safe and intended ordering is:
Steps 2 and 3 are specified, step 1 is not. Without step 1 appearing in the CUTOVER sequence, an implementer working from §3.2 alone could issue the catalog CAS before the rename is applied, creating the data-availability gap. The leader-flap resume case (new migrator checking whether rename already landed before re-proposing the catalog write) also needs to be specced. Suggested fix: add a Medium — Grace window inconsistency (still unresolved)Three locations cite the CLEANUP grace default and they still disagree after commit 17b0719:
§7.2.4 was updated with a rationale ("comfortably exceed both lease TTL and watcher tick × 2") but the value there (30 s) still disagrees with the other two (60 s). Pick one and update all three; if 30 s wins, close Open Question 3 with the chosen rationale. Minor —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 17b0719055
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| FSM-side defense in depth: | ||
|
|
||
| - `kv/fsm.go` already routes through `verifyComposed1` at apply. Extend `verifyOwnerFromSnapshot` to also reject when the resolved route's state is `WriteFenced` — even if a coordinator with a stale engine forwarded a write, the FSM closes the door. The error wraps `ErrRouteWriteFenced` so M4 retry logic differentiates from `ErrComposed1Violation`. |
There was a problem hiding this comment.
Reject fenced writes even when the route version is unpinned
This FSM-side fence is described as an extension of verifyOwnerFromSnapshot, but today verifyComposed1 returns before that helper when ObservedRouteVersion == 0. There are still intentional zero-version write paths (for example read/write txns with ReadKeys, caller-supplied StartTS, or resolver-claimed keys in kv/sharded_coordinator.go:694-754), so a stale coordinator in those contexts can forward a write to the source during FENCE and the proposed FSM defense will not run. The fence check needs to be on an unconditional apply path, or the design needs to require all mutation flows to carry a nonzero observed route version before cross-group split is enabled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit c16e1c4. §7.1 now introduces verifyRouteNotFenced as a SEPARATE, unconditional pre-gate ahead of verifyComposed1 — not an extension of verifyOwnerFromSnapshot. It scans mutations with routeKey() against the current catalog snapshot and rejects ErrRouteWriteFenced for any key whose route is WriteFenced, with the same Phase_ABORT bypass and NO dependency on ObservedRouteVersion. So the zero-version write flows in kv/sharded_coordinator.go:694-754 (ReadKeys, caller-supplied StartTS, resolver-claimed keys) are covered — the gate runs on every applied write request unconditionally. The two gates stay independent so a future change to one cannot regress the other.
| [ target.Active(right) ] | ||
| ``` | ||
|
|
||
| Key invariant: **catalog snapshots never contain overlapping ranges**. The target shadow keyspace lives outside the catalog (BACKFILL / DELTA_COPY land into private `!dist|migstage|<job_id>|<raw_key>` keys on the target group's MVCC store; CUTOVER promotes the staged data to live keys via a single bulk-rename FSM apply). This keeps `routesFromCatalog` + `validateRouteOrder` (`distribution/engine.go:496-527`) green throughout the migration — addresses codex P1 on overlapping routes (PR #945 review). RouteState `MigratingSource` / `MigratingTarget` are reserved but unused in M2; they remain available for a future merge / multi-stage migration design. |
There was a problem hiding this comment.
Avoid an unbounded CUTOVER bulk rename
Staging every imported row under !dist|migstage|... and then promoting the whole range via a single FSM apply makes CUTOVER proportional to the entire migrated range. For any split larger than the Raft proposal/apply budget, this defeats the chunked BACKFILL/DELTA_COPY design and can block or fail exactly when the catalog must switch atomically. The design should avoid a per-key bulk move at cutover, or make the staged-to-live promotion incremental while preserving the visibility fence.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit c16e1c4. New §6.4 "Incremental staged-to-live promotion": CUTOVER is O(1) — just a catalog version bump + route swap + set staged_visibility_active on the new target route, NO per-key work. Reads use a staged-fallback through the per-route bit so imported data is visible the instant CUTOVER lands. A leader-local promoter on the target group walks !dist|migstage|<job_id>|* in cursor-resumable chunks (same chunkBytes / pacing as BACKFILL) and proposes one bounded PromoteStaged Raft entry per batch — never a single oversized apply. CLEANUP → DONE waits for the promoter to drain AND the grace window AND the staged_visibility_active flag clear. SplitJob picks up promote_cursor + promotion_completed_ts for resumability and audit.
…945) Codex P1 on PR #945 (line 407) — FENCE must run on unconditional path verifyComposed1 returns before verifyOwnerFromSnapshot when ObservedRouteVersion == 0, and several intentional zero-version write paths exist today (read/write txns with ReadKeys, caller- supplied StartTS, resolver-claimed keys in kv/sharded_coordinator.go:694-754). The naïve "extend verifyOwnerFromSnapshot to reject WriteFenced" placement would silently skip those paths — a stale coordinator could forward a write to the source during FENCE without the FSM defense ever running. Fix in §7.1: introduce verifyRouteNotFenced as a separate, unconditional pre-gate ahead of verifyComposed1. It scans mutations with routeKey() against the current catalog snapshot and rejects ErrRouteWriteFenced for any key whose route is WriteFenced, with the same Phase_ABORT bypass and no dependency on ObservedRouteVersion. Two independent gates, write side unchanged Composed-1 trust preserved. Codex P2 on PR #945 (line 147) — CUTOVER bulk rename is unbounded Staging every imported row under !dist|migstage|... and promoting via a single FSM apply makes CUTOVER O(migrated range) — defeats chunked BACKFILL/DELTA_COPY exactly when the catalog must atomically switch. New §6.4 ships incremental promotion: 1. CUTOVER apply is O(1): catalog version bump + route swap + set staged_visibility_active on the new target route. 2. Reads use a staged-fallback through the per-route bit so imported data is visible the instant CUTOVER lands. 3. A leader-local promoter on the target group walks the staged prefix in cursor-resumable chunks, copying each version to its live position with one bounded PromoteStaged apply per batch. 4. CLEANUP → DONE precondition extends to "promoter empty AND grace window elapsed AND staged_visibility_active cleared", so route is indistinguishable from any Active route once promotion finishes. SplitJob row picks up promote_cursor (opaque) and promotion_completed_ts (diagnostic). §3.2 state-machine diagram updated: CUTOVER no longer says "bulk-rename"; the staged-to- live work is explicitly background and incremental. Both fixes preserve the §6.2.1 HLC ceiling advance and §7.2 read fence guarantees — incremental promotion is a strict superset of the chunked BACKFILL/DELTA_COPY semantics applied at a different phase.
|
@claude review — round 3 (2 codex findings on commit 17b0719):
|
|
Claude finished @bootjp's task in 3m 56s —— View job Review — Round 3Todo list
Round-3 items — both verified
Prior round-2 P1 — CUTOVER two-step ordering: resolved by §6.4The round-2 P1 flagged that "target bulk-rename must commit before the catalog CAS" to avoid a data-availability gap. §6.4 eliminates the bulk-rename entirely: CUTOVER sets The CUTOVER sequence (§3.2, lines 120–134) now correctly specifies: New P1 — §6.4 step 3 "no-op when a later live write has shadowed it" can lose MVCC history§6.4 step 3 (line 411) says the Scenario:
The "shadow" optimization is safe only for a last-write-wins (non-MVCC) store. In an MVCC store with snapshot read semantics, every version of a key must be preserved across its valid range. The correct guard is idempotency: "no-op if the staged version has already been copied to the live namespace (checked by the presence of a live version at the same commit_ts)." The fix is two parts:
The MVCC compaction GC ( Remaining from round 2 (three items still open)Medium — Grace window inconsistency (unresolved, 3rd flag)Three locations still disagree after commit c16e1c4:
§7.2.4 has a clear rationale for 30 s. If 30 s is the chosen value: update §4 line 168 to 30 s and close Open Question 3 with the chosen value + rationale. If 60 s is the chosen value: update §7.2.4 to 60 s and reconcile the rationale. Either way, one pass to make all three locations consistent and close OQ3. Minor —
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c16e1c4335
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // ExportVersions iterates committed versions in the half-open range | ||
| // [start, end) whose commit_ts is in (minCommitTS, maxCommitTS], |
There was a problem hiding this comment.
Scan internal families outside raw route bounds
For splits whose logical range is a normal user-key interval such as [foo, bar), this raw [start, end) MVCC iterator never visits keys like !redis|str|foo, !lst|...foo, or !txn|lock|foo, because those prefixes sort outside the user-key range before the routeKey() filter can accept them. That means the proposed migration can cut over without copying internal state that routes to the moved key, causing data loss or broken adapter invariants; the export needs to scan the relevant internal-family prefixes or use a route-key-indexed traversal rather than relying on a raw-range scan plus filter.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 231e9a5. New §6.3.1 "Internal-family brackets": migration now runs MULTIPLE parallel exports per phase, one per relevant raw-key family (user-key, !txn|lock|, !txn|cmt|, !txn|rb|, !txn|meta|, !txn|int|, !lst|meta|, !lst|itm|, !redis|, !ddb|item|, !ddb|gsi|, !sqs|, !s3|), each driven by the same KeyFilter so routeKey() remains the single source of truth. PlanExportBrackets() returns the bracket list; per-bracket opaque cursors + done flags on the SplitJob allow independent resumability. Bracket count is small (~12 today) and fan-out parameter caps parallelism. A future adapter MUST add its bracket at the same PR that teaches routeKey() its family.
|
|
||
| #### 7.2.2 Source FSM rejects reads after CUTOVER | ||
|
|
||
| When a source-group FSM applies a lease read whose `read_route_version` is **strictly less than** the FSM's last-applied catalog version, and the requested key's `routeKey()` maps to a route the FSM no longer owns, the FSM returns `ErrRouteMoved{new_route_version, new_group_id}` instead of the value. The error is mapped to a retryable status (gRPC `codes.FailedPrecondition` + detail) and the coordinator retries after a `RouteEngine` refresh. |
There was a problem hiding this comment.
Reject reads when both caller and source are stale
This condition only rejects when the request version is older than the source FSM's local catalog version, so it misses the common post-CUTOVER case where both the coordinator and the source group have only applied the pre-CUTOVER/FENCE version. In that window another coordinator that has refreshed can already write the key on the target, while this stale coordinator routes a read to the stale source and the source serves the old value because read_route_version == local_version; the read fence needs an authoritative cutover/ownership check that also catches equal-stale source reads during the grace window.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 231e9a5. §6.4 step 2 rewritten: the staged area now participates in MVCC MERGE — newest commit_ts wins. The read takes the newest live version with commit_ts ≤ read_ts AND the newest staged version with commit_ts ≤ read_ts and returns whichever has the greater commit_ts. Four read_ts cases enumerated to prove correctness (live_write.ts > max_imported_ts > staged.ts; merge correctly surfaces live for read_ts ≥ live.ts and staged for read_ts ∈ (max_imported_ts, live.ts)). Per-family adapter invariants preserved via the key_family dispatch.
|
|
||
| No iteration over the migrated key range, no bulk rename, no per-key proposals. The Raft proposal carrying this apply is `O(1)` in the migrated data size — bounded by a handful of catalog descriptors and the SplitJob record. | ||
|
|
||
| 2. **Staged keyspace stays visible through a per-route flag.** After CUTOVER the target FSM's read path checks the route's `staged_visibility_active` bit. When set, a read for key `K` in the moved range first looks up `!dist|migstage|<job_id>|<K>` and falls back to the live key only if the staged entry is absent. (Writes always land in the live keyspace — the CUTOVER bump already routed writers to the target group via the catalog, and §6.2.1's HLC ceiling advance guarantees their `commit_ts` is strictly greater than every imported `commit_ts`, so a live write shadowing a staged version is the *correct* MVCC visibility.) This gives reads access to the imported data the instant CUTOVER lands, with **no promotion work blocking CUTOVER itself**. |
There was a problem hiding this comment.
Merge staged and live MVCC versions on reads
Looking in the staged keyspace first and falling back to live only when staged is absent makes post-CUTOVER reads return stale imported versions after any live write to the same key, until promotion deletes the staged entry. For a read timestamp after the live write's commit timestamp, both staged and live versions exist and the newest live version must win, so the read path needs to merge/compare the staged and live MVCC candidates by commit timestamp instead of treating staged presence as authoritative.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 231e9a5. §7.2.2 replaces the strict-less-than version check with an authoritative ownership check. When source has applied CUTOVER → unconditional ErrRouteMoved. When source has NOT applied CUTOVER but the heartbeat-piggybacked max_cutover_version_seen indicates a possible stale read → run GetRouteOwnership(routeKey(K), max_cutover_version_seen) against the default-group leader (cached one second). New §7.2.2a walks through the exact 7-step equal-stale scenario (t=0..6) showing the watermark + ownership-query path catches the case the original rule missed.
Codex P1 round-4 on PR #945 — three fundamental design issues in the §6.3 / §6.4 / §7.2 designs landed in c16e1c4: P1 line 310 — Scan internal families outside raw route bounds A single raw [start, end) iterator + routeKey filter misses every internal-family raw key (!txn|lock|foo, !lst|meta|foo, !redis|str|foo, !ddb|item|<tbl>|foo, !sqs|..., !s3|...) because those prefixes sort outside the user-key band — the filter rejects nothing because the iterator never visits those bytes. Without the fix, CUTOVER would leave intent locks, list metadata, and adapter-private state behind on the source. Fix in new §6.3.1 "Internal-family brackets": migration runs multiple parallel exports per phase, one per relevant raw-key family, driven by the same KeyFilter (routeKey() remains the single source of truth). The migrator computes the bracket list from the moving range's logical scope via PlanExportBrackets; each bracket gets its own opaque cursor and done flag on the SplitJob. Bracket count is small (~12 today) and bounded; the fan-out parameter caps parallelism. A future adapter MUST add its bracket at the same PR that teaches routeKey() its family — both are required for end-to-end coverage. P1 line 408 — Merge staged and live MVCC versions on reads The §6.4 step 2 "staged-first, fall back to live when staged is absent" form returned stale imported versions after any live write to the same key. After the §6.2.1 HLC advance, live writes have commit_ts > max_imported_ts; for read_ts ≥ live_write.ts the live version is the most-recent committed, but the fallback rule would still return the older staged version because the staged entry exists until promotion. Fix in §6.4 step 2: the staged area participates in MVCC merge — newest commit_ts wins. The read takes the newest live version with commit_ts ≤ read_ts AND the newest staged version with commit_ts ≤ read_ts and returns whichever has the greater commit_ts. This is the standard MVCC visibility rule applied to a second column-family-like source, not stage-shadows-live. Four read-ts cases enumerated to prove correctness; per-family adapter invariants preserved via the key_family dispatch (§6.3.1). P1 line 459 — Reject reads when both caller and source are stale The strict-less-than 'read_route_version < source_local_version' rule misses the equal-stale window: both coordinator and source pinned at the pre-CUTOVER version, while a different refreshed coordinator has already written to the target. Source's local snapshot says it still owns the key, so the read fence passes and serves the stale value. Fix in §7.2.2: replace version comparison with an authoritative ownership check scoped on whichever snapshot exists. When source has applied CUTOVER → unconditional ErrRouteMoved. When source has NOT applied CUTOVER but the heartbeat-piggybacked max_cutover_version_seen indicates a possible stale read → run GetRouteOwnership(routeKey(K), max_cutover_version_seen) against the default-group leader, cache one second. New §7.2.2a walks through the exact 7-step equal-stale scenario to demonstrate the fix catches the case the original rule missed. All three are fundamentally correctness fixes — the original wording would have produced silent data loss (310), MVCC visibility regression (408), or stale reads (459) in production. The fixes preserve the catalog-stays-non-overlapping invariant (§3.2), the §6.2.1 HLC advance, and the §7.2 read-fence scaffolding; only the under-spec'd details are tightened.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 231e9a5729
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| 1. **Per-batch advance.** On every apply of an `ImportVersions` batch, the target FSM computes `batchMax = max(versions[i].commit_ts)` and **atomically** (under the same FSM apply lock that mutates the MVCC store): | ||
| - sets `metaLastCommitTS = max(metaLastCommitTS, batchMax)`, and | ||
| - calls `hlc.SetPhysicalCeiling(batchMax)` (the same Raft-agreed primitive used by the M1 HLC renewal proposer, `kv/coordinator.go:644-669`). `SetPhysicalCeiling` is already monotone — a lower argument is a no-op — so duplicate / out-of-order batches are safe. |
There was a problem hiding this comment.
Convert HLC timestamps before setting the physical ceiling
This passes batchMax/max_imported_ts as a full HLC commit timestamp into hlc.SetPhysicalCeiling, but the current HLC stores that field as Unix milliseconds and Next() later shifts it as the physical part (kv/hlc.go). For any imported timestamp with logical bits set or even a normal ms<<16 encoding, this advances the target ceiling to an artificial far-future millisecond, causing subsequent writes to jump to roughly commit_ts<<16 and weakening the lease-ceiling fence until wall time catches up. The import path should either Observe the full HLC timestamp or pass only its physical millisecond component to SetPhysicalCeiling.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 44e6125. §6.2.1 step 1 now passes hlcPhysicalMs(batchMax) = int64(batchMax >> 16) to SetPhysicalCeiling (the same shift HLC.Next() uses to extract the physical half, kv/hlc.go:128-145) AND calls hlc.Observe(batchMax) for the local clock's last-issued value. Step 3 (CUTOVER precondition) uses the same conversion. Both functions are monotone so duplicates/out-of-order are safe.
|
|
||
| The migrator runs the brackets **in parallel up to `--migrationExportFanout` (default 4)** within a single phase (BACKFILL or DELTA_COPY): each bracket gets its own opaque cursor (the §6.1.1 codec is extended to key on `(family, raw_key, commit_ts)`), so a bracket can be paused/resumed independently and the §9 resumability matrix applies per bracket. The migrator records per-bracket `cursor` and `done` flags on the SplitJob; the phase advances only when **every bracket** reports `done = true`. Cost is at most `fanout`× the gRPC frame budget but is bounded by the total bracket count (~12 today), not by data volume. | ||
|
|
||
| A bracket whose family doesn't actually intersect the moving range (e.g. an SQS migration on a Redis-only deployment) returns its first chunk empty and `done=true` immediately — no wasted scan work. The `keys per bracket` distribution is a §7.3 / §11.x metric so operators can spot a runaway family scan. |
There was a problem hiding this comment.
Bound sparse internal-family export scans
This claims an internal-family bracket with no matching keys returns immediately, but the proposed brackets are family-wide and the only narrowing happens after iteration through routeKey(). In a cluster with a large SQS/Redis/etc. prefix outside the moving range, an export for a range with no accepted keys must scan that whole raw prefix to prove done=true because rejected rows do not count toward chunkBytes, making a single chunk unbounded in CPU/I/O and defeating the migration pacing guarantees. The exporter needs a route-key-indexed traversal or a separate scan budget that advances the cursor over rejected rows.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 44e6125. §6.3.1 ships two layers: (1) Scan-budget pacing — each ExportVersions call carries a separate maxScannedBytes (default 4× chunkBytes) that counts both accepted and rejected rows; iterator returns at the budget with cursor advanced over the rejected tail. (2) Per-family EncodeBracketStart hook tightens family-wide bounds to the moving-slice band where the family layout allows it. Layer (1) alone is sufficient for correctness; (2) is the optimization per adapter PR. New metrics: keys per bracket + rejected_rows per bracket.
|
@claude review — round 4 (3 codex P1 on commit c16e1c4):
|
|
Claude finished @bootjp's task in 5m 5s —— View job Claude finished @bootjp's task in 3m — View job Review — Round 4Todo list
Round-4 items — all three verified
New P1 —
|
| Location | Line | Value |
|---|---|---|
| §4 table, CLEANUP row | 168 | 60 s |
| §7.2.4 | 566 | 30 s (with rationale: "comfortably exceed both lease TTL and watcher tick × 2") |
| §12 Open Question 3 | 681 | 60 s |
§7.2.4 has the rationale for 30 s. Pick one value, update all three locations, and close Open Question 3.
Minor — !dist| prefix on target-group import cursor
Line 349 stores the import cursor as !dist|job|<id>|import_cursor in the target group's MVCC store. !dist| is the default-group catalog namespace throughout the codebase. Future operator tools and distribution/catalog.go's key parsers would treat this as a catalog entry. Rename to a target-scoped prefix: !migstage|cursor|<id>.
Medium — "Wait until applied" still a one-liner
Line 544: "the FSM treats read_route_version > source_local_version as a 'wait until applied' (bounded by lease TTL) before answering" — still unspecified. The implementation mechanism (condition variable on appliedIndex signalled by the apply goroutine? short retry loop?) and the timeout error type (does the timeout surface as ErrRouteMoved or a distinct ErrTimeout?) remain unspecified. A one-sentence elaboration would close this: e.g., "condition-variable wait on the FSM's appliedIndex signalled by the apply goroutine, timedout by min(remaining_lease_TTL, 200ms); on timeout, re-evaluate ownership — if still not owned, return ErrRouteMoved."
Summary
All three round-4 items (codex P1s on commit c16e1c4) are properly verified as resolved. One new P1 emerges from the latest codex review on commit 231e9a5: SetPhysicalCeiling(batchMax) receives a full HLC timestamp instead of Unix ms, confirmed by reading kv/hlc.go — passing a full HLC value would set the ceiling ~3.5 billion years ahead and corrupt every subsequent Next() call. The correct call is SetPhysicalCeiling(int64(batchMax >> hlcLogicalBits)).
The round-3 P1 (PromoteStaged "no-op when shadowed" drops MVCC history) remains unresolved for the fourth consecutive round — the "shadowed" clause on line 477 still exists and is still wrong. Three editorial items from round 2 (grace window inconsistency, !dist| prefix, wait-until-applied spec) and one new P2 (sparse family bracket scan unbounded) round out the remaining work before implementation can start cleanly.
…dex round-5) Codex P1 round-5 on PR #945 (line 367) — HLC SetPhysicalCeiling units hlc.SetPhysicalCeiling takes int64 Unix milliseconds (kv/hlc.go:222), but the previous spec passed the full HLC encoded commit_ts (ms<<16 | logical) as the argument. With logical bits set or any normal ms<<16 encoding that would advance the ceiling by 65 536× the intended milliseconds — weakening the lease fence until wall time catches up. Fix §6.2.1 steps 1 and 3: pass hlcPhysicalMs(batchMax) = int64(batchMax >> 16) (the same shift HLC.Next() uses to extract the physical half) AND call hlc.Observe(batchMax) so the local clock's last-issued value tracks the full HLC ts. Both calls are monotone so duplicates and out-of-order batches are safe. Codex P2 round-5 on PR #945 (line 445) — sparse internal-family scans The earlier 'returns its first chunk empty and done=true immediately' claim only holds when the iterator can short-circuit; under the §6.1 contract accept rejections do NOT count against chunkBytes, so a family-wide raw range can scan an arbitrarily large prefix to prove done=true for a moving range with no intersection. Fix §6.3.1: two layers. (1) Scan-budget pacing — each ExportVersions call carries a separate maxScannedBytes (default 4× chunkBytes) that counts both accepted and rejected rows; when the budget hits the iterator returns whatever it has accepted + the next cursor (over the rejected tail position). (2) Route-key sub-prefix indexing where the family layout allows it — per-family EncodeBracketStart() hook tightens bracket bounds from family-wide to moving-slice-within-family; adapters without the hook keep the scan-budget pacing alone, which is sufficient for correctness. New metrics: keys per bracket + rejected_rows per bracket.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 44e6125daf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| | FENCE → DELTA_COPY Copy (snapshot_ts, fence_ts] from | ||
| | source to target shadow space. |
There was a problem hiding this comment.
Wait for source to enforce the fence before fixing fence_ts
When the catalog enters WriteFenced, the source FSM only learns that through its local route engine, which is refreshed asynchronously by the catalog watcher (distribution/watcher.go:11, distribution/watcher.go:87-98), and the proposed FSM fence checks the current local snapshot. If the migrator records fence_ts and starts copying (snapshot_ts, fence_ts] immediately, a stale source leader can still accept writes for the moved keys after fence_ts but before its watcher applies the fenced route; those versions are above the DELTA_COPY upper bound and will be left on the source at CUTOVER. The phase needs a source-side barrier/ack that the fenced route is applied, or fence_ts must be chosen after writes are actually rejected on the source.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 9fa5931. New §3.2a 'Source-side fence-apply barrier': migrator advances FENCE → DELTA_COPY only after every voter of the source-group Raft membership has reported catalog_version_applied >= fence_catalog_version (via the cap_migration_v2 heartbeat or a per-node GetCatalogVersion RPC). Only after the unanimous voter-ACK does the migrator query source.LastCommitTS() and pin fence_ts. Unanimous (not quorum) is required for fence safety — any voter could become the leader later and re-serve writes for the moved range in the pre-FENCE state if its catalog hadn't applied. fence_ack_cursor is durable in the SplitJob for restart-safety. State machine diagram (§3.2) updated.
|
@claude review — round 5 (1 P1 + 1 P2 on commit 231e9a5):
|
|
Claude finished @bootjp's task in 4m 52s —— View job Review — Round 5Todo list
Round-5 items — both verified
New P1 —
|
| Location | Line | Value |
|---|---|---|
| §4 table, CLEANUP row | 168 | 60 s |
| §7.2.4 | 574 | 30 s (with rationale: "comfortably exceed both lease TTL and watcher tick × 2") |
| §12 Open Question 3 | 689 | 60 s |
§7.2.4 has the rationale for 30 s. Picking 30 s: update lines 168 and 689 to match, and close Open Question 3 with the §7.2.4 rationale. Five rounds of flagging — this is a one-line fix.
Minor — !dist| prefix on target-group import cursor (5th flag)
Line 350, inside the ImportVersions doc comment: !dist|job|<id>|import_cursor is stored in the target group's MVCC store. !dist| is the default-group catalog namespace throughout the codebase (distribution/catalog.go). An operator tool or the catalog's key parsers scanning the default-group catalog would misparse these as catalog entries. Rename to a target-scoped prefix such as !migstage|cursor|<id> — one-line change.
Medium — "Wait until applied" spec (4th flag, slightly improved)
Line 552 (§7.2.2 sub-case 2): "the FSM treats this as 'wait until applied' (bounded by lease TTL) before answering, then re-evaluates ownership; if still not owned, returns ErrRouteMoved."
The round-4 addition of "then re-evaluates ownership; if still not owned, returns ErrRouteMoved" is an improvement. What remains unspecified: the implementation mechanism for "wait until applied" (condition variable on appliedIndex signalled by the apply goroutine? a retry loop with time.Sleep? a channel?) and the timeout value within the lease TTL (200 ms? 500 ms?). Given the level of implementation detail elsewhere in the doc, one sentence would close this: e.g., "condition-variable wait on the FSM's appliedIndex signalled by the apply goroutine, timedout by min(remaining_lease_TTL, 200ms); on timeout, re-evaluate ownership — if still not owned, return ErrRouteMoved."
Summary
Both round-5 items are verified and properly resolved. hlcPhysicalMs(ts) = int64(ts >> 16) is correct per kv/hlc.go (confirmed hlcLogicalBits = 16, SetPhysicalCeiling takes Unix ms, Observe takes a full HLC value). The scan-budget pacing correctly bounds per-call cost for sparse family brackets.
One new P1 emerges from the codex review of commit 44e6125: fence_ts is recorded before the source-group FSM has applied the WriteFenced route — writes accepted in the watcher-lag window after fence_ts are outside DELTA_COPY's range and would be lost at CUTOVER. The fix is a source-group catalog-version poll before DELTA_COPY begins.
The round-3 P1 (PromoteStaged "no-op when shadowed") has now been flagged five consecutive rounds without resolution; it is a correctness bug that causes MVCC history versions to vanish. Three editorial items from round 2 remain (grace window, !dist| prefix, wait-until-applied).
…und-6 P1) Codex P1 round-6 on PR #945 (line 115) — fence_ts pinned before source FSM actually rejects writes The FENCE catalog write commits on the default group, but the source FSM only learns about WriteFenced through the watcher's async catalog snapshot apply (distribution/watcher.go:11, 87-98). If the migrator records fence_ts = source.LastCommitTS() and starts DELTA_COPY immediately, a stale source leader can still accept writes for the moved keys for one watcher-tick worth of time after the catalog commit. Those writes get commit_ts > fence_ts, fall outside DELTA_COPY's (snapshot_ts, fence_ts] window, and are silently left on the source at CUTOVER — write loss after CUTOVER hands the routing to the target. Fix: new §3.2a 'Source-side fence-apply barrier'. The migrator advances FENCE → DELTA_COPY only after every voter of the source-group Raft membership has reported catalog_version_applied >= fence_catalog_version (via the cap_migration_v2 heartbeat's existing catalog_version_applied field, or a per-node GetCatalogVersion RPC if the heartbeat is stale). Only then does the migrator query source.LastCommitTS() and pin that as fence_ts. Crucially: this requires UNANIMOUS voter-ACK, not quorum. Any voter could become the leader at any future moment; without local WriteFenced applied, it would re-serve writes for the moved range in the pre-FENCE state. Quorum is necessary for catalog-write durability; unanimous voter-ACK is necessary for fence safety. State machine diagram (§3.2) updated: FENCE → DELTA_COPY now explicitly waits for the source-side fence ACK before picking fence_ts. SplitJob gains a fence_ack_cursor for restart-safety during the wait. A non-responsive voter blocks the phase; the migrator logs the pending voter list and either operator intervenes or Raft removes the voter via the leader-loss path. CUTOVER is structurally guaranteed by a different mechanism (the §7.2 read fence's authoritative ownership check), so no parallel 'every voter applies CUTOVER' gate is needed — readers are the consumers there.
|
@claude review — round 6 (1 P1 on commit 44e6125):
|
|
@codex review 対応しました。
検証: 対象は設計docのみです。 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: da40dd0e1c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 1. the configurable grace window has elapsed (`readFenceGrace`, default 30 s — chosen to comfortably exceed both lease TTL and watcher tick × 2), and | ||
| 2. every current source voter has ACKed `AcknowledgeSourceReadDrain(job_id, fence_ts, cutover_version)`. | ||
|
|
||
| The read-drain ACK is a source-group Raft-applied proof. The apply handler first proves the voter's local serving snapshot has installed the cutover read-fence epoch for `cutover_version`, so no new source read whose normalized point/range footprint intersects the moved route can be admitted. It then checks the voter's active read-pin table and succeeds only when no pin remains for that route with `read_ts <= fence_ts`. A source read helper registers this pin before it touches MVCC and releases it only when the logical read is complete, including streaming or paginated reads that continue using the same `read_ts` across multiple storage calls. This explicitly covers long-lived paths such as S3 `getObject` / `streamObjectChunks`, Redis/DynamoDB paginated scans, and gRPC `RawScanAt` callers: a read admitted on the source before CUTOVER can finish against retained source history, but cleanup cannot delete the versions it may still need. |
There was a problem hiding this comment.
Drain all pre-cutover source reads before cleanup
This ACK only waits for pins with read_ts <= fence_ts, but reads are explicitly still served from the source during FENCE/DELTA_COPY; a RawScanAt/S3 stream admitted after fence_ts is pinned can use a current/read timestamp greater than fence_ts while still reading the pre-cutover source versions that cleanup deletes (commit_ts <= fence_ts). In that case the voter can ACK the drain while the stream is active, source cleanup can delete the moved versions, and the already-admitted read can start returning missing chunks/rows. The drain needs to block on any active read pin intersecting the moved route, not only pins at or below fence_ts.
Useful? React with 👍 / 👎.
| Before BACKFILL opens, the migrator proposes `InstallMigrationRetentionPin(job_id, route_start, route_end, min_ts=snapshot_ts)` on the source group and records `source_retention_pin_ts = snapshot_ts` in the SplitJob. The pin is route-scoped and bracket-aware: compaction may still discard unrelated keys, but for any raw row accepted by the §6.3.1 `RouteKeyFilter` it must retain committed versions and tombstones needed to answer `ExportVersions` at `snapshot_ts`. | ||
|
|
||
| After the tracker is armed, the retention floor is not allowed to wait for final `delta_floor` computation. Whenever the source FSM records a migration-window write with `commit_ts < current_pin`, it atomically lowers the active pin to `commit_ts - 1` in the same source-group Raft apply that appends the `!migwrite|<job_id>|*` row. This closes the compaction race where a low-ts write lands during BACKFILL/FENCE, `snapshot_min_admitted_ts` remembers it, but compaction prunes the older version/tombstone before DELTA_COPY finalises the lower bound. For pre-tracker locks discovered by the initial route-faithful lock scan, the migrator lowers the pin before choosing `snapshot_ts`; for later tracker rows, the source apply path owns the immediate lower. The default-group SplitJob records the monotone-down `source_retention_pin_ts` as an audit/restart witness, but the source-local pin is the safety mechanism. |
There was a problem hiding this comment.
Install retention pin before admitting tracked writes
Here the pin is installed only after snapshot_ts is chosen, but §6.1.0 arms the write tracker before that; a low-timestamp raw/one-phase/prepare write can be admitted and recorded in the tracker in the gap before any source-local pin exists. If compaction runs before the later InstallMigrationRetentionPin, the version/tombstone that delta_floor will force DELTA_COPY to read can already be pruned, recreating the lost-write race the pin is meant to close. The tracker arm needs to create a conservative pin in the same source Raft entry, or writes must not be admitted until the pin is active.
Useful? React with 👍 / 👎.
| FSM-side defense in depth — **must run on an unconditional apply path** (closes codex round-3 P1 on PR #945): | ||
|
|
||
| - The naïve placement — extend `verifyOwnerFromSnapshot` so it also rejects routes in `WriteFenced` — does NOT work. `verifyComposed1` short-circuits when `ObservedRouteVersion == 0` (`kv/fsm.go:608-611`), and there are several **intentional zero-version write paths today**: read/write txns with `ReadKeys`, caller-supplied `StartTS`, and resolver-claimed keys all dispatch with `ObservedRouteVersion = 0` (`kv/sharded_coordinator.go:694-754`, M3 sibling §3.5 + auto-pin policy). A stale coordinator in any of those flows would forward a write to the source during FENCE and the FSM defense would never run. The fence must be enforced **before** any version-conditional early return. | ||
| - M2 therefore introduces a **separate, unconditional pre-gate** `verifyRouteNotFenced(mutations)` in `kv/fsm.go` that the apply path consults **before** `verifyComposed1`. It computes each mutation's write footprint against the **current** catalog snapshot and rejects with `ErrRouteWriteFenced` when **any** footprint intersects a route whose `RouteState == WriteFenced`. For ordinary `PUT` / `DEL` / prepare / one-phase commit mutations, the footprint is the single `routeKey(mut.Key)` point, matching `verifyOwnerFromSnapshot`. For raw `Op_DEL_PREFIX`, the footprint is a range write: an empty prefix means every non-transactional key (`kv/transcoder.go:10-12`, `kv/sharded_coordinator.go:631-635`, `kv/fsm.go:482-527`), and a non-empty prefix covers the half-open prefix interval `[prefix, prefixScanEnd(prefix))` with an empty end meaning `+∞`. The gate rejects if that interval, after the same route-key normalization used by scan ownership (§7.2.2b), intersects any WriteFenced route. It must not decide from `routeKey(prefix)` alone: a prefix can start outside the moving child while still matching keys inside it (closes codex round-21 P1). No dependency on `ObservedRouteVersion`, no early return on `0`; the gate runs on every applied write request unconditionally. The gate has the same `Phase_ABORT` bypass as the Composed-1 gate (for the same lock-release reason in §3.5 of the parent partial doc). `Phase_COMMIT` is **not** a blanket bypass: it is allowed only while the SplitJob is still in the post-fence drain window (`post_fence_drain_completed=false`) and only if every committed mutation resolves a matching pre-existing `!txn|lock|` row for the same `(primary_key, start_ts, commit_ts)` in the moving range. That narrow lane lets already-prepared txns resolve after the fence has stopped new prepares (closing codex round-18 P1's starvation issue) without letting a stale coordinator create new state. Once the post-fence drain returns empty, the lane closes and any later `Phase_COMMIT` for the moved range is rejected; after CUTOVER the target has no historical intent for that key, by construction. |
There was a problem hiding this comment.
Preserve idempotent COMMIT retries after the drain
This closes the Phase_COMMIT lane once post_fence_drain_completed=true and also requires the lock row to still be present, but a prepared transaction can commit successfully during FENCE, delete its lock, and then have the commit response lost. A later COMMIT retry for the same (primary,start_ts,commit_ts) would be rejected before handleCommitRequest can use the existing commit record / already-written versions as its idempotency proof, so clients can see a fenced/moved error for a transaction that actually committed. The gate should continue allowing idempotent COMMIT retries proved by the commit record or committed versions, while still rejecting new/unprepared commits.
Useful? React with 👍 / 👎.
| } | ||
| ``` | ||
|
|
||
| The bracket plan is **disjoint by construction**. `familyUser` is not "whatever bytes fall in `[routeStart, routeEnd)`"; it is user MVCC rows after excluding `knownInternalPrefixes`. The txn exclusions are the concrete internal subprefixes only (`!txn|lock|`, `!txn|int|`, `!txn|cmt|`, `!txn|rb|`, `!txn|meta|`), not the entire `!txn|` umbrella: a legal user key such as `!txn|foo` stays in `familyUser` and migrates with ordinary raw user rows. Other exclusions are `!redis|`, `!lst|`, `!hs|`, `!st|`, `!zs|`, concrete stream storage prefixes `!stream|meta|` and `!stream|entry|`, `!ddb|`, `!sqs|`, `!s3|`, `!dist|`, `!migstage|`, `!migwrite|`, `!migfence|`, etc. Every excluded prefix must either have an explicit family bracket below or be a reserved control prefix that `StartSplitMigration` rejects up front. This prevents the user bracket from exporting the same `(raw_key, commit_ts)` that an internal-family bracket exports; importer idempotency is per `(job_id, bracket_id, batch_seq)` and is not allowed to mask cross-bracket duplicates. |
There was a problem hiding this comment.
Keep adapter-like raw user keys in a migration bracket
Excluding the whole !ddb|/!sqs|/!s3| umbrella from familyUser drops legal RawKV keys that happen to start with those bytes but do not match one of the explicit internal subprefix brackets. For example, the current raw write path only rejects txn-internal keys, and routeKey("!ddb|foo") falls back to the raw key; during a full-range migration it is filtered out of familyUser, not scanned by the DynamoDB meta/item/GSI brackets, and therefore never imported to the target. Either reserve/reject those umbrella prefixes on the data plane or exclude only concrete internal subprefixes, as this paragraph already does for !txn| and stream keys.
Useful? React with 👍 / 👎.
| 2. On `CLEANUP → DONE`, the migrator issues one Raft proposal that (a) deletes `!dist|job|<id>`, (b) writes `!dist|jobhist|<terminal_at_ms_be>|<id>` with the final body. This proposal is allowed only after target-local ack / HLC-floor / readiness / promotion records are gone and the source write tracker / write fence / cutover read fence / retention pin are disarmed. | ||
| 3. On `* → FAILED`, the job stays live under `!dist|job|<id>` and is **not** counted against `maxJobHistory`. The transition atomically records `retry_phase`. A failure before the catalog CUTOVER CAS may use `retry_phase=BACKFILL/FENCE/DELTA_COPY/CUTOVER` depending on which side effects are already armed; a failure after CUTOVER uses `retry_phase=CLEANUP`. The live cap in §3.1.1 includes retryable FAILED jobs so a cluster cannot accumulate infinite failed jobs without operator action. | ||
| 4. On `AbandonSplitJob` before CUTOVER, the migrator first runs a bounded per-job cleanup protocol, then moves the final `ABANDONED` body to `!dist|jobhist|<terminal_at_ms_be>|<id>`: | ||
| - source group: disarm `ArmMigrationWriteTracker`, delete `!migwrite|<job_id>|*`, clear any `!migfence|<job_id>` durable write fence / pending cutover read fence, roll `WriteFenced` back to `Active` if FENCE had landed, and release `source_retention_pin_ts`; |
There was a problem hiding this comment.
Roll back the catalog route during abandon
WriteFenced is the default-group catalog state, not source-local state, so doing this in the source-group cleanup alongside !migfence only clears the durable FSM fence and leaves coordinators/watchers seeing the right child as WriteFenced. An abandon after FENCE would then move the job to history while the route remains write-fenced forever. The cleanup protocol needs an explicit default-group catalog CAS that restores the fenced descriptor to the intended active state before the source/target cleanup ACKs allow ABANDONED.
Useful? React with 👍 / 👎.
| 3. With `ARMING` durable, the migrator proposes `ArmCutoverReadFence(job_id, route_start, route_end, expected_cutover_version, target_group_id)` to the **source group**. The source FSM applies this as source-group Raft state: a pending-fence entry keyed by `(job_id, route_start, route_end, expected_cutover_version)`. The operation is idempotent for the same tuple and rejects a conflicting expected version for the same job. | ||
| 4. After the source-group apply ACK, the migrator CASes the default-group SplitJob from `ARMING` to `ARMED`. From the moment the source apply lands until the source watcher applies CUTOVER and normal §7.2.2 Branch 1 takes over, any point read or scan whose **route-key-normalized interval** intersects the moving range and whose `read_route_version < expected_cutover_version` **must not read source MVCC**. If a raw internal scan cannot be normalized to route-key intervals, the source fails closed with `ErrRouteOwnershipUnknown` / retry rather than proving ownership from raw bytes. If the source has already learned that the default-group CAS committed (via watcher, heartbeat watermark, or the ownership RPC), it returns `ErrRouteMoved{expected_cutover_version, target_group_id}`. Otherwise it returns `ErrRouteCutoverPending{retry_after_ms}`; coordinators treat it like `ErrRouteMoved` with a short wait + `WatcherRefresh` retry. A transient retry is acceptable; serving stale source data is not. | ||
| 5. Only after both `target_staged_readiness_state = ARMED` and `cutover_read_fence_state = ARMED` are durable does the migrator issue the default-group CUTOVER catalog CAS that removes `source.WriteFenced(right)` and inserts `target.Active(right)`. This ordering delays target visibility/writes until the target can fail closed for missing staged metadata, the source can fence stale reads, and the default group has recovery witnesses for both facts. | ||
| 6. If the catalog CAS fails (expected version mismatch, abandoned job, leadership loss), the migrator first CASes the SplitJob to `CLEARING`, then proposes `ClearCutoverReadFence(job_id, expected_cutover_version)` to the source group and `ClearTargetStagedReadiness(job_id, expected_cutover_version)` to the target group, and only after both clear ACKs return sets the two states to `NONE` before returning the job to DELTA_COPY/FAILED. While either fence is armed and the CAS is not committed, clients see bounded retry instead of stale reads or live-only target reads. |
There was a problem hiding this comment.
Prove CUTOVER did not land before clearing guards
Treating leadership loss as a CAS failure is unsafe because the CUTOVER proposal may have committed even though the migrator lost leadership or the RPC returned ambiguously. If recovery follows this branch and clears the source read fence / target readiness after a committed CUTOVER, there is a window where the target route is visible but stale source reads are no longer fenced (and target watcher lag is no longer guarded). Before entering CLEARING, the new leader must reload the catalog/SplitJob and prove the expected cutover_version was not published; otherwise it should resume CUTOVER/CLEANUP with the guards still armed.
Useful? React with 👍 / 👎.
| - For never-overwritten keys post-CUTOVER: only the staged raw candidate exists ≤ read_ts, so the merge returns it if visible and returns not found if the staged winner is a tombstone/expired TTL version. | ||
| - For a key with multiple staged versions (the §6.1.1 hot-key case): the staged raw-candidate iterator returns the newest staged version ≤ read_ts, the live raw-candidate iterator returns the newest live version ≤ read_ts, and the merge applies visibility only to the greater of the two. Correct in every interleaving, including partial promotion of hidden versions. | ||
|
|
||
| Writes always land in the live keyspace — the CUTOVER bump already routed writers to the target group via the catalog, and §6.2.1's full-HLC floor guarantees their `commit_ts` is strictly greater than every imported `commit_ts`, so a live write shadowing a staged version is the *correct* MVCC visibility AND the merge rule above surfaces it correctly. This gives reads access to the imported data the instant CUTOVER lands, with **no promotion work blocking CUTOVER itself**, and with no staleness window where staged shadows a fresher live write. |
There was a problem hiding this comment.
Use staged metadata for write-side transaction probes
While promotion is incomplete, imported txn records and prior versions may exist only under the staged prefix, but the write path still performs store reads for transaction status and idempotency (txnCommitTS, rollback markers, one-phase PrevCommitTS probes, etc.). If a pre-cutover abort/commit response is retried after CUTOVER but before promotion reaches those internal keys, a live-only probe can miss the staged rollback/commit record and create a contradictory live commit record or duplicate one-phase write. Writes can still land in live, but their validation/dedup/status lookups need the same staged+live raw-candidate view until staged_visibility_active is cleared.
Useful? React with 👍 / 👎.
| // Milestone 2: set on CUTOVER, cleared on CLEANUP→DONE in the same | ||
| // atomic proposal. While true, the target FSM's read path runs the | ||
| // staged+live MVCC merge of §6.4 step 2 against the | ||
| // !dist|migstage|<migration_job_id>|* prefix; once cleared the route | ||
| // is indistinguishable from any other Active route. |
There was a problem hiding this comment.
Clear staged visibility before target proof cleanup
This comment says the route clears staged_visibility_active only in the final CLEANUP→DONE proposal, but §6.4 step 5 requires a prior promotion-complete CAS that clears the descriptor, then waits for every target voter to ACK that cleared descriptor before deleting !migstage|hlc_floor|, readiness, and promotion proofs. If the flag really stays set until DONE, there is no cleared descriptor for voters to ACK, so cleanup either deadlocks waiting for an impossible proof or deletes target-local proofs while the serving descriptor still requires them.
Useful? React with 👍 / 👎.
|
@codex review 対応しました。
Validation: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ff0300cde3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 3. **CUTOVER precondition.** Before the migrator issues the CUTOVER catalog write, it ensures the target group's durable full-HLC floor is at least `max_imported_ts` by issuing a final target-group proposal (not default-group) that updates `!migstage|hlc_floor|<jobID> = max(current_floor, max_imported_ts)`, calls `SetPhysicalCeiling(hlcPhysicalMs(max_imported_ts))`, and calls `Observe(max_imported_ts)`. The proposal is gated by `cap_migration_v2` (§11.1) and is the **last** target-side write before CUTOVER. If the target HLC is already above this value (e.g. due to per-batch advancement), the proposal is a no-op except for confirming the durable floor; if a follower flap or snapshot restore dropped in-memory `HLC.last`, this proposal closes the live-leader gap before the catalog is published. | ||
| 4. **Restart/snapshot invariant.** `metaLastCommitTS` is not enough: current `HLC.Next()` reads in-memory `HLC.last` and the physical ceiling, but does **not** consult `metaLastCommitTS`. A restored node that only replays `SetPhysicalCeiling(hlcPhysicalMs(max_imported_ts))` can issue `(ms<<16)|0` even when an imported version was `(ms<<16)|logical>0`. Therefore, on target FSM startup, snapshot restore, and leadership acquisition before serving writes for any route whose migration high-water mark is non-zero, the target computes `fullFloor = max(store.LastCommitTS(), all !migstage|hlc_floor|*)`, then calls `hlc.SetPhysicalCeiling(hlcPhysicalMs(fullFloor))` and `hlc.Observe(fullFloor)`. Re-observing the **full** HLC value is the preferred design because it preserves logical-half monotonicity without artificially jumping the physical ceiling by one millisecond; if an implementation cannot load the full value, the fail-safe alternative is `SetPhysicalCeiling(hlcPhysicalMs(fullFloor)+1)` before accepting writes. | ||
| 5. **Post-CUTOVER write-side invariant, including caller-supplied timestamps.** A target-group `Next()` after CUTOVER is strictly greater than every imported `commit_ts` because the live HLC has observed the durable full floor before writes are admitted. Reads at any post-CUTOVER `Next()` therefore see imported versions as historical and a new live write as the most-recent committed version — no resurrection, no time travel. | ||
|
|
||
| This `Next()` invariant is not enough for adapters that pass a caller-supplied `commit_ts` through `resolveDispatchCommitTS` unchanged. While a target route has `staged_visibility_active=true`, the target FSM must also enforce a write-side floor before any live MVCC write or prepare lock is created: for the route's `migration_job_id`, load the target-local full-HLC floor `!migstage|hlc_floor|<job_id>` (which is `>= max_imported_ts`) and reject any raw write, one-phase txn, or prepared txn whose supplied / already-allocated `commit_ts <= floor` with `ErrMigrationTimestampTooLow` (mapped to a retryable coordinator error that refreshes and reallocates unless the caller explicitly requires that timestamp). This check runs on the target apply path next to the staged/live merge flag, not only in the coordinator, so a stale or direct RawKV caller cannot insert a low live version below an already-imported staged version. Coordinator-allocated writes pass naturally because `Next()` has observed the full floor; the rejection exists only for externally supplied or replayed low timestamps. Once promotion completes and CLEANUP clears `staged_visibility_active`, the route becomes ordinary MVCC again and no migration-specific timestamp floor remains. |
There was a problem hiding this comment.
Floor target timestamps at the fence boundary
When the moved range is empty or only contains old versions, max_imported_ts can be far below the source's fence_ts; flooring the target only at max_imported_ts lets the first post-CUTOVER target write receive a commit timestamp <= fence_ts. Because the API exposes snapshot timestamps via RawLatestCommitTS/RawGet(ts), a client that observed a source timestamp around the fence before cutover could later read at that same timestamp and see a write that actually happened after cutover, violating MVCC real-time ordering. The target floor and the supplied-timestamp rejection should use at least max(max_imported_ts, fence_ts) (or an equivalent cutover timestamp), not just imported-version max.
Useful? React with 👍 / 👎.
| } | ||
| ``` | ||
|
|
||
| The bracket plan is **disjoint by construction**. `familyUser` is not "whatever bytes fall in `[routeStart, routeEnd)`"; it is user MVCC rows after excluding `knownInternalPrefixes`. The exclusions are **concrete internal subprefixes only** unless the data plane already rejects the entire umbrella prefix. For txn, exclude `!txn|lock|`, `!txn|int|`, `!txn|cmt|`, `!txn|rb|`, and `!txn|meta|`, not the entire `!txn|` umbrella: a legal user key such as `!txn|foo` stays in `familyUser` and migrates with ordinary raw user rows. The same rule applies to adapter-looking raw keys: DDB excludes `!ddb|meta|table|`, `!ddb|meta|gen|`, `!ddb|item|`, and `!ddb|gsi|`, not every `!ddb|*`; SQS/S3 exclude only their concrete persisted storage subprefixes returned by the adapter bracket helpers; stream excludes only `!stream|meta|` and `!stream|entry|`. A raw user key such as `!ddb|foo`, `!sqs|foo`, `!s3|foo`, or `!stream|foo` remains in `familyUser` unless the corresponding adapter explicitly reserves/rejects that exact prefix on the write path. Store-owned Redis/list/hash/set/zset umbrellas (`!redis|`, `!lst|`, `!hs|`, `!st|`, `!zs|`) remain excluded because those prefixes are treated as internal storage families in the data plane. Reserved migration/control prefixes (`!dist|`, `!migstage|`, `!migwrite|`, `!migfence|`, etc.) are rejected by `StartSplitMigration` up front rather than exported as data. Every excluded prefix must either have an explicit family bracket below or be a reserved control prefix. This prevents the user bracket from exporting the same `(raw_key, commit_ts)` that an internal-family bracket exports; importer idempotency is per `(job_id, bracket_id, batch_seq)` and is not allowed to mask cross-bracket duplicates. |
There was a problem hiding this comment.
Keep SQS-looking raw keys on their raw route
This exception says a raw user key like !sqs|foo remains in familyUser, but the current routeKey() implementation in kv/shard_key.go maps any !sqs|... key to !sqs|route|global. In a migration of a normal raw interval containing !sqs|foo but not the SQS-global route, the user bracket will scan the raw row and then filter it out by the SQS route key, while no concrete SQS bracket owns it, so the key is left behind on the source. Either reserve the whole !sqs| umbrella on the write path or require M2 to narrow sqsRouteKey to only the concrete storage prefixes named here.
Useful? React with 👍 / 👎.
| FSM-side defense in depth — **must run on an unconditional apply path** (closes codex round-3 P1 on PR #945): | ||
|
|
||
| - The naïve placement — extend `verifyOwnerFromSnapshot` so it also rejects routes in `WriteFenced` — does NOT work. `verifyComposed1` short-circuits when `ObservedRouteVersion == 0` (`kv/fsm.go:608-611`), and there are several **intentional zero-version write paths today**: read/write txns with `ReadKeys`, caller-supplied `StartTS`, and resolver-claimed keys all dispatch with `ObservedRouteVersion = 0` (`kv/sharded_coordinator.go:694-754`, M3 sibling §3.5 + auto-pin policy). A stale coordinator in any of those flows would forward a write to the source during FENCE and the FSM defense would never run. The fence must be enforced **before** any version-conditional early return. | ||
| - M2 therefore introduces a **separate, unconditional pre-gate** `verifyRouteNotFenced(mutations)` in `kv/fsm.go` that the apply path consults **before** `verifyComposed1`. It computes each mutation's write footprint against the **current** catalog snapshot and rejects with `ErrRouteWriteFenced` when **any** footprint intersects a route whose `RouteState == WriteFenced`. For ordinary `PUT` / `DEL` / prepare / one-phase commit mutations, the footprint is the single `routeKey(mut.Key)` point, matching `verifyOwnerFromSnapshot`. For raw `Op_DEL_PREFIX`, the footprint is a range write: an empty prefix means every non-transactional key (`kv/transcoder.go:10-12`, `kv/sharded_coordinator.go:631-635`, `kv/fsm.go:482-527`), and a non-empty prefix covers the half-open prefix interval `[prefix, prefixScanEnd(prefix))` with an empty end meaning `+∞`. The gate rejects if that interval, after the same route-key normalization used by scan ownership (§7.2.2b), intersects any WriteFenced route. It must not decide from `routeKey(prefix)` alone: a prefix can start outside the moving child while still matching keys inside it (closes codex round-21 P1). No dependency on `ObservedRouteVersion`, no early return on `0`; the gate runs on every applied write request unconditionally. The gate has the same `Phase_ABORT` bypass as the Composed-1 gate (for the same lock-release reason in §3.5 of the parent partial doc). `Phase_COMMIT` is **not** a blanket bypass: before the post-fence drain completes (`post_fence_drain_completed=false`), it is allowed only if every committed mutation resolves a matching pre-existing `!txn|lock|` row for the same `(primary_key, start_ts, commit_ts)` in the moving range. That narrow lane lets already-prepared txns resolve after the fence has stopped new prepares (closing codex round-18 P1's starvation issue) without letting a stale coordinator create new state. After the post-fence drain completes, **new/unprepared commits remain rejected**, but an idempotent COMMIT retry for a transaction that already committed must still be admitted when the FSM can prove the prior success from the commit record, rollback/commit marker, or already-written committed versions for the exact `(primary_key, start_ts, commit_ts)`. The idempotency proof uses the staged+live lookup rule from §6.4 while `staged_visibility_active` is true, so a response-lost COMMIT that succeeded before CUTOVER does not later surface as a fenced/moved failure merely because its lock row was deleted. |
There was a problem hiding this comment.
Do not treat rollback markers as COMMIT success
For a post-drain response-lost COMMIT retry, a rollback marker proves that the transaction was aborted, not that the original commit succeeded. Allowing the retry lane to return idempotent COMMIT success from a rollback/commit marker can make a client believe its writes committed even though only rollback state exists and no committed versions will be migrated or served. This proof list should distinguish commit/version proof from rollback proof and only the former may produce COMMIT success.
Useful? React with 👍 / 👎.
| **2a. Range-scan merge iterator (closes codex round-14 P1 line 562).** The point-read rule above generalises to range reads via a **raw-candidate merged iterator**, not by scanning live alone and not by merging only reader-visible rows. For `ScanAt(read_ts, scan_start, scan_end)` or `RawScanAt(...)` on a target route with `staged_visibility_active = true` and `scan` overlapping the route's `[routeStart, routeEnd)`, the target FSM constructs a merged iterator that: | ||
| - walks raw live MVCC candidates in `[scan_start, scan_end)` at `read_ts`, including tombstones and TTL headers, AND | ||
| - walks raw staged MVCC candidates under `!dist|migstage|<migration_job_id>|<encode(scan_start)..encode(scan_end))` at `read_ts`, with the same headers and raw-key window, | ||
| - and merges by `(logical_key, commit_ts)` in `(raw_key ASC, commit_ts DESC)` order. For each logical key, the merged iterator chooses the newest candidate (`max commit_ts ≤ read_ts`) from across **both** sources, then emits it only if the winner is neither a tombstone nor expired at `read_ts`. If the newest candidate is hidden, the iterator suppresses that logical key and MUST NOT emit an older candidate from the other source. |
There was a problem hiding this comment.
Cover reverse scans in the staged merge
The staged/live merge is specified only for forward scan order here, but the repo has ReverseScanAt and RawScanAt(reverse=true) paths used by clients. During staged_visibility_active, a reverse scan over the moved range would otherwise use the existing live-only reverse iterator or an ASC-only merge and return empty/partial results until promotion catches up. The target read path needs an equivalent reverse merged raw-candidate iterator and tests for ReverseScanAt / RawScanAt with reverse=true.
Useful? React with 👍 / 👎.
| // Catalog version observed by ShardStore when it routed this read. | ||
| // Required on proxied and local-leader reads after cap_migration_v2. | ||
| uint64 read_route_version = 4; |
There was a problem hiding this comment.
Stamp RawKV read versions server-side
Requiring read_route_version on local-leader reads makes the user-visible RawGet/RawScanAt RPCs incompatible with existing clients, which cannot know the distribution catalog version and will keep sending the proto default 0. In an upgraded cluster those reads would be treated as legacy and fail with ownership errors even when no migration is in progress. The server-side RawKV adapter should resolve/stamp the observed route version before calling the shared read-fence helper, limiting zero-version rejection to internal proxy requests that actually lost the stamped value.
Useful? React with 👍 / 👎.
|
@codex review Addressed the current review blockers on the M2 migration proposal:
Validation: git diff --check HEAD~1..HEAD |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 15d74813ec
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 4. **Restart/snapshot invariant.** `metaLastCommitTS` is not enough: current `HLC.Next()` reads in-memory `HLC.last` and the physical ceiling, but does **not** consult `metaLastCommitTS`. A restored node that only replays `SetPhysicalCeiling(hlcPhysicalMs(max_imported_ts))` can issue `(ms<<16)|0` even when an imported version was `(ms<<16)|logical>0`, and a sparse migration can issue a timestamp below `fence_ts` if it never imported a high version. Therefore, on target FSM startup, snapshot restore, and leadership acquisition before serving writes for any route whose migration high-water mark is non-zero, the target computes `fullFloor = max(store.LastCommitTS(), all !migstage|hlc_floor|*)`, then calls `hlc.SetPhysicalCeiling(hlcPhysicalMs(fullFloor))` and `hlc.Observe(fullFloor)`. Re-observing the **full** HLC value is the preferred design because it preserves logical-half monotonicity without artificially jumping the physical ceiling by one millisecond; if an implementation cannot load the full value, the fail-safe alternative is `SetPhysicalCeiling(hlcPhysicalMs(fullFloor)+1)` before accepting writes. | ||
| 5. **Post-CUTOVER write-side invariant, including caller-supplied timestamps.** A target-group `Next()` after CUTOVER is strictly greater than both every imported `commit_ts` and the source `fence_ts` because the live HLC has observed `cutoverFloor` before writes are admitted. Reads at any post-CUTOVER `Next()` therefore see imported/source-fenced versions as historical and a new live write as the most-recent committed version — no resurrection, no time travel. | ||
|
|
||
| This `Next()` invariant is not enough for adapters that pass a caller-supplied `commit_ts` through `resolveDispatchCommitTS` unchanged. While a target route has `staged_visibility_active=true`, the target FSM must also enforce a write-side floor before any live MVCC write or prepare lock is created: for the route's `migration_job_id`, load the target-local full-HLC floor `!migstage|hlc_floor|<job_id>` (which is `>= max(max_imported_ts, fence_ts)`) and reject any raw write, one-phase txn, or prepared txn whose supplied / already-allocated `commit_ts <= floor` with `ErrMigrationTimestampTooLow` (mapped to a retryable coordinator error that refreshes and reallocates unless the caller explicitly requires that timestamp). This check runs on the target apply path next to the staged/live merge flag, not only in the coordinator, so a stale or direct RawKV caller cannot insert a low live version below an already-imported staged version or below the source fence boundary. Coordinator-allocated writes pass naturally because `Next()` has observed the full floor; the rejection exists only for externally supplied or replayed low timestamps. Once promotion completes and CLEANUP clears `staged_visibility_active`, the route becomes ordinary MVCC again and no migration-specific timestamp floor remains. |
There was a problem hiding this comment.
Preserve the write timestamp floor after promotion
When promotion finishes but callers can still supply commit_ts values, clearing the migration-specific floor here allows a raw/one-phase/prepare write with commit_ts <= max(max_imported_ts, fence_ts) to be accepted after staged_visibility_active is false. The doc itself notes that supplied timestamps are passed through unchanged, and the stores accept older timestamps while only advancing LastCommitTS monotonically, so such an acknowledged write can remain hidden behind the migrated versions and violate post-cutover visibility. Please keep enforcing a durable per-route floor, or reject low supplied timestamps globally, after the staged merge flag is cleared.
Useful? React with 👍 / 👎.
| 1. After DELTA_COPY completes and after the target full-HLC floor proposal in §6.2.1, the migrator fixes `expected_cutover_version = current_catalog_version + 1`, persists `target_staged_readiness_state = ARMING`, and applies §6.4 step 0's `ApplyTargetStagedReadiness` proposal on the target group. Only after the target ACK does it record `target_staged_readiness_state = ARMED`. This closes the target-side watcher-lag gap before the source read barrier is armed. | ||
| 2. With target readiness armed, but **before** any source-group read-fence proposal, the migrator persists `cutover_read_fence_state = ARMING` on the default-group SplitJob. This default-group intent is durable proof that recovery must reconcile the source-side fence even if the leader dies during the next RPC. | ||
| 3. With `ARMING` durable, the migrator proposes `ArmCutoverReadFence(job_id, route_start, route_end, expected_cutover_version, target_group_id)` to the **source group**. The source FSM applies this as source-group Raft state: a pending-fence entry keyed by `(job_id, route_start, route_end, expected_cutover_version)`. The operation is idempotent for the same tuple and rejects a conflicting expected version for the same job. | ||
| 4. After the source-group apply ACK, the migrator CASes the default-group SplitJob from `ARMING` to `ARMED`. From the moment the source apply lands until the source watcher applies CUTOVER and normal §7.2.2 Branch 1 takes over, any point read or scan whose **route-key-normalized interval** intersects the moving range and whose `read_route_version < expected_cutover_version` **must not read source MVCC**. If a raw internal scan cannot be normalized to route-key intervals, the source fails closed with `ErrRouteOwnershipUnknown` / retry rather than proving ownership from raw bytes. If the source has already learned that the default-group CAS committed (via watcher, heartbeat watermark, or the ownership RPC), it returns `ErrRouteMoved{expected_cutover_version, target_group_id}`. Otherwise it returns `ErrRouteCutoverPending{retry_after_ms}`; coordinators treat it like `ErrRouteMoved` with a short wait + `WatcherRefresh` retry. A transient retry is acceptable; serving stale source data is not. | ||
| 5. Only after both `target_staged_readiness_state = ARMED` and `cutover_read_fence_state = ARMED` are durable does the migrator issue the default-group CUTOVER catalog CAS that removes `source.WriteFenced(right)` and inserts `target.Active(right)`. This ordering delays target visibility/writes until the target can fail closed for missing staged metadata, the source can fence stale reads, and the default group has recovery witnesses for both facts. |
There was a problem hiding this comment.
Wait for all voters to apply cutover guards
These steps mark the target readiness and source read-fence guards ARMED after a single target/source apply ACK, then publish target.Active. If a source or target follower that has not applied the corresponding guard becomes leader in this window, it can serve stale source reads or live-only target reads/writes until its watcher catches up—the same failure mode §3.2a avoids by requiring every current voter to apply the write fence. Please require per-voter applied ACKs for both pre-CUTOVER guards, or explicitly block leadership serving until the committed guards are loaded.
Useful? React with 👍 / 👎.
| FSM-side defense in depth — **must run on an unconditional apply path** (closes codex round-3 P1 on PR #945): | ||
|
|
||
| - The naïve placement — extend `verifyOwnerFromSnapshot` so it also rejects routes in `WriteFenced` — does NOT work. `verifyComposed1` short-circuits when `ObservedRouteVersion == 0` (`kv/fsm.go:608-611`), and there are several **intentional zero-version write paths today**: read/write txns with `ReadKeys`, caller-supplied `StartTS`, and resolver-claimed keys all dispatch with `ObservedRouteVersion = 0` (`kv/sharded_coordinator.go:694-754`, M3 sibling §3.5 + auto-pin policy). A stale coordinator in any of those flows would forward a write to the source during FENCE and the FSM defense would never run. The fence must be enforced **before** any version-conditional early return. | ||
| - M2 therefore introduces a **separate, unconditional pre-gate** `verifyRouteNotFenced(mutations)` in `kv/fsm.go` that the apply path consults **before** `verifyComposed1`. It computes each mutation's write footprint against the **current** catalog snapshot and rejects with `ErrRouteWriteFenced` when **any** footprint intersects a route whose `RouteState == WriteFenced`. For ordinary `PUT` / `DEL` / prepare / one-phase commit mutations, the footprint is the single `routeKey(mut.Key)` point, matching `verifyOwnerFromSnapshot`. For raw `Op_DEL_PREFIX`, the footprint is a range write: an empty prefix means every non-transactional key (`kv/transcoder.go:10-12`, `kv/sharded_coordinator.go:631-635`, `kv/fsm.go:482-527`), and a non-empty prefix covers the half-open prefix interval `[prefix, prefixScanEnd(prefix))` with an empty end meaning `+∞`. The gate rejects if that interval, after the same route-key normalization used by scan ownership (§7.2.2b), intersects any WriteFenced route. It must not decide from `routeKey(prefix)` alone: a prefix can start outside the moving child while still matching keys inside it (closes codex round-21 P1). No dependency on `ObservedRouteVersion`, no early return on `0`; the gate runs on every applied write request unconditionally. The gate has the same `Phase_ABORT` bypass as the Composed-1 gate (for the same lock-release reason in §3.5 of the parent partial doc). `Phase_COMMIT` is **not** a blanket bypass: before the post-fence drain completes (`post_fence_drain_completed=false`), it is allowed only if every committed mutation resolves a matching pre-existing `!txn|lock|` row for the same `(primary_key, start_ts, commit_ts)` in the moving range. That narrow lane lets already-prepared txns resolve after the fence has stopped new prepares (closing codex round-18 P1's starvation issue) without letting a stale coordinator create new state. After the post-fence drain completes, **new/unprepared commits remain rejected**, but an idempotent COMMIT retry for a transaction that already committed must still be admitted when the FSM can prove the prior success from the commit record, a commit marker, or already-written committed versions for the exact `(primary_key, start_ts, commit_ts)`. The same staged+live probe also checks rollback markers, but a rollback marker proves the transaction already aborted and must return the idempotent abort/rollback result (or reject the COMMIT as already rolled back), never COMMIT success. The idempotency proof uses the staged+live lookup rule from §6.4 while `staged_visibility_active` is true, so a response-lost COMMIT that succeeded before CUTOVER does not later surface as a fenced/moved failure merely because its lock row was deleted. |
There was a problem hiding this comment.
Require transaction identity for COMMIT proof
In the post-drain retry path, a moved secondary can have its lock cleaned up while the primary commit record lives outside the moved range, leaving only ordinary committed MVCC versions as the local proof. Those versions carry key/commit_ts/value but not primary_key or start_ts, and caller-supplied timestamps can collide, so treating a version at the requested timestamp as exact COMMIT proof can acknowledge a transaction that never committed. Please add a per-key commit marker/txn identity to the proof, or only return COMMIT success from records that actually encode the transaction identity.
Useful? React with 👍 / 👎.
| Two sub-cases of the legacy framing are preserved: | ||
|
|
||
| - **Coordinator stale, source up-to-date**: source's `source_local_version ≥ cutover_version` and source no longer owns the key → unconditional `ErrRouteMoved`. Coordinator refreshes → re-routes to target → succeeds. | ||
| - **Source stale, coordinator further ahead (`read_route_version > source_local_version`)**: the FSM blocks on a **condition variable signalled by the apply goroutine** when it advances `appliedIndex`, with a timeout of `min(remaining_lease_TTL, 200 ms)` (the FSM tracks `appliedIndex` natively; the condition variable wraps it with a one-shot notifier consumed by the read-fence path). On the signal the FSM re-evaluates ownership; if `source_local_version` has now caught up to `read_route_version` AND the source still owns the key under the new snapshot, the read proceeds. If `source_local_version` is still behind on timeout, OR if the catch-up reveals the source no longer owns the key, the FSM returns `ErrRouteMoved`. This avoids a wedge while keeping the read consistent (closes claude round-2 spec-mechanism flag, 4-round flag). |
There was a problem hiding this comment.
Wait on catalog version rather than applied index
In this branch source_local_version is the distribution engine's catalog version, which is advanced by CatalogWatcher.ApplySnapshot, not by the FSM Raft apply goroutine/applied index. A condition variable signalled only from Raft apply can therefore sleep until timeout even when the watcher has already applied read_route_version, producing spurious ErrRouteMoved/retry for source-owned reads after benign catalog changes. Please wake this wait from the catalog watcher/engine version update, or poll the engine version directly.
Useful? React with 👍 / 👎.
|
@codex review Addressed the current review blockers in commit 0067d6f:
Validation: git diff --check |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0067d6f33b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| 0. **Target staged-readiness barrier before the target can become Active.** After DELTA_COPY completes and after the final target full-HLC floor proposal in §6.2.1, but before the default-group CUTOVER CAS can publish `target.Active(right)`, the migrator persists `target_staged_readiness_state = ARMING` on the default-group SplitJob and proposes/probes `ApplyTargetStagedReadiness(job_id, route_start, route_end, expected_cutover_version, migration_job_id, min_write_ts_exclusive=max(max_imported_ts, fence_ts))` until every current target voter has applied it and advanced `target_staged_readiness_ack_cursor`. The target FSM persists `!migstage|ready|<job_id>` in its own Raft state/snapshots and treats that record as a fail-closed guard for the moving route interval. | ||
|
|
||
| While the readiness record is armed, any target read or write whose route-key-normalized point/range intersects `[route_start, route_end)` must prove one of two things before touching live MVCC: either the target's local route snapshot contains `staged_visibility_active=true` with the matching `migration_job_id` and catalog version, or the default-group ownership lookup says the CUTOVER has not committed yet and the request should retry. If neither proof is available (watcher lag, ownership RPC timeout, snapshot restore before route replay), the target returns `ErrRouteCutoverPending` / `ErrRouteOwnershipUnknown` and **must not** read live-only data or accept a live write for the interval. This closes the target-side mirror of §7.2.2e: a fresh coordinator may learn the new target route from the default group before the target FSM has applied the same descriptor locally, so the target must fail closed until it can run the staged/live merge and timestamp floor. |
There was a problem hiding this comment.
Allow cleared target descriptors past readiness guard
When the promotion-complete CAS clears staged_visibility_active, §6.4 step 5 intentionally keeps !migstage|ready|<job_id> until every target voter ACKs the cleared descriptor, but this guard only allows serving when the local descriptor still has staged_visibility_active=true. In the cleanup window—especially while one target voter is lagging—a target leader that has already applied the cleared descriptor will fail this proof and return retry/unknown for a route it now owns, making the migrated range unavailable until cleanup can delete the readiness record. The readiness guard needs to also accept the matching cleared descriptor with the retained min_write_ts_exclusive.
Useful? React with 👍 / 👎.
|
|
||
| ### 5.1.1 `proto/service.proto` read-path fields | ||
|
|
||
| The read fence in §7.2 is wired through the concrete user-visible read RPCs, not an abstract helper message. M2 extends the existing RawKV service messages as follows: |
There was a problem hiding this comment.
Include RawLatestCommitTS in the read fence
The existing RawKV API also exposes RawLatestCommitTS (adapter/grpc.go reads r.store.LatestCommitTS directly, and ShardStore.LatestCommitTS proxies through it), but this section only adds route-version fencing to RawGet and RawScanAt. After CUTOVER, a stale coordinator/source leader can still answer RawLatestCommitTS for a moved key from source MVCC, and the target can report missing/low latest-ts before staged rows are promoted, so timestamp reads are not atomic with the staged/live data view. Please extend this request/path through the same ownership fence and staged/live merge semantics.
Useful? React with 👍 / 👎.
| 1. After DELTA_COPY completes and after the target full-HLC floor proposal in §6.2.1, the migrator fixes `expected_cutover_version = current_catalog_version + 1`, persists `target_staged_readiness_state = ARMING`, and applies/probes §6.4 step 0's `ApplyTargetStagedReadiness` proposal on every current target voter. Only after `target_staged_readiness_ack_cursor` shows every current target voter has applied the guard does it record `target_staged_readiness_state = ARMED`. This closes the target-side watcher-lag gap before the source read barrier is armed; a follower that later becomes target leader already has the fail-closed readiness record or refuses to serve the moving interval until it catches up. | ||
| 2. With target readiness armed, but **before** any source-group read-fence proposal, the migrator persists `cutover_read_fence_state = ARMING` on the default-group SplitJob. This default-group intent is durable proof that recovery must reconcile the source-side fence even if the leader dies during the next RPC. | ||
| 3. With `ARMING` durable, the migrator proposes/probes `ArmCutoverReadFence(job_id, route_start, route_end, expected_cutover_version, target_group_id)` on every current **source group** voter. Each source FSM applies this as source-group Raft state: a pending-fence entry keyed by `(job_id, route_start, route_end, expected_cutover_version)`. The operation is idempotent for the same tuple and rejects a conflicting expected version for the same job. | ||
| 4. After `source_cutover_read_fence_ack_cursor` shows every current source voter has applied the pending fence, the migrator CASes the default-group SplitJob from `ARMING` to `ARMED`. From the moment a source voter applies the fence until its watcher applies CUTOVER and normal §7.2.2 Branch 1 takes over, any point read or scan whose **route-key-normalized interval** intersects the moving range and whose `read_route_version < expected_cutover_version` **must not read source MVCC**. A source voter that restarts or wins leadership before applying the guard must fail closed for the moving interval until it has either applied the guard or loaded a catalog snapshot at/after `expected_cutover_version`; it may not serve from the stale pre-CUTOVER snapshot. If a raw internal scan cannot be normalized to route-key intervals, the source fails closed with `ErrRouteOwnershipUnknown` / retry rather than proving ownership from raw bytes. If the source has already learned that the default-group CAS committed (via watcher, heartbeat watermark, or the ownership RPC), it returns `ErrRouteMoved{expected_cutover_version, target_group_id}`. Otherwise it returns `ErrRouteCutoverPending{retry_after_ms}`; coordinators treat it like `ErrRouteMoved` with a short wait + `WatcherRefresh` retry. A transient retry is acceptable; serving stale source data is not. | ||
| 5. Only after both `target_staged_readiness_state = ARMED` and `cutover_read_fence_state = ARMED` are durable does the migrator issue the default-group CUTOVER catalog CAS that removes `source.WriteFenced(right)` and inserts `target.Active(right)`. This ordering delays target visibility/writes until the target can fail closed for missing staged metadata, the source can fence stale reads, and the default group has recovery witnesses for both facts. |
There was a problem hiding this comment.
Re-arm guards after membership changes
These barriers only wait for the voters that are current when the migrator samples the configuration. If a voter is added or a learner is promoted after target_staged_readiness_state / cutover_read_fence_state reaches ARMED but before CUTOVER (or before cleanup disarms proofs), that new possible leader has not applied ApplyTargetStagedReadiness or ArmCutoverReadFence and can serve live-only target reads/writes or stale source reads. The migration needs to pin a Raft configuration epoch, block/promote membership changes during guarded phases, or re-run the ACK barriers whenever the source/target voter set changes.
Useful? React with 👍 / 👎.
|
|
||
| Per-family adapter helpers (Redis list head pointers, DynamoDB GSI shadow rows, etc.) follow the same merge rule on the corresponding internal-family keys — the merge operates on raw keys, so the per-family invariants the `key_family` dispatch (§6.3.1) already enforces still apply. | ||
|
|
||
| Write-side transaction probes also use the staged+live raw-candidate view while `staged_visibility_active=true`. New writes still create live MVCC versions/intents only, but any validation, deduplication, or status lookup that decides whether a write is idempotent or legal must search the same logical record across live and staged keyspaces before concluding "absent." This includes `txnCommitTS` commit-record checks, rollback markers, one-phase `PrevCommitTS` probes, primary/secondary lock status lookups, and adapter-specific internal keys that participate in OCC. Otherwise a retry after CUTOVER but before promotion reaches those internal keys could miss staged proof and create a contradictory live record. The proof result is type-specific: primary commit records, identity-bearing commit markers, or per-key transaction-success markers carrying `(primary_key,start_ts,commit_ts)` can drive idempotent COMMIT success; a plain committed MVCC version that lacks transaction identity is only a cross-check and must not be the sole success proof; rollback markers drive the idempotent abort/rollback result. The staged+live probe rule is cleared only after the default-group promotion-complete CAS clears `staged_visibility_active` and every target voter has ACKed the cleared descriptor (§6.4 step 5). |
There was a problem hiding this comment.
Define local commit proof for moved secondaries
This requires post-drain COMMIT retries to find identity-bearing proof in the target's staged/live view, but for a transaction whose primary stays outside the moved range and only a secondary key moves, the existing !txn|cmt|<primary> proof routes outside the target slice (as noted in §6.1.0) and is not migrated. The text names a per-key success marker, but the data model/PR plan never adds a marker that is written with secondary commits and exported, so a response-lost COMMIT retry after CUTOVER has no local identity proof and will be rejected despite the transaction having committed. Please specify either that per-key proof record end-to-end or a cross-group primary-proof lookup.
Useful? React with 👍 / 👎.
| Layered mitigations: | ||
|
|
||
| 1. **Capability advertisement.** Each node advertises a `node_capability_bitfield` in its periodic heartbeat to the default group (existing distribution heartbeat — extend by one field). M2-capable nodes set the `cap_migration_v2` bit at boot. | ||
| 2. **Cluster-readiness gate at the entry RPC.** `StartSplitMigrationRequest` is rejected at `adapter/distribution_server.go` with `ErrClusterNotReadyForMigration` unless **every active node** advertises `cap_migration_v2` and the default-group leader itself supports the new RPC. Same-group `SplitRange` (M1) remains unconditionally available — its semantics and wire shape did not change. |
There was a problem hiding this comment.
Block same-group splits while a migration is live
Leaving the M1 SplitRange path unconditionally available during a cross-group job lets an operator split the same source route while BACKFILL/FENCE/DELTA_COPY still rely on the original source_route_id, [route_start, route_end), fence records, and cleanup cursors. That concurrent catalog rewrite can change route ids/bounds under the migrator, so later FENCE/CUTOVER/cleanup CASes may fence or delete the wrong interval or fail after data has already been staged. Same-group splits should be rejected (or proven disjoint) while any live SplitJob overlaps the route.
Useful? React with 👍 / 👎.
|
@codex review Addressed the current inline findings in commit 804882a:
Validation: git diff --check HEAD~1..HEAD |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 804882aea2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| {Start: txnLockPrefix(), End: txnLockPrefixEnd(), Family: familyTxnLock}, | ||
| {Start: txnCmtPrefix(), End: txnCmtPrefixEnd(), Family: familyTxnCommit}, | ||
| {Start: txnRbPrefix(), End: txnRbPrefixEnd(), Family: familyTxnRollback}, | ||
| {Start: txnMetaPrefix(), End: txnMetaPrefixEnd(), Family: familyTxnMeta}, | ||
| {Start: txnIntPrefix(), End: txnIntPrefixEnd(), Family: familyTxnInternal}, |
There was a problem hiding this comment.
Add the missing txn success export bracket
When M2-PR0 starts writing !txn|ok|<locked_key>|... markers, this bracket list never exports them even though §6.3.1 later excludes !txn|ok| from familyUser. In a cross-group migration where a transaction primary stays on the source but a committed secondary moves to the target, the target will lose the local identity-bearing success proof; a response-lost COMMIT retry after CUTOVER can no longer prove the prior success from staged/live probes and will return fenced/moved/reject despite the version having committed. Add an explicit txnOkPrefix/familyTxnSuccess bracket here (with the matching routeKey decoder).
Useful? React with 👍 / 👎.
| | Leader loss during target promotion | The target leader resumes from target-local `!migstage|promote|<job_id>.promote_cursor`; cursor advancement, live copy, and staged delete are one target-group Raft apply, so a persisted cursor cannot skip rows whose copy did not land. Default-group SplitJob state is only a witness and never drives the cursor. | None — reads keep using the raw staged/live merge until promotion finishes. | | ||
| | Target promotion finished but default-group promotion-complete CAS did not land | Target-local `PromotionState.done=true` and empty staged prefix are durable, and target-local ack / HLC-floor / readiness / promotion records are retained as recovery proof. The new migrator retries the default-group CAS that clears `staged_visibility_active` / `migration_job_id` and records `target_promotion_done=true`; only after the target observes the cleared descriptor does it delete those records. Until then, reads pay the merge cost over an empty staged prefix. | None. | | ||
| | Source cleanup reaches data-delete done before every source voter has applied CUTOVER | The source disarm step remains blocked on `source_cutover_ack_cursor`. The migrator keeps probing `AcknowledgeCutoverRouteRemoval(job_id, cutover_version)` for non-ACKed voters; if a voter is stalled or temporarily unreachable, the job stays in `CLEANUP` (or `FAILED` with `retry_phase=CLEANUP` if operator action is required) and keeps `!migfence`, the pending cutover read fence, and the retention pin armed. | Possible continued retryable rejects on stale source paths; no stale source writes/reads. | | ||
| | Target restart or snapshot restore after import / HLC-floor proposal | The restored target computes `fullFloor = max(store.LastCommitTS(), all !migstage|hlc_floor|*)`, then calls `SetPhysicalCeiling(hlcPhysicalMs(fullFloor))` and `Observe(fullFloor)` before accepting writes for routes with a non-zero migration high-water mark. The first post-restore `Next()` is therefore `> max_imported_ts`, including when the imported max has a non-zero logical half. | None. | |
There was a problem hiding this comment.
Preserve descriptor floors during target restart
This recovery rule omits the RouteDescriptor.min_write_ts_exclusive floors that §6.2.1 makes permanent after promotion. Once cleanup has deleted !migstage|hlc_floor|<job_id>, a target restart that follows this matrix computes fullFloor only from LastCommitTS and remaining per-job floors, so an empty/sparse migrated range can start issuing or accepting caller-supplied timestamps at or below fence_ts after cleanup. Include the local route descriptor floors here too, matching the startup invariant above.
Useful? React with 👍 / 👎.
|
|
||
| 1. **Capability advertisement.** Each node advertises a `node_capability_bitfield` in its periodic heartbeat to the default group (existing distribution heartbeat — extend by one field). M2-capable nodes set the `cap_migration_v2` bit at boot. | ||
| 2. **Cluster-readiness gate at the entry RPC.** `StartSplitMigrationRequest` is rejected at `adapter/distribution_server.go` with `ErrClusterNotReadyForMigration` unless **every active node** advertises `cap_migration_v2` and the default-group leader itself supports the new RPC. Same-group `SplitRange` (M1) keeps its semantics and wire shape, but it is not allowed to overlap a live cross-group SplitJob (§3.1.1): overlapping requests return `ErrSplitJobOverlap`, while disjoint same-group splits remain available. | ||
| 3. **In-flight job invalidation on downgrade.** If the default-group leader observes any node losing `cap_migration_v2` mid-migration (rare — would only happen with a botched rollback), the migrator pauses (stays in current phase, no new state advancement) and surfaces `ErrClusterCapabilityRegressed`. Operator must either roll forward or `AbandonSplitJob`. |
There was a problem hiding this comment.
Fail closed when capability regresses mid-job
If a node is rolled back after the tracker, fences, or readiness guards are armed, it no longer understands the source write tracker or the !migfence/readiness checks that this design relies on; simply pausing the default-group migrator leaves that voter able to become leader and accept untracked BACKFILL writes or post-FENCE traffic while the job remains resumable. The downgrade path needs to make participating groups fail closed, remove the voter, or abort/restart from a safe phase before allowing traffic, not just stop phase advancement.
Useful? React with 👍 / 👎.
|
@codex review Addressed the current findings in commit 70a88b0:
Validation: git diff --check HEAD~1..HEAD |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 70a88b0f01
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| ``` | ||
|
|
||
| These fields are persisted on the default-group `RouteDescriptor` record itself, propagated through the normal catalog watcher snapshot, and included in the same CUTOVER CAS that publishes `target.Active(right)`. They are not transient SplitJob-only metadata: a target leader that only has its route snapshot must be able to decide whether staged reads are active, which `!dist|migstage|<migration_job_id>|*` prefix to merge, and the minimum live commit timestamp it may admit. Because a freshly refreshed coordinator can route to the target before the target watcher has applied that descriptor locally, CUTOVER is also preceded by the target-group staged-readiness barrier in §6.4 step 0; a target with the readiness record but without the matching route descriptor fails closed instead of serving live-only data or accepting a write below `min_write_ts_exclusive`. Existing descriptors decode as `staged_visibility_active=false`, `migration_job_id=0`, and `min_write_ts_exclusive=0`; the default-group promotion-complete CAS clears only the staged fields after the target-local promotion completion proof in §6.4 step 5. It **does not** clear `min_write_ts_exclusive`: the floor is a durable route invariant after cross-group migration, later child descriptors copy the parent floor on split and descriptor rewrites preserve `max(existing_floor, new_floor)`, and it continues to reject caller-supplied / replayed `commit_ts <= floor` even after the route is otherwise ordinary MVCC. The final CLEANUP→DONE move happens after the per-job source/target cleanup ACKs are durable. |
There was a problem hiding this comment.
Define the durable RouteDescriptor codec upgrade
This assumes existing route records will decode with default values after adding the staged/floor fields, but the durable catalog is not encoded as this protobuf today: distribution/catalog.go uses a custom catalogRouteCodecVersion binary format and DecodeRouteDescriptor rejects both unknown codec versions and trailing bytes. When PR4 persists these fields on catalog routes, either old descriptors/new descriptors will fail to decode or the target route will lose staged_visibility_active / min_write_ts_exclusive; the design needs an explicit v1/v2 route-codec migration or dual decoder for the durable catalog records.
Useful? React with 👍 / 👎.
| // family-prefixed raw key back to its routing key, so the | ||
| // filter rejects any entry whose routing key falls outside | ||
| // [routeStart, routeEnd). | ||
| {Start: txnLockPrefix(), End: txnLockPrefixEnd(), Family: familyTxnLock}, |
There was a problem hiding this comment.
Don't export txn locks as migrated data
This bracket puts !txn|lock| into the BACKFILL/DELTA_COPY data export plan even though §6.1 says ExportVersions excludes in-flight intent locks and §12 says the target's empty intent-lock space is correct after the fence drain. If implemented literally, a pre-fence prepared lock can be materialized on the target after CUTOVER with coordinator state that was never migrated; if the exporter filters it out, the bracket/cursor/test expectations are inconsistent. Please keep the lock scan drain-only or remove familyTxnLock from the data export plan.
Useful? React with 👍 / 👎.
|
|
||
| The tracker is keyed by the **locked key**, not the primary, so a secondary lock in the moving range counts even when the txn's primary routes elsewhere. Verified against `kv/fsm.go:1136-1142` and `kv/txn_keys.go:50-58`: the durable `!txn|cmt|<primaryKey>|<startTS>` commit record is written **only when committing the primary** (`if committingPrimary`) and is keyed by `primaryKey`, so for such a txn the commit record's `routeKey()` falls **outside** `[routeStart, routeEnd)` — a commit-record family scan filtered by `routeKey()` never sees it. The migrator may still use the exact `PrimaryKey` + `StartTS` decoded from a live lock to do a **direct point lookup** for backward-compat locks where the M2-PR0 `commit_ts` field is absent or zero, but it never relies on a route-faithful commit-record-family scan for moved-secondaries. | ||
|
|
||
| M2-PR0 also adds a route-local **per-key transaction-success marker** for committed participants: `!txn|ok|<locked_key>|<start_ts>|<commit_ts>|<primary_key>` (exact byte encoding versioned with the txn key codecs). The commit apply path writes this marker in the same source/target Raft apply that installs the committed MVCC version for each participant key; its value redundantly encodes `(locked_key, primary_key, start_ts, commit_ts)` and a small marker version. Because `locked_key` is the first routed component, `routeKey(txnSuccessKey)` equals `routeKey(locked_key)`, so a secondary key that moves without its primary carries local identity-bearing proof through BACKFILL/DELTA_COPY. The export bracket planner treats `!txn|ok|` as a concrete txn family bracket, the `routeKey()` decoder understands it, and cleanup/txn-GC removes it with the same retention horizon as the primary commit/rollback markers after no idempotent retry can need it. Post-drain COMMIT idempotency may return success from this marker (or a primary commit record / identity-bearing commit marker); it must not fall back to an ordinary committed MVCC version that lacks `(primary_key,start_ts)`. |
There was a problem hiding this comment.
Make txn success keys self-delimiting
This key shape embeds two arbitrary variable-length keys but does not specify how the decoder finds the end of locked_key before the fixed timestamps and trailing primary_key. In cases where user keys contain separator bytes or timestamp-like suffixes, routeKey(txnSuccessKey) and the export/cleanup bracket cannot reliably route the marker by the locked key, which loses the moved-secondary idempotent COMMIT proof this marker is meant to provide. Please specify a length-prefixed or otherwise self-delimiting encoding for the routed component.
Useful? React with 👍 / 👎.
| 4. On `AbandonSplitJob` before CUTOVER, the migrator first runs a bounded per-job cleanup protocol, then moves the final `ABANDONED` body to `!dist|jobhist|<terminal_at_ms_be>|<id>`: | ||
| - source group: disarm `ArmMigrationWriteTracker`, delete `!migwrite|<job_id>|*`, clear any `!migfence|<job_id>` durable write fence / pending cutover read fence, and release `source_retention_pin_ts`; | ||
| - target group: delete staged rows under `!dist|migstage|<job_id>|*`, import ack records under `!migstage|ack|<job_id>|*`, `!migstage|hlc_floor|<job_id>`, `!migstage|ready|<job_id>`, and `!migstage|promote|<job_id>` in cursor-bounded batches; |
There was a problem hiding this comment.
Persist abandon cleanup before deleting side effects
Abandon cleanup deletes source trackers/fences and target staged/ack state before the job is durably moved out of its live pre-CUTOVER phase. If the default leader or RPC dies after either cleanup bullet but before the final ABANDONED history CAS, recovery can still see a retryable BACKFILL/FENCE/DELTA_COPY job and replay it with missing tracker/fence/staged state, so writes admitted during that partial abandon can be missed. The protocol needs a durable ABANDONING/CLEARING state or an equivalent witness before side effects are removed.
Useful? React with 👍 / 👎.
|
@codex review Addressed current P2 findings in commit 4904a2b:
Validation: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4904a2b2ba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| 1. **Capability advertisement.** Each node advertises a `node_capability_bitfield` in its periodic heartbeat to the default group (existing distribution heartbeat — extend by one field). M2-capable nodes set the `cap_migration_v2` bit at boot. | ||
| 2. **Cluster-readiness gate at the entry RPC.** `StartSplitMigrationRequest` is rejected at `adapter/distribution_server.go` with `ErrClusterNotReadyForMigration` unless **every active node** advertises `cap_migration_v2` and the default-group leader itself supports the new RPC. Same-group `SplitRange` (M1) keeps its semantics and wire shape, but it is not allowed to overlap a live cross-group SplitJob (§3.1.1): overlapping requests return `ErrSplitJobOverlap`, while disjoint same-group splits remain available. | ||
| 3. **In-flight job invalidation on downgrade.** If the default-group leader observes any node losing `cap_migration_v2` mid-migration (rare — would only happen with a botched rollback), the job records `capability_regressed = true` / `ErrClusterCapabilityRegressed`, stops phase advancement, and makes every still-M2-capable coordinator fail closed for the moving route on the participating source/target groups until the regression is resolved. This is stronger than "pause the migrator": once the write tracker, source fence/read-fence, or target readiness guard may be armed, a downgraded voter that can become leader is unsafe because it may ignore `!migwrite`, `!migfence`, or `!migstage|ready`. Recovery must do one of three safe things before traffic resumes: (a) roll the node forward so it advertises `cap_migration_v2` again and then re-run the per-voter ACK barriers under the current voter-set epoch (§3.1); (b) remove/demote the regressed node from every default/source/target voter set that can serve the live SplitJob, then re-run the ACK barriers for the remaining voters; or (c) if and only if the job is still before any source/target guard or staged import became durable, abort through the normal pre-CUTOVER `AbandonSplitJob` cleanup. A hard rollback of a serving participant without roll-forward/removal is treated as cluster-unavailable for the moving range, not as a resumable paused job. | ||
| 4. **M2-PR0..PR7 land in this order so the capability bit only opens when the full feature is present (closes claude round-11 P2 on PR #945).** PR0-PR3 are wire/codec/store additions with **no behaviour change for M1** (the existing `SplitRange` RPC remains same-group only; `StartSplitMigration` returns `UNIMPLEMENTED` or `ErrClusterNotReadyForMigration` until the feature gate opens); PR4 introduces the state machine but no data move; PR5 lands FENCE rejection (`verifyRouteNotFenced` + `ErrRouteWriteFenced`); PR6 lands `ExportRangeVersions` / `ImportRangeVersions` server-side handlers + §6.4 step 2 staged-merge read path + §7.2.2e source-side cutover read-fence arm; PR7 lands the background promoter + `AbandonSplitJob` + CLEANUP drain. **The `cap_migration_v2` bit goes live only in PR7 commit**, when the full end-to-end migration path can complete — not in PR5 (where FENCE is enforced but BACKFILL has no export handler yet, so a SplitJob would create successfully and then fail at BACKFILL, giving operators false assurance). The earlier-rounds intention was "the bit guards against rollout-skew silent execution," and the corrected intent is "the bit guards against operators starting a migration in a half-shipped cluster." If a finer-grained gate is desired later, OQ-18 records the option to split into `cap_fence_v2` (PR5) + `cap_data_move_v2` (PR6/PR7), with `StartSplitMigration` gated on the conjunction — out of scope for v1. |
There was a problem hiding this comment.
Gate PR0 txn side effects during rolling upgrades
In a rolling upgrade where PR0 is deployed before PR7 opens cap_migration_v2, PR0 is specified above to change the live txn apply path (prepare locks carry commit_ts and COMMIT writes !txn|ok| markers), but this line classifies PR0-PR3 as no-behaviour-change. That lets upgraded and not-yet-upgraded voters apply normal transaction traffic with different generated side effects before StartSplitMigration is even allowed, risking replica divergence or missing the idempotency proof. Please either gate the PR0 writes until every voter advertises the needed codec capability, or move the capability bit earlier with an explicit mixed-version rule.
Useful? React with 👍 / 👎.
|
@codex review Addressed current-head P1 inline finding in 13ca07f:
Validation: git diff --check HEAD~1..HEAD. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13ca07fc38
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| #### 6.1.0a Source MVCC retention pin — export windows must remain readable | ||
|
|
||
| Before BACKFILL opens, the source already has the conservative route-scoped pin installed by `ArmMigrationWriteTracker` in §6.1.0 step 1. After `snapshot_ts` is chosen and the initial route-faithful lock scan / current tracker rows have produced `snapshot_min_admitted_ts`, the migrator may replace the conservative no-prune floor with the current safe floor `min(snapshot_ts, snapshot_min_admitted_ts - 1)` and records that value as `source_retention_pin_ts` in the SplitJob. The pin is route-scoped and bracket-aware: compaction may still discard unrelated keys, but for any raw row accepted by the §6.3.1 `RouteKeyFilter` it must retain committed versions and tombstones needed to answer `ExportVersions` at the current BACKFILL/DELTA_COPY lower bound. There is no window where tracked writes are admitted while the moving route has no source-local retention pin. |
There was a problem hiding this comment.
Keep the source pin at the BACKFILL lower bound
In the BACKFILL phase the export window is (0, snapshot_ts], so replacing the initial no-prune pin with min(snapshot_ts, snapshot_min_admitted_ts - 1) before the BACKFILL cursor has copied the range lets source compaction discard older committed versions or tombstones that are still inside the BACKFILL export window. On a long migration where compaction runs after this pin change but before a bracket reaches a key with history below snapshot_ts, the target will never import those MVCC versions, causing historical reads/deletes to regress after CUTOVER. The pin should stay at the actual BACKFILL lower bound until BACKFILL completes, and only then move to the DELTA_COPY lower bound.
Useful? React with 👍 / 👎.
| - Post-CUTOVER, `WriteFenced(right)` is removed from the source snapshot, so a fence-only gate is no longer enough. M2 therefore also introduces an unconditional **current write owner gate** `verifyCurrentWriteOwner(mutations)` on the same pre-version path. It rejects any non-abort write whose normalized footprint does not resolve to a route owned by the local Raft group under the current serving snapshot. If the footprint resolves to a foreign route, return `ErrRouteMoved{new_route_version, new_group_id}`; if it cannot be normalized or ownership is unavailable, return `ErrRouteOwnershipUnknown` and do not apply. This gate runs even when there is no `WriteFenced` route and even when `ObservedRouteVersion == 0`, so clearing `!migfence` after §6.5 cannot reopen stale writes to the old source. The same narrow prepared-intent resolution lane described above applies before `post_fence_drain_completed`; after the drain, only idempotent COMMIT retries with a durable transaction-identity prior-success proof are allowed to return success, rollback proof returns the prior abort/rollback result, and every new/unprepared `Phase_COMMIT` for the moved route is rejected on source like any other write. | ||
| - Concretely, the apply path becomes: `verifyNoMigrationControlPrefix` → `verifyCurrentWriteOwner` → `verifyRouteNotFenced` → `verifyComposed1` → existing handlers. The owner/fence gates share a small `mutationFootprints(mut)` helper: point lookups call `OwnerOf(routeKey(mut.Key))`, while `Op_DEL_PREFIX` calls `IntersectingRoutes(normalizePrefixFootprint(mut.Key))`. A raw user prefix normalizes to the raw prefix interval; known internal families (`!lst|meta|d|`, `!lst|claim|`, `!hs|`, `!st|`, `!zs|`, concrete stream prefixes `!stream|meta|` and `!stream|entry|`, DynamoDB/S3/SQS families) use the same decoders as the export bracket planner and read-side scan gate when the prefix contains enough bytes to recover logical owner bounds. A family-wide internal prefix that does not carry an owner segment, or any unrecognized internal prefix, fails closed with `ErrRouteOwnershipUnknown` for the owner gate and `ErrRouteWriteFenced` if the current snapshot contains any WriteFenced route, because the FSM cannot prove it is disjoint. `kv/sharded_coordinator.go` may preflight the same intersection before broadcasting `DEL_PREFIX`, but correctness is source-FSM-local: every broadcast recipient re-runs the gates before `handleDelPrefix`, so a stale coordinator cannot slip source tombstones in after FENCE or after CUTOVER. Cost is bounded by one route-interval lookup for prefix deletes and the existing per-mutation `OwnerOf` call for point writes. The gates stay independent so a future change to one cannot regress the others (same isolation rationale as `verifyReadOwner` next to `verifyOwnerFromSnapshot` in §7.2.5). | ||
| - Independently of `WriteFenced`, the same unconditional pre-gate sequence starts with `verifyNoMigrationControlPrefix(mutations)`. It rejects any user-plane mutation footprint that intersects the registered migration/control prefixes from §3.2 (`!dist|`, `!dist|migstage|`, `!migstage|`, `!migwrite|`, `!migfence|`, etc.) before a key reaches `handleRawRequest`, `handleOnePhaseTxnRequest`, `handlePrepare`, or `handleDelPrefix`. This covers point keys and prefix-delete ranges, including `Op_DEL_PREFIX ""`; an empty prefix is a user-data request, not permission to delete catalog, staged import rows, HLC floors, readiness records, promotion records, or source fences. The implementation may share the `mutationFootprints` interval helper with `verifyRouteNotFenced`, but it must not be conditional on there being an active WriteFenced route. Typed catalog/migration Raft proposals are the only writers to those prefixes. |
There was a problem hiding this comment.
Preserve broadcast DEL_PREFIX semantics in owner gates
For DEL_PREFIX requests that are intentionally broadcast to every shard (including empty-prefix flushes and S3/DynamoDB safety-net prefix deletes), each recipient's normalized footprint can include routes owned by other groups and reserved/control subranges that the local store would not actually mutate as user data. As written, the new pre-gates reject any non-local owner or reserved-prefix intersection before the local prefix delete runs, so a valid multi-route prefix delete can fail on every recipient once M2 is enabled. The gate needs to split/limit the operation to the local user-owned portions, while still rejecting the whole request when any intersecting route is WriteFenced or truly targets migration control keys.
Useful? React with 👍 / 👎.
| versions []MVCCVersion, cursor []byte) ([]byte, error) | ||
| ``` | ||
|
|
||
| Idempotency: persist `(jobID, bracketID) → (max_applied_batch_seq, acked_cursor)` under `!migstage|ack|<jobID>|<bracketID>` after each batch's apply (the bracket dimension is required — a single `!migstage|cursor|<jobID>` record would be clobbered by parallel brackets per §6.1.1), and on a duplicate request drop any batch whose `batchSeq` is `≤` the recorded high-water mark while returning the stored `acked_cursor`. The `cursor` is also persisted on the SplitJob's `bracket_progress` entry for exporter resume, separate from the dedup record. Empty batches follow the same ack path; they are not skipped. Non-empty batches preserve the complete MVCC value header relevant to visibility: `tombstone` and `expire_at` are part of the imported version, not recomputed on the target. |
There was a problem hiding this comment.
Reject per-bracket import gaps
The target records only a single max_applied_batch_seq, so if a later batch for the same bracket is applied before an earlier one, the earlier batch is later treated as ≤ the high-water mark and dropped without importing its versions. This can happen if the migrator pipelines per-bracket imports or a retry races with an already-arrived later RPC; the text even says duplicate/out-of-order batches are safe in §6.2.1, but the high-water scheme is safe only for strictly sequential stored+1 application. The importer should either reject/cache batchSeq > stored+1 until the gap is filled, or the migrator contract must explicitly forbid per-bracket out-of-order delivery.
Useful? React with 👍 / 👎.
|
@codex review Addressed current-head P1 inline findings in b7039bf:
Validation: git diff --check HEAD~1..HEAD. This was docs-only, so no code caller audit was needed. |
|
Codex Review: Didn't find any major issues. 🚀 Reviewed commit: ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
## What
Adds `docs/design/2026_06_12_proposed_scaling_roadmap.md` — a doc-only
scaling roadmap that surveys what exists, what is proposed, and what is
**missing** for elastickv to scale further (more data, more throughput,
more nodes).
This is a roadmap / implementation-plan doc, which
`docs/design/README.md` explicitly permits ("Concrete implementation
plans"). No code changes.
## Structure
1. **Current scaling envelope** — the seven things that bound a
deployment today, each anchored to a file:line on `main`.
2. **Dimension-by-dimension analysis** — (a) data volume → split M2/M3 +
blob offload + shared cache, with range MERGE flagged as a missing
design; (b) write throughput → multi-group + leader balance (#953) + the
multi-node bootstrap gap; (c) read throughput → lease reads (done),
follower-reads (missing design, learner primitive present); (d)
cross-group txn → per-group HLC vs centralized TSO assessment; (e)
cluster size / membership; (f) operational scaling.
3. **Gap list → recommended new designs** ranked by leverage, each with
depends-on edges.
4. **Sequencing** — dependency-ordered rollout tying in #945 / #951 /
#953.
5. **Open Questions.**
## Verified-on-tree claims (selection)
- `RunHLCLeaseRenewal` proposes the ceiling to the **default group
only** (`kv/sharded_coordinator.go:1915`) — the TSO doc's near-term "all
led groups" fix is **not** yet implemented.
- `--raftBootstrapMembers` is rejected for multi-group (`main.go:746`,
`len(groups) != 1`); each group bootstraps single-member
(`multiraft_runtime.go:246-254`) — the multi-node multi-group gap PR
#953 §1.1a (PR0) documents.
- Each Pebble store allocates a private 256 MiB block cache
(`store/lsm_store.go:273-279`); shared-cache is a TODO comment
(`:117-120`), not a design.
- FSM apply is serial per group (`kv/fsm.go:123-124`).
- The raft **learner primitive is implemented** in code (engine
`AddLearner`/`PromoteLearner`, raftadmin CLI, `--raftJoinAsLearner`)
even though the doc filename still reads `proposed` — but
follower-served reads (its consumer) have no design.
- Snapshot disk offload and lease reads are implemented; TSO and range
merge are not.
## Notes
- References a companion proposal being written in parallel,
`2026_06_12_proposed_multinode_multigroup_bootstrap.md`, by name (it is
the authoritative spec for Gap 1).
- All referenced design docs were confirmed to exist (or are named
in-flight PR branches: #945 `docs/hotspot-split-m2-proposal`, #951
`design/hotspot-split-m3-automation`, #953
`design/leader-balance-scheduler`).
Design-doc-only PR. No implementation code. Per the repo's design-doc-first workflow, this lands the M3 proposal for review before any implementation. New file: `docs/design/2026_06_11_proposed_hotspot_split_milestone3_automation.md` (`Status: Proposed`, `Author: bootjp`, `Date: 2026-06-11`). ## Summary Milestone 3 of the [hotspot shard split](docs/design/2026_02_18_partial_hotspot_shard_split.md) design: **automation** — auto-detection + an auto-split scheduler. M1 (durable versioned route catalog, manual same-group `SplitRange`, route watcher) has shipped; [M2 (#945)](#945) proposes the cross-group migration plane. M3 composes with M2 but delivers first value standalone: auto-detection + same-group auto-split (the M1 capability) work without M2; cross-group target selection slots in behind the same `SplitRange` interface once M2 lands. It closes the gap the parent doc §1 calls out: `distribution/engine.go`'s `RecordAccess`/threshold-`splitRange` exists but is wired into **nothing** (only tests call it; verified by grep) and is midpoint-only + mutates the engine directly (bypassing the catalog/watcher). ## Key design points - **Detection signal — reuse the keyviz sampler, retire the engine counters.** The keyviz `MemSampler` is already wired into the exact request-path functions the parent doc names (`groupMutations`→`observeMutation`, `observeRead` in `kv/sharded_coordinator.go`), is allocation-free on the hot path, splits reads vs writes, is windowed, has sub-range + Top-K split-key evidence, and bounds cardinality. Driving detection off it avoids a second parallel pipeline. The engine's dead `RecordAccess`/`splitRange`/`Route.Load` path is removed in M3-PR1. M3 adds **zero** hot-path cost — it consumes already-flushed windows off-path. - **Aggregation — leader-local, no new RPC.** The detector runs on the default-group leader and reads its own `MemSampler.Snapshot` for routes it leads (the leader serves the load it would split on). No `ReportAccess` RPC and no RPC piggybacking; cluster fan-out stays an admin-UI concern and can slot behind the detector interface later if needed. - **Detector** — `score = write_ops*Ww + read_ops*Wr` (4/1), 50k ops/min threshold, 3-consecutive-window up-side hysteresis + 0.8× down-side band, per-route 10-min cooldown (monotonic deadline, not wall clock), split-key from observed distribution (sub-range p50 / single-key isolation / decline when no evidence — never blind midpoint), guardrails: min route span, max routes, per-cycle split budget. - **Scheduler** — invokes the existing `proto.Distribution.SplitRange` RPC with `expected_catalog_version` CAS; **all catalog mutations go through `SplitRange`** (repo convention). Same-group today; once M2 lands the **least-loaded-group** target policy flips the `target_group_id` field on the same RPC. - **Safety/ops** — default OFF behind `--autoSplit`; runtime kill switch (`atomic.Bool`); fixed-enum metrics (no per-route/key labels); stable slog keys; detector state is **leader-local and resets on election** (re-earns confidence; cooldown re-seeded from `parent_route_id` lineage) — stated explicitly so non-replication isn't mistaken for an oversight. - **PR breakdown** — M3-PR1 (retire engine path + metric scaffolding), M3-PR2 (aggregation + pure detector, observe-only), M3-PR3 (scheduler + flag + kill switch), each independently shippable; M3-PR4 enables cross-group target selection post-M2. Every current-state claim cites concrete `file:line` evidence, verified before writing. ## Open questions 1. **OQ-1** — Fully delete `Route.Load`/`Engine.Stats`, or keep `Stats` for diagnostics? 2. **OQ-2** — Down-side hysteresis: hold the counter in the `[0.8×, 1.0×)` band, or decay it? 3. **OQ-3** — No sub-range/Top-K evidence: decline to split (proposed), or allow a min/max-observed-key midpoint? 4. **OQ-4** — `minRouteSpan` without per-route key counts: approximate from sampler evidence, or add a cheap key-count estimate? 5. **OQ-5** — Leader-local scoring sufficient, or is there a heavy-follower-read workload needing cluster fan-out? 6. **OQ-6** — Post-M2: enable cross-group target automatically, or gate behind `--autoSplitCrossGroup`? 7. **OQ-7** — Durable cooldown across elections: is `parent_route_id` + a catalog split timestamp enough, or do we need durable `last_split_at`? 8. **OQ-8** — Express threshold in ops/min (proposed) or ops-per-`Step` to avoid coupling sensitivity to `--keyvizStep`? ## Test plan Doc-only — no Go tests run. Markdown verified clean; all cited `file:line` references confirmed against `main`. The doc's own §8.2 lays out the implementation-phase test strategy (table-driven detector unit tests + `pgregory.net/rapid` invariants, detector→`SplitRange` integration in the 3-node demo, Jepsen hotspot workload deferred to M4 per the parent doc).
Summary
docs/design/.Key design points
!dist|job|<id>so a default-group leader flap resumes mid-migration without operator action.routeKey()equivalence check on every exported row, so!txn|,!lst|,!ddb|,!redis|,!sqs|,!s3|families all land in the right shard.target_group_id == 0or equal to source short-circuits through the existing fast path.Open questions for reviewers
7 questions called out at
§12— notably: grace-period sizing, abandon-after-cutover policy, backfill-vs-prepare semantics, andMVCCVersion.key_familystrict-enum vs numeric.Test plan
*_proposed_*→*_partial_*once PR1 lands.Summary by CodeRabbit