From ffb9c73fc76d6ef0aeea2ff86f863109e3c70ff0 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 13:20:42 +0900 Subject: [PATCH 01/10] =?UTF-8?q?docs(design):=20Composed-1=20M5=20?= =?UTF-8?q?=E2=80=94=20Jepsen=20route-shuffle=20workload=20proposal?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Companion to 2026_05_29_proposed_composed1_cross_group_commit_guard.md (parent — proposes the cross-group commit-time ownership guard M1+M2+M3+M4 implemented in PR #900). M5 is the integration-level proof from the parent doc's milestone table. This proposal elaborates the row into a full design: * cmd/elastickv-split — tiny CLI that invokes SplitRange once by route ID + split key + expected version. Lets the Jepsen nemesis shell out instead of re-implementing the gRPC client in Clojure. * jepsen/src/elastickv/composed1_nemesis.clj — route-shuffle nemesis that periodically splits an existing route. Composes with the existing partition+kill nemesis package. * dynamodb-append-workload setup hook — issues one initial SplitRange so the workload spans >=2 shards from t=0, exercising the 2PC path (prewrite -> primary commit -> secondary commit + the new ErrTxnSecondaryRouteShiftedAfterPrimaryCommit sentinel). Forward-looking posture (same as parent): today's SplitRange is same-group only, so the Composed-1 hazard the M3/M4 guard catches cannot yet be *triggered* in production. M5 ships the scaffolding so: 1. The current SplitRange is exercised under realistic concurrent multi-shard write load and proved non-regressing (workload finds zero G1c — baseline M4 contract). 2. When a future PR introduces a route-mutating RPC that DOES shift ownership across groups, the M5 workload — with a one-line nemesis swap — becomes the integration-level proof that M3+M4 hold under cross-group churn. Two-phase milestone breakdown: * M5a: CLI + workload setup (mergeable on its own; trivially finds zero G1c without the nemesis). * M5b: Route-shuffle nemesis itself + cadence-tuning analysis. Open questions tracked in section 5; lifecycle questions about renaming the parent doc are in OQ-4. --- ...posed_composed1_m5_jepsen_route_shuffle.md | 284 ++++++++++++++++++ 1 file changed, 284 insertions(+) create mode 100644 docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md new file mode 100644 index 00000000..a7f4d8ea --- /dev/null +++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md @@ -0,0 +1,284 @@ +# Composed-1 M5 — Jepsen route-shuffle workload + +Status: Proposed +Author: bootjp +Date: 2026-06-02 +Parent design: +[`2026_05_29_proposed_composed1_cross_group_commit_guard.md`](2026_05_29_proposed_composed1_cross_group_commit_guard.md) + +> **Forward-looking proposal, same posture as the parent doc.** +> Today's `SplitRange` is same-group only (per CLAUDE.md and +> `adapter/distribution_server.go`'s implementation), so the +> Composed-1 hazard the M3/M4 guard catches cannot yet be +> *triggered* in production. M5 ships the integration-test +> scaffolding — workload shape, nemesis, success criterion — so +> that: +> +> 1. The current `SplitRange` is exercised under realistic +> concurrent multi-shard write load and proved non-regressing +> (the workload finds **no** G1c, which is the baseline M4 +> contract). +> 2. When a future PR introduces a route-mutating RPC that DOES +> shift ownership across groups (cross-group `SplitRange`, +> `MoveRange`, online rebalancer), the M5 workload — with a +> one-line nemesis change to call the new RPC — becomes the +> integration-level proof that M3+M4 hold under cross-group +> churn. + +--- + +## 1. Goals and non-goals + +### 1.1 Goals + +- **G1.** Add a `route-shuffle` nemesis to the DynamoDB Jepsen + suite that issues `SplitRange` against the cluster at a + configurable cadence concurrently with the existing DynamoDB + workload's `TransactWriteItems` traffic. +- **G2.** Force the workload to issue **multi-shard** transactions + with high probability so the 2PC path (`dispatchMultiShardTxn`, + `commitSecondaryTxns`, the new + `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit` sentinel) is + exercised. +- **G3.** Verify the workload's Elle checker reports **zero G1c** + cycles after the run. This is M4's "Done when" criterion + promoted from a unit test to an integration test. +- **G4.** Land a tiny CLI helper (`cmd/elastickv-split`, or a + subcommand in `elastickv-admin`) that issues a single + `SplitRange` RPC by route ID + split key + expected version. + The Jepsen nemesis shells out to it rather than re-implementing + the gRPC client in Clojure. + +### 1.2 Non-goals + +- **NG1.** A cross-group `SplitRange` or `MoveRange` RPC. The + parent doc explicitly defers this; M5 must not depend on it. +- **NG2.** Reproducing a real Composed-1 anomaly. With same-group + `SplitRange` only, no such anomaly is reachable; the workload's + job is to prove the gate is non-regressing today and to be + ready for tomorrow. +- **NG3.** New Jepsen workload primitives (new operation types, + new generators outside the route-shuffle nemesis). The + existing `dynamodb-append-workload` is the right surface. +- **NG4.** Changing the DynamoDB adapter or Composed-1 code on + the server side. M5 is purely test-harness work. + +--- + +## 2. Why this matters now + +PR #900 lands M1+M2+M3+M4 on `feat/composed1-m4-retry`. The unit +tests cover each milestone in isolation. Two gaps remain that +only an integration test can close: + +1. **End-to-end ordering.** The M3 gate runs inside the FSM + apply path; the M4 retry runs inside `ShardedCoordinator`; + the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit` + sentinel surfaces from `commitSecondaryTxns`. Each is unit- + tested in isolation. None of the tests run the full + prewrite → primary-commit → secondary-commit chain on real + Raft groups under concurrent client load, which is where + subtle apply-ordering bugs hide. +2. **Workload realism.** PR #900's nine review rounds each + surfaced an auto-pin overreach (read-write, caller-StartTS, + 2PC secondary, resolver-claimed). The review process is + thorough, but a Jepsen run is the empirical check: if the + workload's Elle checker finds G1c against the M4 build, the + review missed something. + +If we ship M4 to `main` without M5, every later change to +routing, OCC, or the FSM apply path lacks the integration-level +sentinel that would catch a regression. M5 closes that gap. + +--- + +## 3. Design + +### 3.1 SplitRange invocation CLI (`cmd/elastickv-split`) + +A standalone Go binary, ~80 lines: + +``` +elastickv-split \ + --address 127.0.0.1:50051 \ + --route-id 100 \ + --split-key /q1/00001 \ + --expected-version 7 +``` + +Reads the four flags, dials the leader, issues +`proto.Distribution/SplitRange`, prints the new catalog version +and the two child route IDs on success. Non-zero exit on any +error so the Jepsen nemesis sees the failure. + +The CLI lives in `cmd/elastickv-split/main.go`. No tests beyond +a smoke test (`main_test.go`) that runs `elastickv-split --help` +and asserts non-zero exit on missing flags. The real coverage +is the Jepsen run itself. + +**Alternative considered:** add a `split` subcommand to +`cmd/elastickv-admin/`. Rejected because `elastickv-admin` is +the HTTP fanout admin and conflating it with a gRPC control- +plane invocation would muddy its scope. A standalone tool is +clearer. + +### 3.2 Route-shuffle nemesis (`jepsen/src/elastickv/composed1_nemesis.clj`) + +A new Clojure file with a single `route-shuffle-nemesis` +function returning a `jepsen.nemesis/Nemesis` instance: + +``` +(defn route-shuffle-nemesis + "Periodically invokes elastickv-split against the cluster. + :start -> shuffle one route (pick a non-edge split key) + :stop -> no-op (splits are durable, no rollback)" + [opts] + (reify nemesis/Nemesis + (setup! [this test] ...) + (invoke! [this test op] ... + ;; shell out to elastickv-split with a fresh split key + ) + (teardown! [this test] ...))) +``` + +The nemesis is composed with the existing +`jepsen.nemesis.combined/nemesis-package` (partitions + kills) +via `jepsen.nemesis/compose`. The combined nemesis becomes the +workload's `:nemesis`. + +**Split key picking strategy.** A simple monotonically-increasing +counter: every `:start` invocation appends a fresh integer +suffix to a fixed key prefix the workload reserves. This avoids +collisions with the workload's keyspace and guarantees the +split always picks a key that's between existing keys (so the +operation succeeds against a real catalog). + +**Expected version.** The nemesis calls `ListRoutes` once at +setup to learn the current catalog version, then increments its +local copy by 1 after each successful split. Catalog drift +(another split landing concurrently) is rare in practice — if it +happens, the nemesis logs and refreshes from `ListRoutes`. + +### 3.3 Multi-shard workload guarantee + +The existing `dynamodb-append-workload` writes to a per-key +queue. With a single shard layout, every write goes to that +shard — no 2PC, no Composed-1 exposure. + +M5 needs the workload to consistently span shards. Two options: + +| Option | Mechanism | Pro | Con | +|---|---|---|---| +| **A** Force initial split | The test setup issues one `SplitRange` before the workload starts | Workload runs on 2+ shards from t=0 | Adds a setup step; needs a known split key | +| **B** Multi-key txns | Modify each `:append` op to write to ≥2 keys with deterministic routing across shards | Workload exercises 2PC even on a 1-shard layout | Changes the workload's operation shape (harder to compare against historical runs) | + +**Choose A.** Less invasive to the workload, and the +route-shuffle nemesis itself increases the shard count over +time, giving organic multi-shard coverage. + +The setup hook (`db/setup!` in Jepsen parlance, or the test's +`:setup` map) runs `elastickv-split` once with a split key in +the middle of the workload's keyspace. + +### 3.4 Success criterion + +The workload's existing Elle checker emits a `:valid?` boolean +and a list of detected cycles (`:G0`, `:G1a`, `:G1b`, `:G1c`, +`:G-single`, etc.). M5's pass condition: + +``` +(and (:valid? results) + (zero? (count (filter #(= (:type %) :G1c) (:anomalies results)))) + (zero? (count (filter #(= (:type %) :G-single) (:anomalies results))))) +``` + +`G1c` is the parent doc's explicit safety violation; `G-single` +is the closely-related single-item anomaly we already chase in +the existing workload. Other anomaly types (G0, G1a, G1b) +indicate orthogonal bugs and should also fail the run, but the +parent doc's M5 row names G1c as the headline criterion. + +### 3.5 Cadence + +Default `:route-shuffle-interval` = `30s`. Configurable via the +test CLI. Rationale: the workload's typical txn rate is ~10 +ops/sec across 5 concurrent clients (= 50 ops/sec), so a +30s shuffle puts ~1500 txns between shuffles — enough to +plausibly catch a mid-2PC race, but rare enough that the run +doesn't degenerate into "every txn races a split." + +The route-shuffle nemesis composes with the existing +partition+kill nemesis. The combined nemesis fires at the +shortest of its members' intervals (Jepsen default +behaviour); kills/partitions remain at their existing 40s. + +--- + +## 4. Milestone breakdown + +Two phases. The phases land in this order; the first is +mergeable on its own. + +| Phase | Title | Scope | Done when | +|---|---|---|---| +| M5a | CLI + workload setup | `cmd/elastickv-split` binary; `dynamodb-append-workload`'s `:setup` issues the initial split; no nemesis yet. | `./scripts/run-jepsen-local.sh` runs unchanged but the cluster starts with 2 shards. Workload finds zero G1c (trivially, no shuffle). | +| M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into `dynamodb-append-workload`'s nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). | A `./scripts/run-jepsen-local.sh` run with `--composed1-route-shuffle` produces zero G1c after ≥10 shuffles during a 5-minute run. | + +M5a is a small, focused PR (Go CLI + Clojure setup hook + +docs). M5b carries the nemesis itself plus the cadence-tuning +analysis. + +--- + +## 5. Open questions + +- **OQ-1.** Should the nemesis also issue an `Abort`-shaped + fault that interrupts an in-flight 2PC mid-prewrite? The + existing partition nemesis effectively does this. Tentative + answer: no, the partition nemesis is enough; adding a + prewrite-interrupt would test `abortPreparedTxn`, which is + out of M5's scope. +- **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two is + cleaner but doubles the review burden. Tentative answer: two + if M5a's CLI work runs ≥150 lines (likely); one if M5a fits in + a single screen. Decide at implementation time. +- **OQ-3.** Where does the new `cmd/elastickv-split` slot in + the README and the `make` targets? Likely add it to + `make tools`, mirror in `docs/operations/` (does this dir + exist? — check at implementation). Out of scope for the + design doc itself. +- **OQ-4.** Should the M5 design doc rename happen with PR #900 + merge (since M1–M4 ship)? Yes per CLAUDE.md's lifecycle + guidance: rename `*_proposed_*.md` → `*_partial_*.md` after + PR #900 lands, then this M5 doc tracks the open milestone. + When M5 ships, rename the parent to `*_implemented_*.md` and + this M5 doc to `*_implemented_*.md` as well (or fold the M5 + content back into the parent — tentative answer: keep them + separate so the M5 design history isn't lost). + +--- + +## 6. Self-review summary + +Five-pass per CLAUDE.md: + +1. **Data loss.** No new write paths; the CLI invokes the + existing `SplitRange` RPC which already has full unit + e2e + coverage. Nemesis-driven calls of an existing RPC can't lose + committed writes (worst case: a split fails and the test + fails, no data effect). +2. **Concurrency / distributed failures.** The nemesis runs + under Jepsen's existing concurrency harness alongside + partitions + kills. Combined behaviour is the *point* of the + test — if anything breaks, the workload finds it. No new + server-side concurrency code is being introduced. +3. **Performance.** Nemesis fires every 30s; CLI invocation is a + single short-lived gRPC call. No measurable impact on hot + paths. +4. **Data consistency.** This IS the data-consistency check + (G1c = serializability violation). The success criterion is + the property we want. +5. **Test coverage.** M5 ships the integration test; the + smoke test on the CLI is the only unit-level coverage, + correctly. The CLI's logic is thin enough that a smoke test + plus the Jepsen run constitute adequate coverage. From f5d2ad7ac7c695b8dc1b86119a59feb080fd95da Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 16:15:09 +0900 Subject: [PATCH 02/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20multi-ta?= =?UTF-8?q?ble=20workload=20+=20post-review=20revisions=20(PR=20#905)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex P1 + 3 gemini medium findings on the original PR #905 revision (ffb9c73f). All addressed by revising sections 3.2, 3.3, and 4 (milestone breakdown) and adding OQ-5 / OQ-6: * codex P1 — "Don't rely on item-key splits to shard DynamoDB txns." Verified against kv/shard_key.go:94-124: every DynamoDB table-metadata, item, and GSI key normalises to a SINGLE per- table route key (!ddb|route|table|). Splitting inside a single-table workload's item keyspace cannot put two items on different shards, so the 2PC path (dispatchMultiShardTxn, secondary commits, ErrTxnSecondaryRouteShiftedAfterPrimaryCommit) would never fire — invalidating G2. Fix: replace single-key-split (Option A) with a NEW workload variant dynamodb-append-multi-table-workload that creates N=4 tables (jepsen_append_t1 … jepsen_append_t4) and writes to >=2 distinct tables per TransactWriteItems. The router maps each table to its own route key, so cross-table txns naturally fan out across shards. The setup hook splits the table-route keyspace at !ddb|route|table|jepsen_append_t2. * gemini medium R1 — "Lexicographical Shard Split Issue." The prior /split/ split-key prefix was lexicographically smaller than the workload's keyspace ("/" < "0" in ASCII), so every workload key ended up on the rightmost shard and G2 was never exercised. Fix: anchor split keys to the table-route prefix !ddb|route|table|... so the split lands INSIDE the active workload route range. * gemini medium R2 — "Route ID Resolution for SplitRange." Successful SplitRange deletes the parent route ID and creates two child IDs, so a cached ID from a one-time setup-time ListRoutes call is stale on the next shuffle. Fix: nemesis re-queries ListRoutes on every :start invocation, walks the snapshot to find the route covering the chosen split key, and uses that route's ID + snapshot.version as expected_catalog_version. Catalog drift surfaces as ErrCatalogVersionMismatch from the server and the nemesis refreshes on the next tick. * gemini medium R3 — "Gating of Initial Split in Setup Hook." Jepsen db/setup! runs on EVERY node; an ungated initial split would be attempted concurrently by all nodes. Fix: gate the setup-time split on (when (= node (first (:nodes test))) ...) so only the first node attempts it. Also: * Updated §4 milestone table: M5a now ships the new workload variant (not just a setup hook), so it is meaningfully bigger than the original §4 row suggested. * Added OQ-5 (is N=4 the right default?) and OQ-6 (first-node gate semantics) as follow-ups for implementation time. * Resolved OQ-4: PR #900 has merged, so the parent doc rename *_proposed_*.md → *_partial_*.md should now land as a separate small doc-only PR. --- ...posed_composed1_m5_jepsen_route_shuffle.md | 150 ++++++++++++------ 1 file changed, 102 insertions(+), 48 deletions(-) diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md index a7f4d8ea..40bde00a 100644 --- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md +++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md @@ -136,7 +136,12 @@ function returning a `jepsen.nemesis/Nemesis` instance: (reify nemesis/Nemesis (setup! [this test] ...) (invoke! [this test op] ... - ;; shell out to elastickv-split with a fresh split key + ;; 1. call ListRoutes to find the route currently covering + ;; the chosen split key — route IDs change after every + ;; split, so a cached ID from setup is stale + ;; 2. pick a split key inside that route's range + ;; 3. shell out to elastickv-split with route-id + + ;; split-key + expected-version from ListRoutes ) (teardown! [this test] ...))) ``` @@ -146,39 +151,76 @@ The nemesis is composed with the existing via `jepsen.nemesis/compose`. The combined nemesis becomes the workload's `:nemesis`. -**Split key picking strategy.** A simple monotonically-increasing -counter: every `:start` invocation appends a fresh integer -suffix to a fixed key prefix the workload reserves. This avoids -collisions with the workload's keyspace and guarantees the -split always picks a key that's between existing keys (so the -operation succeeds against a real catalog). - -**Expected version.** The nemesis calls `ListRoutes` once at -setup to learn the current catalog version, then increments its -local copy by 1 after each successful split. Catalog drift -(another split landing concurrently) is rare in practice — if it -happens, the nemesis logs and refreshes from `ListRoutes`. - -### 3.3 Multi-shard workload guarantee - -The existing `dynamodb-append-workload` writes to a per-key -queue. With a single shard layout, every write goes to that -shard — no 2PC, no Composed-1 exposure. - -M5 needs the workload to consistently span shards. Two options: - -| Option | Mechanism | Pro | Con | -|---|---|---|---| -| **A** Force initial split | The test setup issues one `SplitRange` before the workload starts | Workload runs on 2+ shards from t=0 | Adds a setup step; needs a known split key | -| **B** Multi-key txns | Modify each `:append` op to write to ≥2 keys with deterministic routing across shards | Workload exercises 2PC even on a 1-shard layout | Changes the workload's operation shape (harder to compare against historical runs) | - -**Choose A.** Less invasive to the workload, and the -route-shuffle nemesis itself increases the shard count over -time, giving organic multi-shard coverage. - -The setup hook (`db/setup!` in Jepsen parlance, or the test's -`:setup` map) runs `elastickv-split` once with a split key in -the middle of the workload's keyspace. +**Split key picking strategy (gemini medium R1).** Pick a split +key from inside the DynamoDB **table-route** key space +(`!ddb|route|table|` — see `kv/shard_key.go:94-124`). +Concretely, with N tables `jepsen_append_t1` … +`jepsen_append_tN` per §3.3, the route key for table `tK` is +`!ddb|route|table|jepsen_append_tK`. Splits happen between +adjacent table-route keys — e.g. between `…jepsen_append_t2` +and `…jepsen_append_t3`. This guarantees: + +- The split key falls **inside** the active workload route + range (not lexicographically before or after, which would + leave all workload keys on one side of the split). +- Each side of the split owns a distinct set of tables, so + cross-table `TransactWriteItems` actually exercises 2PC. + +A prior revision of this doc proposed a `/split/` prefix. +That was lexicographically smaller than the workload's keyspace +(`/` < `0` in ASCII), so every workload key ended up on the +rightmost shard and the 2PC path was never exercised. Fixed +above by anchoring split keys to the table-route prefix. + +**Route ID resolution (gemini medium R2).** The nemesis CANNOT +rely on a single `ListRoutes` call + a local counter — every +successful split deletes the parent route ID and creates two +fresh child IDs, so a cached route ID is stale on the next +shuffle. On every `:start` invocation the nemesis re-queries +`ListRoutes`, walks the returned snapshot to find the route +whose range contains the chosen split key, and uses that +route's ID + the snapshot's `version` as the +`SplitRangeRequest`'s `expected_catalog_version`. Catalog +drift (another split landing concurrently between +`ListRoutes` and `SplitRange`) surfaces as +`ErrCatalogVersionMismatch` from the server; the nemesis logs +and refreshes on the next tick. + +### 3.3 Multi-shard workload guarantee (revised post-codex P1) + +**Original §3.3 (Option A: single-key split in workload keyspace) +was wrong.** `kv/shard_key.go:94-124` normalises every DynamoDB +table-metadata, item, and GSI key to a single per-table route +key (`!ddb|route|table|`). So every +`jepsen_append` item resolves to the SAME catalog point +regardless of its partition-key value, and a `SplitRange` +inside the item keyspace cannot put two items on different +shards. The 2PC path (`dispatchMultiShardTxn`, secondary +commits, the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit` +sentinel) would never fire — invalidating G2 (codex P1 on +PR #905). + +**Revised strategy: multi-table workload.** The M5 workload +creates `N` tables (default `N = 4`): `jepsen_append_t1` … +`jepsen_append_t4`. Each `TransactWriteItems` operation writes +to **at least two** distinct tables. The router maps each +table to its own table-route key, so a cross-table txn +naturally fans out across whichever shards own those route +keys. The setup hook splits the table-route keyspace at +`!ddb|route|table|jepsen_append_t2` so tables 1 lives on one +shard and tables 2–4 on another from t=0. + +| Concern | Resolution | +|---|---| +| Workload shape change | Append ops still write a single value per row; the change is the table they write to (one per row, ≥2 rows per txn — picked from a per-txn random subset of `t1…tN`). | +| Elle compatibility | The append checker keys on `(table, partition-key)` pairs already (the workload's history shape supports this); cross-table txns appear as multi-key ops, which Elle handles natively. | +| Comparison with historical runs | Historical runs used a single table — the M5 workload is a NEW workload variant `dynamodb-append-multi-table-workload` rather than a modification of `dynamodb-append-workload`. Both ship; the existing one stays for trend comparison. | + +The setup hook (Jepsen `db/setup!`) is gated to run only on +the FIRST node (`(when (= node (first (:nodes test))) …)`) so +the initial split is not attempted concurrently by every +cluster node and does not cause catalog-version conflicts +during bootstrap (gemini medium R3). ### 3.4 Success criterion @@ -221,8 +263,8 @@ mergeable on its own. | Phase | Title | Scope | Done when | |---|---|---|---| -| M5a | CLI + workload setup | `cmd/elastickv-split` binary; `dynamodb-append-workload`'s `:setup` issues the initial split; no nemesis yet. | `./scripts/run-jepsen-local.sh` runs unchanged but the cluster starts with 2 shards. Workload finds zero G1c (trivially, no shuffle). | -| M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into `dynamodb-append-workload`'s nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). | A `./scripts/run-jepsen-local.sh` run with `--composed1-route-shuffle` produces zero G1c after ≥10 shuffles during a 5-minute run. | +| M5a | CLI + multi-table workload | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; setup hook (gated to first node) issues the initial split between table-route keys. | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables split across 2 shards; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. | +| M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. | M5a is a small, focused PR (Go CLI + Clojure setup hook + docs). M5b carries the nemesis itself plus the cadence-tuning @@ -238,23 +280,35 @@ analysis. answer: no, the partition nemesis is enough; adding a prewrite-interrupt would test `abortPreparedTxn`, which is out of M5's scope. -- **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two is - cleaner but doubles the review burden. Tentative answer: two - if M5a's CLI work runs ≥150 lines (likely); one if M5a fits in - a single screen. Decide at implementation time. +- **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two + is cleaner but doubles the review burden. With the §3.3 + revision M5a is now meaningfully bigger (a new workload + variant, not just a setup hook), so two-PR is now the more + likely shape. Decide at implementation time. - **OQ-3.** Where does the new `cmd/elastickv-split` slot in the README and the `make` targets? Likely add it to `make tools`, mirror in `docs/operations/` (does this dir exist? — check at implementation). Out of scope for the design doc itself. -- **OQ-4.** Should the M5 design doc rename happen with PR #900 - merge (since M1–M4 ship)? Yes per CLAUDE.md's lifecycle - guidance: rename `*_proposed_*.md` → `*_partial_*.md` after - PR #900 lands, then this M5 doc tracks the open milestone. - When M5 ships, rename the parent to `*_implemented_*.md` and - this M5 doc to `*_implemented_*.md` as well (or fold the M5 - content back into the parent — tentative answer: keep them - separate so the M5 design history isn't lost). +- **OQ-4** (resolved post-PR #900 merge). The parent doc + rename `*_proposed_*.md` → `*_partial_*.md` should land as a + separate small doc-only PR now that PR #900 is merged. When + M5 ships, rename both this doc and the parent to + `*_implemented_*.md` (tentative — keep both files separate + so the M5 design history isn't lost). +- **OQ-5** (new, codex P1 follow-up). Is `N = 4` tables the + right default? Trade-offs: more tables = better 2PC + fan-out coverage but slower setup and noisier history. The + workload's existing `:concurrency` defaults to 5, so 4 + tables means each client touches ~all of them per txn on + average. Defer to implementation; revisit if the workload + becomes I/O-bound on table-meta lookups. +- **OQ-6** (new, gemini medium R3 follow-up). The first-node + gate for setup splits assumes Jepsen's `(first (:nodes test))` + is stable across nodes; verify this matches actual Jepsen + semantics (it should — `:nodes` is the test config, not a + per-node view). Out of scope to design more carefully; will + test at M5a implementation. --- From 248128672675265b1cc85d448bdf3f77fc445d40 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 17:33:33 +0900 Subject: [PATCH 03/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20multi-gr?= =?UTF-8?q?oup=20cluster=20startup=20+=20post-review=20fixes=20(PR=20#905)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Six findings on f5d2ad7a — two codex P1, two claude P2, two claude P3. All addressed: * codex P1 #1 (encode table segments before choosing split keys). My split key used the raw table name; the actual table-route key uses dynamoRouteTableKey(encodeDynamoSegment( tableName)) — base64 RawURLEncoding (adapter/dynamodb.go: 8337-8365). Split at the raw name sorts after every base64-encoded segment and leaves every table on one side of the split. Fix: §3.2 now specifies the Clojure helper encode-dynamo-segment that mirrors the Go encoding, and the split key is constructed as `(str "!ddb|route|table|" (encode-dynamo-segment "jepsen_append_t2"))`. * codex P1 #2 (seed cross-group routes before claiming 2PC coverage). Today's SplitRange is same-group only — both children inherit the parent's GroupID. dispatchTxn enters dispatchMultiShardTxn only when mutations group to ≥2 Raft GROUPS, not ≥2 routes. An initial SplitRange in the setup hook would leave tables 1 and 2-4 in the same group and the workload would still take the single-shard path. Fix: §3.3 rewritten substantially. Multi-group cluster startup is now required — M5a extends scripts/run-jepsen-local.sh to pass --shardRanges so the cluster launches with ≥2 Raft groups, table-route keys for t1..t4 statically split across groups (e.g. tables 1-2 → group 1, tables 3-4 → group 2). The Jepsen db/setup! hook no longer issues SplitRange — it calls ListRoutes (gated to first node) and VERIFIES the multi-group routing is in place, failing fast if not. Also added §3.3's "Why the nemesis is still useful" paragraph explaining that nemesis-driven SplitRange (same-group) churns catalog versions without moving ownership, exercising M3 + M4 drift detection — a non-regression check. When a future cross-group MoveRange lands, swapping it into the nemesis upgrades the workload from non-regression to real Composed-1 anomaly hunting. * claude P2 #1 (success criterion in §3.4 will always pass). My Clojure snippet used `(filter #(= (:type %) :G1c) (:anomalies results))` but :anomalies is a map (keyword → seq), not a flat list of maps with :type. Iterating a map yields [key val] vectors with no :type field — the filter always returns empty, the check always passes silently. Fix: §3.4 corrected to `(nil? (get (:anomalies results) :G1c))`. Added doc note about the prior broken form so future readers understand the trap. * claude P2 #2 (lexicographic direction inverted in §3.2). My old explanation said `/split/` was smaller than workload keys, but `!` is ASCII 33 and `/` is 47, so `/split/` is LARGER than `!ddb|route|table|*`. Fix: §3.2 corrected — workload keys would land on the LEFT shard, not the rightmost. The fix (anchoring to the table-route prefix) is correct in either direction. * claude P3 #1 (NG3 contradicts the new multi-table variant). Fix: NG3 tightened to "no new operation types or generators BEYOND the multi-table workload variant and the route- shuffle nemesis specified in §3." * claude P3 #2 (setup hook route ID resolution not described). Resolved differently from claude's suggestion: §3.3's revision means the setup hook no longer needs route ID resolution at all (no mutation), just a ListRoutes verification. * Added OQ-7 (codex P1 #1 follow-up) for the --shardRanges boundary key encoding helper. --- ...posed_composed1_m5_jepsen_route_shuffle.md | 196 ++++++++++++------ 1 file changed, 133 insertions(+), 63 deletions(-) diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md index 40bde00a..d6770a8b 100644 --- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md +++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md @@ -57,9 +57,12 @@ Parent design: `SplitRange` only, no such anomaly is reachable; the workload's job is to prove the gate is non-regressing today and to be ready for tomorrow. -- **NG3.** New Jepsen workload primitives (new operation types, - new generators outside the route-shuffle nemesis). The - existing `dynamodb-append-workload` is the right surface. +- **NG3.** New Jepsen workload primitives **beyond** the new + `dynamodb-append-multi-table-workload` variant (§3.3) and the + new `route-shuffle-nemesis` (§3.2). No new operation types, + no new generators outside those two surfaces. Pre-existing + `dynamodb-append-workload` stays as-is for trend comparison + with historical runs. - **NG4.** Changing the DynamoDB adapter or Composed-1 code on the server side. M5 is purely test-harness work. @@ -151,26 +154,41 @@ The nemesis is composed with the existing via `jepsen.nemesis/compose`. The combined nemesis becomes the workload's `:nemesis`. -**Split key picking strategy (gemini medium R1).** Pick a split -key from inside the DynamoDB **table-route** key space -(`!ddb|route|table|` — see `kv/shard_key.go:94-124`). -Concretely, with N tables `jepsen_append_t1` … -`jepsen_append_tN` per §3.3, the route key for table `tK` is -`!ddb|route|table|jepsen_append_tK`. Splits happen between -adjacent table-route keys — e.g. between `…jepsen_append_t2` -and `…jepsen_append_t3`. This guarantees: - -- The split key falls **inside** the active workload route - range (not lexicographically before or after, which would - leave all workload keys on one side of the split). -- Each side of the split owns a distinct set of tables, so - cross-table `TransactWriteItems` actually exercises 2PC. - -A prior revision of this doc proposed a `/split/` prefix. -That was lexicographically smaller than the workload's keyspace -(`/` < `0` in ASCII), so every workload key ended up on the -rightmost shard and the 2PC path was never exercised. Fixed -above by anchoring split keys to the table-route prefix. +**Split key picking strategy (gemini medium R1 + codex P1 #1 +on f5d2ad7a).** Pick a split key from inside the DynamoDB +**table-route** key space. The exact encoding matters: the +route key is built by `dynamoRouteTableKey(encodeDynamoSegment(tableName))` +(`adapter/dynamodb.go:8337-8365`, `kv/shard_key.go:117-124`). +`encodeDynamoSegment` is base64 `RawURLEncoding` — for table +`jepsen_append_t1` the real route key is +`!ddb|route|table|am...` (base64 of the literal table name), +**not** `!ddb|route|table|jepsen_append_t1`. A prior revision +of this doc used the raw table name, which would sort after +every base64-encoded segment and leave all workload tables on +one side of the split. + +Concretely, the M5 split key construction is: + +```clojure +(str "!ddb|route|table|" + (encode-dynamo-segment "jepsen_append_t2")) ; base64 RawURLEncoding +``` + +The Clojure side gets a small helper (`composed1-nemesis/encode-dynamo-segment`) +that mirrors the Go `encodeDynamoSegment` exactly. The +Composed-1 doc and the encoding helper land together so any +future change to the encoding surface is caught by an +unambiguous reference point. + +A prior revision of this doc also wrongly explained the old +`/split/` proposal: it said `/` < workload keys, but the +DynamoDB table-route prefix starts with `!` (ASCII 33) and +`/` is ASCII 47, so `/split/` is lexicographically +**larger** than `!ddb|route|table|*` — workload keys would land +on the **left** shard, not the rightmost (claude[bot] P2 on +f5d2ad7a). The fix (anchoring to the table-route prefix) +is correct in either direction; the explanation above is now +also correct. **Route ID resolution (gemini medium R2).** The nemesis CANNOT rely on a single `ListRoutes` call + a local counter — every @@ -186,59 +204,101 @@ drift (another split landing concurrently between `ErrCatalogVersionMismatch` from the server; the nemesis logs and refreshes on the next tick. -### 3.3 Multi-shard workload guarantee (revised post-codex P1) - -**Original §3.3 (Option A: single-key split in workload keyspace) -was wrong.** `kv/shard_key.go:94-124` normalises every DynamoDB -table-metadata, item, and GSI key to a single per-table route -key (`!ddb|route|table|`). So every -`jepsen_append` item resolves to the SAME catalog point -regardless of its partition-key value, and a `SplitRange` -inside the item keyspace cannot put two items on different -shards. The 2PC path (`dispatchMultiShardTxn`, secondary -commits, the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit` -sentinel) would never fire — invalidating G2 (codex P1 on -PR #905). - -**Revised strategy: multi-table workload.** The M5 workload -creates `N` tables (default `N = 4`): `jepsen_append_t1` … -`jepsen_append_t4`. Each `TransactWriteItems` operation writes -to **at least two** distinct tables. The router maps each -table to its own table-route key, so a cross-table txn -naturally fans out across whichever shards own those route -keys. The setup hook splits the table-route keyspace at -`!ddb|route|table|jepsen_append_t2` so tables 1 lives on one -shard and tables 2–4 on another from t=0. +### 3.3 Multi-shard workload guarantee (revised post-codex P1 #2) + +**Original §3.3 (single-key item split) and revised §3.3 +(multi-table with setup-hook SplitRange) were both incomplete.** + +Two facts about today's routing surface make the "split at +setup" approach insufficient on its own: + +1. `kv/shard_key.go:94-124` normalises every DynamoDB + table-metadata, item, and GSI key to a single per-table + route key (`!ddb|route|table|`). So + single-table workloads have one route key and cannot + fan out across shards. +2. `adapter/distribution_server.go`'s `SplitRange` only + creates child routes with the **parent's GroupID**. + `dispatchTxn` enters `dispatchMultiShardTxn` only when + mutations group to ≥2 distinct Raft **groups**, not just + ≥2 routes (codex P1 #2 on f5d2ad7a). A `SplitRange` + inside the table-route keyspace produces two routes still + in the same group — single-shard transaction path, + 2PC never fires. + +**Revised strategy: multi-table workload + multi-group cluster +startup.** Both halves are required. + +| Half | Mechanism | +|---|---| +| Multi-table workload | A NEW workload variant `dynamodb-append-multi-table-workload` creates N=4 tables (`jepsen_append_t1` … `jepsen_append_t4`) and writes to ≥2 distinct tables per `TransactWriteItems`. The router maps each table to its own route key so cross-table txns engage multi-route routing. | +| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to pass `--shardRanges` so the cluster starts with at least 2 Raft groups, and the table-route keys for `t1…t4` are statically split between groups: e.g. `tables 1-2 → group 1, tables 3-4 → group 2`. The DDB leader address per-group is wired via `--raftDynamoMap` (already supported by the runner — only the `--shardRanges` argument is new). | + +`encodeDynamoSegment("jepsen_append_t1")` etc. are computed at +script setup time and inlined into the `--shardRanges` +boundary keys. An `m5_setup.sh` helper (or a Go binary, see +OQ-7) emits the boundary keys deterministically. + +The Jepsen `db/setup!` hook does NOT issue any `SplitRange`. +Instead it calls `ListRoutes` once (gated to the first node +per `(when (= node (first (:nodes test))) …)`) and **verifies** +that the expected multi-group routing is in place; if not, it +fails fast with a clear error so the operator knows the +launch script needs to be re-run with the right +`--shardRanges`. This avoids the gemini-medium R3 setup-hook +concurrency hazard entirely (no mutation, just a read). It +also resolves the claude[bot] P3 follow-up about needing +ListRoutes for route-ID resolution in setup — the hook now +needs ListRoutes for **verification** rather than mutation. | Concern | Resolution | |---|---| | Workload shape change | Append ops still write a single value per row; the change is the table they write to (one per row, ≥2 rows per txn — picked from a per-txn random subset of `t1…tN`). | | Elle compatibility | The append checker keys on `(table, partition-key)` pairs already (the workload's history shape supports this); cross-table txns appear as multi-key ops, which Elle handles natively. | -| Comparison with historical runs | Historical runs used a single table — the M5 workload is a NEW workload variant `dynamodb-append-multi-table-workload` rather than a modification of `dynamodb-append-workload`. Both ship; the existing one stays for trend comparison. | - -The setup hook (Jepsen `db/setup!`) is gated to run only on -the FIRST node (`(when (= node (first (:nodes test))) …)`) so -the initial split is not attempted concurrently by every -cluster node and does not cause catalog-version conflicts -during bootstrap (gemini medium R3). +| Comparison with historical runs | The pre-existing `dynamodb-append-workload` (single table, single group) stays as-is for trend comparison. The M5 workload is a new variant alongside it. | + +**Why the nemesis is still useful even though SplitRange is +same-group only.** The route-shuffle nemesis (§3.2) issues +`SplitRange` calls that churn the catalog version + route IDs +without moving ownership across groups. This exercises the +M3 ObservedRouteVersion drift detection and the M4 retry +path under concurrent route-mutating control-plane traffic, +which is the closest non-regression check the current routing +surface allows. When a future cross-group `MoveRange` or +cross-group `SplitRange` lands, swapping that RPC into the +nemesis turns the workload from a "no-regression under +same-group churn" check into a "no G1c under cross-group +movement" check — matching the parent doc's forward-looking +posture. ### 3.4 Success criterion The workload's existing Elle checker emits a `:valid?` boolean -and a list of detected cycles (`:G0`, `:G1a`, `:G1b`, `:G1c`, -`:G-single`, etc.). M5's pass condition: +and an `:anomalies` map keyed by anomaly type — `{:G0 […], :G1a +[…], :G1c […], :G-single […], …}`. M5's pass condition: ``` (and (:valid? results) - (zero? (count (filter #(= (:type %) :G1c) (:anomalies results)))) - (zero? (count (filter #(= (:type %) :G-single) (:anomalies results))))) + (nil? (get (:anomalies results) :G1c)) + (nil? (get (:anomalies results) :G-single))) ``` +A prior revision of this doc used `(filter #(= (:type %) :G1c) +…)` over `(:anomalies results)`, which is wrong: iterating a +map yields `[key val]` vectors with no `:type` field, so the +filter always returned an empty seq and the check would +silently pass on any G1c run (claude[bot] P2 on f5d2ad7a). +The corrected form above keys off the map directly. + `G1c` is the parent doc's explicit safety violation; `G-single` is the closely-related single-item anomaly we already chase in the existing workload. Other anomaly types (G0, G1a, G1b) indicate orthogonal bugs and should also fail the run, but the parent doc's M5 row names G1c as the headline criterion. +`(:valid? results)` is the canonical Elle pass/fail bit; the +explicit G1c / G-single checks are belt-and-suspenders so a +future Elle refactor that subdivides the cycle taxonomy still +fails on the specific anomalies M5 cares about. ### 3.5 Cadence @@ -263,7 +323,7 @@ mergeable on its own. | Phase | Title | Scope | Done when | |---|---|---|---| -| M5a | CLI + multi-table workload | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; setup hook (gated to first node) issues the initial split between table-route keys. | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables split across 2 shards; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. | +| M5a | CLI + multi-table workload + multi-group launch | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; **`scripts/run-jepsen-local.sh` extended to pass `--shardRanges` so the cluster launches with ≥2 Raft groups** and table-route keys for `t1…tN` are statically split across groups; setup hook calls `ListRoutes` from the first node and verifies the multi-group routing is in place (read-only, no mutation). | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables 1-2 owned by group 1 and tables 3-4 owned by group 2; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. | | M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. | M5a is a small, focused PR (Go CLI + Clojure setup hook + @@ -304,11 +364,21 @@ analysis. average. Defer to implementation; revisit if the workload becomes I/O-bound on table-meta lookups. - **OQ-6** (new, gemini medium R3 follow-up). The first-node - gate for setup splits assumes Jepsen's `(first (:nodes test))` - is stable across nodes; verify this matches actual Jepsen - semantics (it should — `:nodes` is the test config, not a - per-node view). Out of scope to design more carefully; will - test at M5a implementation. + gate for the setup verification assumes Jepsen's + `(first (:nodes test))` is stable across nodes; verify this + matches actual Jepsen semantics (it should — `:nodes` is the + test config, not a per-node view). Out of scope to design + more carefully; will test at M5a implementation. +- **OQ-7** (new, codex P1 #1 follow-up). The `--shardRanges` + boundary keys for the multi-group launch (§3.3) need to be + emitted as bytes that exactly match + `dynamoRouteTableKey(encodeDynamoSegment(tableName))`. Two + options: (a) a tiny Go helper (`cmd/elastickv-route-key`) + that prints the encoded key for a given table name, called + by `scripts/run-jepsen-local.sh` to build `--shardRanges`; + (b) inline the base64 encoding in shell. Tentative answer: + (a), because the encoding lives in Go and any drift would + silently mis-route. Decide at implementation. --- From 673bee8cc1e0e88d8b8ff02d9d49cebdc97f4560 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 17:39:26 +0900 Subject: [PATCH 04/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20markdown?= =?UTF-8?q?lint=20cleanup=20(MD027=20+=20MD040)=20on=20PR=20#905?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit coderabbitai minor on 24812867: * MD027 (multiple-space-after-blockquote) at lines 17,21 in the upfront posture block. Trimmed double-spaces after `>` to a single space. * MD040 (fenced-code-language) at the three plain ``` blocks: - the elastickv-split CLI usage (```bash) - the route-shuffle-nemesis Clojure sketch (```clojure) - the M5 success-criterion Clojure snippet (```clojure) Doc-only, no semantic changes. --- ...oposed_composed1_m5_jepsen_route_shuffle.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md index d6770a8b..457d3f37 100644 --- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md +++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md @@ -14,12 +14,12 @@ Parent design: > scaffolding — workload shape, nemesis, success criterion — so > that: > -> 1. The current `SplitRange` is exercised under realistic -> concurrent multi-shard write load and proved non-regressing -> (the workload finds **no** G1c, which is the baseline M4 -> contract). -> 2. When a future PR introduces a route-mutating RPC that DOES -> shift ownership across groups (cross-group `SplitRange`, +> 1. The current `SplitRange` is exercised under realistic +> concurrent multi-shard write load and proved non-regressing +> (the workload finds **no** G1c, which is the baseline M4 +> contract). +> 2. When a future PR introduces a route-mutating RPC that DOES +> shift ownership across groups (cross-group `SplitRange`, > `MoveRange`, online rebalancer), the M5 workload — with a > one-line nemesis change to call the new RPC — becomes the > integration-level proof that M3+M4 hold under cross-group @@ -101,7 +101,7 @@ sentinel that would catch a regression. M5 closes that gap. A standalone Go binary, ~80 lines: -``` +```bash elastickv-split \ --address 127.0.0.1:50051 \ --route-id 100 \ @@ -130,7 +130,7 @@ clearer. A new Clojure file with a single `route-shuffle-nemesis` function returning a `jepsen.nemesis/Nemesis` instance: -``` +```clojure (defn route-shuffle-nemesis "Periodically invokes elastickv-split against the cluster. :start -> shuffle one route (pick a non-edge split key) @@ -277,7 +277,7 @@ The workload's existing Elle checker emits a `:valid?` boolean and an `:anomalies` map keyed by anomaly type — `{:G0 […], :G1a […], :G1c […], :G-single […], …}`. M5's pass condition: -``` +```clojure (and (:valid? results) (nil? (get (:anomalies results) :G1c)) (nil? (get (:anomalies results) :G-single))) From 3ca2a7f7c4d4acbba6f05c3f39ee5b3aef1e84b4 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 17:44:00 +0900 Subject: [PATCH 05/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20claude[b?= =?UTF-8?q?ot]=20P2+P3=20on=2024812867=20(launch=20script=20scope=20+=20Ba?= =?UTF-8?q?se64=20nuance)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two findings on 24812867 from claude[bot]'s post-revision review: * P2 — '"only --shardRanges is new" understates M5a's launch script changes' (§3.3 table cell). Confirmed against shard_config.go:61-99: without --raftGroups, every --shardRanges entry collapses into the default single group 1 and groupMutations returns gids = [1] for every key. Single-shard path, 2PC never fires. Fix: §3.3 table cell rewritten to list FIVE coordinated launch-script changes M5a must make: 1. --raftGroups declaring >=2 groups 2. --shardRanges with t1-t2 in group 1, t3-t4 in group 2 3. Cluster topology decision (tentative: 2 single-node groups for CI speed; scale to 6 nodes if needed) 4. Per-group --raftBootstrapMembers 5. Expanded --raftDynamoMap + matching port allocations Avoids the implementer discovering the topology gap mid-PR. * P3 — encode-dynamo-segment Clojure implementation needs Base64 nuance. Go uses base64.RawURLEncoding (URL-safe charset WITHOUT '=' padding). Java's three Base64 variants differ: - Base64/getEncoder (standard '+'/'/', with padding) — wrong - Base64/getUrlEncoder (URL-safe, WITH padding) — wrong - Base64/getUrlEncoder + .withoutPadding — correct Failure mode is silent: wrong encoding produces split keys outside every route's range; ListRoutes returns no match; nemesis logs and skips every :start; run appears clean while the nemesis is a no-op. Fix: §3.2 now calls out .withoutPadding explicitly and documents the silent-no-op failure mode. --- ...posed_composed1_m5_jepsen_route_shuffle.md | 19 +++++++++++++++---- 1 file changed, 15 insertions(+), 4 deletions(-) diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md index 457d3f37..0fadddca 100644 --- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md +++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md @@ -175,9 +175,20 @@ Concretely, the M5 split key construction is: ``` The Clojure side gets a small helper (`composed1-nemesis/encode-dynamo-segment`) -that mirrors the Go `encodeDynamoSegment` exactly. The -Composed-1 doc and the encoding helper land together so any -future change to the encoding surface is caught by an +that mirrors the Go `encodeDynamoSegment` exactly. The Go +implementation uses `base64.RawURLEncoding` — URL-safe charset +(`-`/`_`) **without `=` padding**. The Clojure side MUST use +`(.withoutPadding (java.util.Base64/getUrlEncoder))`; +`Base64/getEncoder` (standard alphabet with `+`/`/`) and the +default `Base64/getUrlEncoder` (URL-safe **with** padding) are +both wrong (claude[bot] P3 on 24812867). The failure mode is +silent and non-obvious: a wrong encoding produces split keys +that fall outside every route's range, `ListRoutes` returns no +matching route, the nemesis logs and skips every `:start`, +and the run appears clean while the nemesis is a no-op. + +The Composed-1 doc and the encoding helper land together so +any future change to the encoding surface is caught by an unambiguous reference point. A prior revision of this doc also wrongly explained the old @@ -232,7 +243,7 @@ startup.** Both halves are required. | Half | Mechanism | |---|---| | Multi-table workload | A NEW workload variant `dynamodb-append-multi-table-workload` creates N=4 tables (`jepsen_append_t1` … `jepsen_append_t4`) and writes to ≥2 distinct tables per `TransactWriteItems`. The router maps each table to its own route key so cross-table txns engage multi-route routing. | -| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to pass `--shardRanges` so the cluster starts with at least 2 Raft groups, and the table-route keys for `t1…t4` are statically split between groups: e.g. `tables 1-2 → group 1, tables 3-4 → group 2`. The DDB leader address per-group is wired via `--raftDynamoMap` (already supported by the runner — only the `--shardRanges` argument is new). | +| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to launch a multi-group cluster. Per `shard_config.go:61-99` (claude[bot] P2 on 24812867), `--shardRanges` alone is not enough: without `--raftGroups`, every shard range's `groupID` collapses into the single default group 1 and `groupMutations` returns `gids = [1]` for every key — single-shard path, 2PC never fires. M5a must therefore add **five** coordinated launch-script changes:

1. **`--raftGroups`** declaring ≥2 groups (e.g. `1=127.0.0.1:50051,2=127.0.0.1:50054`).
2. **`--shardRanges`** boundary keys placing `t1-t2` in group 1 and `t3-t4` in group 2.
3. **Cluster topology decision** — either 6 nodes (3-per-group, production-like consensus) or 2 single-node groups (`--raftBootstrap` per group, faster for CI). Tentative pick: 2 single-node groups for M5a, scale to 6 nodes if/when the workload outgrows it (low cost — flip a flag).
4. **Per-group `--raftBootstrapMembers`** (different members sets for each group's nodes).
5. **Expanded `--raftDynamoMap`** covering both groups' leader addresses, plus the matching port allocations (currently `5005{1,2,3}`/`6380{1,2,3}` for the single-group 3-node layout — need `5005{1,4}`/`6380{1,4}` for the 2-single-node layout, or `5005{1..6}`/`6380{1..6}` for 6-node). | `encodeDynamoSegment("jepsen_append_t1")` etc. are computed at script setup time and inlined into the `--shardRanges` From 87dcfe3f31a8bd953f745869478ed9e8b5ef7f18 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 17:48:24 +0900 Subject: [PATCH 06/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20update?= =?UTF-8?q?=20=C2=A74=20wording=20(claude[bot]=20minor=20on=20673bee8c)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The 'M5a is a small, focused PR' line at the bottom of §4 was accurate for the original M5a scope (Go CLI + setup hook + docs) but became understated after the codex P1 #2 revision expanded M5a to include the multi-table workload variant and the five-change multi-group launch script. Updated to reflect actual scope: M5a now ships: * cmd/elastickv-split (Go CLI) * dynamodb-append-multi-table-workload (Clojure) * scripts/run-jepsen-local.sh multi-group launch extension (5 coordinated changes per §3.3 table) * setup-hook verification (read-only ListRoutes + assert) * docs Likely lands as a single PR for atomicity (workload variant only makes sense alongside multi-group launch + verification) but reviewable as four roughly independent diff sections. --- ...2_proposed_composed1_m5_jepsen_route_shuffle.md | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md index 0fadddca..f013bd39 100644 --- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md +++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md @@ -337,9 +337,17 @@ mergeable on its own. | M5a | CLI + multi-table workload + multi-group launch | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; **`scripts/run-jepsen-local.sh` extended to pass `--shardRanges` so the cluster launches with ≥2 Raft groups** and table-route keys for `t1…tN` are statically split across groups; setup hook calls `ListRoutes` from the first node and verifies the multi-group routing is in place (read-only, no mutation). | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables 1-2 owned by group 1 and tables 3-4 owned by group 2; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. | | M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. | -M5a is a small, focused PR (Go CLI + Clojure setup hook + -docs). M5b carries the nemesis itself plus the cadence-tuning -analysis. +M5a is a single focused PR but no longer "small" after the +codex P1 #2 expansion: `cmd/elastickv-split` (Go CLI), the +new `dynamodb-append-multi-table-workload` (Clojure), the +`scripts/run-jepsen-local.sh` multi-group launch extension +(five coordinated changes — see §3.3 table), the setup-hook +verification (read-only `ListRoutes` + assertion), plus docs. +Likely to land as a single PR for atomicity (the workload +variant only makes sense alongside the multi-group launch and +the verification hook), but reviewable as four roughly +independent diff sections. M5b carries the nemesis itself +plus the cadence-tuning analysis. --- From b752a89448529d54fc025b7bbfbd407cd73b689a Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 17:54:12 +0900 Subject: [PATCH 07/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20bootstra?= =?UTF-8?q?p=20flag=20+=20interior=20split=20keys=20(codex=20P2=20+=20clau?= =?UTF-8?q?de[bot]=20P2=20on=203ca2a7f7)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two new findings on 3ca2a7f7: * codex P2 + claude[bot] P2 (same defect, independently flagged): --raftBootstrapMembers is rejected on multi-group startup (main.go:735-741 — ErrBootstrapMembersRequireSingleGroup guard). My §3.3 item 4 said 'Per-group --raftBootstrapMembers', which would abort startup before Jepsen could run. Fix: §3.3 item 4 now correctly specifies --raftBootstrap (boolean) for the tentative 2-single-node-groups topology. Single-member groups need no peer discovery so --raftBootstrap is necessary and sufficient. The existing --raftBootstrapMembers in run-jepsen-local.sh must be REMOVED for multi-group processes — it's only valid for single-group multi-node formations. * codex P2 #2 — split key reuse becomes no-op. My §3.2 example showed the boundary key !ddb|route|table|, but after the setup-time split that key is the Start of the right child route. adapter/distribution_server.go:357-372's validateSplitKey rejects splitKey == parent.Start/End with 'split key at route boundary' — so every subsequent nemesis tick at the same key would fail. In a 5-minute run expecting >=10 shuffles, the nemesis would no-op silently and the run would appear clean while catalog churn never landed. Fix: §3.2 now splits into TWO distinct split-key constructions: (a) Setup-time boundary key (one-shot, in scripts/run-jepsen-local.sh) — IS a route boundary by design (the static partition between groups). (b) Per-shuffle interior key (every nemesis :start) — MUST lie strictly between the current route's Start and End. Strategy: append a base64-alphabet byte plus a fresh counter to route.start; the result sorts after start but before any non-base64 byte (e.g. ASCII '|'=124). Implementation details (counter persistence, collision recovery) deferred to M5b; the design contract is: each shuffle's split key is strictly interior to the route's current range. --- ...posed_composed1_m5_jepsen_route_shuffle.md | 49 ++++++++++++++++++- 1 file changed, 47 insertions(+), 2 deletions(-) diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md index f013bd39..2e4b40a2 100644 --- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md +++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md @@ -167,13 +167,58 @@ of this doc used the raw table name, which would sort after every base64-encoded segment and leave all workload tables on one side of the split. -Concretely, the M5 split key construction is: +There are TWO distinct split keys to construct. Getting them +right is the difference between an honest non-regression +check and a silent no-op: + +**(a) Setup-time boundary key (one-shot, in +`scripts/run-jepsen-local.sh`).** This is the key that +partitions the table-route keyspace between groups at cluster +launch via `--shardRanges`. It IS a route boundary by +design — group 1 covers `[… , bk)`, group 2 covers +`[bk, …)`. Constructed as: ```clojure (str "!ddb|route|table|" (encode-dynamo-segment "jepsen_append_t2")) ; base64 RawURLEncoding ``` +**(b) Per-shuffle interior key (every nemesis `:start` invocation).** +After the setup-time boundary lands, the table-route key +`!ddb|route|table|` is now the +**`Start`** of the right child route. `validateSplitKey` +in `adapter/distribution_server.go:357-372` rejects +`splitKey == parent.Start` or `splitKey == parent.End` with +"split key at route boundary" — so reusing the setup-time key +would fail every nemesis tick after the first (codex P2 #2 on +3ca2a7f7). In a 5-minute run expecting ≥10 shuffles, the +nemesis would no-op silently and the workload would never +exercise the catalog-version churn M3+M4 must absorb. + +The nemesis MUST derive a fresh interior key per `:start` +that lies strictly between the current route's `Start` and +`End`. Strategy: + +```clojure +(defn fresh-interior-split-key + "Returns a key strictly inside [route.start, route.end). + The base64 RawURL alphabet (A-Z, a-z, 0-9, -, _) admits + simple suffix appending: any byte from that alphabet + appended to route.start sorts after route.start but + before any byte that is not in the alphabet (such as ASCII + '|' = 124, which is greater than every base64 char)." + [route] + ;; Append a fresh alphabetic byte plus an incrementing + ;; counter; check the result is < route.end and retry with + ;; a different prefix byte on collision. + (str (:start route) "A" (System/nanoTime))) +``` + +Exact strategy details (counter persistence across nemesis +restarts, collision recovery, etc.) are M5b implementation +notes — the design contract is: each shuffle's split key is +strictly interior to the route's current range. + The Clojure side gets a small helper (`composed1-nemesis/encode-dynamo-segment`) that mirrors the Go `encodeDynamoSegment` exactly. The Go implementation uses `base64.RawURLEncoding` — URL-safe charset @@ -243,7 +288,7 @@ startup.** Both halves are required. | Half | Mechanism | |---|---| | Multi-table workload | A NEW workload variant `dynamodb-append-multi-table-workload` creates N=4 tables (`jepsen_append_t1` … `jepsen_append_t4`) and writes to ≥2 distinct tables per `TransactWriteItems`. The router maps each table to its own route key so cross-table txns engage multi-route routing. | -| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to launch a multi-group cluster. Per `shard_config.go:61-99` (claude[bot] P2 on 24812867), `--shardRanges` alone is not enough: without `--raftGroups`, every shard range's `groupID` collapses into the single default group 1 and `groupMutations` returns `gids = [1]` for every key — single-shard path, 2PC never fires. M5a must therefore add **five** coordinated launch-script changes:

1. **`--raftGroups`** declaring ≥2 groups (e.g. `1=127.0.0.1:50051,2=127.0.0.1:50054`).
2. **`--shardRanges`** boundary keys placing `t1-t2` in group 1 and `t3-t4` in group 2.
3. **Cluster topology decision** — either 6 nodes (3-per-group, production-like consensus) or 2 single-node groups (`--raftBootstrap` per group, faster for CI). Tentative pick: 2 single-node groups for M5a, scale to 6 nodes if/when the workload outgrows it (low cost — flip a flag).
4. **Per-group `--raftBootstrapMembers`** (different members sets for each group's nodes).
5. **Expanded `--raftDynamoMap`** covering both groups' leader addresses, plus the matching port allocations (currently `5005{1,2,3}`/`6380{1,2,3}` for the single-group 3-node layout — need `5005{1,4}`/`6380{1,4}` for the 2-single-node layout, or `5005{1..6}`/`6380{1..6}` for 6-node). | +| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to launch a multi-group cluster. Per `shard_config.go:61-99` (claude[bot] P2 on 24812867), `--shardRanges` alone is not enough: without `--raftGroups`, every shard range's `groupID` collapses into the single default group 1 and `groupMutations` returns `gids = [1]` for every key — single-shard path, 2PC never fires. M5a must therefore add **five** coordinated launch-script changes:

1. **`--raftGroups`** declaring ≥2 groups (e.g. `1=127.0.0.1:50051,2=127.0.0.1:50054`).
2. **`--shardRanges`** boundary keys placing `t1-t2` in group 1 and `t3-t4` in group 2.
3. **Cluster topology decision** — either 6 nodes (3-per-group, production-like consensus) or 2 single-node groups (`--raftBootstrap` per group, faster for CI). Tentative pick: 2 single-node groups for M5a, scale to 6 nodes if/when the workload outgrows it (low cost — flip a flag).
4. **Per-node `--raftBootstrap` (boolean) — NOT `--raftBootstrapMembers`.** `main.go:735-741` has a hard guard: `resolveBootstrapServers` returns `ErrBootstrapMembersRequireSingleGroup` whenever `--raftBootstrapMembers` is set and `--raftGroups` parses to more than one group (codex P2 + claude[bot] P2 on 3ca2a7f7). For the tentative 2-single-node-groups topology, each process hosts ONE group with ONE member — single-member groups need no peer discovery, so `--raftBootstrap` is both necessary and sufficient. The existing `--raftBootstrapMembers` in `run-jepsen-local.sh` must be **removed** for multi-group processes; it is only valid for single-group, multi-node formations.
5. **Expanded `--raftDynamoMap`** covering both groups' leader addresses, plus the matching port allocations (currently `5005{1,2,3}`/`6380{1,2,3}` for the single-group 3-node layout — need `5005{1,4}`/`6380{1,4}` for the 2-single-node layout, or `5005{1..6}`/`6380{1..6}` for 6-node). | `encodeDynamoSegment("jepsen_append_t1")` etc. are computed at script setup time and inlined into the `--shardRanges` From 6f024aaa7378a289dc09990a13de831b629fd72f Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 18:02:42 +0900 Subject: [PATCH 08/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20note=20C?= =?UTF-8?q?lojure=20byte-vs-string=20coercion=20in=20deferred=20M5b=20impl?= =?UTF-8?q?ementation=20list=20(claude[bot]=20minor=20on=20b752a894)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit claude[bot] flagged a non-blocking M5b implementation concern on b752a894: ListRoutes protobuf response fields may be Java byte arrays rather than strings depending on deserialization. Calling Clojure (str byte-array) silently yields '[B@…' (Java object reference) instead of the bytes, producing a wrong split key that would land outside the route's range — a silent mis-routing footgun. The existing 'Implementation details deferred to M5b' line in §3.2 already covers this in spirit, but explicitly listing 'byte-vs-string coercion of ListRoutes response fields' next to 'counter persistence' and 'collision recovery' makes the specific footgun visible at design-doc-read time so the M5b implementer can budget for it rather than discover it through debugging silent mis-routing. Doc-only, no semantic changes. --- ...06_02_proposed_composed1_m5_jepsen_route_shuffle.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md index 2e4b40a2..7bb4e341 100644 --- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md +++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md @@ -215,9 +215,13 @@ that lies strictly between the current route's `Start` and ``` Exact strategy details (counter persistence across nemesis -restarts, collision recovery, etc.) are M5b implementation -notes — the design contract is: each shuffle's split key is -strictly interior to the route's current range. +restarts, collision recovery, byte-vs-string coercion of +`ListRoutes` response fields — `(str (:start route) …)` +silently yields `"[B@…"` if `:start` is a Java byte array +rather than a string, a silent mis-routing footgun flagged +by claude[bot] on b752a894) are M5b implementation notes — +the design contract is: each shuffle's split key is strictly +interior to the route's current range. The Clojure side gets a small helper (`composed1-nemesis/encode-dynamo-segment`) that mirrors the Go `encodeDynamoSegment` exactly. The Go From f92a029e284c16efb69b2822eec023ebc57fa3ad Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 21:31:37 +0900 Subject: [PATCH 09/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20single-p?= =?UTF-8?q?rocess=20two-group=20topology=20(codex=20P1=20on=206f024aaa)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Codex P1 on 6f024aaa correctly identifies that the proposed '2 processes, each hosting one group' topology is NOT supported by the current server: * shard_config.go:399-410 (validateShardRanges) requires every --shardRanges group ID to appear in that process's --raftGroups. Process A with --raftGroups '1=...' but a shardRanges entry pointing to group 2 fails validation. * main.go:764 (buildShardGroups) starts a local Raft runtime for every group listed in --raftGroups. Configuring both processes with --raftGroups '1=...,2=...' makes both try to host both groups and race on the Raft listeners. Fix: §3.3 items 3-5 rewritten for the SUPPORTED topology — ONE PROCESS hosting both single-member groups (two Raft addresses, shared Dynamo endpoint). Cross-group txns go through the in-process router and 2PC fires across the two co-located groups. Trade-off explicitly documented: because both groups live in one process, partition/kill nemeses can't separate them. Only the route-shuffle nemesis (API-level) exercises the cross-group path meaningfully. True distributed multi-group (multi-process) requires server-side support for 'remote-only groups in --raftGroups' or equivalent — M6+ work, out of M5 scope. This is a strictly correct narrowing of the topology choice (removed the unsupported '2 separate processes' and '6 nodes' options; the only supported shape is now spelled out). The M5a Done-when criterion still holds: cluster starts with tables 1-2 in group 1 and tables 3-4 in group 2, workload exercises dispatchMultiShardTxn. --- .../2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md index 7bb4e341..0eec2488 100644 --- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md +++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md @@ -292,7 +292,7 @@ startup.** Both halves are required. | Half | Mechanism | |---|---| | Multi-table workload | A NEW workload variant `dynamodb-append-multi-table-workload` creates N=4 tables (`jepsen_append_t1` … `jepsen_append_t4`) and writes to ≥2 distinct tables per `TransactWriteItems`. The router maps each table to its own route key so cross-table txns engage multi-route routing. | -| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to launch a multi-group cluster. Per `shard_config.go:61-99` (claude[bot] P2 on 24812867), `--shardRanges` alone is not enough: without `--raftGroups`, every shard range's `groupID` collapses into the single default group 1 and `groupMutations` returns `gids = [1]` for every key — single-shard path, 2PC never fires. M5a must therefore add **five** coordinated launch-script changes:

1. **`--raftGroups`** declaring ≥2 groups (e.g. `1=127.0.0.1:50051,2=127.0.0.1:50054`).
2. **`--shardRanges`** boundary keys placing `t1-t2` in group 1 and `t3-t4` in group 2.
3. **Cluster topology decision** — either 6 nodes (3-per-group, production-like consensus) or 2 single-node groups (`--raftBootstrap` per group, faster for CI). Tentative pick: 2 single-node groups for M5a, scale to 6 nodes if/when the workload outgrows it (low cost — flip a flag).
4. **Per-node `--raftBootstrap` (boolean) — NOT `--raftBootstrapMembers`.** `main.go:735-741` has a hard guard: `resolveBootstrapServers` returns `ErrBootstrapMembersRequireSingleGroup` whenever `--raftBootstrapMembers` is set and `--raftGroups` parses to more than one group (codex P2 + claude[bot] P2 on 3ca2a7f7). For the tentative 2-single-node-groups topology, each process hosts ONE group with ONE member — single-member groups need no peer discovery, so `--raftBootstrap` is both necessary and sufficient. The existing `--raftBootstrapMembers` in `run-jepsen-local.sh` must be **removed** for multi-group processes; it is only valid for single-group, multi-node formations.
5. **Expanded `--raftDynamoMap`** covering both groups' leader addresses, plus the matching port allocations (currently `5005{1,2,3}`/`6380{1,2,3}` for the single-group 3-node layout — need `5005{1,4}`/`6380{1,4}` for the 2-single-node layout, or `5005{1..6}`/`6380{1..6}` for 6-node). | +| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to launch a multi-group cluster. Per `shard_config.go:61-99` (claude[bot] P2 on 24812867), `--shardRanges` alone is not enough: without `--raftGroups`, every shard range's `groupID` collapses into the single default group 1 and `groupMutations` returns `gids = [1]` for every key — single-shard path, 2PC never fires. M5a must therefore add **five** coordinated launch-script changes:

1. **`--raftGroups`** declaring ≥2 groups (e.g. `1=127.0.0.1:50051,2=127.0.0.1:50054`).
2. **`--shardRanges`** boundary keys placing `t1-t2` in group 1 and `t3-t4` in group 2.
3. **Cluster topology decision — constrained by the current server.** `shard_config.go:399-410` (`validateShardRanges`) requires every `--shardRanges` group ID to appear in that process's `--raftGroups`, and `main.go:764` (`buildShardGroups`) starts a local Raft runtime for every group listed in `--raftGroups` (codex P1 on 6f024aaa). Consequence: "two processes, each hosting one group" is NOT supported — Process A with `--raftGroups "1=…"` and a `--shardRanges` entry pointing to group 2 fails validation; configuring both processes with `--raftGroups "1=…,2=…"` makes both try to host both groups and race on the Raft listeners.
The supported topology for M5a is **one process hosting both single-member groups** — two Raft addresses (`50051` for group 1, `50054` for group 2), shared by a single Dynamo endpoint (e.g. `63801`). Each group is single-member (just this process) so `--raftBootstrap` works for both. Cross-group txns go through the in-process router → 2PC fires across the two co-located groups.
**Limitation accepted for M5a:** because both groups live in one process, partition/kill nemeses can't separate them — only the route-shuffle nemesis (API-level) exercises the cross-group path meaningfully. True distributed multi-group (multi-process) requires server-side support for "remote-only groups in `--raftGroups`" or equivalent, which is M6+ work and out of M5's scope.
4. **Per-process `--raftBootstrap` (boolean) — NOT `--raftBootstrapMembers`.** `main.go:735-741` has a hard guard: `resolveBootstrapServers` returns `ErrBootstrapMembersRequireSingleGroup` whenever `--raftBootstrapMembers` is set and `--raftGroups` parses to more than one group (codex P2 + claude[bot] P2 on 3ca2a7f7). For the single-process-two-groups topology, the process needs no peer discovery for either of its two single-member groups, so `--raftBootstrap` is necessary and sufficient. The existing `--raftBootstrapMembers` in `run-jepsen-local.sh` must be **removed** for the M5a launch; it is only valid for single-group, multi-node formations.
5. **Single `--raftDynamoMap`** entry pointing the process's Dynamo address at itself (no cross-process leader fan-out is needed since there's only one process); port allocation simplifies — drop the `5005{1,2,3}`/`6380{1,2,3}` 3-node layout to a 1-process layout with two Raft ports (`50051`, `50054`) and one Dynamo port (`63801`). | `encodeDynamoSegment("jepsen_append_t1")` etc. are computed at script setup time and inlined into the `--shardRanges` From 58d6ac5edc8fe7adef9a62fe0bc5c22272b2c0ff Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Tue, 2 Jun 2026 21:43:40 +0900 Subject: [PATCH 10/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20=C2=A74?= =?UTF-8?q?=20row=20must=20require=20both=20--raftGroups=20AND=20--shardRa?= =?UTF-8?q?nges=20(coderabbit=20Major=20on=20f92a029e)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit coderabbit Major on f92a029e correctly identifies that §4's M5a milestone row was inconsistent with §3.3's five-change contract: the row said the launch extension passes '--shardRanges so the cluster launches with >=2 Raft groups' but §3.3 spells out that --shardRanges alone collapses every range to the single default group 1 (shard_config.go:61-99 — the claude[bot] P2 finding on 24812867 that drove the §3.3 rewrite in the first place). Both --raftGroups AND --shardRanges are required for the multi-group contract. Fix: §4's M5a Scope cell now explicitly says 'pass BOTH --raftGroups (declaring >=2 groups) AND --shardRanges (placing tables across those groups)', and the cell cross-references §3.3 for the 'shardRanges alone collapses to group 1' explanation. Avoids the implementer following §4 in isolation and missing the --raftGroups requirement. Doc-only, no semantic changes to the design itself — just internal consistency between §4 row and §3.3 table. --- .../2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md index 0eec2488..01ea9e40 100644 --- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md +++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md @@ -383,7 +383,7 @@ mergeable on its own. | Phase | Title | Scope | Done when | |---|---|---|---| -| M5a | CLI + multi-table workload + multi-group launch | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; **`scripts/run-jepsen-local.sh` extended to pass `--shardRanges` so the cluster launches with ≥2 Raft groups** and table-route keys for `t1…tN` are statically split across groups; setup hook calls `ListRoutes` from the first node and verifies the multi-group routing is in place (read-only, no mutation). | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables 1-2 owned by group 1 and tables 3-4 owned by group 2; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. | +| M5a | CLI + multi-table workload + multi-group launch | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; **`scripts/run-jepsen-local.sh` extended to pass BOTH `--raftGroups` (declaring ≥2 groups) AND `--shardRanges` (placing tables across those groups)** — per §3.3, `--shardRanges` alone collapses every range to the single default group 1 and 2PC never fires; both flags are required for the multi-group contract. Table-route keys for `t1…tN` are statically split across the declared groups; setup hook calls `ListRoutes` from the first node and verifies the multi-group routing is in place (read-only, no mutation). | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables 1-2 owned by group 1 and tables 3-4 owned by group 2; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. | | M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. | M5a is a single focused PR but no longer "small" after the