From ffb9c73fc76d6ef0aeea2ff86f863109e3c70ff0 Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Tue, 2 Jun 2026 13:20:42 +0900
Subject: [PATCH 01/10] =?UTF-8?q?docs(design):=20Composed-1=20M5=20?=
 =?UTF-8?q?=E2=80=94=20Jepsen=20route-shuffle=20workload=20proposal?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Companion to 2026_05_29_proposed_composed1_cross_group_commit_guard.md
(parent — proposes the cross-group commit-time ownership guard
M1+M2+M3+M4 implemented in PR #900).

M5 is the integration-level proof from the parent doc's
milestone table.  This proposal elaborates the row into a full
design:

  * cmd/elastickv-split — tiny CLI that invokes SplitRange once
    by route ID + split key + expected version.  Lets the
    Jepsen nemesis shell out instead of re-implementing the
    gRPC client in Clojure.
  * jepsen/src/elastickv/composed1_nemesis.clj — route-shuffle
    nemesis that periodically splits an existing route.
    Composes with the existing partition+kill nemesis package.
  * dynamodb-append-workload setup hook — issues one initial
    SplitRange so the workload spans >=2 shards from t=0,
    exercising the 2PC path (prewrite -> primary commit ->
    secondary commit + the new
    ErrTxnSecondaryRouteShiftedAfterPrimaryCommit sentinel).

Forward-looking posture (same as parent): today's SplitRange
is same-group only, so the Composed-1 hazard the M3/M4 guard
catches cannot yet be *triggered* in production.  M5 ships the
scaffolding so:

  1. The current SplitRange is exercised under realistic
     concurrent multi-shard write load and proved non-regressing
     (workload finds zero G1c — baseline M4 contract).
  2. When a future PR introduces a route-mutating RPC that DOES
     shift ownership across groups, the M5 workload — with a
     one-line nemesis swap — becomes the integration-level proof
     that M3+M4 hold under cross-group churn.

Two-phase milestone breakdown:

  * M5a: CLI + workload setup (mergeable on its own; trivially
    finds zero G1c without the nemesis).
  * M5b: Route-shuffle nemesis itself + cadence-tuning analysis.

Open questions tracked in section 5; lifecycle questions about
renaming the parent doc are in OQ-4.
---
 ...posed_composed1_m5_jepsen_route_shuffle.md | 284 ++++++++++++++++++
 1 file changed, 284 insertions(+)
 create mode 100644 docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md

diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
new file mode 100644
index 00000000..a7f4d8ea
--- /dev/null
+++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
@@ -0,0 +1,284 @@
+# Composed-1 M5 — Jepsen route-shuffle workload
+
+Status: Proposed
+Author: bootjp
+Date: 2026-06-02
+Parent design:
+[`2026_05_29_proposed_composed1_cross_group_commit_guard.md`](2026_05_29_proposed_composed1_cross_group_commit_guard.md)
+
+> **Forward-looking proposal, same posture as the parent doc.**
+> Today's `SplitRange` is same-group only (per CLAUDE.md and
+> `adapter/distribution_server.go`'s implementation), so the
+> Composed-1 hazard the M3/M4 guard catches cannot yet be
+> *triggered* in production. M5 ships the integration-test
+> scaffolding — workload shape, nemesis, success criterion — so
+> that:
+>
+>   1. The current `SplitRange` is exercised under realistic
+>      concurrent multi-shard write load and proved non-regressing
+>      (the workload finds **no** G1c, which is the baseline M4
+>      contract).
+>   2. When a future PR introduces a route-mutating RPC that DOES
+>      shift ownership across groups (cross-group `SplitRange`,
+>      `MoveRange`, online rebalancer), the M5 workload — with a
+>      one-line nemesis change to call the new RPC — becomes the
+>      integration-level proof that M3+M4 hold under cross-group
+>      churn.
+
+---
+
+## 1. Goals and non-goals
+
+### 1.1 Goals
+
+- **G1.** Add a `route-shuffle` nemesis to the DynamoDB Jepsen
+  suite that issues `SplitRange` against the cluster at a
+  configurable cadence concurrently with the existing DynamoDB
+  workload's `TransactWriteItems` traffic.
+- **G2.** Force the workload to issue **multi-shard** transactions
+  with high probability so the 2PC path (`dispatchMultiShardTxn`,
+  `commitSecondaryTxns`, the new
+  `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit` sentinel) is
+  exercised.
+- **G3.** Verify the workload's Elle checker reports **zero G1c**
+  cycles after the run. This is M4's "Done when" criterion
+  promoted from a unit test to an integration test.
+- **G4.** Land a tiny CLI helper (`cmd/elastickv-split`, or a
+  subcommand in `elastickv-admin`) that issues a single
+  `SplitRange` RPC by route ID + split key + expected version.
+  The Jepsen nemesis shells out to it rather than re-implementing
+  the gRPC client in Clojure.
+
+### 1.2 Non-goals
+
+- **NG1.** A cross-group `SplitRange` or `MoveRange` RPC. The
+  parent doc explicitly defers this; M5 must not depend on it.
+- **NG2.** Reproducing a real Composed-1 anomaly. With same-group
+  `SplitRange` only, no such anomaly is reachable; the workload's
+  job is to prove the gate is non-regressing today and to be
+  ready for tomorrow.
+- **NG3.** New Jepsen workload primitives (new operation types,
+  new generators outside the route-shuffle nemesis). The
+  existing `dynamodb-append-workload` is the right surface.
+- **NG4.** Changing the DynamoDB adapter or Composed-1 code on
+  the server side. M5 is purely test-harness work.
+
+---
+
+## 2. Why this matters now
+
+PR #900 lands M1+M2+M3+M4 on `feat/composed1-m4-retry`. The unit
+tests cover each milestone in isolation. Two gaps remain that
+only an integration test can close:
+
+1. **End-to-end ordering.** The M3 gate runs inside the FSM
+   apply path; the M4 retry runs inside `ShardedCoordinator`;
+   the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit`
+   sentinel surfaces from `commitSecondaryTxns`. Each is unit-
+   tested in isolation. None of the tests run the full
+   prewrite → primary-commit → secondary-commit chain on real
+   Raft groups under concurrent client load, which is where
+   subtle apply-ordering bugs hide.
+2. **Workload realism.** PR #900's nine review rounds each
+   surfaced an auto-pin overreach (read-write, caller-StartTS,
+   2PC secondary, resolver-claimed). The review process is
+   thorough, but a Jepsen run is the empirical check: if the
+   workload's Elle checker finds G1c against the M4 build, the
+   review missed something.
+
+If we ship M4 to `main` without M5, every later change to
+routing, OCC, or the FSM apply path lacks the integration-level
+sentinel that would catch a regression. M5 closes that gap.
+
+---
+
+## 3. Design
+
+### 3.1 SplitRange invocation CLI (`cmd/elastickv-split`)
+
+A standalone Go binary, ~80 lines:
+
+```
+elastickv-split \
+  --address 127.0.0.1:50051 \
+  --route-id 100 \
+  --split-key /q1/00001 \
+  --expected-version 7
+```
+
+Reads the four flags, dials the leader, issues
+`proto.Distribution/SplitRange`, prints the new catalog version
+and the two child route IDs on success. Non-zero exit on any
+error so the Jepsen nemesis sees the failure.
+
+The CLI lives in `cmd/elastickv-split/main.go`. No tests beyond
+a smoke test (`main_test.go`) that runs `elastickv-split --help`
+and asserts non-zero exit on missing flags. The real coverage
+is the Jepsen run itself.
+
+**Alternative considered:** add a `split` subcommand to
+`cmd/elastickv-admin/`. Rejected because `elastickv-admin` is
+the HTTP fanout admin and conflating it with a gRPC control-
+plane invocation would muddy its scope. A standalone tool is
+clearer.
+
+### 3.2 Route-shuffle nemesis (`jepsen/src/elastickv/composed1_nemesis.clj`)
+
+A new Clojure file with a single `route-shuffle-nemesis`
+function returning a `jepsen.nemesis/Nemesis` instance:
+
+```
+(defn route-shuffle-nemesis
+  "Periodically invokes elastickv-split against the cluster.
+   :start  -> shuffle one route (pick a non-edge split key)
+   :stop   -> no-op (splits are durable, no rollback)"
+  [opts]
+  (reify nemesis/Nemesis
+    (setup! [this test] ...)
+    (invoke! [this test op] ...
+       ;; shell out to elastickv-split with a fresh split key
+       )
+    (teardown! [this test] ...)))
+```
+
+The nemesis is composed with the existing
+`jepsen.nemesis.combined/nemesis-package` (partitions + kills)
+via `jepsen.nemesis/compose`. The combined nemesis becomes the
+workload's `:nemesis`.
+
+**Split key picking strategy.** A simple monotonically-increasing
+counter: every `:start` invocation appends a fresh integer
+suffix to a fixed key prefix the workload reserves. This avoids
+collisions with the workload's keyspace and guarantees the
+split always picks a key that's between existing keys (so the
+operation succeeds against a real catalog).
+
+**Expected version.** The nemesis calls `ListRoutes` once at
+setup to learn the current catalog version, then increments its
+local copy by 1 after each successful split. Catalog drift
+(another split landing concurrently) is rare in practice — if it
+happens, the nemesis logs and refreshes from `ListRoutes`.
+
+### 3.3 Multi-shard workload guarantee
+
+The existing `dynamodb-append-workload` writes to a per-key
+queue. With a single shard layout, every write goes to that
+shard — no 2PC, no Composed-1 exposure.
+
+M5 needs the workload to consistently span shards. Two options:
+
+| Option | Mechanism | Pro | Con |
+|---|---|---|---|
+| **A** Force initial split | The test setup issues one `SplitRange` before the workload starts | Workload runs on 2+ shards from t=0 | Adds a setup step; needs a known split key |
+| **B** Multi-key txns | Modify each `:append` op to write to ≥2 keys with deterministic routing across shards | Workload exercises 2PC even on a 1-shard layout | Changes the workload's operation shape (harder to compare against historical runs) |
+
+**Choose A.** Less invasive to the workload, and the
+route-shuffle nemesis itself increases the shard count over
+time, giving organic multi-shard coverage.
+
+The setup hook (`db/setup!` in Jepsen parlance, or the test's
+`:setup` map) runs `elastickv-split` once with a split key in
+the middle of the workload's keyspace.
+
+### 3.4 Success criterion
+
+The workload's existing Elle checker emits a `:valid?` boolean
+and a list of detected cycles (`:G0`, `:G1a`, `:G1b`, `:G1c`,
+`:G-single`, etc.). M5's pass condition:
+
+```
+(and (:valid? results)
+     (zero? (count (filter #(= (:type %) :G1c) (:anomalies results))))
+     (zero? (count (filter #(= (:type %) :G-single) (:anomalies results)))))
+```
+
+`G1c` is the parent doc's explicit safety violation; `G-single`
+is the closely-related single-item anomaly we already chase in
+the existing workload. Other anomaly types (G0, G1a, G1b)
+indicate orthogonal bugs and should also fail the run, but the
+parent doc's M5 row names G1c as the headline criterion.
+
+### 3.5 Cadence
+
+Default `:route-shuffle-interval` = `30s`. Configurable via the
+test CLI. Rationale: the workload's typical txn rate is ~10
+ops/sec across 5 concurrent clients (= 50 ops/sec), so a
+30s shuffle puts ~1500 txns between shuffles — enough to
+plausibly catch a mid-2PC race, but rare enough that the run
+doesn't degenerate into "every txn races a split."
+
+The route-shuffle nemesis composes with the existing
+partition+kill nemesis. The combined nemesis fires at the
+shortest of its members' intervals (Jepsen default
+behaviour); kills/partitions remain at their existing 40s.
+
+---
+
+## 4. Milestone breakdown
+
+Two phases. The phases land in this order; the first is
+mergeable on its own.
+
+| Phase | Title | Scope | Done when |
+|---|---|---|---|
+| M5a | CLI + workload setup | `cmd/elastickv-split` binary; `dynamodb-append-workload`'s `:setup` issues the initial split; no nemesis yet. | `./scripts/run-jepsen-local.sh` runs unchanged but the cluster starts with 2 shards. Workload finds zero G1c (trivially, no shuffle). |
+| M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into `dynamodb-append-workload`'s nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). | A `./scripts/run-jepsen-local.sh` run with `--composed1-route-shuffle` produces zero G1c after ≥10 shuffles during a 5-minute run. |
+
+M5a is a small, focused PR (Go CLI + Clojure setup hook +
+docs). M5b carries the nemesis itself plus the cadence-tuning
+analysis.
+
+---
+
+## 5. Open questions
+
+- **OQ-1.** Should the nemesis also issue an `Abort`-shaped
+  fault that interrupts an in-flight 2PC mid-prewrite? The
+  existing partition nemesis effectively does this. Tentative
+  answer: no, the partition nemesis is enough; adding a
+  prewrite-interrupt would test `abortPreparedTxn`, which is
+  out of M5's scope.
+- **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two is
+  cleaner but doubles the review burden. Tentative answer: two
+  if M5a's CLI work runs ≥150 lines (likely); one if M5a fits in
+  a single screen. Decide at implementation time.
+- **OQ-3.** Where does the new `cmd/elastickv-split` slot in
+  the README and the `make` targets? Likely add it to
+  `make tools`, mirror in `docs/operations/` (does this dir
+  exist? — check at implementation). Out of scope for the
+  design doc itself.
+- **OQ-4.** Should the M5 design doc rename happen with PR #900
+  merge (since M1–M4 ship)? Yes per CLAUDE.md's lifecycle
+  guidance: rename `*_proposed_*.md` → `*_partial_*.md` after
+  PR #900 lands, then this M5 doc tracks the open milestone.
+  When M5 ships, rename the parent to `*_implemented_*.md` and
+  this M5 doc to `*_implemented_*.md` as well (or fold the M5
+  content back into the parent — tentative answer: keep them
+  separate so the M5 design history isn't lost).
+
+---
+
+## 6. Self-review summary
+
+Five-pass per CLAUDE.md:
+
+1. **Data loss.** No new write paths; the CLI invokes the
+   existing `SplitRange` RPC which already has full unit + e2e
+   coverage. Nemesis-driven calls of an existing RPC can't lose
+   committed writes (worst case: a split fails and the test
+   fails, no data effect).
+2. **Concurrency / distributed failures.** The nemesis runs
+   under Jepsen's existing concurrency harness alongside
+   partitions + kills. Combined behaviour is the *point* of the
+   test — if anything breaks, the workload finds it. No new
+   server-side concurrency code is being introduced.
+3. **Performance.** Nemesis fires every 30s; CLI invocation is a
+   single short-lived gRPC call. No measurable impact on hot
+   paths.
+4. **Data consistency.** This IS the data-consistency check
+   (G1c = serializability violation). The success criterion is
+   the property we want.
+5. **Test coverage.** M5 ships the integration test; the
+   smoke test on the CLI is the only unit-level coverage,
+   correctly. The CLI's logic is thin enough that a smoke test
+   plus the Jepsen run constitute adequate coverage.

From f5d2ad7ac7c695b8dc1b86119a59feb080fd95da Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Tue, 2 Jun 2026 16:15:09 +0900
Subject: [PATCH 02/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20multi-ta?=
 =?UTF-8?q?ble=20workload=20+=20post-review=20revisions=20(PR=20#905)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Codex P1 + 3 gemini medium findings on the original PR #905
revision (ffb9c73f).  All addressed by revising sections 3.2,
3.3, and 4 (milestone breakdown) and adding OQ-5 / OQ-6:

* codex P1 — "Don't rely on item-key splits to shard DynamoDB
  txns."  Verified against kv/shard_key.go:94-124: every DynamoDB
  table-metadata, item, and GSI key normalises to a SINGLE per-
  table route key (!ddb|route|table|<tableSegment>).  Splitting
  inside a single-table workload's item keyspace cannot put two
  items on different shards, so the 2PC path
  (dispatchMultiShardTxn, secondary commits,
  ErrTxnSecondaryRouteShiftedAfterPrimaryCommit) would never
  fire — invalidating G2.

  Fix: replace single-key-split (Option A) with a NEW workload
  variant dynamodb-append-multi-table-workload that creates N=4
  tables (jepsen_append_t1 … jepsen_append_t4) and writes to >=2
  distinct tables per TransactWriteItems.  The router maps each
  table to its own route key, so cross-table txns naturally fan
  out across shards.  The setup hook splits the table-route
  keyspace at !ddb|route|table|jepsen_append_t2.

* gemini medium R1 — "Lexicographical Shard Split Issue."  The
  prior /split/<int> split-key prefix was lexicographically
  smaller than the workload's keyspace ("/" < "0" in ASCII),
  so every workload key ended up on the rightmost shard and
  G2 was never exercised.

  Fix: anchor split keys to the table-route prefix
  !ddb|route|table|... so the split lands INSIDE the active
  workload route range.

* gemini medium R2 — "Route ID Resolution for SplitRange."
  Successful SplitRange deletes the parent route ID and
  creates two child IDs, so a cached ID from a one-time
  setup-time ListRoutes call is stale on the next shuffle.

  Fix: nemesis re-queries ListRoutes on every :start
  invocation, walks the snapshot to find the route covering
  the chosen split key, and uses that route's ID +
  snapshot.version as expected_catalog_version.  Catalog drift
  surfaces as ErrCatalogVersionMismatch from the server and the
  nemesis refreshes on the next tick.

* gemini medium R3 — "Gating of Initial Split in Setup Hook."
  Jepsen db/setup! runs on EVERY node; an ungated initial
  split would be attempted concurrently by all nodes.

  Fix: gate the setup-time split on
  (when (= node (first (:nodes test))) ...) so only the first
  node attempts it.

Also:

  * Updated §4 milestone table: M5a now ships the new workload
    variant (not just a setup hook), so it is meaningfully
    bigger than the original §4 row suggested.
  * Added OQ-5 (is N=4 the right default?) and OQ-6 (first-node
    gate semantics) as follow-ups for implementation time.
  * Resolved OQ-4: PR #900 has merged, so the parent doc rename
    *_proposed_*.md → *_partial_*.md should now land as a
    separate small doc-only PR.
---
 ...posed_composed1_m5_jepsen_route_shuffle.md | 150 ++++++++++++------
 1 file changed, 102 insertions(+), 48 deletions(-)

diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
index a7f4d8ea..40bde00a 100644
--- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
+++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
@@ -136,7 +136,12 @@ function returning a `jepsen.nemesis/Nemesis` instance:
   (reify nemesis/Nemesis
     (setup! [this test] ...)
     (invoke! [this test op] ...
-       ;; shell out to elastickv-split with a fresh split key
+       ;; 1. call ListRoutes to find the route currently covering
+       ;;    the chosen split key — route IDs change after every
+       ;;    split, so a cached ID from setup is stale
+       ;; 2. pick a split key inside that route's range
+       ;; 3. shell out to elastickv-split with route-id +
+       ;;    split-key + expected-version from ListRoutes
        )
     (teardown! [this test] ...)))
 ```
@@ -146,39 +151,76 @@ The nemesis is composed with the existing
 via `jepsen.nemesis/compose`. The combined nemesis becomes the
 workload's `:nemesis`.
 
-**Split key picking strategy.** A simple monotonically-increasing
-counter: every `:start` invocation appends a fresh integer
-suffix to a fixed key prefix the workload reserves. This avoids
-collisions with the workload's keyspace and guarantees the
-split always picks a key that's between existing keys (so the
-operation succeeds against a real catalog).
-
-**Expected version.** The nemesis calls `ListRoutes` once at
-setup to learn the current catalog version, then increments its
-local copy by 1 after each successful split. Catalog drift
-(another split landing concurrently) is rare in practice — if it
-happens, the nemesis logs and refreshes from `ListRoutes`.
-
-### 3.3 Multi-shard workload guarantee
-
-The existing `dynamodb-append-workload` writes to a per-key
-queue. With a single shard layout, every write goes to that
-shard — no 2PC, no Composed-1 exposure.
-
-M5 needs the workload to consistently span shards. Two options:
-
-| Option | Mechanism | Pro | Con |
-|---|---|---|---|
-| **A** Force initial split | The test setup issues one `SplitRange` before the workload starts | Workload runs on 2+ shards from t=0 | Adds a setup step; needs a known split key |
-| **B** Multi-key txns | Modify each `:append` op to write to ≥2 keys with deterministic routing across shards | Workload exercises 2PC even on a 1-shard layout | Changes the workload's operation shape (harder to compare against historical runs) |
-
-**Choose A.** Less invasive to the workload, and the
-route-shuffle nemesis itself increases the shard count over
-time, giving organic multi-shard coverage.
-
-The setup hook (`db/setup!` in Jepsen parlance, or the test's
-`:setup` map) runs `elastickv-split` once with a split key in
-the middle of the workload's keyspace.
+**Split key picking strategy (gemini medium R1).** Pick a split
+key from inside the DynamoDB **table-route** key space
+(`!ddb|route|table|<tableSegment>` — see `kv/shard_key.go:94-124`).
+Concretely, with N tables `jepsen_append_t1` …
+`jepsen_append_tN` per §3.3, the route key for table `tK` is
+`!ddb|route|table|jepsen_append_tK`. Splits happen between
+adjacent table-route keys — e.g. between `…jepsen_append_t2`
+and `…jepsen_append_t3`. This guarantees:
+
+- The split key falls **inside** the active workload route
+  range (not lexicographically before or after, which would
+  leave all workload keys on one side of the split).
+- Each side of the split owns a distinct set of tables, so
+  cross-table `TransactWriteItems` actually exercises 2PC.
+
+A prior revision of this doc proposed a `/split/<int>` prefix.
+That was lexicographically smaller than the workload's keyspace
+(`/` < `0` in ASCII), so every workload key ended up on the
+rightmost shard and the 2PC path was never exercised. Fixed
+above by anchoring split keys to the table-route prefix.
+
+**Route ID resolution (gemini medium R2).** The nemesis CANNOT
+rely on a single `ListRoutes` call + a local counter — every
+successful split deletes the parent route ID and creates two
+fresh child IDs, so a cached route ID is stale on the next
+shuffle. On every `:start` invocation the nemesis re-queries
+`ListRoutes`, walks the returned snapshot to find the route
+whose range contains the chosen split key, and uses that
+route's ID + the snapshot's `version` as the
+`SplitRangeRequest`'s `expected_catalog_version`. Catalog
+drift (another split landing concurrently between
+`ListRoutes` and `SplitRange`) surfaces as
+`ErrCatalogVersionMismatch` from the server; the nemesis logs
+and refreshes on the next tick.
+
+### 3.3 Multi-shard workload guarantee (revised post-codex P1)
+
+**Original §3.3 (Option A: single-key split in workload keyspace)
+was wrong.** `kv/shard_key.go:94-124` normalises every DynamoDB
+table-metadata, item, and GSI key to a single per-table route
+key (`!ddb|route|table|<tableSegment>`). So every
+`jepsen_append` item resolves to the SAME catalog point
+regardless of its partition-key value, and a `SplitRange`
+inside the item keyspace cannot put two items on different
+shards. The 2PC path (`dispatchMultiShardTxn`, secondary
+commits, the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit`
+sentinel) would never fire — invalidating G2 (codex P1 on
+PR #905).
+
+**Revised strategy: multi-table workload.** The M5 workload
+creates `N` tables (default `N = 4`): `jepsen_append_t1` …
+`jepsen_append_t4`. Each `TransactWriteItems` operation writes
+to **at least two** distinct tables. The router maps each
+table to its own table-route key, so a cross-table txn
+naturally fans out across whichever shards own those route
+keys. The setup hook splits the table-route keyspace at
+`!ddb|route|table|jepsen_append_t2` so tables 1 lives on one
+shard and tables 2–4 on another from t=0.
+
+| Concern | Resolution |
+|---|---|
+| Workload shape change | Append ops still write a single value per row; the change is the table they write to (one per row, ≥2 rows per txn — picked from a per-txn random subset of `t1…tN`). |
+| Elle compatibility | The append checker keys on `(table, partition-key)` pairs already (the workload's history shape supports this); cross-table txns appear as multi-key ops, which Elle handles natively. |
+| Comparison with historical runs | Historical runs used a single table — the M5 workload is a NEW workload variant `dynamodb-append-multi-table-workload` rather than a modification of `dynamodb-append-workload`. Both ship; the existing one stays for trend comparison. |
+
+The setup hook (Jepsen `db/setup!`) is gated to run only on
+the FIRST node (`(when (= node (first (:nodes test))) …)`) so
+the initial split is not attempted concurrently by every
+cluster node and does not cause catalog-version conflicts
+during bootstrap (gemini medium R3).
 
 ### 3.4 Success criterion
 
@@ -221,8 +263,8 @@ mergeable on its own.
 
 | Phase | Title | Scope | Done when |
 |---|---|---|---|
-| M5a | CLI + workload setup | `cmd/elastickv-split` binary; `dynamodb-append-workload`'s `:setup` issues the initial split; no nemesis yet. | `./scripts/run-jepsen-local.sh` runs unchanged but the cluster starts with 2 shards. Workload finds zero G1c (trivially, no shuffle). |
-| M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into `dynamodb-append-workload`'s nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). | A `./scripts/run-jepsen-local.sh` run with `--composed1-route-shuffle` produces zero G1c after ≥10 shuffles during a 5-minute run. |
+| M5a | CLI + multi-table workload | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; setup hook (gated to first node) issues the initial split between table-route keys. | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables split across 2 shards; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. |
+| M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. |
 
 M5a is a small, focused PR (Go CLI + Clojure setup hook +
 docs). M5b carries the nemesis itself plus the cadence-tuning
@@ -238,23 +280,35 @@ analysis.
   answer: no, the partition nemesis is enough; adding a
   prewrite-interrupt would test `abortPreparedTxn`, which is
   out of M5's scope.
-- **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two is
-  cleaner but doubles the review burden. Tentative answer: two
-  if M5a's CLI work runs ≥150 lines (likely); one if M5a fits in
-  a single screen. Decide at implementation time.
+- **OQ-2.** Do we ship M5a + M5b in a single PR or two? Two
+  is cleaner but doubles the review burden. With the §3.3
+  revision M5a is now meaningfully bigger (a new workload
+  variant, not just a setup hook), so two-PR is now the more
+  likely shape. Decide at implementation time.
 - **OQ-3.** Where does the new `cmd/elastickv-split` slot in
   the README and the `make` targets? Likely add it to
   `make tools`, mirror in `docs/operations/` (does this dir
   exist? — check at implementation). Out of scope for the
   design doc itself.
-- **OQ-4.** Should the M5 design doc rename happen with PR #900
-  merge (since M1–M4 ship)? Yes per CLAUDE.md's lifecycle
-  guidance: rename `*_proposed_*.md` → `*_partial_*.md` after
-  PR #900 lands, then this M5 doc tracks the open milestone.
-  When M5 ships, rename the parent to `*_implemented_*.md` and
-  this M5 doc to `*_implemented_*.md` as well (or fold the M5
-  content back into the parent — tentative answer: keep them
-  separate so the M5 design history isn't lost).
+- **OQ-4** (resolved post-PR #900 merge). The parent doc
+  rename `*_proposed_*.md` → `*_partial_*.md` should land as a
+  separate small doc-only PR now that PR #900 is merged. When
+  M5 ships, rename both this doc and the parent to
+  `*_implemented_*.md` (tentative — keep both files separate
+  so the M5 design history isn't lost).
+- **OQ-5** (new, codex P1 follow-up). Is `N = 4` tables the
+  right default? Trade-offs: more tables = better 2PC
+  fan-out coverage but slower setup and noisier history. The
+  workload's existing `:concurrency` defaults to 5, so 4
+  tables means each client touches ~all of them per txn on
+  average. Defer to implementation; revisit if the workload
+  becomes I/O-bound on table-meta lookups.
+- **OQ-6** (new, gemini medium R3 follow-up). The first-node
+  gate for setup splits assumes Jepsen's `(first (:nodes test))`
+  is stable across nodes; verify this matches actual Jepsen
+  semantics (it should — `:nodes` is the test config, not a
+  per-node view). Out of scope to design more carefully; will
+  test at M5a implementation.
 
 ---
 

From 248128672675265b1cc85d448bdf3f77fc445d40 Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Tue, 2 Jun 2026 17:33:33 +0900
Subject: [PATCH 03/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20multi-gr?=
 =?UTF-8?q?oup=20cluster=20startup=20+=20post-review=20fixes=20(PR=20#905)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Six findings on f5d2ad7a — two codex P1, two claude P2, two
claude P3. All addressed:

* codex P1 #1 (encode table segments before choosing split
  keys).  My split key used the raw table name; the actual
  table-route key uses dynamoRouteTableKey(encodeDynamoSegment(
  tableName)) — base64 RawURLEncoding (adapter/dynamodb.go:
  8337-8365).  Split at the raw name sorts after every
  base64-encoded segment and leaves every table on one side
  of the split.

  Fix: §3.2 now specifies the Clojure helper
  encode-dynamo-segment that mirrors the Go encoding, and the
  split key is constructed as
  `(str "!ddb|route|table|" (encode-dynamo-segment "jepsen_append_t2"))`.

* codex P1 #2 (seed cross-group routes before claiming 2PC
  coverage).  Today's SplitRange is same-group only — both
  children inherit the parent's GroupID.  dispatchTxn enters
  dispatchMultiShardTxn only when mutations group to ≥2 Raft
  GROUPS, not ≥2 routes.  An initial SplitRange in the
  setup hook would leave tables 1 and 2-4 in the same group
  and the workload would still take the single-shard path.

  Fix: §3.3 rewritten substantially.  Multi-group cluster
  startup is now required — M5a extends
  scripts/run-jepsen-local.sh to pass --shardRanges so the
  cluster launches with ≥2 Raft groups, table-route keys for
  t1..t4 statically split across groups (e.g. tables 1-2 →
  group 1, tables 3-4 → group 2).  The Jepsen db/setup! hook
  no longer issues SplitRange — it calls ListRoutes (gated
  to first node) and VERIFIES the multi-group routing is in
  place, failing fast if not.  Also added §3.3's "Why the
  nemesis is still useful" paragraph explaining that
  nemesis-driven SplitRange (same-group) churns catalog
  versions without moving ownership, exercising M3 + M4
  drift detection — a non-regression check.  When a future
  cross-group MoveRange lands, swapping it into the nemesis
  upgrades the workload from non-regression to real
  Composed-1 anomaly hunting.

* claude P2 #1 (success criterion in §3.4 will always pass).
  My Clojure snippet used
  `(filter #(= (:type %) :G1c) (:anomalies results))` but
  :anomalies is a map (keyword → seq), not a flat list of
  maps with :type.  Iterating a map yields [key val] vectors
  with no :type field — the filter always returns empty, the
  check always passes silently.

  Fix: §3.4 corrected to
  `(nil? (get (:anomalies results) :G1c))`.  Added doc note
  about the prior broken form so future readers understand
  the trap.

* claude P2 #2 (lexicographic direction inverted in §3.2).
  My old explanation said `/split/<int>` was smaller than
  workload keys, but `!` is ASCII 33 and `/` is 47, so
  `/split/<int>` is LARGER than `!ddb|route|table|*`.

  Fix: §3.2 corrected — workload keys would land on the LEFT
  shard, not the rightmost.  The fix (anchoring to the
  table-route prefix) is correct in either direction.

* claude P3 #1 (NG3 contradicts the new multi-table variant).

  Fix: NG3 tightened to "no new operation types or generators
  BEYOND the multi-table workload variant and the route-
  shuffle nemesis specified in §3."

* claude P3 #2 (setup hook route ID resolution not described).
  Resolved differently from claude's suggestion: §3.3's
  revision means the setup hook no longer needs route ID
  resolution at all (no mutation), just a ListRoutes
  verification.

* Added OQ-7 (codex P1 #1 follow-up) for the
  --shardRanges boundary key encoding helper.
---
 ...posed_composed1_m5_jepsen_route_shuffle.md | 196 ++++++++++++------
 1 file changed, 133 insertions(+), 63 deletions(-)

diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
index 40bde00a..d6770a8b 100644
--- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
+++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
@@ -57,9 +57,12 @@ Parent design:
   `SplitRange` only, no such anomaly is reachable; the workload's
   job is to prove the gate is non-regressing today and to be
   ready for tomorrow.
-- **NG3.** New Jepsen workload primitives (new operation types,
-  new generators outside the route-shuffle nemesis). The
-  existing `dynamodb-append-workload` is the right surface.
+- **NG3.** New Jepsen workload primitives **beyond** the new
+  `dynamodb-append-multi-table-workload` variant (§3.3) and the
+  new `route-shuffle-nemesis` (§3.2).  No new operation types,
+  no new generators outside those two surfaces.  Pre-existing
+  `dynamodb-append-workload` stays as-is for trend comparison
+  with historical runs.
 - **NG4.** Changing the DynamoDB adapter or Composed-1 code on
   the server side. M5 is purely test-harness work.
 
@@ -151,26 +154,41 @@ The nemesis is composed with the existing
 via `jepsen.nemesis/compose`. The combined nemesis becomes the
 workload's `:nemesis`.
 
-**Split key picking strategy (gemini medium R1).** Pick a split
-key from inside the DynamoDB **table-route** key space
-(`!ddb|route|table|<tableSegment>` — see `kv/shard_key.go:94-124`).
-Concretely, with N tables `jepsen_append_t1` …
-`jepsen_append_tN` per §3.3, the route key for table `tK` is
-`!ddb|route|table|jepsen_append_tK`. Splits happen between
-adjacent table-route keys — e.g. between `…jepsen_append_t2`
-and `…jepsen_append_t3`. This guarantees:
-
-- The split key falls **inside** the active workload route
-  range (not lexicographically before or after, which would
-  leave all workload keys on one side of the split).
-- Each side of the split owns a distinct set of tables, so
-  cross-table `TransactWriteItems` actually exercises 2PC.
-
-A prior revision of this doc proposed a `/split/<int>` prefix.
-That was lexicographically smaller than the workload's keyspace
-(`/` < `0` in ASCII), so every workload key ended up on the
-rightmost shard and the 2PC path was never exercised. Fixed
-above by anchoring split keys to the table-route prefix.
+**Split key picking strategy (gemini medium R1 + codex P1 #1
+on f5d2ad7a).** Pick a split key from inside the DynamoDB
+**table-route** key space. The exact encoding matters: the
+route key is built by `dynamoRouteTableKey(encodeDynamoSegment(tableName))`
+(`adapter/dynamodb.go:8337-8365`, `kv/shard_key.go:117-124`).
+`encodeDynamoSegment` is base64 `RawURLEncoding` — for table
+`jepsen_append_t1` the real route key is
+`!ddb|route|table|am...` (base64 of the literal table name),
+**not** `!ddb|route|table|jepsen_append_t1`. A prior revision
+of this doc used the raw table name, which would sort after
+every base64-encoded segment and leave all workload tables on
+one side of the split.
+
+Concretely, the M5 split key construction is:
+
+```clojure
+(str "!ddb|route|table|"
+     (encode-dynamo-segment "jepsen_append_t2"))  ; base64 RawURLEncoding
+```
+
+The Clojure side gets a small helper (`composed1-nemesis/encode-dynamo-segment`)
+that mirrors the Go `encodeDynamoSegment` exactly. The
+Composed-1 doc and the encoding helper land together so any
+future change to the encoding surface is caught by an
+unambiguous reference point.
+
+A prior revision of this doc also wrongly explained the old
+`/split/<int>` proposal: it said `/` < workload keys, but the
+DynamoDB table-route prefix starts with `!` (ASCII 33) and
+`/` is ASCII 47, so `/split/<int>` is lexicographically
+**larger** than `!ddb|route|table|*` — workload keys would land
+on the **left** shard, not the rightmost (claude[bot] P2 on
+f5d2ad7a). The fix (anchoring to the table-route prefix)
+is correct in either direction; the explanation above is now
+also correct.
 
 **Route ID resolution (gemini medium R2).** The nemesis CANNOT
 rely on a single `ListRoutes` call + a local counter — every
@@ -186,59 +204,101 @@ drift (another split landing concurrently between
 `ErrCatalogVersionMismatch` from the server; the nemesis logs
 and refreshes on the next tick.
 
-### 3.3 Multi-shard workload guarantee (revised post-codex P1)
-
-**Original §3.3 (Option A: single-key split in workload keyspace)
-was wrong.** `kv/shard_key.go:94-124` normalises every DynamoDB
-table-metadata, item, and GSI key to a single per-table route
-key (`!ddb|route|table|<tableSegment>`). So every
-`jepsen_append` item resolves to the SAME catalog point
-regardless of its partition-key value, and a `SplitRange`
-inside the item keyspace cannot put two items on different
-shards. The 2PC path (`dispatchMultiShardTxn`, secondary
-commits, the new `ErrTxnSecondaryRouteShiftedAfterPrimaryCommit`
-sentinel) would never fire — invalidating G2 (codex P1 on
-PR #905).
-
-**Revised strategy: multi-table workload.** The M5 workload
-creates `N` tables (default `N = 4`): `jepsen_append_t1` …
-`jepsen_append_t4`. Each `TransactWriteItems` operation writes
-to **at least two** distinct tables. The router maps each
-table to its own table-route key, so a cross-table txn
-naturally fans out across whichever shards own those route
-keys. The setup hook splits the table-route keyspace at
-`!ddb|route|table|jepsen_append_t2` so tables 1 lives on one
-shard and tables 2–4 on another from t=0.
+### 3.3 Multi-shard workload guarantee (revised post-codex P1 #2)
+
+**Original §3.3 (single-key item split) and revised §3.3
+(multi-table with setup-hook SplitRange) were both incomplete.**
+
+Two facts about today's routing surface make the "split at
+setup" approach insufficient on its own:
+
+1. `kv/shard_key.go:94-124` normalises every DynamoDB
+   table-metadata, item, and GSI key to a single per-table
+   route key (`!ddb|route|table|<tableSegment>`).  So
+   single-table workloads have one route key and cannot
+   fan out across shards.
+2. `adapter/distribution_server.go`'s `SplitRange` only
+   creates child routes with the **parent's GroupID**.
+   `dispatchTxn` enters `dispatchMultiShardTxn` only when
+   mutations group to ≥2 distinct Raft **groups**, not just
+   ≥2 routes (codex P1 #2 on f5d2ad7a).  A `SplitRange`
+   inside the table-route keyspace produces two routes still
+   in the same group — single-shard transaction path,
+   2PC never fires.
+
+**Revised strategy: multi-table workload + multi-group cluster
+startup.** Both halves are required.
+
+| Half | Mechanism |
+|---|---|
+| Multi-table workload | A NEW workload variant `dynamodb-append-multi-table-workload` creates N=4 tables (`jepsen_append_t1` … `jepsen_append_t4`) and writes to ≥2 distinct tables per `TransactWriteItems`.  The router maps each table to its own route key so cross-table txns engage multi-route routing. |
+| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to pass `--shardRanges` so the cluster starts with at least 2 Raft groups, and the table-route keys for `t1…t4` are statically split between groups: e.g. `tables 1-2 → group 1, tables 3-4 → group 2`.  The DDB leader address per-group is wired via `--raftDynamoMap` (already supported by the runner — only the `--shardRanges` argument is new). |
+
+`encodeDynamoSegment("jepsen_append_t1")` etc. are computed at
+script setup time and inlined into the `--shardRanges`
+boundary keys.  An `m5_setup.sh` helper (or a Go binary, see
+OQ-7) emits the boundary keys deterministically.
+
+The Jepsen `db/setup!` hook does NOT issue any `SplitRange`.
+Instead it calls `ListRoutes` once (gated to the first node
+per `(when (= node (first (:nodes test))) …)`) and **verifies**
+that the expected multi-group routing is in place; if not, it
+fails fast with a clear error so the operator knows the
+launch script needs to be re-run with the right
+`--shardRanges`.  This avoids the gemini-medium R3 setup-hook
+concurrency hazard entirely (no mutation, just a read).  It
+also resolves the claude[bot] P3 follow-up about needing
+ListRoutes for route-ID resolution in setup — the hook now
+needs ListRoutes for **verification** rather than mutation.
 
 | Concern | Resolution |
 |---|---|
 | Workload shape change | Append ops still write a single value per row; the change is the table they write to (one per row, ≥2 rows per txn — picked from a per-txn random subset of `t1…tN`). |
 | Elle compatibility | The append checker keys on `(table, partition-key)` pairs already (the workload's history shape supports this); cross-table txns appear as multi-key ops, which Elle handles natively. |
-| Comparison with historical runs | Historical runs used a single table — the M5 workload is a NEW workload variant `dynamodb-append-multi-table-workload` rather than a modification of `dynamodb-append-workload`. Both ship; the existing one stays for trend comparison. |
-
-The setup hook (Jepsen `db/setup!`) is gated to run only on
-the FIRST node (`(when (= node (first (:nodes test))) …)`) so
-the initial split is not attempted concurrently by every
-cluster node and does not cause catalog-version conflicts
-during bootstrap (gemini medium R3).
+| Comparison with historical runs | The pre-existing `dynamodb-append-workload` (single table, single group) stays as-is for trend comparison.  The M5 workload is a new variant alongside it. |
+
+**Why the nemesis is still useful even though SplitRange is
+same-group only.**  The route-shuffle nemesis (§3.2) issues
+`SplitRange` calls that churn the catalog version + route IDs
+without moving ownership across groups.  This exercises the
+M3 ObservedRouteVersion drift detection and the M4 retry
+path under concurrent route-mutating control-plane traffic,
+which is the closest non-regression check the current routing
+surface allows.  When a future cross-group `MoveRange` or
+cross-group `SplitRange` lands, swapping that RPC into the
+nemesis turns the workload from a "no-regression under
+same-group churn" check into a "no G1c under cross-group
+movement" check — matching the parent doc's forward-looking
+posture.
 
 ### 3.4 Success criterion
 
 The workload's existing Elle checker emits a `:valid?` boolean
-and a list of detected cycles (`:G0`, `:G1a`, `:G1b`, `:G1c`,
-`:G-single`, etc.). M5's pass condition:
+and an `:anomalies` map keyed by anomaly type — `{:G0 […], :G1a
+[…], :G1c […], :G-single […], …}`.  M5's pass condition:
 
 ```
 (and (:valid? results)
-     (zero? (count (filter #(= (:type %) :G1c) (:anomalies results))))
-     (zero? (count (filter #(= (:type %) :G-single) (:anomalies results)))))
+     (nil? (get (:anomalies results) :G1c))
+     (nil? (get (:anomalies results) :G-single)))
 ```
 
+A prior revision of this doc used `(filter #(= (:type %) :G1c)
+…)` over `(:anomalies results)`, which is wrong: iterating a
+map yields `[key val]` vectors with no `:type` field, so the
+filter always returned an empty seq and the check would
+silently pass on any G1c run (claude[bot] P2 on f5d2ad7a).
+The corrected form above keys off the map directly.
+
 `G1c` is the parent doc's explicit safety violation; `G-single`
 is the closely-related single-item anomaly we already chase in
 the existing workload. Other anomaly types (G0, G1a, G1b)
 indicate orthogonal bugs and should also fail the run, but the
 parent doc's M5 row names G1c as the headline criterion.
+`(:valid? results)` is the canonical Elle pass/fail bit; the
+explicit G1c / G-single checks are belt-and-suspenders so a
+future Elle refactor that subdivides the cycle taxonomy still
+fails on the specific anomalies M5 cares about.
 
 ### 3.5 Cadence
 
@@ -263,7 +323,7 @@ mergeable on its own.
 
 | Phase | Title | Scope | Done when |
 |---|---|---|---|
-| M5a | CLI + multi-table workload | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; setup hook (gated to first node) issues the initial split between table-route keys. | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables split across 2 shards; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. |
+| M5a | CLI + multi-table workload + multi-group launch | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; **`scripts/run-jepsen-local.sh` extended to pass `--shardRanges` so the cluster launches with ≥2 Raft groups** and table-route keys for `t1…tN` are statically split across groups; setup hook calls `ListRoutes` from the first node and verifies the multi-group routing is in place (read-only, no mutation). | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables 1-2 owned by group 1 and tables 3-4 owned by group 2; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. |
 | M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. |
 
 M5a is a small, focused PR (Go CLI + Clojure setup hook +
@@ -304,11 +364,21 @@ analysis.
   average. Defer to implementation; revisit if the workload
   becomes I/O-bound on table-meta lookups.
 - **OQ-6** (new, gemini medium R3 follow-up). The first-node
-  gate for setup splits assumes Jepsen's `(first (:nodes test))`
-  is stable across nodes; verify this matches actual Jepsen
-  semantics (it should — `:nodes` is the test config, not a
-  per-node view). Out of scope to design more carefully; will
-  test at M5a implementation.
+  gate for the setup verification assumes Jepsen's
+  `(first (:nodes test))` is stable across nodes; verify this
+  matches actual Jepsen semantics (it should — `:nodes` is the
+  test config, not a per-node view). Out of scope to design
+  more carefully; will test at M5a implementation.
+- **OQ-7** (new, codex P1 #1 follow-up). The `--shardRanges`
+  boundary keys for the multi-group launch (§3.3) need to be
+  emitted as bytes that exactly match
+  `dynamoRouteTableKey(encodeDynamoSegment(tableName))`. Two
+  options: (a) a tiny Go helper (`cmd/elastickv-route-key`)
+  that prints the encoded key for a given table name, called
+  by `scripts/run-jepsen-local.sh` to build `--shardRanges`;
+  (b) inline the base64 encoding in shell. Tentative answer:
+  (a), because the encoding lives in Go and any drift would
+  silently mis-route. Decide at implementation.
 
 ---
 

From 673bee8cc1e0e88d8b8ff02d9d49cebdc97f4560 Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Tue, 2 Jun 2026 17:39:26 +0900
Subject: [PATCH 04/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20markdown?=
 =?UTF-8?q?lint=20cleanup=20(MD027=20+=20MD040)=20on=20PR=20#905?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

coderabbitai minor on 24812867:

* MD027 (multiple-space-after-blockquote) at lines 17,21 in the
  upfront posture block.  Trimmed double-spaces after `>` to a
  single space.

* MD040 (fenced-code-language) at the three plain ``` blocks:
  - the elastickv-split CLI usage (```bash)
  - the route-shuffle-nemesis Clojure sketch (```clojure)
  - the M5 success-criterion Clojure snippet (```clojure)

Doc-only, no semantic changes.
---
 ...oposed_composed1_m5_jepsen_route_shuffle.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
index d6770a8b..457d3f37 100644
--- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
+++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
@@ -14,12 +14,12 @@ Parent design:
 > scaffolding — workload shape, nemesis, success criterion — so
 > that:
 >
->   1. The current `SplitRange` is exercised under realistic
->      concurrent multi-shard write load and proved non-regressing
->      (the workload finds **no** G1c, which is the baseline M4
->      contract).
->   2. When a future PR introduces a route-mutating RPC that DOES
->      shift ownership across groups (cross-group `SplitRange`,
+> 1. The current `SplitRange` is exercised under realistic
+>    concurrent multi-shard write load and proved non-regressing
+>    (the workload finds **no** G1c, which is the baseline M4
+>    contract).
+> 2. When a future PR introduces a route-mutating RPC that DOES
+>    shift ownership across groups (cross-group `SplitRange`,
 >      `MoveRange`, online rebalancer), the M5 workload — with a
 >      one-line nemesis change to call the new RPC — becomes the
 >      integration-level proof that M3+M4 hold under cross-group
@@ -101,7 +101,7 @@ sentinel that would catch a regression. M5 closes that gap.
 
 A standalone Go binary, ~80 lines:
 
-```
+```bash
 elastickv-split \
   --address 127.0.0.1:50051 \
   --route-id 100 \
@@ -130,7 +130,7 @@ clearer.
 A new Clojure file with a single `route-shuffle-nemesis`
 function returning a `jepsen.nemesis/Nemesis` instance:
 
-```
+```clojure
 (defn route-shuffle-nemesis
   "Periodically invokes elastickv-split against the cluster.
    :start  -> shuffle one route (pick a non-edge split key)
@@ -277,7 +277,7 @@ The workload's existing Elle checker emits a `:valid?` boolean
 and an `:anomalies` map keyed by anomaly type — `{:G0 […], :G1a
 […], :G1c […], :G-single […], …}`.  M5's pass condition:
 
-```
+```clojure
 (and (:valid? results)
      (nil? (get (:anomalies results) :G1c))
      (nil? (get (:anomalies results) :G-single)))

From 3ca2a7f7c4d4acbba6f05c3f39ee5b3aef1e84b4 Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Tue, 2 Jun 2026 17:44:00 +0900
Subject: [PATCH 05/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20claude[b?=
 =?UTF-8?q?ot]=20P2+P3=20on=2024812867=20(launch=20script=20scope=20+=20Ba?=
 =?UTF-8?q?se64=20nuance)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two findings on 24812867 from claude[bot]'s post-revision review:

* P2 — '"only --shardRanges is new" understates M5a's launch
  script changes' (§3.3 table cell).  Confirmed against
  shard_config.go:61-99: without --raftGroups, every
  --shardRanges entry collapses into the default single group 1
  and groupMutations returns gids = [1] for every key.
  Single-shard path, 2PC never fires.

  Fix: §3.3 table cell rewritten to list FIVE coordinated
  launch-script changes M5a must make:
    1. --raftGroups declaring >=2 groups
    2. --shardRanges with t1-t2 in group 1, t3-t4 in group 2
    3. Cluster topology decision (tentative: 2 single-node
       groups for CI speed; scale to 6 nodes if needed)
    4. Per-group --raftBootstrapMembers
    5. Expanded --raftDynamoMap + matching port allocations

  Avoids the implementer discovering the topology gap mid-PR.

* P3 — encode-dynamo-segment Clojure implementation needs
  Base64 nuance.  Go uses base64.RawURLEncoding (URL-safe charset
  WITHOUT '=' padding).  Java's three Base64 variants differ:
    - Base64/getEncoder (standard '+'/'/', with padding) — wrong
    - Base64/getUrlEncoder (URL-safe, WITH padding) — wrong
    - Base64/getUrlEncoder + .withoutPadding — correct

  Failure mode is silent: wrong encoding produces split keys
  outside every route's range; ListRoutes returns no match;
  nemesis logs and skips every :start; run appears clean while
  the nemesis is a no-op.

  Fix: §3.2 now calls out .withoutPadding explicitly and
  documents the silent-no-op failure mode.
---
 ...posed_composed1_m5_jepsen_route_shuffle.md | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
index 457d3f37..0fadddca 100644
--- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
+++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
@@ -175,9 +175,20 @@ Concretely, the M5 split key construction is:
 ```
 
 The Clojure side gets a small helper (`composed1-nemesis/encode-dynamo-segment`)
-that mirrors the Go `encodeDynamoSegment` exactly. The
-Composed-1 doc and the encoding helper land together so any
-future change to the encoding surface is caught by an
+that mirrors the Go `encodeDynamoSegment` exactly. The Go
+implementation uses `base64.RawURLEncoding` — URL-safe charset
+(`-`/`_`) **without `=` padding**. The Clojure side MUST use
+`(.withoutPadding (java.util.Base64/getUrlEncoder))`;
+`Base64/getEncoder` (standard alphabet with `+`/`/`) and the
+default `Base64/getUrlEncoder` (URL-safe **with** padding) are
+both wrong (claude[bot] P3 on 24812867).  The failure mode is
+silent and non-obvious: a wrong encoding produces split keys
+that fall outside every route's range, `ListRoutes` returns no
+matching route, the nemesis logs and skips every `:start`,
+and the run appears clean while the nemesis is a no-op.
+
+The Composed-1 doc and the encoding helper land together so
+any future change to the encoding surface is caught by an
 unambiguous reference point.
 
 A prior revision of this doc also wrongly explained the old
@@ -232,7 +243,7 @@ startup.** Both halves are required.
 | Half | Mechanism |
 |---|---|
 | Multi-table workload | A NEW workload variant `dynamodb-append-multi-table-workload` creates N=4 tables (`jepsen_append_t1` … `jepsen_append_t4`) and writes to ≥2 distinct tables per `TransactWriteItems`.  The router maps each table to its own route key so cross-table txns engage multi-route routing. |
-| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to pass `--shardRanges` so the cluster starts with at least 2 Raft groups, and the table-route keys for `t1…t4` are statically split between groups: e.g. `tables 1-2 → group 1, tables 3-4 → group 2`.  The DDB leader address per-group is wired via `--raftDynamoMap` (already supported by the runner — only the `--shardRanges` argument is new). |
+| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to launch a multi-group cluster.  Per `shard_config.go:61-99` (claude[bot] P2 on 24812867), `--shardRanges` alone is not enough: without `--raftGroups`, every shard range's `groupID` collapses into the single default group 1 and `groupMutations` returns `gids = [1]` for every key — single-shard path, 2PC never fires.  M5a must therefore add **five** coordinated launch-script changes:<br><br>1. **`--raftGroups`** declaring ≥2 groups (e.g. `1=127.0.0.1:50051,2=127.0.0.1:50054`).<br>2. **`--shardRanges`** boundary keys placing `t1-t2` in group 1 and `t3-t4` in group 2.<br>3. **Cluster topology decision** — either 6 nodes (3-per-group, production-like consensus) or 2 single-node groups (`--raftBootstrap` per group, faster for CI).  Tentative pick: 2 single-node groups for M5a, scale to 6 nodes if/when the workload outgrows it (low cost — flip a flag).<br>4. **Per-group `--raftBootstrapMembers`** (different members sets for each group's nodes).<br>5. **Expanded `--raftDynamoMap`** covering both groups' leader addresses, plus the matching port allocations (currently `5005{1,2,3}`/`6380{1,2,3}` for the single-group 3-node layout — need `5005{1,4}`/`6380{1,4}` for the 2-single-node layout, or `5005{1..6}`/`6380{1..6}` for 6-node). |
 
 `encodeDynamoSegment("jepsen_append_t1")` etc. are computed at
 script setup time and inlined into the `--shardRanges`

From 87dcfe3f31a8bd953f745869478ed9e8b5ef7f18 Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Tue, 2 Jun 2026 17:48:24 +0900
Subject: [PATCH 06/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20update?=
 =?UTF-8?q?=20=C2=A74=20wording=20(claude[bot]=20minor=20on=20673bee8c)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The 'M5a is a small, focused PR' line at the bottom of §4
was accurate for the original M5a scope (Go CLI + setup hook
+ docs) but became understated after the codex P1 #2 revision
expanded M5a to include the multi-table workload variant and
the five-change multi-group launch script.

Updated to reflect actual scope: M5a now ships:
  * cmd/elastickv-split (Go CLI)
  * dynamodb-append-multi-table-workload (Clojure)
  * scripts/run-jepsen-local.sh multi-group launch extension
    (5 coordinated changes per §3.3 table)
  * setup-hook verification (read-only ListRoutes + assert)
  * docs

Likely lands as a single PR for atomicity (workload variant
only makes sense alongside multi-group launch + verification)
but reviewable as four roughly independent diff sections.
---
 ...2_proposed_composed1_m5_jepsen_route_shuffle.md | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
index 0fadddca..f013bd39 100644
--- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
+++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
@@ -337,9 +337,17 @@ mergeable on its own.
 | M5a | CLI + multi-table workload + multi-group launch | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; **`scripts/run-jepsen-local.sh` extended to pass `--shardRanges` so the cluster launches with ≥2 Raft groups** and table-route keys for `t1…tN` are statically split across groups; setup hook calls `ListRoutes` from the first node and verifies the multi-group routing is in place (read-only, no mutation). | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables 1-2 owned by group 1 and tables 3-4 owned by group 2; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. |
 | M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. |
 
-M5a is a small, focused PR (Go CLI + Clojure setup hook +
-docs). M5b carries the nemesis itself plus the cadence-tuning
-analysis.
+M5a is a single focused PR but no longer "small" after the
+codex P1 #2 expansion: `cmd/elastickv-split` (Go CLI), the
+new `dynamodb-append-multi-table-workload` (Clojure), the
+`scripts/run-jepsen-local.sh` multi-group launch extension
+(five coordinated changes — see §3.3 table), the setup-hook
+verification (read-only `ListRoutes` + assertion), plus docs.
+Likely to land as a single PR for atomicity (the workload
+variant only makes sense alongside the multi-group launch and
+the verification hook), but reviewable as four roughly
+independent diff sections.  M5b carries the nemesis itself
+plus the cadence-tuning analysis.
 
 ---
 

From b752a89448529d54fc025b7bbfbd407cd73b689a Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Tue, 2 Jun 2026 17:54:12 +0900
Subject: [PATCH 07/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20bootstra?=
 =?UTF-8?q?p=20flag=20+=20interior=20split=20keys=20(codex=20P2=20+=20clau?=
 =?UTF-8?q?de[bot]=20P2=20on=203ca2a7f7)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two new findings on 3ca2a7f7:

* codex P2 + claude[bot] P2 (same defect, independently flagged):
  --raftBootstrapMembers is rejected on multi-group startup
  (main.go:735-741 — ErrBootstrapMembersRequireSingleGroup
  guard).  My §3.3 item 4 said 'Per-group --raftBootstrapMembers',
  which would abort startup before Jepsen could run.

  Fix: §3.3 item 4 now correctly specifies --raftBootstrap
  (boolean) for the tentative 2-single-node-groups topology.
  Single-member groups need no peer discovery so --raftBootstrap
  is necessary and sufficient.  The existing --raftBootstrapMembers
  in run-jepsen-local.sh must be REMOVED for multi-group
  processes — it's only valid for single-group multi-node
  formations.

* codex P2 #2 — split key reuse becomes no-op.  My §3.2 example
  showed the boundary key !ddb|route|table|<encode(t2)>, but
  after the setup-time split that key is the Start of the right
  child route.  adapter/distribution_server.go:357-372's
  validateSplitKey rejects splitKey == parent.Start/End with
  'split key at route boundary' — so every subsequent nemesis
  tick at the same key would fail.  In a 5-minute run expecting
  >=10 shuffles, the nemesis would no-op silently and the run
  would appear clean while catalog churn never landed.

  Fix: §3.2 now splits into TWO distinct split-key constructions:
    (a) Setup-time boundary key (one-shot, in
        scripts/run-jepsen-local.sh) — IS a route boundary by
        design (the static partition between groups).
    (b) Per-shuffle interior key (every nemesis :start) — MUST
        lie strictly between the current route's Start and End.
        Strategy: append a base64-alphabet byte plus a fresh
        counter to route.start; the result sorts after start
        but before any non-base64 byte (e.g. ASCII '|'=124).

  Implementation details (counter persistence, collision
  recovery) deferred to M5b; the design contract is: each
  shuffle's split key is strictly interior to the route's
  current range.
---
 ...posed_composed1_m5_jepsen_route_shuffle.md | 49 ++++++++++++++++++-
 1 file changed, 47 insertions(+), 2 deletions(-)

diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
index f013bd39..2e4b40a2 100644
--- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
+++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
@@ -167,13 +167,58 @@ of this doc used the raw table name, which would sort after
 every base64-encoded segment and leave all workload tables on
 one side of the split.
 
-Concretely, the M5 split key construction is:
+There are TWO distinct split keys to construct.  Getting them
+right is the difference between an honest non-regression
+check and a silent no-op:
+
+**(a) Setup-time boundary key (one-shot, in
+`scripts/run-jepsen-local.sh`).** This is the key that
+partitions the table-route keyspace between groups at cluster
+launch via `--shardRanges`.  It IS a route boundary by
+design — group 1 covers `[… , bk)`, group 2 covers
+`[bk, …)`.  Constructed as:
 
 ```clojure
 (str "!ddb|route|table|"
      (encode-dynamo-segment "jepsen_append_t2"))  ; base64 RawURLEncoding
 ```
 
+**(b) Per-shuffle interior key (every nemesis `:start` invocation).**
+After the setup-time boundary lands, the table-route key
+`!ddb|route|table|<encode("jepsen_append_t2")>` is now the
+**`Start`** of the right child route.  `validateSplitKey`
+in `adapter/distribution_server.go:357-372` rejects
+`splitKey == parent.Start` or `splitKey == parent.End` with
+"split key at route boundary" — so reusing the setup-time key
+would fail every nemesis tick after the first (codex P2 #2 on
+3ca2a7f7).  In a 5-minute run expecting ≥10 shuffles, the
+nemesis would no-op silently and the workload would never
+exercise the catalog-version churn M3+M4 must absorb.
+
+The nemesis MUST derive a fresh interior key per `:start`
+that lies strictly between the current route's `Start` and
+`End`.  Strategy:
+
+```clojure
+(defn fresh-interior-split-key
+  "Returns a key strictly inside [route.start, route.end).
+   The base64 RawURL alphabet (A-Z, a-z, 0-9, -, _) admits
+   simple suffix appending: any byte from that alphabet
+   appended to route.start sorts after route.start but
+   before any byte that is not in the alphabet (such as ASCII
+   '|' = 124, which is greater than every base64 char)."
+  [route]
+  ;; Append a fresh alphabetic byte plus an incrementing
+  ;; counter; check the result is < route.end and retry with
+  ;; a different prefix byte on collision.
+  (str (:start route) "A" (System/nanoTime)))
+```
+
+Exact strategy details (counter persistence across nemesis
+restarts, collision recovery, etc.) are M5b implementation
+notes — the design contract is: each shuffle's split key is
+strictly interior to the route's current range.
+
 The Clojure side gets a small helper (`composed1-nemesis/encode-dynamo-segment`)
 that mirrors the Go `encodeDynamoSegment` exactly. The Go
 implementation uses `base64.RawURLEncoding` — URL-safe charset
@@ -243,7 +288,7 @@ startup.** Both halves are required.
 | Half | Mechanism |
 |---|---|
 | Multi-table workload | A NEW workload variant `dynamodb-append-multi-table-workload` creates N=4 tables (`jepsen_append_t1` … `jepsen_append_t4`) and writes to ≥2 distinct tables per `TransactWriteItems`.  The router maps each table to its own route key so cross-table txns engage multi-route routing. |
-| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to launch a multi-group cluster.  Per `shard_config.go:61-99` (claude[bot] P2 on 24812867), `--shardRanges` alone is not enough: without `--raftGroups`, every shard range's `groupID` collapses into the single default group 1 and `groupMutations` returns `gids = [1]` for every key — single-shard path, 2PC never fires.  M5a must therefore add **five** coordinated launch-script changes:<br><br>1. **`--raftGroups`** declaring ≥2 groups (e.g. `1=127.0.0.1:50051,2=127.0.0.1:50054`).<br>2. **`--shardRanges`** boundary keys placing `t1-t2` in group 1 and `t3-t4` in group 2.<br>3. **Cluster topology decision** — either 6 nodes (3-per-group, production-like consensus) or 2 single-node groups (`--raftBootstrap` per group, faster for CI).  Tentative pick: 2 single-node groups for M5a, scale to 6 nodes if/when the workload outgrows it (low cost — flip a flag).<br>4. **Per-group `--raftBootstrapMembers`** (different members sets for each group's nodes).<br>5. **Expanded `--raftDynamoMap`** covering both groups' leader addresses, plus the matching port allocations (currently `5005{1,2,3}`/`6380{1,2,3}` for the single-group 3-node layout — need `5005{1,4}`/`6380{1,4}` for the 2-single-node layout, or `5005{1..6}`/`6380{1..6}` for 6-node). |
+| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to launch a multi-group cluster.  Per `shard_config.go:61-99` (claude[bot] P2 on 24812867), `--shardRanges` alone is not enough: without `--raftGroups`, every shard range's `groupID` collapses into the single default group 1 and `groupMutations` returns `gids = [1]` for every key — single-shard path, 2PC never fires.  M5a must therefore add **five** coordinated launch-script changes:<br><br>1. **`--raftGroups`** declaring ≥2 groups (e.g. `1=127.0.0.1:50051,2=127.0.0.1:50054`).<br>2. **`--shardRanges`** boundary keys placing `t1-t2` in group 1 and `t3-t4` in group 2.<br>3. **Cluster topology decision** — either 6 nodes (3-per-group, production-like consensus) or 2 single-node groups (`--raftBootstrap` per group, faster for CI).  Tentative pick: 2 single-node groups for M5a, scale to 6 nodes if/when the workload outgrows it (low cost — flip a flag).<br>4. **Per-node `--raftBootstrap` (boolean) — NOT `--raftBootstrapMembers`.**  `main.go:735-741` has a hard guard: `resolveBootstrapServers` returns `ErrBootstrapMembersRequireSingleGroup` whenever `--raftBootstrapMembers` is set and `--raftGroups` parses to more than one group (codex P2 + claude[bot] P2 on 3ca2a7f7).  For the tentative 2-single-node-groups topology, each process hosts ONE group with ONE member — single-member groups need no peer discovery, so `--raftBootstrap` is both necessary and sufficient.  The existing `--raftBootstrapMembers` in `run-jepsen-local.sh` must be **removed** for multi-group processes; it is only valid for single-group, multi-node formations.<br>5. **Expanded `--raftDynamoMap`** covering both groups' leader addresses, plus the matching port allocations (currently `5005{1,2,3}`/`6380{1,2,3}` for the single-group 3-node layout — need `5005{1,4}`/`6380{1,4}` for the 2-single-node layout, or `5005{1..6}`/`6380{1..6}` for 6-node). |
 
 `encodeDynamoSegment("jepsen_append_t1")` etc. are computed at
 script setup time and inlined into the `--shardRanges`

From 6f024aaa7378a289dc09990a13de831b629fd72f Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Tue, 2 Jun 2026 18:02:42 +0900
Subject: [PATCH 08/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20note=20C?=
 =?UTF-8?q?lojure=20byte-vs-string=20coercion=20in=20deferred=20M5b=20impl?=
 =?UTF-8?q?ementation=20list=20(claude[bot]=20minor=20on=20b752a894)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

claude[bot] flagged a non-blocking M5b implementation concern
on b752a894: ListRoutes protobuf response fields may be Java
byte arrays rather than strings depending on deserialization.
Calling Clojure (str byte-array) silently yields '[B@…' (Java
object reference) instead of the bytes, producing a wrong
split key that would land outside the route's range — a silent
mis-routing footgun.

The existing 'Implementation details deferred to M5b' line in
§3.2 already covers this in spirit, but explicitly listing
'byte-vs-string coercion of ListRoutes response fields' next
to 'counter persistence' and 'collision recovery' makes the
specific footgun visible at design-doc-read time so the M5b
implementer can budget for it rather than discover it through
debugging silent mis-routing.

Doc-only, no semantic changes.
---
 ...06_02_proposed_composed1_m5_jepsen_route_shuffle.md | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
index 2e4b40a2..7bb4e341 100644
--- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
+++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
@@ -215,9 +215,13 @@ that lies strictly between the current route's `Start` and
 ```
 
 Exact strategy details (counter persistence across nemesis
-restarts, collision recovery, etc.) are M5b implementation
-notes — the design contract is: each shuffle's split key is
-strictly interior to the route's current range.
+restarts, collision recovery, byte-vs-string coercion of
+`ListRoutes` response fields — `(str (:start route) …)`
+silently yields `"[B@…"` if `:start` is a Java byte array
+rather than a string, a silent mis-routing footgun flagged
+by claude[bot] on b752a894) are M5b implementation notes —
+the design contract is: each shuffle's split key is strictly
+interior to the route's current range.
 
 The Clojure side gets a small helper (`composed1-nemesis/encode-dynamo-segment`)
 that mirrors the Go `encodeDynamoSegment` exactly. The Go

From f92a029e284c16efb69b2822eec023ebc57fa3ad Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Tue, 2 Jun 2026 21:31:37 +0900
Subject: [PATCH 09/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20single-p?=
 =?UTF-8?q?rocess=20two-group=20topology=20(codex=20P1=20on=206f024aaa)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Codex P1 on 6f024aaa correctly identifies that the proposed
'2 processes, each hosting one group' topology is NOT
supported by the current server:

  * shard_config.go:399-410 (validateShardRanges) requires
    every --shardRanges group ID to appear in that process's
    --raftGroups.  Process A with --raftGroups '1=...' but a
    shardRanges entry pointing to group 2 fails validation.

  * main.go:764 (buildShardGroups) starts a local Raft runtime
    for every group listed in --raftGroups.  Configuring both
    processes with --raftGroups '1=...,2=...' makes both try
    to host both groups and race on the Raft listeners.

Fix: §3.3 items 3-5 rewritten for the SUPPORTED topology —
ONE PROCESS hosting both single-member groups (two Raft
addresses, shared Dynamo endpoint).  Cross-group txns go
through the in-process router and 2PC fires across the two
co-located groups.

Trade-off explicitly documented: because both groups live in
one process, partition/kill nemeses can't separate them.
Only the route-shuffle nemesis (API-level) exercises the
cross-group path meaningfully.  True distributed multi-group
(multi-process) requires server-side support for 'remote-only
groups in --raftGroups' or equivalent — M6+ work, out of M5
scope.

This is a strictly correct narrowing of the topology choice
(removed the unsupported '2 separate processes' and '6 nodes'
options; the only supported shape is now spelled out).  The
M5a Done-when criterion still holds: cluster starts with
tables 1-2 in group 1 and tables 3-4 in group 2, workload
exercises dispatchMultiShardTxn.
---
 .../2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md    | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
index 7bb4e341..0eec2488 100644
--- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
+++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
@@ -292,7 +292,7 @@ startup.** Both halves are required.
 | Half | Mechanism |
 |---|---|
 | Multi-table workload | A NEW workload variant `dynamodb-append-multi-table-workload` creates N=4 tables (`jepsen_append_t1` … `jepsen_append_t4`) and writes to ≥2 distinct tables per `TransactWriteItems`.  The router maps each table to its own route key so cross-table txns engage multi-route routing. |
-| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to launch a multi-group cluster.  Per `shard_config.go:61-99` (claude[bot] P2 on 24812867), `--shardRanges` alone is not enough: without `--raftGroups`, every shard range's `groupID` collapses into the single default group 1 and `groupMutations` returns `gids = [1]` for every key — single-shard path, 2PC never fires.  M5a must therefore add **five** coordinated launch-script changes:<br><br>1. **`--raftGroups`** declaring ≥2 groups (e.g. `1=127.0.0.1:50051,2=127.0.0.1:50054`).<br>2. **`--shardRanges`** boundary keys placing `t1-t2` in group 1 and `t3-t4` in group 2.<br>3. **Cluster topology decision** — either 6 nodes (3-per-group, production-like consensus) or 2 single-node groups (`--raftBootstrap` per group, faster for CI).  Tentative pick: 2 single-node groups for M5a, scale to 6 nodes if/when the workload outgrows it (low cost — flip a flag).<br>4. **Per-node `--raftBootstrap` (boolean) — NOT `--raftBootstrapMembers`.**  `main.go:735-741` has a hard guard: `resolveBootstrapServers` returns `ErrBootstrapMembersRequireSingleGroup` whenever `--raftBootstrapMembers` is set and `--raftGroups` parses to more than one group (codex P2 + claude[bot] P2 on 3ca2a7f7).  For the tentative 2-single-node-groups topology, each process hosts ONE group with ONE member — single-member groups need no peer discovery, so `--raftBootstrap` is both necessary and sufficient.  The existing `--raftBootstrapMembers` in `run-jepsen-local.sh` must be **removed** for multi-group processes; it is only valid for single-group, multi-node formations.<br>5. **Expanded `--raftDynamoMap`** covering both groups' leader addresses, plus the matching port allocations (currently `5005{1,2,3}`/`6380{1,2,3}` for the single-group 3-node layout — need `5005{1,4}`/`6380{1,4}` for the 2-single-node layout, or `5005{1..6}`/`6380{1..6}` for 6-node). |
+| Multi-group cluster | M5a extends `scripts/run-jepsen-local.sh` to launch a multi-group cluster.  Per `shard_config.go:61-99` (claude[bot] P2 on 24812867), `--shardRanges` alone is not enough: without `--raftGroups`, every shard range's `groupID` collapses into the single default group 1 and `groupMutations` returns `gids = [1]` for every key — single-shard path, 2PC never fires.  M5a must therefore add **five** coordinated launch-script changes:<br><br>1. **`--raftGroups`** declaring ≥2 groups (e.g. `1=127.0.0.1:50051,2=127.0.0.1:50054`).<br>2. **`--shardRanges`** boundary keys placing `t1-t2` in group 1 and `t3-t4` in group 2.<br>3. **Cluster topology decision — constrained by the current server.**  `shard_config.go:399-410` (`validateShardRanges`) requires every `--shardRanges` group ID to appear in that process's `--raftGroups`, and `main.go:764` (`buildShardGroups`) starts a local Raft runtime for every group listed in `--raftGroups` (codex P1 on 6f024aaa).  Consequence: "two processes, each hosting one group" is NOT supported — Process A with `--raftGroups "1=…"` and a `--shardRanges` entry pointing to group 2 fails validation; configuring both processes with `--raftGroups "1=…,2=…"` makes both try to host both groups and race on the Raft listeners.<br>The supported topology for M5a is **one process hosting both single-member groups** — two Raft addresses (`50051` for group 1, `50054` for group 2), shared by a single Dynamo endpoint (e.g. `63801`).  Each group is single-member (just this process) so `--raftBootstrap` works for both.  Cross-group txns go through the in-process router → 2PC fires across the two co-located groups.<br>**Limitation accepted for M5a:** because both groups live in one process, partition/kill nemeses can't separate them — only the route-shuffle nemesis (API-level) exercises the cross-group path meaningfully.  True distributed multi-group (multi-process) requires server-side support for "remote-only groups in `--raftGroups`" or equivalent, which is M6+ work and out of M5's scope.<br>4. **Per-process `--raftBootstrap` (boolean) — NOT `--raftBootstrapMembers`.**  `main.go:735-741` has a hard guard: `resolveBootstrapServers` returns `ErrBootstrapMembersRequireSingleGroup` whenever `--raftBootstrapMembers` is set and `--raftGroups` parses to more than one group (codex P2 + claude[bot] P2 on 3ca2a7f7).  For the single-process-two-groups topology, the process needs no peer discovery for either of its two single-member groups, so `--raftBootstrap` is necessary and sufficient.  The existing `--raftBootstrapMembers` in `run-jepsen-local.sh` must be **removed** for the M5a launch; it is only valid for single-group, multi-node formations.<br>5. **Single `--raftDynamoMap`** entry pointing the process's Dynamo address at itself (no cross-process leader fan-out is needed since there's only one process); port allocation simplifies — drop the `5005{1,2,3}`/`6380{1,2,3}` 3-node layout to a 1-process layout with two Raft ports (`50051`, `50054`) and one Dynamo port (`63801`). |
 
 `encodeDynamoSegment("jepsen_append_t1")` etc. are computed at
 script setup time and inlined into the `--shardRanges`

From 58d6ac5edc8fe7adef9a62fe0bc5c22272b2c0ff Mon Sep 17 00:00:00 2001
From: "Yoshiaki Ueda (bootjp)" <contact@bootjp.me>
Date: Tue, 2 Jun 2026 21:43:40 +0900
Subject: [PATCH 10/10] =?UTF-8?q?docs(design):=20M5=20=E2=80=94=20=C2=A74?=
 =?UTF-8?q?=20row=20must=20require=20both=20--raftGroups=20AND=20--shardRa?=
 =?UTF-8?q?nges=20(coderabbit=20Major=20on=20f92a029e)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

coderabbit Major on f92a029e correctly identifies that §4's
M5a milestone row was inconsistent with §3.3's five-change
contract: the row said the launch extension passes
'--shardRanges so the cluster launches with >=2 Raft groups'
but §3.3 spells out that --shardRanges alone collapses every
range to the single default group 1 (shard_config.go:61-99 —
the claude[bot] P2 finding on 24812867 that drove the §3.3
rewrite in the first place).  Both --raftGroups AND
--shardRanges are required for the multi-group contract.

Fix: §4's M5a Scope cell now explicitly says 'pass BOTH
--raftGroups (declaring >=2 groups) AND --shardRanges (placing
tables across those groups)', and the cell cross-references
§3.3 for the 'shardRanges alone collapses to group 1'
explanation.  Avoids the implementer following §4 in
isolation and missing the --raftGroups requirement.

Doc-only, no semantic changes to the design itself — just
internal consistency between §4 row and §3.3 table.
---
 .../2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md    | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
index 0eec2488..01ea9e40 100644
--- a/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
+++ b/docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md
@@ -383,7 +383,7 @@ mergeable on its own.
 
 | Phase | Title | Scope | Done when |
 |---|---|---|---|
-| M5a | CLI + multi-table workload + multi-group launch | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; **`scripts/run-jepsen-local.sh` extended to pass `--shardRanges` so the cluster launches with ≥2 Raft groups** and table-route keys for `t1…tN` are statically split across groups; setup hook calls `ListRoutes` from the first node and verifies the multi-group routing is in place (read-only, no mutation). | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables 1-2 owned by group 1 and tables 3-4 owned by group 2; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. |
+| M5a | CLI + multi-table workload + multi-group launch | `cmd/elastickv-split` binary; new `dynamodb-append-multi-table-workload` that creates N tables and writes to ≥2 tables per `TransactWriteItems`; **`scripts/run-jepsen-local.sh` extended to pass BOTH `--raftGroups` (declaring ≥2 groups) AND `--shardRanges` (placing tables across those groups)** — per §3.3, `--shardRanges` alone collapses every range to the single default group 1 and 2PC never fires; both flags are required for the multi-group contract.  Table-route keys for `t1…tN` are statically split across the declared groups; setup hook calls `ListRoutes` from the first node and verifies the multi-group routing is in place (read-only, no mutation). | `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table` runs from t=0 with tables 1-2 owned by group 1 and tables 3-4 owned by group 2; the workload exercises `dispatchMultiShardTxn` (verifiable via server-side log markers or a probe metric); Elle finds zero G1c. |
 | M5b | Route-shuffle nemesis | `jepsen/src/elastickv/composed1_nemesis.clj`; compose into the multi-table workload's nemesis package; CLI flag `--composed1-route-shuffle` (default off, on under `run-jepsen-local.sh`). Nemesis re-queries `ListRoutes` before every split and picks split keys from inside the table-route keyspace. | A `./scripts/run-jepsen-local.sh --workload dynamodb-append-multi-table --composed1-route-shuffle` run produces zero G1c after ≥10 shuffles during a 5-minute run. |
 
 M5a is a single focused PR but no longer "small" after the