From 7095953658f1b02e20323a1df0e3ad0ab005e775 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Thu, 4 Jun 2026 16:08:11 +0900 Subject: [PATCH 1/4] =?UTF-8?q?feat(scripts):=20M5a=20=E2=80=94=20run-jeps?= =?UTF-8?q?en-m5-local.sh=20single-process=20two-group=20launcher?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Third slice of Composed-1 M5a per PR #905 design doc §3.3. New launch script alongside the existing scripts/run-jepsen-local.sh (unchanged so the historical 3-node single-group workloads stay on it). Topology: * ONE process hosts BOTH single-member Raft groups (the only shape today's server supports — codex P1 on PR #905 6f024aaa confirmed against shard_config.go:399-410 / main.go:764). * Two Raft listeners: 50051 (group 1), 50054 (group 2). * One shared Dynamo endpoint: 63801. --raftDynamoMap sends both Raft addresses to the same Dynamo (no cross-process leader fan-out). Flag selection (each one tied to a finding from PR #905): * --raftBootstrap (boolean) NOT --raftBootstrapMembers. The latter is rejected by resolveBootstrapServers on any multi-group process (main.go:735-741) — codex P2 + claude[bot] P2 on PR #905 3ca2a7f7. * --raftGroups AND --shardRanges (both required). --shardRanges alone collapses every range into the default group 1 (coderabbit Major on PR #905 f92a029e). * --shardRanges boundary keys computed via cmd/elastickv-route-key rather than inlined base64 in shell, so the encoding stays in lock-step with adapter/dynamodb.go's encodeDynamoSegment (codex P1 #1 on PR #905 ffb9c73f). Boundary computation: T1_KEY = !ddb|route|table| T3_KEY = !ddb|route|table| Group 1: [T1_KEY, T3_KEY) → tables 1, 2 Group 2: [T3_KEY, +inf) → tables 3, 4 Trade-off explicitly accepted: because both groups live in one process, partition / kill nemeses can't separate them — only the route-shuffle nemesis (slated for M5b) exercises the cross-group path meaningfully under this topology. True distributed multi-group requires server-side support for remote-only groups in --raftGroups (M6+ work, out of M5 scope). CI portability: lein binary resolves via $LEIN env, then PATH, then macOS Homebrew default /opt/homebrew/bin/lein. Fails with exit 127 if none found. Smoke verification: * bash -n scripts/run-jepsen-m5-local.sh -> OK * elastickv-route-key smoke against jepsen_append_t1 returns !ddb|route|table|amVwc2VuX2FwcGVuZF90MQ (matches the test pin in cmd/elastickv-route-key/main_test.go). End-to-end Jepsen verification deferred until the setup-hook verification slice (Clojure ListRoutes gRPC client) lands; that slice is the next commit on this branch and will assert the multi-group routing is actually in place before any workload op runs. --- scripts/run-jepsen-m5-local.sh | 174 +++++++++++++++++++++++++++++++++ 1 file changed, 174 insertions(+) create mode 100755 scripts/run-jepsen-m5-local.sh diff --git a/scripts/run-jepsen-m5-local.sh b/scripts/run-jepsen-m5-local.sh new file mode 100755 index 00000000..ff3032cf --- /dev/null +++ b/scripts/run-jepsen-m5-local.sh @@ -0,0 +1,174 @@ +#!/usr/bin/env bash +# Run the Composed-1 M5 multi-table DynamoDB Jepsen workload against a +# single-process two-group elastickv cluster. +# +# Why this script exists separately from run-jepsen-local.sh: the M5 +# workload requires a multi-Raft-group cluster topology that the +# existing 3-node single-group layout cannot provide. Per the design +# doc (docs/design/2026_06_02_proposed_composed1_m5_jepsen_route_shuffle.md +# §3.3), today's `validateShardRanges` / `buildShardGroups` only +# support a "single process hosts all groups" model — separate +# processes per group fail validation or race on Raft listeners. +# So this script launches ONE process hosting BOTH single-member +# groups, with two Raft listeners (50051, 50054) and one shared +# DynamoDB endpoint (63801). +# +# Trade-off accepted: partition / kill nemeses can't isolate one +# group from the other since they share a process. Only the +# (future) route-shuffle nemesis exercises the cross-group path +# meaningfully under this topology. True distributed multi-group is +# M6+ work — see the parent design doc. +# +# Usage: +# ./scripts/run-jepsen-m5-local.sh # build + start + test +# ./scripts/run-jepsen-m5-local.sh --no-rebuild # skip go build +# ./scripts/run-jepsen-m5-local.sh --no-cluster # reuse running cluster +set -euo pipefail + +REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)" +BINARY=/tmp/elastickv4-m5-binary +ROUTE_KEY_BIN=/tmp/elastickv4-m5-route-key +DATA_DIR=/tmp/elastickv-m5-test-run +PID_FILE=/tmp/elastickv-m5-test-run.pid + +# ---- topology: one process, two single-member Raft groups ---- +RAFT_ADDR_G1="127.0.0.1:50051" +RAFT_ADDR_G2="127.0.0.1:50054" +DYNAMO_ADDR="127.0.0.1:63801" +PROC_ADDR="$RAFT_ADDR_G1" # the process's primary gRPC address +RAFT_GROUPS="1=$RAFT_ADDR_G1,2=$RAFT_ADDR_G2" +RAFT_DYNAMO_MAP="$RAFT_ADDR_G1=$DYNAMO_ADDR,$RAFT_ADDR_G2=$DYNAMO_ADDR" + +NO_REBUILD=false +NO_CLUSTER=false +for arg in "$@"; do + case "$arg" in + --no-rebuild) NO_REBUILD=true ;; + --no-cluster) NO_CLUSTER=true ;; + esac +done + +# ---- build (server + route-key helper) ---- +if ! $NO_REBUILD; then + echo "[build] compiling elastickv server..." + cd "$REPO_ROOT" + go build -o "$BINARY" . + echo "[build] compiling elastickv-route-key helper..." + go build -o "$ROUTE_KEY_BIN" ./cmd/elastickv-route-key + echo "[build] done -> $BINARY, $ROUTE_KEY_BIN" +fi + +# ---- compute --shardRanges boundary keys ---- +# Multi-table workload uses tables jepsen_append_t{1..4}. Tables 1-2 +# go to group 1, tables 3-4 to group 2. Boundary keys are the +# byte-for-byte route-key encoding of the table names — computed via +# the elastickv-route-key Go helper rather than inlined in shell so +# the base64 encoding stays in sync with adapter/dynamodb.go's +# encodeDynamoSegment (codex P1 #1 on PR #905 ffb9c73f). +T1_KEY="$("$ROUTE_KEY_BIN" jepsen_append_t1)" +T3_KEY="$("$ROUTE_KEY_BIN" jepsen_append_t3)" +# Group 1: [T1_KEY, T3_KEY) — tables 1, 2 +# Group 2: [T3_KEY, +inf) — tables 3, 4 +# Keys outside [T1_KEY, +inf) fall through to the default group; this +# workload only writes table-route keys so that range is unused. +SHARD_RANGES="${T1_KEY}:${T3_KEY}=1,${T3_KEY}:=2" +echo "[shard-ranges] $SHARD_RANGES" + +# ---- stop any previously managed cluster ---- +stop_cluster() { + if [ -f "$PID_FILE" ]; then + echo "[cluster] stopping previous cluster..." + while IFS= read -r pid; do + kill "$pid" 2>/dev/null || true + done < "$PID_FILE" + rm -f "$PID_FILE" + fi +} + +# ---- start cluster: ONE process hosting both groups ---- +if ! $NO_CLUSTER; then + stop_cluster + rm -rf "$DATA_DIR" + mkdir -p "$DATA_DIR" + : > "$PID_FILE" + + echo "[cluster] starting single-process two-group cluster..." + # Notes on flag selection: + # --raftBootstrap : boolean; each group is single-member so no + # peer discovery is needed. --raftBootstrapMembers + # is rejected by resolveBootstrapServers on any + # multi-group process (main.go:735-741) and so + # MUST NOT appear here (codex P2 + claude[bot] + # P2 on PR #905 3ca2a7f7). + # --raftGroups : declares both groups with distinct Raft + # listeners. + # --shardRanges : places t1-t2 in group 1 and t3-t4 in group 2. + # Both flags are required for the multi-group + # contract: --shardRanges alone collapses + # everything to the default group 1 + # (coderabbit Major on PR #905 f92a029e). + # --raftDynamoMap : both Raft addresses point at the same Dynamo + # endpoint since there's only one process. + nohup "$BINARY" \ + --address "$PROC_ADDR" \ + --dynamoAddress "$DYNAMO_ADDR" \ + --redisAddress "" \ + --s3Address "" \ + --sqsAddress "" \ + --metricsAddress "" \ + --pprofAddress "" \ + --raftId "n1" \ + --raftDataDir "${DATA_DIR}/n1" \ + --raftBootstrap \ + --raftGroups "$RAFT_GROUPS" \ + --shardRanges "$SHARD_RANGES" \ + --raftDynamoMap "$RAFT_DYNAMO_MAP" \ + > "${DATA_DIR}/n1.log" 2>&1 & + echo $! >> "$PID_FILE" + + echo "[cluster] waiting for Dynamo endpoint ($DYNAMO_ADDR)..." + for i in $(seq 1 90); do + if nc -z 127.0.0.1 63801; then + echo "[cluster] up after ${i}s" + break + fi + sleep 1 + if [ "$i" -eq 90 ]; then + echo "[cluster] FAILED to start - dumping log:" + tail -n 100 "${DATA_DIR}/n1.log" || true + exit 1 + fi + done +fi + +# ---- run M5 Jepsen multi-table workload ---- +cd "$REPO_ROOT/jepsen" + +# Resolve lein: prefer LEIN env override, then PATH (works on CI), then +# the macOS Homebrew default. Failing to find lein is fatal. +LEIN_BIN="${LEIN:-$(command -v lein || echo /opt/homebrew/bin/lein)}" +if [ ! -x "$LEIN_BIN" ]; then + echo "[jepsen] lein not found (tried \$LEIN, PATH, /opt/homebrew/bin/lein)" >&2 + exit 127 +fi + +echo "[jepsen] running DynamoDB multi-table list-append workload via $LEIN_BIN..." +mkdir -p tmp-home .lein +HOME="$(pwd)/tmp-home" LEIN_HOME="$(pwd)/.lein" \ + LEIN_JVM_OPTS="-Duser.home=$(pwd)/tmp-home" \ + "$LEIN_BIN" run -m elastickv.dynamodb-multi-table-workload \ + --local \ + --time-limit 30 \ + --rate 5 \ + --dynamo-port 63801 \ + || EXIT_CODE=$? + +EXIT_CODE=${EXIT_CODE:-0} + +# ---- teardown ---- +if ! $NO_CLUSTER; then + echo "[cluster] stopping..." + stop_cluster +fi + +exit "$EXIT_CODE" From f6181449774b8488988c1da085650b3323dc937f Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Thu, 4 Jun 2026 19:50:06 +0900 Subject: [PATCH 2/4] =?UTF-8?q?feat(scripts):=20M5a=20=E2=80=94=20build=20?= =?UTF-8?q?elastickv-list-routes=20+=20wire=20--list-routes-bin=20/=20--gr?= =?UTF-8?q?pc-host-port?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Follow-up commit on PR #924's launch script. Brings the setup-hook verification online by: * Adding 'go build ./cmd/elastickv-list-routes' alongside the existing server + route-key builds in the script's [build] section. The output binary lives at /tmp/elastickv4-m5-list-routes matching the BINARY / ROUTE_KEY_BIN naming convention. * Passing --list-routes-bin and --grpc-host-port to lein run so the workload's verify-multi-group-routing! (introduced in PR #925) shells out to the freshly-built binary against the process's primary gRPC address (PROC_ADDR=127.0.0.1:50051). Without these flags the hook falls back to PATH lookup, which fails in a tmp-build environment. With them, every local Jepsen run starts with a clear yes/no on whether the --raftGroups / --shardRanges combo actually produced a multi-group routing catalog — silencing the entire class of 'workload ran clean but never exercised dispatchMultiShardTxn' results that motivated the design doc §3.3 hook in the first place. Cross-PR ordering note: this commit depends on PR #925's --list-routes-bin / --grpc-host-port CLI flags being merged. Until that lands, this commit's launch will fail with 'Unknown option: --list-routes-bin' — kept on this PR for review ergonomics; flag commit is tagged 'do not merge before PR #925' in the PR description. Verification: bash -n scripts/run-jepsen-m5-local.sh -> OK. --- scripts/run-jepsen-m5-local.sh | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/scripts/run-jepsen-m5-local.sh b/scripts/run-jepsen-m5-local.sh index ff3032cf..edf30a16 100755 --- a/scripts/run-jepsen-m5-local.sh +++ b/scripts/run-jepsen-m5-local.sh @@ -28,6 +28,7 @@ set -euo pipefail REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)" BINARY=/tmp/elastickv4-m5-binary ROUTE_KEY_BIN=/tmp/elastickv4-m5-route-key +LIST_ROUTES_BIN=/tmp/elastickv4-m5-list-routes DATA_DIR=/tmp/elastickv-m5-test-run PID_FILE=/tmp/elastickv-m5-test-run.pid @@ -48,14 +49,19 @@ for arg in "$@"; do esac done -# ---- build (server + route-key helper) ---- +# ---- build (server + route-key + list-routes helpers) ---- if ! $NO_REBUILD; then echo "[build] compiling elastickv server..." cd "$REPO_ROOT" go build -o "$BINARY" . echo "[build] compiling elastickv-route-key helper..." go build -o "$ROUTE_KEY_BIN" ./cmd/elastickv-route-key - echo "[build] done -> $BINARY, $ROUTE_KEY_BIN" + echo "[build] compiling elastickv-list-routes helper..." + # Used by the Jepsen workload's setup-hook verification + # (verify-multi-group-routing!). Confirms the cluster booted with + # >=2 distinct Raft groups before any workload op runs. + go build -o "$LIST_ROUTES_BIN" ./cmd/elastickv-list-routes + echo "[build] done -> $BINARY, $ROUTE_KEY_BIN, $LIST_ROUTES_BIN" fi # ---- compute --shardRanges boundary keys ---- @@ -154,6 +160,10 @@ fi echo "[jepsen] running DynamoDB multi-table list-append workload via $LEIN_BIN..." mkdir -p tmp-home .lein +# --list-routes-bin / --grpc-host-port wire the setup-hook verification +# (verify-multi-group-routing!) at the workload's first setup! call. +# Without them the hook falls back to PATH lookup which fails when +# run from this script's tmp build. HOME="$(pwd)/tmp-home" LEIN_HOME="$(pwd)/.lein" \ LEIN_JVM_OPTS="-Duser.home=$(pwd)/tmp-home" \ "$LEIN_BIN" run -m elastickv.dynamodb-multi-table-workload \ @@ -161,6 +171,8 @@ HOME="$(pwd)/tmp-home" LEIN_HOME="$(pwd)/.lein" \ --time-limit 30 \ --rate 5 \ --dynamo-port 63801 \ + --list-routes-bin "$LIST_ROUTES_BIN" \ + --grpc-host-port "$PROC_ADDR" \ || EXIT_CODE=$? EXIT_CODE=${EXIT_CODE:-0} From 834b1721005d1d6add5e678f8c836cb7e92a232a Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Thu, 4 Jun 2026 19:54:00 +0900 Subject: [PATCH 3/4] =?UTF-8?q?feat(scripts):=20M5a=20=E2=80=94=204=20gemi?= =?UTF-8?q?ni=20medium=20fixes=20on=20PR=20#924=20(cleanup,=20nc,=20helper?= =?UTF-8?q?-presence)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit gemini-code-assist medium findings on f6181449: * helper-binary presence guard (line 75) — with --no-rebuild, $ROUTE_KEY_BIN / $LIST_ROUTES_BIN may not exist; the script would fail with a misleading 'No such file or directory' buried in set -e output. Added a for-loop guard that checks all three helpers ([$ROUTE_KEY_BIN, $LIST_ROUTES_BIN, $BINARY]) up-front and exits 1 with a clear remediation message. * nc -> /dev/tcp (line 140) — bash's built-in (echo > /dev/tcp/127.0.0.1/63801) probe works on minimal CI images that don't ship netcat. Same semantics (succeeds iff the port accepts a TCP connection); no behavioural change. * trap stop_cluster EXIT INT TERM (line 99) — installed BEFORE the cluster launch so an exception during launch (bind-port collision, missing flag) still tears down the half-started state. Covers normal exit, Ctrl-C, and CI cancellation. * trailing manual stop_cluster removed (line 184) — the EXIT trap is the canonical teardown path; calling stop_cluster manually would double-call on success (harmless but noisy). Verification: bash -n scripts/run-jepsen-m5-local.sh -> OK. --- scripts/run-jepsen-m5-local.sh | 35 +++++++++++++++++++++++++++++----- 1 file changed, 30 insertions(+), 5 deletions(-) diff --git a/scripts/run-jepsen-m5-local.sh b/scripts/run-jepsen-m5-local.sh index edf30a16..0071aa25 100755 --- a/scripts/run-jepsen-m5-local.sh +++ b/scripts/run-jepsen-m5-local.sh @@ -71,6 +71,18 @@ fi # the elastickv-route-key Go helper rather than inlined in shell so # the base64 encoding stays in sync with adapter/dynamodb.go's # encodeDynamoSegment (codex P1 #1 on PR #905 ffb9c73f). +# +# When --no-rebuild is set, every helper binary must already exist +# from a previous run; fail fast with a clear message rather than +# letting `set -e` swallow a misleading 'No such file or directory' +# (gemini medium on PR #924). +for bin in "$ROUTE_KEY_BIN" "$LIST_ROUTES_BIN" "$BINARY"; do + if [ ! -x "$bin" ]; then + echo "[error] required helper not found at $bin." >&2 + echo " Re-run without --no-rebuild to compile the helpers." >&2 + exit 1 + fi +done T1_KEY="$("$ROUTE_KEY_BIN" jepsen_append_t1)" T3_KEY="$("$ROUTE_KEY_BIN" jepsen_append_t3)" # Group 1: [T1_KEY, T3_KEY) — tables 1, 2 @@ -93,6 +105,14 @@ stop_cluster() { # ---- start cluster: ONE process hosting both groups ---- if ! $NO_CLUSTER; then + # Install the cleanup hook BEFORE starting the cluster so an + # exception during launch (e.g. bind-port collision) still + # tears down the half-started state. EXIT covers normal flow, + # INT/TERM cover user Ctrl-C and CI cancellation. Without + # this the failure path leaks background processes that hold + # the Raft / Dynamo ports for the next run (gemini medium on + # PR #924). + trap stop_cluster EXIT INT TERM stop_cluster rm -rf "$DATA_DIR" mkdir -p "$DATA_DIR" @@ -134,7 +154,10 @@ if ! $NO_CLUSTER; then echo "[cluster] waiting for Dynamo endpoint ($DYNAMO_ADDR)..." for i in $(seq 1 90); do - if nc -z 127.0.0.1 63801; then + # Use bash's built-in /dev/tcp probe rather than `nc` so the + # script runs on minimal CI images that may not ship netcat + # (gemini medium on PR #924). + if (echo > /dev/tcp/127.0.0.1/63801) >/dev/null 2>&1; then echo "[cluster] up after ${i}s" break fi @@ -178,9 +201,11 @@ HOME="$(pwd)/tmp-home" LEIN_HOME="$(pwd)/.lein" \ EXIT_CODE=${EXIT_CODE:-0} # ---- teardown ---- -if ! $NO_CLUSTER; then - echo "[cluster] stopping..." - stop_cluster -fi +# Cluster shutdown is handled by the `trap stop_cluster EXIT INT TERM` +# installed above the cluster launch. No explicit teardown call is +# needed here; doing so would double-call stop_cluster on success +# (harmless but noisy) and double-call on failure (which is also +# harmless since stop_cluster is idempotent, but the EXIT trap path +# is the canonical one — see gemini medium on PR #924). exit "$EXIT_CODE" From 0d44fd1f0179df76464b5d53a69b8e30711d9766 Mon Sep 17 00:00:00 2001 From: "Yoshiaki Ueda (bootjp)" Date: Thu, 4 Jun 2026 20:04:28 +0900 Subject: [PATCH 4/4] =?UTF-8?q?feat(scripts):=20M5a=20=E2=80=94=20claude[b?= =?UTF-8?q?ot]=20minor=20+=20suggestion=20on=20PR=20#924?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit claude[bot] review on 834b1721 — all 4 gemini fixes confirmed correct. Two non-blocking notes addressed inline: * Minor — line 75-78 comment overstated scope. The guard runs unconditionally, not just under --no-rebuild. Rewrote the comment to reflect the actual semantics (catches both the --no-rebuild case AND a fresh-build environment where a helper somehow produced a non-executable). * Suggestion — cross-PR dependency machine-readable guard. Added a pre-flight check before 'go build ./cmd/elastickv-list-routes' that surfaces a clear error if cmd/elastickv-list-routes/ doesn't exist in the current tree (i.e. PR #925 hasn't been merged yet). Without this guard, anyone trying to run the script from PR #924 alone gets an opaque 'package not found' error from go build with no remediation hint. Verification: bash -n scripts/run-jepsen-m5-local.sh -> OK. --- scripts/run-jepsen-m5-local.sh | 21 +++++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/scripts/run-jepsen-m5-local.sh b/scripts/run-jepsen-m5-local.sh index 0071aa25..4afd2653 100755 --- a/scripts/run-jepsen-m5-local.sh +++ b/scripts/run-jepsen-m5-local.sh @@ -51,6 +51,16 @@ done # ---- build (server + route-key + list-routes helpers) ---- if ! $NO_REBUILD; then + # Pre-flight: cmd/elastickv-list-routes lands in PR #925. If this + # branch is run before #925 merges, `go build` would emit an + # opaque package-not-found error. Surface the cross-PR dependency + # in a machine-readable way (claude[bot] suggestion on PR #924). + if [ ! -d "$REPO_ROOT/cmd/elastickv-list-routes" ]; then + echo "[error] cmd/elastickv-list-routes/ not found in this branch." >&2 + echo " PR #924 depends on PR #925 (setup-hook + list-routes CLI)." >&2 + echo " Merge #925 first, or check out the integrated branch." >&2 + exit 1 + fi echo "[build] compiling elastickv server..." cd "$REPO_ROOT" go build -o "$BINARY" . @@ -72,10 +82,13 @@ fi # the base64 encoding stays in sync with adapter/dynamodb.go's # encodeDynamoSegment (codex P1 #1 on PR #905 ffb9c73f). # -# When --no-rebuild is set, every helper binary must already exist -# from a previous run; fail fast with a clear message rather than -# letting `set -e` swallow a misleading 'No such file or directory' -# (gemini medium on PR #924). +# Guard: every helper binary must exist before continuing. Runs +# unconditionally — catches both --no-rebuild (helpers expected from +# a previous run) AND a fresh-build environment where a helper +# somehow produced a non-executable. Failing fast with a clear +# remediation message is strictly better than letting `set -e` +# swallow a misleading 'No such file or directory' deeper in the +# script (gemini medium + claude[bot] minor on PR #924). for bin in "$ROUTE_KEY_BIN" "$LIST_ROUTES_BIN" "$BINARY"; do if [ ! -x "$bin" ]; then echo "[error] required helper not found at $bin." >&2