CI: Raft consensus testing under failure conditions#87
Conversation
📝 WalkthroughWalkthroughAdded Raft consensus functional testing infrastructure to CI: a new GitHub Actions job that orchestrates a 3-node Raft cluster via Docker Compose, runs topology discovery and failover scenarios, and validates leader election, cluster data serving, and node rejoin operations. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as test-raft.sh
participant R1 as orchestrator-raft1
participant R2 as orchestrator-raft2
participant R3 as orchestrator-raft3
participant MySQL as mysql1/2/3
rect rgba(100, 149, 237, 0.5)
Note over Test,R3: Startup & Leader Election
Test->>R1: Start container (port 3100)
Test->>R1: Poll /api/raft-status (wait ~90s)
Test->>R2: Start container (port 3101)
Test->>R3: Start container (port 3102)
Test->>R1: Query /api/raft-leader
Test->>R2: Query /api/raft-leader
Test->>R3: Query /api/raft-leader
Test->>R1: Verify exactly one /api/raft-state == "Leader"
end
rect rgba(144, 238, 144, 0.5)
Note over Test,MySQL: Topology Discovery Phase
Test->>R1: Identify leader via /api/raft-state
Test->>R1: Trigger /api/discover/mysql{1,2,3}/3306
Test->>MySQL: Query MySQL topology
Test->>R1: Poll /api/clusters (wait ~60s)
Test->>R2: Query /api/clusters (follower check)
Test->>R3: Query /api/clusters (follower check)
end
rect rgba(255, 182, 193, 0.5)
Note over Test,R3: Failover & Re-election
Test->>R1: Stop old leader container
Test->>R2: Poll /api/raft-leader (wait ~60s)
Test->>R3: Poll /api/raft-leader (wait ~60s)
Test->>R2: Verify new leader elected
Test->>R2: Query /api/clusters
Test->>R1: Restart old leader container
Test->>R1: Wait /api/raft-leader matches new leader (~60s)
Test->>R1: Validate /api/raft-state == "Follower"
end
rect rgba(255, 215, 0, 0.5)
Note over Test,R3: Final Consensus Verification
Test->>R1: Query /api/raft-leader
Test->>R2: Query /api/raft-leader
Test->>R3: Query /api/raft-leader
Test->>R1: Query /api/raft-health
Test->>R2: Query /api/raft-health
Test->>R3: Query /api/raft-health
Test->>Test: Verify all nodes agree on leader
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces functional tests for Raft consensus by adding three orchestrator nodes to the Docker Compose environment along with their respective configuration files. A new test script, test-raft.sh, is included to verify leader election, topology discovery, failover, and node rejoining. Review feedback recommends improving the Docker setup by using pre-built images instead of installing dependencies at runtime and refactoring the test script to use helper functions for repeated API interactions and JSON parsing to enhance maintainability.
| - ./orchestrator-raft1.conf.json:/orchestrator/orchestrator.conf.json:ro | ||
| command: > | ||
| bash -c " | ||
| apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 && |
There was a problem hiding this comment.
tests/functional/test-raft.sh
Outdated
| LEADER1=$(curl -sf "http://localhost:3100/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "") | ||
| LEADER2=$(curl -sf "http://localhost:3101/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "") | ||
| LEADER3=$(curl -sf "http://localhost:3102/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "") |
There was a problem hiding this comment.
The logic for fetching and parsing the Raft leader from a node is repeated multiple times throughout the script. Consider defining a helper function to reduce duplication and improve maintainability.
Example:
get_raft_leader() {
local port=$1
curl -sf "http://localhost:${port}/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo ""
}
tests/functional/test-raft.sh
Outdated
| # Verify exactly one node reports itself as Leader state | ||
| LEADER_COUNT=0 | ||
| for port in "${RAFT_PORTS[@]}"; do | ||
| STATE=$(curl -sf "http://localhost:${port}/api/raft-state" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "") |
| CLUSTER_FOUND=false | ||
| for i in $(seq 1 60); do | ||
| CLUSTERS=$(curl -sf "http://localhost:${LEADER_PORT}/api/clusters" 2>/dev/null || echo "[]") | ||
| COUNT=$(echo "$CLUSTERS" | python3 -c "import json,sys; c=json.load(sys.stdin); print(len(c))" 2>/dev/null || echo "0") |
cfab30d to
a6c5739
Compare
197e86f to
81793d4
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/functional/test-failover-advanced.sh`:
- Line 4: The cd invocation cd "$(dirname "$0")/../.." can fail and leave the
script running in the wrong directory; update that line to handle errors by
checking its exit status and exiting on failure (e.g., change it to perform the
cd and then do || { echo "Failed to change directory"; exit 1; } or otherwise
abort on non-zero status) so the script does not continue when the directory
change fails.
In `@tests/functional/test-raft.sh`:
- Line 4: The script currently runs cd "$(dirname "$0")/../.." without checking
its success; update the invocation so the script aborts on failure (e.g.,
replace with a guarded form like if ! cd "$(dirname "$0")/../.."; then echo
"Failed to change directory to $(dirname "$0")/../.."; exit 1; fi or append || {
echo "Failed to cd"; exit 1; } ), ensuring the working directory is validated
before proceeding.
- Line 149: The curl invocation that sets FCLUSTERS uses a malformed option
`--max-time 10L`; update the command in the FCLUSTERS assignment so the timeout
and follow-redirect flags are separate (e.g., `--max-time 10 -L` or just
`--max-time 10` if you don't need redirects) to remove the stray `L` and ensure
redirects are followed when desired; modify the line that defines FCLUSTERS
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: bdbc12a7-24a2-4916-beb5-6b7bdcb53ca7
📒 Files selected for processing (8)
.github/workflows/functional.ymltests/functional/docker-compose.ymltests/functional/lib.shtests/functional/orchestrator-raft1.conf.jsontests/functional/orchestrator-raft2.conf.jsontests/functional/orchestrator-raft3.conf.jsontests/functional/test-failover-advanced.shtests/functional/test-raft.sh
| #!/bin/bash | ||
| # Tier B+: Advanced failover tests — intermediate master, errant GTID, co-master | ||
| set -uo pipefail # no -e: we handle failures ourselves | ||
| cd "$(dirname "$0")/../.." |
There was a problem hiding this comment.
Add error handling for cd command.
Same issue as in test-raft.sh - if the cd fails, the script continues from an unexpected location.
🛡️ Proposed fix
-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || exit 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cd "$(dirname "$0")/../.." | |
| cd "$(dirname "$0")/../.." || exit 1 |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional/test-failover-advanced.sh` at line 4, The cd invocation cd
"$(dirname "$0")/../.." can fail and leave the script running in the wrong
directory; update that line to handle errors by checking its exit status and
exiting on failure (e.g., change it to perform the cd and then do || { echo
"Failed to change directory"; exit 1; } or otherwise abort on non-zero status)
so the script does not continue when the directory change fails.
| #!/bin/bash | ||
| # Raft consensus tests -- verify leader election, failover, and follower redirect | ||
| set -uo pipefail | ||
| cd "$(dirname "$0")/../.." |
There was a problem hiding this comment.
Add error handling for cd command.
If the cd fails (e.g., directory doesn't exist), the script will continue from an unexpected location, which could cause failures or incorrect behavior.
🛡️ Proposed fix
-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || exit 1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cd "$(dirname "$0")/../.." | |
| cd "$(dirname "$0")/../.." || exit 1 |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional/test-raft.sh` at line 4, The script currently runs cd
"$(dirname "$0")/../.." without checking its success; update the invocation so
the script aborts on failure (e.g., replace with a guarded form like if ! cd
"$(dirname "$0")/../.."; then echo "Failed to change directory to $(dirname
"$0")/../.."; exit 1; fi or append || { echo "Failed to cd"; exit 1; } ),
ensuring the working directory is validated before proceeding.
| fi | ||
| FPORT="${RAFT_PORTS[$idx]}" | ||
| # Followers may redirect or serve data directly; either is valid | ||
| FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]") |
There was a problem hiding this comment.
Typo: 10L should be 10 -L or just 10.
The curl option --max-time 10L appears to have an errant L. If the intent is to follow redirects (which makes sense for followers that may redirect to the leader), use -L as a separate flag.
🔧 Proposed fix
- FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
+ FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]") | |
| FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional/test-raft.sh` at line 149, The curl invocation that sets
FCLUSTERS uses a malformed option `--max-time 10L`; update the command in the
FCLUSTERS assignment so the timeout and follow-redirect flags are separate
(e.g., `--max-time 10 -L` or just `--max-time 10` if you don't need redirects)
to remove the stray `L` and ensure redirects are followed when desired; modify
the line that defines FCLUSTERS accordingly.
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tests/functional/test-raft.sh (3)
79-81: Consider quoting the URL to satisfy shellcheck.The URL in the curl command is technically unquoted. While safe here since
portcomes from a hardcoded array, quoting prevents potential issues if the code is modified later.🔧 Proposed fix
- echo " Node :${port} raft-status: $(curl -sf --max-time 10 http://localhost:${port}/api/raft-status 2>/dev/null || echo 'unreachable')" + echo " Node :${port} raft-status: $(curl -sf --max-time 10 "http://localhost:${port}/api/raft-status" 2>/dev/null || echo 'unreachable')"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-raft.sh` around lines 79 - 81, The for-loop iterating RAFT_PORTS uses an unquoted URL in the curl call; update the curl invocation inside the loop so the URL and the port variable are quoted (e.g., "http://localhost:${port}/api/raft-status") to satisfy shellcheck and prevent future word-splitting or globbing issues when using RAFT_PORTS and the curl command.
237-239: Consider quoting the URL for consistency.Same shellcheck suggestion as line 80 - quote the URL to prevent potential word splitting.
🔧 Proposed fix
- echo " Node :${port} status: $(curl -sf --max-time 10 http://localhost:${port}/api/raft-status 2>/dev/null || echo 'unreachable')" + echo " Node :${port} status: $(curl -sf --max-time 10 "http://localhost:${port}/api/raft-status" 2>/dev/null || echo 'unreachable')"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-raft.sh` around lines 237 - 239, Loop uses an unquoted URL in the curl call which can lead to word-splitting; update the curl invocation inside the for loop that iterates REMAINING_PORTS to quote the interpolated URL (use "$port" within the URL string) so the argument passed to curl is a single token (e.g., curl -sf --max-time 10 "http://localhost:${port}/api/raft-status" ...).
62-64: Consider extracting JSON parsing to a helper function.The pattern
curl ... | python3 -c "import json,sys; print(json.load(sys.stdin))"is repeated throughout the script (lines 62-64, 87, 102, 123, 147, 177, 223-224, 253, 261, 284, 300, 314). Extracting this to a helper would reduce duplication and make the script easier to maintain.♻️ Proposed helper function
Add near the top of the script after sourcing lib.sh:
# Helper to fetch JSON endpoint and extract scalar value raft_api() { local port=$1 local endpoint=$2 curl -sf --max-time 10 "http://localhost:${port}/api/${endpoint}" 2>/dev/null \ | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "" }Then usage becomes:
- LEADER1=$(curl -sf --max-time 10 "http://localhost:3100/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "") + LEADER1=$(raft_api 3100 raft-leader)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-raft.sh` around lines 62 - 64, The script repeats the curl+python JSON-extraction pipeline for endpoints (used to set variables like LEADER1/LEADER2/LEADER3 and other checks); extract that into a small helper function (e.g., raft_api) defined near the top after sourcing lib.sh that accepts a port and an endpoint (port, endpoint) and returns the scalar JSON value or empty string on failure, then replace each occurrence of the inline pipeline (curl ... | python3 -c "import json,sys; print(json.load(sys.stdin))" ...) with calls like raft_api 3100 raft-leader (or raft_api "$port" "some-endpoint") for LEADER*, the checks at lines referenced, and any other occurrences to remove duplication and centralize error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/functional/test-raft.sh`:
- Around line 79-81: The for-loop iterating RAFT_PORTS uses an unquoted URL in
the curl call; update the curl invocation inside the loop so the URL and the
port variable are quoted (e.g., "http://localhost:${port}/api/raft-status") to
satisfy shellcheck and prevent future word-splitting or globbing issues when
using RAFT_PORTS and the curl command.
- Around line 237-239: Loop uses an unquoted URL in the curl call which can lead
to word-splitting; update the curl invocation inside the for loop that iterates
REMAINING_PORTS to quote the interpolated URL (use "$port" within the URL
string) so the argument passed to curl is a single token (e.g., curl -sf
--max-time 10 "http://localhost:${port}/api/raft-status" ...).
- Around line 62-64: The script repeats the curl+python JSON-extraction pipeline
for endpoints (used to set variables like LEADER1/LEADER2/LEADER3 and other
checks); extract that into a small helper function (e.g., raft_api) defined near
the top after sourcing lib.sh that accepts a port and an endpoint (port,
endpoint) and returns the scalar JSON value or empty string on failure, then
replace each occurrence of the inline pipeline (curl ... | python3 -c "import
json,sys; print(json.load(sys.stdin))" ...) with calls like raft_api 3100
raft-leader (or raft_api "$port" "some-endpoint") for LEADER*, the checks at
lines referenced, and any other occurrences to remove duplication and centralize
error handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8d356ac6-1896-44fb-90f6-2dd3b42449bc
📒 Files selected for processing (2)
go/raft/rel_store.gotests/functional/test-raft.sh
✅ Files skipped from review due to trivial changes (1)
- go/raft/rel_store.go
71d1b7f to
f7d24c2
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/functional/docker-compose.yml (1)
208-231: Consider adding a healthcheck for Raft orchestrator services.Unlike the existing
orchestratorservice (line 165-170), the Raft services lack healthchecks. While the test script handles bootstrap ordering manually, adding healthchecks would improve CI reliability and enabledepends_onwithcondition: service_healthybetween nodes or for downstream steps.♻️ Optional: Add healthcheck to each Raft service
ports: - "3100:3099" depends_on: mysql1: condition: service_healthy + healthcheck: + test: ["CMD", "curl", "-sf", "http://localhost:3099/api/raft-status"] + interval: 5s + timeout: 3s + retries: 60 + start_period: 30s networks:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional/docker-compose.yml` around lines 208 - 231, The Raft node services (e.g., orchestrator-raft1) are missing Docker healthchecks; add a healthcheck block to each Raft orchestrator service (orchestrator-raft1 and the other orchestrator-raftN services) similar to the non‑Raft orchestrator service: include a test that hits the internal HTTP port (3099) such as ["CMD", "curl", "-f", "http://localhost:3099/health"] (or equivalent endpoint), and set sensible interval/timeout/retries/start_period values (for example interval: 10s, timeout: 5s, retries: 5, start_period: 5s) so depends_on can use condition: service_healthy and CI ordering becomes reliable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/functional/docker-compose.yml`:
- Around line 208-231: The Raft node services (e.g., orchestrator-raft1) are
missing Docker healthchecks; add a healthcheck block to each Raft orchestrator
service (orchestrator-raft1 and the other orchestrator-raftN services) similar
to the non‑Raft orchestrator service: include a test that hits the internal HTTP
port (3099) such as ["CMD", "curl", "-f", "http://localhost:3099/health"] (or
equivalent endpoint), and set sensible interval/timeout/retries/start_period
values (for example interval: 10s, timeout: 5s, retries: 5, start_period: 5s) so
depends_on can use condition: service_healthy and CI ordering becomes reliable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3f8db2b3-e6dd-4122-8511-304330439e88
📒 Files selected for processing (8)
.github/workflows/functional.ymlgo/raft/rel_store.gotests/functional/docker-compose.ymltests/functional/orchestrator-raft1.conf.jsontests/functional/orchestrator-raft2.conf.jsontests/functional/orchestrator-raft3.conf.jsontests/functional/test-failover-advanced.shtests/functional/test-raft.sh
✅ Files skipped from review due to trivial changes (4)
- go/raft/rel_store.go
- tests/functional/orchestrator-raft3.conf.json
- tests/functional/orchestrator-raft1.conf.json
- tests/functional/orchestrator-raft2.conf.json
- Add 3-node orchestrator Raft cluster to docker-compose.yml
- Add Raft config files for each node (orchestrator-raft{1,2,3}.conf.json)
- Add test-raft.sh with 4 test phases: cluster formation & leader
election, leader serves topology, leader failure & re-election,
node rejoin
- Add functional-raft CI job in GitHub Actions workflow
All curl calls lacked --max-time, causing potential test hangs if any API endpoint was slow to respond.
Starting all 3 Raft nodes simultaneously causes each to call BootstrapCluster independently, creating conflicting initial states and perpetual election cycles. Fix by starting node 1 first, waiting for it to be ready, then starting nodes 2 and 3.
The raft_log.data and raft_store.store_value columns were NOT NULL, but hashicorp/raft can produce log entries with nil Data (e.g. LogNoop when a new leader takes over). This caused "NOT NULL constraint failed: raft_log.data" errors immediately after election, causing the leader to step down and creating perpetual election cycles.
d8e4de2 to
2ea8a25
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
tests/functional/test-raft.sh (2)
4-4:⚠️ Potential issue | 🟡 MinorGuard
cdfailure on Line 4.If directory change fails, subsequent docker/paths are evaluated from an undefined working directory.
🔧 Suggested fix
-cd "$(dirname "$0")/../.." +cd "$(dirname "$0")/../.." || exit 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-raft.sh` at line 4, The script runs cd "$(dirname "$0")/../.." without checking failure; modify the tests/functional/test-raft.sh startup to guard that cd (e.g., change the line to cd "$(dirname "$0")/../.." || { echo "Failed to change directory"; exit 1; } or add set -e at top) so that if the cd command in the script fails the script aborts and subsequent docker/path operations are not run from an undefined working directory.
176-176:⚠️ Potential issue | 🟠 MajorFix malformed curl flags on Line 176; current check is unreliable.
--max-time 10Lis malformed and can make the follower check always fall back to[], masking real behavior.🔧 Suggested fix
- FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]") + FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")#!/bin/bash # Verify malformed curl usage and confirm expected curl flags exist. rg -n -- '--max-time[[:space:]]+[0-9]+L' tests/functional/test-raft.sh curl --help all | rg -n -- '--max-time|-L'🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-raft.sh` at line 176, The curl invocation that sets FCLUSTERS is using a malformed flag `--max-time 10L` which causes curl to fail and fallback to `[]`; update the curl call that assigns FCLUSTERS to use a valid timeout (e.g., `--max-time 10` or an appropriate numeric value) and ensure other flags (`-s -f`) remain correct so the follower check behaves reliably; locate the FCLUSTERS assignment line with the `curl ... "http://localhost:${FPORT}/api/clusters"` invocation and replace the `10L` timeout token with a valid numeric timeout value.
🧹 Nitpick comments (1)
tests/functional/docker-compose.yml (1)
215-221: Reset per-node Raft state at container boot to reduce flaky reruns.Right now each node only
mkdir -p /tmp/raft*. If an existing stopped container is reused, stale Raft/SQLite state can leak across runs and affect election behavior.♻️ Suggested hardening
command: > bash -c " apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 && mkdir -p /tmp/raft1 && + rm -rf /tmp/raft1/* && cd /orchestrator && orchestrator -config orchestrator.conf.json http " @@ command: > bash -c " apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 && mkdir -p /tmp/raft2 && + rm -rf /tmp/raft2/* && cd /orchestrator && orchestrator -config orchestrator.conf.json http " @@ command: > bash -c " apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 && mkdir -p /tmp/raft3 && + rm -rf /tmp/raft3/* && cd /orchestrator && orchestrator -config orchestrator.conf.json http "Also applies to: 240-246, 265-271
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional/docker-compose.yml` around lines 215 - 221, The container start command only runs "mkdir -p /tmp/raft1" which leaves stale Raft/SQLite state if a stopped container is reused; change the startup shell in the orchestrator command to remove any existing Raft directory before recreating it (for example use "rm -rf /tmp/raft1 && mkdir -p /tmp/raft1") so old state is wiped on boot, and apply the same change to the other node startup commands that create /tmp/raft2 and /tmp/raft3 (i.e., replace their "mkdir -p /tmp/raftN" with "rm -rf /tmp/raftN && mkdir -p /tmp/raftN").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/functional/test-raft.sh`:
- Around line 299-307: The test currently treats a restarted node reporting
"Leader" as a skip, weakening the requirement that a rejoining node must settle
as a Follower; update the logic in tests/functional/test-raft.sh to treat RSTATE
== "Leader" as a failure instead of skipping by changing the branch that checks
RSTATE (the code using variables RESTARTED_PORT and RSTATE and calling
pass/skip/fail) so that only "Follower" results in pass and any other value —
including "Leader" — triggers fail with a clear message like "Restarted node in
unexpected state: $RSTATE".
- Around line 168-188: The test currently treats missing follower data as a
non-fatal skip; change it to enforce the follower redirect/serve behavior by
failing the test when a follower neither serves cluster data nor issues a valid
redirect. In the loop that uses FPORT, FCLUSTERS and FCOUNT, update the check to
attempt detecting an HTTP 3xx redirect (inspect status header from curl -I or
-sS -D -) or a non-empty JSON array; if neither condition is met set
FOLLOWER_HAS_DATA=false and call the test failure helper (replace skip "..."
with fail "Follower on :${FPORT} neither served data nor redirected to leader")
so CI fails when followers do not accept redirect/serve behavior, and keep the
pass "Followers serve cluster data (Raft state replicated)" when all followers
pass.
---
Duplicate comments:
In `@tests/functional/test-raft.sh`:
- Line 4: The script runs cd "$(dirname "$0")/../.." without checking failure;
modify the tests/functional/test-raft.sh startup to guard that cd (e.g., change
the line to cd "$(dirname "$0")/../.." || { echo "Failed to change directory";
exit 1; } or add set -e at top) so that if the cd command in the script fails
the script aborts and subsequent docker/path operations are not run from an
undefined working directory.
- Line 176: The curl invocation that sets FCLUSTERS is using a malformed flag
`--max-time 10L` which causes curl to fail and fallback to `[]`; update the curl
call that assigns FCLUSTERS to use a valid timeout (e.g., `--max-time 10` or an
appropriate numeric value) and ensure other flags (`-s -f`) remain correct so
the follower check behaves reliably; locate the FCLUSTERS assignment line with
the `curl ... "http://localhost:${FPORT}/api/clusters"` invocation and replace
the `10L` timeout token with a valid numeric timeout value.
---
Nitpick comments:
In `@tests/functional/docker-compose.yml`:
- Around line 215-221: The container start command only runs "mkdir -p
/tmp/raft1" which leaves stale Raft/SQLite state if a stopped container is
reused; change the startup shell in the orchestrator command to remove any
existing Raft directory before recreating it (for example use "rm -rf /tmp/raft1
&& mkdir -p /tmp/raft1") so old state is wiped on boot, and apply the same
change to the other node startup commands that create /tmp/raft2 and /tmp/raft3
(i.e., replace their "mkdir -p /tmp/raftN" with "rm -rf /tmp/raftN && mkdir -p
/tmp/raftN").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a6f07ef4-bea0-4cda-90b1-5f440a39d31d
📒 Files selected for processing (7)
.github/workflows/functional.ymlgo/raft/rel_store.gotests/functional/docker-compose.ymltests/functional/orchestrator-raft1.conf.jsontests/functional/orchestrator-raft2.conf.jsontests/functional/orchestrator-raft3.conf.jsontests/functional/test-raft.sh
✅ Files skipped from review due to trivial changes (4)
- go/raft/rel_store.go
- tests/functional/orchestrator-raft3.conf.json
- tests/functional/orchestrator-raft1.conf.json
- tests/functional/orchestrator-raft2.conf.json
| # Verify followers can also return cluster data (Raft replicates state) | ||
| FOLLOWER_HAS_DATA=true | ||
| for idx in 0 1 2; do | ||
| if [ "$idx" = "$LEADER_INDEX" ]; then | ||
| continue | ||
| fi | ||
| FPORT="${RAFT_PORTS[$idx]}" | ||
| # Followers may redirect or serve data directly; either is valid | ||
| FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]") | ||
| FCOUNT=$(echo "$FCLUSTERS" | python3 -c "import json,sys; c=json.load(sys.stdin); print(len(c))" 2>/dev/null || echo "0") | ||
| if [ "$FCOUNT" -lt 1 ]; then | ||
| FOLLOWER_HAS_DATA=false | ||
| echo " Follower on :${FPORT} returned $FCOUNT clusters" | ||
| fi | ||
| done | ||
| if $FOLLOWER_HAS_DATA; then | ||
| pass "Followers serve cluster data (Raft state replicated)" | ||
| else | ||
| # This is not necessarily a failure -- followers may need more time | ||
| skip "Some followers do not yet serve cluster data (may need more replication time)" | ||
| fi |
There was a problem hiding this comment.
Follower redirect acceptance criterion is not enforced.
This block treats missing follower data as skip, so CI can pass without proving follower-to-leader redirect behavior required by the PR objective.
🔧 Suggested assertion pattern
- # Verify followers can also return cluster data (Raft replicates state)
- FOLLOWER_HAS_DATA=true
+ # Verify follower requests are redirected (or explicitly served) in a strict way
+ FOLLOWER_BEHAVIOR_OK=true
for idx in 0 1 2; do
if [ "$idx" = "$LEADER_INDEX" ]; then
continue
fi
FPORT="${RAFT_PORTS[$idx]}"
- # Followers may redirect or serve data directly; either is valid
- FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
- FCOUNT=$(echo "$FCLUSTERS" | python3 -c "import json,sys; c=json.load(sys.stdin); print(len(c))" 2>/dev/null || echo "0")
- if [ "$FCOUNT" -lt 1 ]; then
- FOLLOWER_HAS_DATA=false
- echo " Follower on :${FPORT} returned $FCOUNT clusters"
- fi
+ HEADERS=$(curl -sS -D - -o /dev/null --max-time 10 "http://localhost:${FPORT}/api/clusters" || true)
+ CODE=$(echo "$HEADERS" | awk 'toupper($1) ~ /^HTTP/ {print $2}' | tail -1)
+ LOCATION=$(echo "$HEADERS" | awk 'tolower($1)=="location:" {print $2}' | tr -d '\r')
+ if [ "$CODE" -ge 300 ] && [ "$CODE" -lt 400 ]; then
+ [ -n "$LOCATION" ] || FOLLOWER_BEHAVIOR_OK=false
+ elif [ "$CODE" != "200" ]; then
+ FOLLOWER_BEHAVIOR_OK=false
+ fi
done
- if $FOLLOWER_HAS_DATA; then
- pass "Followers serve cluster data (Raft state replicated)"
+ if $FOLLOWER_BEHAVIOR_OK; then
+ pass "Follower behavior validated for /api/clusters"
else
- # This is not necessarily a failure -- followers may need more time
- skip "Some followers do not yet serve cluster data (may need more replication time)"
+ fail "Follower /api/clusters behavior did not meet redirect/serve expectations"
fi🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional/test-raft.sh` around lines 168 - 188, The test currently
treats missing follower data as a non-fatal skip; change it to enforce the
follower redirect/serve behavior by failing the test when a follower neither
serves cluster data nor issues a valid redirect. In the loop that uses FPORT,
FCLUSTERS and FCOUNT, update the check to attempt detecting an HTTP 3xx redirect
(inspect status header from curl -I or -sS -D -) or a non-empty JSON array; if
neither condition is met set FOLLOWER_HAS_DATA=false and call the test failure
helper (replace skip "..." with fail "Follower on :${FPORT} neither served data
nor redirected to leader") so CI fails when followers do not accept
redirect/serve behavior, and keep the pass "Followers serve cluster data (Raft
state replicated)" when all followers pass.
| # Verify the restarted node is a follower (not a new leader) | ||
| RSTATE=$(curl -sf --max-time 10 "http://localhost:${RESTARTED_PORT}/api/raft-state" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "") | ||
| if [ "$RSTATE" = "Follower" ]; then | ||
| pass "Restarted node is a Follower (stable leader)" | ||
| elif [ "$RSTATE" = "Leader" ]; then | ||
| # Leadership may have shifted -- still valid if all agree | ||
| skip "Restarted node became Leader (leadership may have shifted)" | ||
| else | ||
| fail "Restarted node in unexpected state: $RSTATE" |
There was a problem hiding this comment.
Node rejoin check should fail if the restarted node does not settle as follower.
Line 305 currently downgrades a leader state to skip, which weakens the explicit “rejoins as follower” acceptance criterion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional/test-raft.sh` around lines 299 - 307, The test currently
treats a restarted node reporting "Leader" as a skip, weakening the requirement
that a rejoining node must settle as a Follower; update the logic in
tests/functional/test-raft.sh to treat RSTATE == "Leader" as a failure instead
of skipping by changing the branch that checks RSTATE (the code using variables
RESTARTED_PORT and RSTATE and calling pass/skip/fail) so that only "Follower"
results in pass and any other value — including "Leader" — triggers fail with a
clear message like "Restarted node in unexpected state: $RSTATE".
Summary
Adds a 3-node orchestrator Raft cluster to the functional test infrastructure and tests consensus behavior under failure conditions (#84).
Changes
docker-compose.yml(IPs .40-.42, ports 3100-3102)orchestrator-raft{1,2,3}.conf.json) with SQLite backend and Raft peer discoverytest-raft.shwith 4 test phases:functional-raftCI job (depends on shared build job)Depends on #86
Closes #84
Test plan
Summary by CodeRabbit
Tests
Chores