CI: Raft consensus testing under failure conditions by renecannao · Pull Request #87 · ProxySQL/orchestrator

renecannao · 2026-04-09T06:01:24Z

Summary

Adds a 3-node orchestrator Raft cluster to the functional test infrastructure and tests consensus behavior under failure conditions (#84).

Changes

Added 3 Raft orchestrator containers to docker-compose.yml (IPs .40-.42, ports 3100-3102)
Created Raft config files (orchestrator-raft{1,2,3}.conf.json) with SQLite backend and Raft peer discovery
Created test-raft.sh with 4 test phases:
1. Cluster formation: Start 3 nodes, verify leader elected, all nodes agree
2. Leader serves topology: Discover MySQL cluster through leader, verify API responses
3. Leader failure: Stop leader, verify re-election within timeout, new leader serves API
4. Node rejoin: Restart stopped node, verify it rejoins as follower
Added functional-raft CI job (depends on shared build job)

Depends on #86
Closes #84

Test plan

3-node Raft cluster elects a leader within 90s
All nodes agree on the same leader
Leader serves MySQL topology data via API
Killing leader triggers re-election with new leader
Stopped node rejoins as follower after restart

Summary by CodeRabbit

Tests
- Added comprehensive functional Raft consensus tests validating leader election, topology discovery, leader failover, and node rejoin scenarios across a three-node cluster.
Chores
- Integrated Raft functional testing into CI/CD pipeline with automated log collection and cleanup.

coderabbitai · 2026-04-09T06:01:30Z

📝 Walkthrough

Walkthrough

Added Raft consensus functional testing infrastructure to CI: a new GitHub Actions job that orchestrates a 3-node Raft cluster via Docker Compose, runs topology discovery and failover scenarios, and validates leader election, cluster data serving, and node rejoin operations.

Changes

Cohort / File(s)	Summary
GitHub Actions Workflow `.github/workflows/functional.yml`	Added new `functional-raft` job that downloads the orchestrator binary, starts a 3-node Raft cluster via Docker Compose, runs functional test scripts, collects orchestrator and compose logs, and cleans up the environment.
Docker Compose Services `tests/functional/docker-compose.yml`	Introduced three Raft orchestrator services (`orchestrator-raft1`, `orchestrator-raft2`, `orchestrator-raft3`) with distinct static IPs, port mappings, mounted binaries and configs, and health dependencies on MySQL.
Raft Configuration Files `tests/functional/orchestrator-raft1.conf.json`, `tests/functional/orchestrator-raft2.conf.json`, `tests/functional/orchestrator-raft3.conf.json`	Added three identical-structure Raft runtime configurations, each with debug logging, SQLite persistence, Raft consensus enabled, distinct HTTP advertise addresses, and a static 3-node cluster node list.
Functional Test Script `tests/functional/test-raft.sh`	Implemented end-to-end Raft test that validates: leader election on startup, leader-driven topology discovery, follower health checks, leader failover and re-election, node rejoin, and final leader consensus verification.
Raft Schema Update `go/raft/rel_store.go`	Removed `NOT NULL` constraints from the `data` column in `raft_log` table and `store_value` column in `raft_store` table to allow nullable blob fields.

Sequence Diagram(s)

sequenceDiagram
    participant Test as test-raft.sh
    participant R1 as orchestrator-raft1
    participant R2 as orchestrator-raft2
    participant R3 as orchestrator-raft3
    participant MySQL as mysql1/2/3

    rect rgba(100, 149, 237, 0.5)
      Note over Test,R3: Startup & Leader Election
      Test->>R1: Start container (port 3100)
      Test->>R1: Poll /api/raft-status (wait ~90s)
      Test->>R2: Start container (port 3101)
      Test->>R3: Start container (port 3102)
      Test->>R1: Query /api/raft-leader
      Test->>R2: Query /api/raft-leader
      Test->>R3: Query /api/raft-leader
      Test->>R1: Verify exactly one /api/raft-state == "Leader"
    end

    rect rgba(144, 238, 144, 0.5)
      Note over Test,MySQL: Topology Discovery Phase
      Test->>R1: Identify leader via /api/raft-state
      Test->>R1: Trigger /api/discover/mysql{1,2,3}/3306
      Test->>MySQL: Query MySQL topology
      Test->>R1: Poll /api/clusters (wait ~60s)
      Test->>R2: Query /api/clusters (follower check)
      Test->>R3: Query /api/clusters (follower check)
    end

    rect rgba(255, 182, 193, 0.5)
      Note over Test,R3: Failover & Re-election
      Test->>R1: Stop old leader container
      Test->>R2: Poll /api/raft-leader (wait ~60s)
      Test->>R3: Poll /api/raft-leader (wait ~60s)
      Test->>R2: Verify new leader elected
      Test->>R2: Query /api/clusters
      Test->>R1: Restart old leader container
      Test->>R1: Wait /api/raft-leader matches new leader (~60s)
      Test->>R1: Validate /api/raft-state == "Follower"
    end

    rect rgba(255, 215, 0, 0.5)
      Note over Test,R3: Final Consensus Verification
      Test->>R1: Query /api/raft-leader
      Test->>R2: Query /api/raft-leader
      Test->>R3: Query /api/raft-leader
      Test->>R1: Query /api/raft-health
      Test->>R2: Query /api/raft-health
      Test->>R3: Query /api/raft-health
      Test->>Test: Verify all nodes agree on leader
    end

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Three nodes dance the Raft consensus jig,
With leader elected, the cluster grows big!
Through failover storms and rejoins so bright,
These tests keep our Raft standing upright! 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	❓ Inconclusive	The minor schema change in go/raft/rel_store.go (removing NOT NULL constraints) appears tangentially related to Raft testing but is not directly justified by the linked issue requirements.	Clarify whether the NOT NULL constraint removal in rel_store.go is necessary for the Raft tests or if it should be in a separate PR addressing schema requirements.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main objective: adding CI tests for Raft consensus behavior under failure conditions, which aligns with the primary changeset (adding Raft cluster setup, test scripts, and CI job).
Linked Issues check	✅ Passed	All coding requirements from `#84` are met: 3-node Raft cluster setup, leader election verification, failover testing, topology operations validation, and CI integration via the functional-raft job.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch issue84-raft-testing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces functional tests for Raft consensus by adding three orchestrator nodes to the Docker Compose environment along with their respective configuration files. A new test script, test-raft.sh, is included to verify leader election, topology discovery, failover, and node rejoining. Review feedback recommends improving the Docker setup by using pre-built images instead of installing dependencies at runtime and refactoring the test script to use helper functions for repeated API interactions and JSON parsing to enhance maintainability.

gemini-code-assist · 2026-04-09T06:07:25Z

tests/functional/docker-compose.yml

+      - ./orchestrator-raft1.conf.json:/orchestrator/orchestrator.conf.json:ro
+    command: >
+      bash -c "
+        apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 &&


Installing dependencies like curl and sqlite3 at runtime via apt-get update in the container's command is inefficient and can lead to flaky tests if external repositories are temporarily unavailable. It is better to use a custom Dockerfile or a pre-built image that already contains these tools.

gemini-code-assist · 2026-04-09T06:07:26Z

tests/functional/test-raft.sh

+        LEADER1=$(curl -sf "http://localhost:3100/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")
+        LEADER2=$(curl -sf "http://localhost:3101/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")
+        LEADER3=$(curl -sf "http://localhost:3102/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")


The logic for fetching and parsing the Raft leader from a node is repeated multiple times throughout the script. Consider defining a helper function to reduce duplication and improve maintainability.

Example:

get_raft_leader() { local port=$1 curl -sf "http://localhost:${port}/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "" }

gemini-code-assist · 2026-04-09T06:07:26Z

tests/functional/test-raft.sh

+# Verify exactly one node reports itself as Leader state
+LEADER_COUNT=0
+for port in "${RAFT_PORTS[@]}"; do
+    STATE=$(curl -sf "http://localhost:${port}/api/raft-state" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")


Similar to the leader fetching logic, the Raft state retrieval is also repeated. A helper function would be beneficial here as well.

gemini-code-assist · 2026-04-09T06:07:26Z

tests/functional/test-raft.sh

+    CLUSTER_FOUND=false
+    for i in $(seq 1 60); do
+        CLUSTERS=$(curl -sf "http://localhost:${LEADER_PORT}/api/clusters" 2>/dev/null || echo "[]")
+        COUNT=$(echo "$CLUSTERS" | python3 -c "import json,sys; c=json.load(sys.stdin); print(len(c))" 2>/dev/null || echo "0")


The JSON parsing logic using python3 is repeated. While effective, it adds verbosity to the script. Consider a helper function for JSON extraction or using jq if available in the test environment.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-failover-advanced.sh`:
- Line 4: The cd invocation cd "$(dirname "$0")/../.." can fail and leave the
script running in the wrong directory; update that line to handle errors by
checking its exit status and exiting on failure (e.g., change it to perform the
cd and then do || { echo "Failed to change directory"; exit 1; } or otherwise
abort on non-zero status) so the script does not continue when the directory
change fails.

In `@tests/functional/test-raft.sh`:
- Line 4: The script currently runs cd "$(dirname "$0")/../.." without checking
its success; update the invocation so the script aborts on failure (e.g.,
replace with a guarded form like if ! cd "$(dirname "$0")/../.."; then echo
"Failed to change directory to $(dirname "$0")/../.."; exit 1; fi or append || {
echo "Failed to cd"; exit 1; } ), ensuring the working directory is validated
before proceeding.
- Line 149: The curl invocation that sets FCLUSTERS uses a malformed option
`--max-time 10L`; update the command in the FCLUSTERS assignment so the timeout
and follow-redirect flags are separate (e.g., `--max-time 10 -L` or just
`--max-time 10` if you don't need redirects) to remove the stray `L` and ensure
redirects are followed when desired; modify the line that defines FCLUSTERS
accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bdbc12a7-24a2-4916-beb5-6b7bdcb53ca7

📥 Commits

Reviewing files that changed from the base of the PR and between 5b13080 and 730d7d3.

📒 Files selected for processing (8)

.github/workflows/functional.yml
tests/functional/docker-compose.yml
tests/functional/lib.sh
tests/functional/orchestrator-raft1.conf.json
tests/functional/orchestrator-raft2.conf.json
tests/functional/orchestrator-raft3.conf.json
tests/functional/test-failover-advanced.sh
tests/functional/test-raft.sh

coderabbitai · 2026-04-09T10:28:14Z

tests/functional/test-failover-advanced.sh

+#!/bin/bash
+# Tier B+: Advanced failover tests — intermediate master, errant GTID, co-master
+set -uo pipefail  # no -e: we handle failures ourselves
+cd "$(dirname "$0")/../.."


⚠️ Potential issue | 🟡 Minor

Add error handling for cd command.

Same issue as in test-raft.sh - if the cd fails, the script continues from an unexpected location.

🛡️ Proposed fix

-cd "$(dirname "$0")/../.." +cd "$(dirname "$0")/../.." || exit 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd "$(dirname "$0")/../.."

cd "$(dirname "$0")/../.." || exit 1

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-failover-advanced.sh` at line 4, The cd invocation cd "$(dirname "$0")/../.." can fail and leave the script running in the wrong directory; update that line to handle errors by checking its exit status and exiting on failure (e.g., change it to perform the cd and then do || { echo "Failed to change directory"; exit 1; } or otherwise abort on non-zero status) so the script does not continue when the directory change fails.

coderabbitai · 2026-04-09T10:28:15Z

tests/functional/test-raft.sh

+#!/bin/bash
+# Raft consensus tests -- verify leader election, failover, and follower redirect
+set -uo pipefail
+cd "$(dirname "$0")/../.."


⚠️ Potential issue | 🟡 Minor

Add error handling for cd command.

If the cd fails (e.g., directory doesn't exist), the script will continue from an unexpected location, which could cause failures or incorrect behavior.

🛡️ Proposed fix

-cd "$(dirname "$0")/../.." +cd "$(dirname "$0")/../.." || exit 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd "$(dirname "$0")/../.."

cd "$(dirname "$0")/../.." || exit 1

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-raft.sh` at line 4, The script currently runs cd "$(dirname "$0")/../.." without checking its success; update the invocation so the script aborts on failure (e.g., replace with a guarded form like if ! cd "$(dirname "$0")/../.."; then echo "Failed to change directory to $(dirname "$0")/../.."; exit 1; fi or append || { echo "Failed to cd"; exit 1; } ), ensuring the working directory is validated before proceeding.

coderabbitai · 2026-04-09T10:28:15Z

tests/functional/test-raft.sh

+        fi
+        FPORT="${RAFT_PORTS[$idx]}"
+        # Followers may redirect or serve data directly; either is valid
+        FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")


⚠️ Potential issue | 🟡 Minor

Typo: 10L should be 10 -L or just 10.

The curl option --max-time 10L appears to have an errant L. If the intent is to follow redirects (which makes sense for followers that may redirect to the leader), use -L as a separate flag.

🔧 Proposed fix

- FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]") + FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")

FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-raft.sh` at line 149, The curl invocation that sets FCLUSTERS uses a malformed option `--max-time 10L`; update the command in the FCLUSTERS assignment so the timeout and follow-redirect flags are separate (e.g., `--max-time 10 -L` or just `--max-time 10` if you don't need redirects) to remove the stray `L` and ensure redirects are followed when desired; modify the line that defines FCLUSTERS accordingly.

coderabbitai

🧹 Nitpick comments (3)

tests/functional/test-raft.sh (3)

79-81: Consider quoting the URL to satisfy shellcheck.

The URL in the curl command is technically unquoted. While safe here since port comes from a hardcoded array, quoting prevents potential issues if the code is modified later.

🔧 Proposed fix

-        echo "  Node :${port} raft-status: $(curl -sf --max-time 10 http://localhost:${port}/api/raft-status 2>/dev/null || echo 'unreachable')"
+        echo "  Node :${port} raft-status: $(curl -sf --max-time 10 "http://localhost:${port}/api/raft-status" 2>/dev/null || echo 'unreachable')"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` around lines 79 - 81, The for-loop iterating
RAFT_PORTS uses an unquoted URL in the curl call; update the curl invocation
inside the loop so the URL and the port variable are quoted (e.g.,
"http://localhost:${port}/api/raft-status") to satisfy shellcheck and prevent
future word-splitting or globbing issues when using RAFT_PORTS and the curl
command.

237-239: Consider quoting the URL for consistency.

Same shellcheck suggestion as line 80 - quote the URL to prevent potential word splitting.

🔧 Proposed fix

-            echo "  Node :${port} status: $(curl -sf --max-time 10 http://localhost:${port}/api/raft-status 2>/dev/null || echo 'unreachable')"
+            echo "  Node :${port} status: $(curl -sf --max-time 10 "http://localhost:${port}/api/raft-status" 2>/dev/null || echo 'unreachable')"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` around lines 237 - 239, Loop uses an unquoted
URL in the curl call which can lead to word-splitting; update the curl
invocation inside the for loop that iterates REMAINING_PORTS to quote the
interpolated URL (use "$port" within the URL string) so the argument passed to
curl is a single token (e.g., curl -sf --max-time 10
"http://localhost:${port}/api/raft-status" ...).

62-64: Consider extracting JSON parsing to a helper function.

The pattern curl ... | python3 -c "import json,sys; print(json.load(sys.stdin))" is repeated throughout the script (lines 62-64, 87, 102, 123, 147, 177, 223-224, 253, 261, 284, 300, 314). Extracting this to a helper would reduce duplication and make the script easier to maintain.

♻️ Proposed helper function

Add near the top of the script after sourcing lib.sh:

# Helper to fetch JSON endpoint and extract scalar value
raft_api() {
    local port=$1
    local endpoint=$2
    curl -sf --max-time 10 "http://localhost:${port}/api/${endpoint}" 2>/dev/null \
        | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo ""
}

Then usage becomes:

-        LEADER1=$(curl -sf --max-time 10 "http://localhost:3100/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")
+        LEADER1=$(raft_api 3100 raft-leader)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` around lines 62 - 64, The script repeats the
curl+python JSON-extraction pipeline for endpoints (used to set variables like
LEADER1/LEADER2/LEADER3 and other checks); extract that into a small helper
function (e.g., raft_api) defined near the top after sourcing lib.sh that
accepts a port and an endpoint (port, endpoint) and returns the scalar JSON
value or empty string on failure, then replace each occurrence of the inline
pipeline (curl ... | python3 -c "import json,sys; print(json.load(sys.stdin))"
...) with calls like raft_api 3100 raft-leader (or raft_api "$port"
"some-endpoint") for LEADER*, the checks at lines referenced, and any other
occurrences to remove duplication and centralize error handling.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/functional/test-raft.sh`:
- Around line 79-81: The for-loop iterating RAFT_PORTS uses an unquoted URL in
the curl call; update the curl invocation inside the loop so the URL and the
port variable are quoted (e.g., "http://localhost:${port}/api/raft-status") to
satisfy shellcheck and prevent future word-splitting or globbing issues when
using RAFT_PORTS and the curl command.
- Around line 237-239: Loop uses an unquoted URL in the curl call which can lead
to word-splitting; update the curl invocation inside the for loop that iterates
REMAINING_PORTS to quote the interpolated URL (use "$port" within the URL
string) so the argument passed to curl is a single token (e.g., curl -sf
--max-time 10 "http://localhost:${port}/api/raft-status" ...).
- Around line 62-64: The script repeats the curl+python JSON-extraction pipeline
for endpoints (used to set variables like LEADER1/LEADER2/LEADER3 and other
checks); extract that into a small helper function (e.g., raft_api) defined near
the top after sourcing lib.sh that accepts a port and an endpoint (port,
endpoint) and returns the scalar JSON value or empty string on failure, then
replace each occurrence of the inline pipeline (curl ... | python3 -c "import
json,sys; print(json.load(sys.stdin))" ...) with calls like raft_api 3100
raft-leader (or raft_api "$port" "some-endpoint") for LEADER*, the checks at
lines referenced, and any other occurrences to remove duplication and centralize
error handling.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8d356ac6-1896-44fb-90f6-2dd3b42449bc

📥 Commits

Reviewing files that changed from the base of the PR and between 730d7d3 and 71d1b7f.

📒 Files selected for processing (2)

go/raft/rel_store.go
tests/functional/test-raft.sh

✅ Files skipped from review due to trivial changes (1)

go/raft/rel_store.go

coderabbitai

🧹 Nitpick comments (1)

tests/functional/docker-compose.yml (1)

208-231: Consider adding a healthcheck for Raft orchestrator services.

Unlike the existing orchestrator service (line 165-170), the Raft services lack healthchecks. While the test script handles bootstrap ordering manually, adding healthchecks would improve CI reliability and enable depends_on with condition: service_healthy between nodes or for downstream steps.

♻️ Optional: Add healthcheck to each Raft service

     ports:
       - "3100:3099"
     depends_on:
       mysql1:
         condition: service_healthy
+    healthcheck:
+      test: ["CMD", "curl", "-sf", "http://localhost:3099/api/raft-status"]
+      interval: 5s
+      timeout: 3s
+      retries: 60
+      start_period: 30s
     networks:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional/docker-compose.yml` around lines 208 - 231, The Raft node
services (e.g., orchestrator-raft1) are missing Docker healthchecks; add a
healthcheck block to each Raft orchestrator service (orchestrator-raft1 and the
other orchestrator-raftN services) similar to the non‑Raft orchestrator service:
include a test that hits the internal HTTP port (3099) such as ["CMD", "curl",
"-f", "http://localhost:3099/health"] (or equivalent endpoint), and set sensible
interval/timeout/retries/start_period values (for example interval: 10s,
timeout: 5s, retries: 5, start_period: 5s) so depends_on can use condition:
service_healthy and CI ordering becomes reliable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/functional/docker-compose.yml`:
- Around line 208-231: The Raft node services (e.g., orchestrator-raft1) are
missing Docker healthchecks; add a healthcheck block to each Raft orchestrator
service (orchestrator-raft1 and the other orchestrator-raftN services) similar
to the non‑Raft orchestrator service: include a test that hits the internal HTTP
port (3099) such as ["CMD", "curl", "-f", "http://localhost:3099/health"] (or
equivalent endpoint), and set sensible interval/timeout/retries/start_period
values (for example interval: 10s, timeout: 5s, retries: 5, start_period: 5s) so
depends_on can use condition: service_healthy and CI ordering becomes reliable.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3f8db2b3-e6dd-4122-8511-304330439e88

📥 Commits

Reviewing files that changed from the base of the PR and between 71d1b7f and f7d24c2.

📒 Files selected for processing (8)

.github/workflows/functional.yml
go/raft/rel_store.go
tests/functional/docker-compose.yml
tests/functional/orchestrator-raft1.conf.json
tests/functional/orchestrator-raft2.conf.json
tests/functional/orchestrator-raft3.conf.json
tests/functional/test-failover-advanced.sh
tests/functional/test-raft.sh

✅ Files skipped from review due to trivial changes (4)

go/raft/rel_store.go
tests/functional/orchestrator-raft3.conf.json
tests/functional/orchestrator-raft1.conf.json
tests/functional/orchestrator-raft2.conf.json

- Add 3-node orchestrator Raft cluster to docker-compose.yml - Add Raft config files for each node (orchestrator-raft{1,2,3}.conf.json) - Add test-raft.sh with 4 test phases: cluster formation & leader election, leader serves topology, leader failure & re-election, node rejoin - Add functional-raft CI job in GitHub Actions workflow

All curl calls lacked --max-time, causing potential test hangs if any API endpoint was slow to respond.

Starting all 3 Raft nodes simultaneously causes each to call BootstrapCluster independently, creating conflicting initial states and perpetual election cycles. Fix by starting node 1 first, waiting for it to be ready, then starting nodes 2 and 3.

The raft_log.data and raft_store.store_value columns were NOT NULL, but hashicorp/raft can produce log entries with nil Data (e.g. LogNoop when a new leader takes over). This caused "NOT NULL constraint failed: raft_log.data" errors immediately after election, causing the leader to step down and creating perpetual election cycles.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

tests/functional/test-raft.sh (2)

4-4: ⚠️ Potential issue | 🟡 Minor

Guard cd failure on Line 4.

If directory change fails, subsequent docker/paths are evaluated from an undefined working directory.

🔧 Suggested fix

-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || exit 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` at line 4, The script runs cd "$(dirname
"$0")/../.." without checking failure; modify the tests/functional/test-raft.sh
startup to guard that cd (e.g., change the line to cd "$(dirname "$0")/../.." ||
{ echo "Failed to change directory"; exit 1; } or add set -e at top) so that if
the cd command in the script fails the script aborts and subsequent docker/path
operations are not run from an undefined working directory.

176-176: ⚠️ Potential issue | 🟠 Major

Fix malformed curl flags on Line 176; current check is unreliable.

--max-time 10L is malformed and can make the follower check always fall back to [], masking real behavior.

🔧 Suggested fix

-        FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
+        FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")

#!/bin/bash
# Verify malformed curl usage and confirm expected curl flags exist.
rg -n -- '--max-time[[:space:]]+[0-9]+L' tests/functional/test-raft.sh
curl --help all | rg -n -- '--max-time|-L'

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` at line 176, The curl invocation that sets
FCLUSTERS is using a malformed flag `--max-time 10L` which causes curl to fail
and fallback to `[]`; update the curl call that assigns FCLUSTERS to use a valid
timeout (e.g., `--max-time 10` or an appropriate numeric value) and ensure other
flags (`-s -f`) remain correct so the follower check behaves reliably; locate
the FCLUSTERS assignment line with the `curl ...
"http://localhost:${FPORT}/api/clusters"` invocation and replace the `10L`
timeout token with a valid numeric timeout value.

🧹 Nitpick comments (1)

tests/functional/docker-compose.yml (1)

215-221: Reset per-node Raft state at container boot to reduce flaky reruns.

Right now each node only mkdir -p /tmp/raft*. If an existing stopped container is reused, stale Raft/SQLite state can leak across runs and affect election behavior.

♻️ Suggested hardening

   command: >
     bash -c "
       apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 &&
       mkdir -p /tmp/raft1 &&
+      rm -rf /tmp/raft1/* &&
       cd /orchestrator &&
       orchestrator -config orchestrator.conf.json http
     "
@@
   command: >
     bash -c "
       apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 &&
       mkdir -p /tmp/raft2 &&
+      rm -rf /tmp/raft2/* &&
       cd /orchestrator &&
       orchestrator -config orchestrator.conf.json http
     "
@@
   command: >
     bash -c "
       apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 &&
       mkdir -p /tmp/raft3 &&
+      rm -rf /tmp/raft3/* &&
       cd /orchestrator &&
       orchestrator -config orchestrator.conf.json http
     "

Also applies to: 240-246, 265-271

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional/docker-compose.yml` around lines 215 - 221, The container
start command only runs "mkdir -p /tmp/raft1" which leaves stale Raft/SQLite
state if a stopped container is reused; change the startup shell in the
orchestrator command to remove any existing Raft directory before recreating it
(for example use "rm -rf /tmp/raft1 && mkdir -p /tmp/raft1") so old state is
wiped on boot, and apply the same change to the other node startup commands that
create /tmp/raft2 and /tmp/raft3 (i.e., replace their "mkdir -p /tmp/raftN" with
"rm -rf /tmp/raftN && mkdir -p /tmp/raftN").

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-raft.sh`:
- Around line 299-307: The test currently treats a restarted node reporting
"Leader" as a skip, weakening the requirement that a rejoining node must settle
as a Follower; update the logic in tests/functional/test-raft.sh to treat RSTATE
== "Leader" as a failure instead of skipping by changing the branch that checks
RSTATE (the code using variables RESTARTED_PORT and RSTATE and calling
pass/skip/fail) so that only "Follower" results in pass and any other value —
including "Leader" — triggers fail with a clear message like "Restarted node in
unexpected state: $RSTATE".
- Around line 168-188: The test currently treats missing follower data as a
non-fatal skip; change it to enforce the follower redirect/serve behavior by
failing the test when a follower neither serves cluster data nor issues a valid
redirect. In the loop that uses FPORT, FCLUSTERS and FCOUNT, update the check to
attempt detecting an HTTP 3xx redirect (inspect status header from curl -I or
-sS -D -) or a non-empty JSON array; if neither condition is met set
FOLLOWER_HAS_DATA=false and call the test failure helper (replace skip "..."
with fail "Follower on :${FPORT} neither served data nor redirected to leader")
so CI fails when followers do not accept redirect/serve behavior, and keep the
pass "Followers serve cluster data (Raft state replicated)" when all followers
pass.

---

Duplicate comments:
In `@tests/functional/test-raft.sh`:
- Line 4: The script runs cd "$(dirname "$0")/../.." without checking failure;
modify the tests/functional/test-raft.sh startup to guard that cd (e.g., change
the line to cd "$(dirname "$0")/../.." || { echo "Failed to change directory";
exit 1; } or add set -e at top) so that if the cd command in the script fails
the script aborts and subsequent docker/path operations are not run from an
undefined working directory.
- Line 176: The curl invocation that sets FCLUSTERS is using a malformed flag
`--max-time 10L` which causes curl to fail and fallback to `[]`; update the curl
call that assigns FCLUSTERS to use a valid timeout (e.g., `--max-time 10` or an
appropriate numeric value) and ensure other flags (`-s -f`) remain correct so
the follower check behaves reliably; locate the FCLUSTERS assignment line with
the `curl ... "http://localhost:${FPORT}/api/clusters"` invocation and replace
the `10L` timeout token with a valid numeric timeout value.

---

Nitpick comments:
In `@tests/functional/docker-compose.yml`:
- Around line 215-221: The container start command only runs "mkdir -p
/tmp/raft1" which leaves stale Raft/SQLite state if a stopped container is
reused; change the startup shell in the orchestrator command to remove any
existing Raft directory before recreating it (for example use "rm -rf /tmp/raft1
&& mkdir -p /tmp/raft1") so old state is wiped on boot, and apply the same
change to the other node startup commands that create /tmp/raft2 and /tmp/raft3
(i.e., replace their "mkdir -p /tmp/raftN" with "rm -rf /tmp/raftN && mkdir -p
/tmp/raftN").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a6f07ef4-bea0-4cda-90b1-5f440a39d31d

📥 Commits

Reviewing files that changed from the base of the PR and between f7d24c2 and 2ea8a25.

📒 Files selected for processing (7)

.github/workflows/functional.yml
go/raft/rel_store.go
tests/functional/docker-compose.yml
tests/functional/orchestrator-raft1.conf.json
tests/functional/orchestrator-raft2.conf.json
tests/functional/orchestrator-raft3.conf.json
tests/functional/test-raft.sh

✅ Files skipped from review due to trivial changes (4)

go/raft/rel_store.go
tests/functional/orchestrator-raft3.conf.json
tests/functional/orchestrator-raft1.conf.json
tests/functional/orchestrator-raft2.conf.json

coderabbitai · 2026-04-09T14:24:15Z

tests/functional/test-raft.sh

+    # Verify followers can also return cluster data (Raft replicates state)
+    FOLLOWER_HAS_DATA=true
+    for idx in 0 1 2; do
+        if [ "$idx" = "$LEADER_INDEX" ]; then
+            continue
+        fi
+        FPORT="${RAFT_PORTS[$idx]}"
+        # Followers may redirect or serve data directly; either is valid
+        FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
+        FCOUNT=$(echo "$FCLUSTERS" | python3 -c "import json,sys; c=json.load(sys.stdin); print(len(c))" 2>/dev/null || echo "0")
+        if [ "$FCOUNT" -lt 1 ]; then
+            FOLLOWER_HAS_DATA=false
+            echo "  Follower on :${FPORT} returned $FCOUNT clusters"
+        fi
+    done
+    if $FOLLOWER_HAS_DATA; then
+        pass "Followers serve cluster data (Raft state replicated)"
+    else
+        # This is not necessarily a failure -- followers may need more time
+        skip "Some followers do not yet serve cluster data (may need more replication time)"
+    fi


⚠️ Potential issue | 🟠 Major

Follower redirect acceptance criterion is not enforced.

This block treats missing follower data as skip, so CI can pass without proving follower-to-leader redirect behavior required by the PR objective.

🔧 Suggested assertion pattern

- # Verify followers can also return cluster data (Raft replicates state) - FOLLOWER_HAS_DATA=true + # Verify follower requests are redirected (or explicitly served) in a strict way + FOLLOWER_BEHAVIOR_OK=true for idx in 0 1 2; do if [ "$idx" = "$LEADER_INDEX" ]; then continue fi FPORT="${RAFT_PORTS[$idx]}" - # Followers may redirect or serve data directly; either is valid - FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]") - FCOUNT=$(echo "$FCLUSTERS" | python3 -c "import json,sys; c=json.load(sys.stdin); print(len(c))" 2>/dev/null || echo "0") - if [ "$FCOUNT" -lt 1 ]; then - FOLLOWER_HAS_DATA=false - echo " Follower on :${FPORT} returned $FCOUNT clusters" - fi + HEADERS=$(curl -sS -D - -o /dev/null --max-time 10 "http://localhost:${FPORT}/api/clusters" || true) + CODE=$(echo "$HEADERS" | awk 'toupper($1) ~ /^HTTP/ {print $2}' | tail -1) + LOCATION=$(echo "$HEADERS" | awk 'tolower($1)=="location:" {print $2}' | tr -d '\r') + if [ "$CODE" -ge 300 ] && [ "$CODE" -lt 400 ]; then + [ -n "$LOCATION" ] || FOLLOWER_BEHAVIOR_OK=false + elif [ "$CODE" != "200" ]; then + FOLLOWER_BEHAVIOR_OK=false + fi done - if $FOLLOWER_HAS_DATA; then - pass "Followers serve cluster data (Raft state replicated)" + if $FOLLOWER_BEHAVIOR_OK; then + pass "Follower behavior validated for /api/clusters" else - # This is not necessarily a failure -- followers may need more time - skip "Some followers do not yet serve cluster data (may need more replication time)" + fail "Follower /api/clusters behavior did not meet redirect/serve expectations" fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-raft.sh` around lines 168 - 188, The test currently treats missing follower data as a non-fatal skip; change it to enforce the follower redirect/serve behavior by failing the test when a follower neither serves cluster data nor issues a valid redirect. In the loop that uses FPORT, FCLUSTERS and FCOUNT, update the check to attempt detecting an HTTP 3xx redirect (inspect status header from curl -I or -sS -D -) or a non-empty JSON array; if neither condition is met set FOLLOWER_HAS_DATA=false and call the test failure helper (replace skip "..." with fail "Follower on :${FPORT} neither served data nor redirected to leader") so CI fails when followers do not accept redirect/serve behavior, and keep the pass "Followers serve cluster data (Raft state replicated)" when all followers pass.

coderabbitai · 2026-04-09T14:24:16Z

tests/functional/test-raft.sh

+    # Verify the restarted node is a follower (not a new leader)
+    RSTATE=$(curl -sf --max-time 10 "http://localhost:${RESTARTED_PORT}/api/raft-state" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")
+    if [ "$RSTATE" = "Follower" ]; then
+        pass "Restarted node is a Follower (stable leader)"
+    elif [ "$RSTATE" = "Leader" ]; then
+        # Leadership may have shifted -- still valid if all agree
+        skip "Restarted node became Leader (leadership may have shifted)"
+    else
+        fail "Restarted node in unexpected state: $RSTATE"


⚠️ Potential issue | 🟠 Major

Node rejoin check should fail if the restarted node does not settle as follower.

Line 305 currently downgrades a leader state to skip, which weakens the explicit “rejoins as follower” acceptance criterion.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-raft.sh` around lines 299 - 307, The test currently treats a restarted node reporting "Leader" as a skip, weakening the requirement that a rejoining node must settle as a Follower; update the logic in tests/functional/test-raft.sh to treat RSTATE == "Leader" as a failure instead of skipping by changing the branch that checks RSTATE (the code using variables RESTARTED_PORT and RSTATE and calling pass/skip/fail) so that only "Follower" results in pass and any other value — including "Leader" — triggers fail with a clear message like "Restarted node in unexpected state: $RSTATE".

gemini-code-assist bot reviewed Apr 9, 2026

View reviewed changes

renecannao force-pushed the issue82-83-failover-pg-matrix branch from cfab30d to a6c5739 Compare April 9, 2026 09:34

renecannao force-pushed the issue84-raft-testing branch from 197e86f to 81793d4 Compare April 9, 2026 09:38

renecannao changed the base branch from issue82-83-failover-pg-matrix to master April 9, 2026 10:03

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

renecannao force-pushed the issue84-raft-testing branch from 71d1b7f to f7d24c2 Compare April 9, 2026 13:27

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

renecannao added 4 commits April 9, 2026 14:17

Add curl timeouts to test-raft.sh and test-failover-advanced.sh

84f7361

All curl calls lacked --max-time, causing potential test hangs if any API endpoint was slow to respond.

renecannao force-pushed the issue84-raft-testing branch from d8e4de2 to 2ea8a25 Compare April 9, 2026 14:17

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

renecannao merged commit 6b6f645 into master Apr 9, 2026
17 checks passed

	cd "$(dirname "$0")/../.."
	cd "$(dirname "$0")/../.." \|\| exit 1

	FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null \|\| echo "[]")
	FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null \|\| echo "[]")

Conversation

renecannao commented Apr 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

renecannao commented Apr 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading