Skip to content

CI: Raft consensus testing under failure conditions#87

Merged
renecannao merged 4 commits intomasterfrom
issue84-raft-testing
Apr 9, 2026
Merged

CI: Raft consensus testing under failure conditions#87
renecannao merged 4 commits intomasterfrom
issue84-raft-testing

Conversation

@renecannao
Copy link
Copy Markdown

@renecannao renecannao commented Apr 9, 2026

Summary

Adds a 3-node orchestrator Raft cluster to the functional test infrastructure and tests consensus behavior under failure conditions (#84).

Changes

  • Added 3 Raft orchestrator containers to docker-compose.yml (IPs .40-.42, ports 3100-3102)
  • Created Raft config files (orchestrator-raft{1,2,3}.conf.json) with SQLite backend and Raft peer discovery
  • Created test-raft.sh with 4 test phases:
    1. Cluster formation: Start 3 nodes, verify leader elected, all nodes agree
    2. Leader serves topology: Discover MySQL cluster through leader, verify API responses
    3. Leader failure: Stop leader, verify re-election within timeout, new leader serves API
    4. Node rejoin: Restart stopped node, verify it rejoins as follower
  • Added functional-raft CI job (depends on shared build job)

Depends on #86
Closes #84

Test plan

  • 3-node Raft cluster elects a leader within 90s
  • All nodes agree on the same leader
  • Leader serves MySQL topology data via API
  • Killing leader triggers re-election with new leader
  • Stopped node rejoins as follower after restart

Summary by CodeRabbit

  • Tests

    • Added comprehensive functional Raft consensus tests validating leader election, topology discovery, leader failover, and node rejoin scenarios across a three-node cluster.
  • Chores

    • Integrated Raft functional testing into CI/CD pipeline with automated log collection and cleanup.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Added Raft consensus functional testing infrastructure to CI: a new GitHub Actions job that orchestrates a 3-node Raft cluster via Docker Compose, runs topology discovery and failover scenarios, and validates leader election, cluster data serving, and node rejoin operations.

Changes

Cohort / File(s) Summary
GitHub Actions Workflow
.github/workflows/functional.yml
Added new functional-raft job that downloads the orchestrator binary, starts a 3-node Raft cluster via Docker Compose, runs functional test scripts, collects orchestrator and compose logs, and cleans up the environment.
Docker Compose Services
tests/functional/docker-compose.yml
Introduced three Raft orchestrator services (orchestrator-raft1, orchestrator-raft2, orchestrator-raft3) with distinct static IPs, port mappings, mounted binaries and configs, and health dependencies on MySQL.
Raft Configuration Files
tests/functional/orchestrator-raft1.conf.json, tests/functional/orchestrator-raft2.conf.json, tests/functional/orchestrator-raft3.conf.json
Added three identical-structure Raft runtime configurations, each with debug logging, SQLite persistence, Raft consensus enabled, distinct HTTP advertise addresses, and a static 3-node cluster node list.
Functional Test Script
tests/functional/test-raft.sh
Implemented end-to-end Raft test that validates: leader election on startup, leader-driven topology discovery, follower health checks, leader failover and re-election, node rejoin, and final leader consensus verification.
Raft Schema Update
go/raft/rel_store.go
Removed NOT NULL constraints from the data column in raft_log table and store_value column in raft_store table to allow nullable blob fields.

Sequence Diagram(s)

sequenceDiagram
    participant Test as test-raft.sh
    participant R1 as orchestrator-raft1
    participant R2 as orchestrator-raft2
    participant R3 as orchestrator-raft3
    participant MySQL as mysql1/2/3

    rect rgba(100, 149, 237, 0.5)
      Note over Test,R3: Startup & Leader Election
      Test->>R1: Start container (port 3100)
      Test->>R1: Poll /api/raft-status (wait ~90s)
      Test->>R2: Start container (port 3101)
      Test->>R3: Start container (port 3102)
      Test->>R1: Query /api/raft-leader
      Test->>R2: Query /api/raft-leader
      Test->>R3: Query /api/raft-leader
      Test->>R1: Verify exactly one /api/raft-state == "Leader"
    end

    rect rgba(144, 238, 144, 0.5)
      Note over Test,MySQL: Topology Discovery Phase
      Test->>R1: Identify leader via /api/raft-state
      Test->>R1: Trigger /api/discover/mysql{1,2,3}/3306
      Test->>MySQL: Query MySQL topology
      Test->>R1: Poll /api/clusters (wait ~60s)
      Test->>R2: Query /api/clusters (follower check)
      Test->>R3: Query /api/clusters (follower check)
    end

    rect rgba(255, 182, 193, 0.5)
      Note over Test,R3: Failover & Re-election
      Test->>R1: Stop old leader container
      Test->>R2: Poll /api/raft-leader (wait ~60s)
      Test->>R3: Poll /api/raft-leader (wait ~60s)
      Test->>R2: Verify new leader elected
      Test->>R2: Query /api/clusters
      Test->>R1: Restart old leader container
      Test->>R1: Wait /api/raft-leader matches new leader (~60s)
      Test->>R1: Validate /api/raft-state == "Follower"
    end

    rect rgba(255, 215, 0, 0.5)
      Note over Test,R3: Final Consensus Verification
      Test->>R1: Query /api/raft-leader
      Test->>R2: Query /api/raft-leader
      Test->>R3: Query /api/raft-leader
      Test->>R1: Query /api/raft-health
      Test->>R2: Query /api/raft-health
      Test->>R3: Query /api/raft-health
      Test->>Test: Verify all nodes agree on leader
    end
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Three nodes dance the Raft consensus jig,
With leader elected, the cluster grows big!
Through failover storms and rejoins so bright,
These tests keep our Raft standing upright! 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ❓ Inconclusive The minor schema change in go/raft/rel_store.go (removing NOT NULL constraints) appears tangentially related to Raft testing but is not directly justified by the linked issue requirements. Clarify whether the NOT NULL constraint removal in rel_store.go is necessary for the Raft tests or if it should be in a separate PR addressing schema requirements.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main objective: adding CI tests for Raft consensus behavior under failure conditions, which aligns with the primary changeset (adding Raft cluster setup, test scripts, and CI job).
Linked Issues check ✅ Passed All coding requirements from #84 are met: 3-node Raft cluster setup, leader election verification, failover testing, topology operations validation, and CI integration via the functional-raft job.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue84-raft-testing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces functional tests for Raft consensus by adding three orchestrator nodes to the Docker Compose environment along with their respective configuration files. A new test script, test-raft.sh, is included to verify leader election, topology discovery, failover, and node rejoining. Review feedback recommends improving the Docker setup by using pre-built images instead of installing dependencies at runtime and refactoring the test script to use helper functions for repeated API interactions and JSON parsing to enhance maintainability.

- ./orchestrator-raft1.conf.json:/orchestrator/orchestrator.conf.json:ro
command: >
bash -c "
apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 &&
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Installing dependencies like curl and sqlite3 at runtime via apt-get update in the container's command is inefficient and can lead to flaky tests if external repositories are temporarily unavailable. It is better to use a custom Dockerfile or a pre-built image that already contains these tools.

Comment on lines +35 to +37
LEADER1=$(curl -sf "http://localhost:3100/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")
LEADER2=$(curl -sf "http://localhost:3101/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")
LEADER3=$(curl -sf "http://localhost:3102/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for fetching and parsing the Raft leader from a node is repeated multiple times throughout the script. Consider defining a helper function to reduce duplication and improve maintainability.

Example:

get_raft_leader() {
    local port=$1
    curl -sf "http://localhost:${port}/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo ""
}

# Verify exactly one node reports itself as Leader state
LEADER_COUNT=0
for port in "${RAFT_PORTS[@]}"; do
STATE=$(curl -sf "http://localhost:${port}/api/raft-state" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the leader fetching logic, the Raft state retrieval is also repeated. A helper function would be beneficial here as well.

CLUSTER_FOUND=false
for i in $(seq 1 60); do
CLUSTERS=$(curl -sf "http://localhost:${LEADER_PORT}/api/clusters" 2>/dev/null || echo "[]")
COUNT=$(echo "$CLUSTERS" | python3 -c "import json,sys; c=json.load(sys.stdin); print(len(c))" 2>/dev/null || echo "0")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The JSON parsing logic using python3 is repeated. While effective, it adds verbosity to the script. Consider a helper function for JSON extraction or using jq if available in the test environment.

@renecannao renecannao force-pushed the issue82-83-failover-pg-matrix branch from cfab30d to a6c5739 Compare April 9, 2026 09:34
@renecannao renecannao force-pushed the issue84-raft-testing branch from 197e86f to 81793d4 Compare April 9, 2026 09:38
@renecannao renecannao changed the base branch from issue82-83-failover-pg-matrix to master April 9, 2026 10:03
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-failover-advanced.sh`:
- Line 4: The cd invocation cd "$(dirname "$0")/../.." can fail and leave the
script running in the wrong directory; update that line to handle errors by
checking its exit status and exiting on failure (e.g., change it to perform the
cd and then do || { echo "Failed to change directory"; exit 1; } or otherwise
abort on non-zero status) so the script does not continue when the directory
change fails.

In `@tests/functional/test-raft.sh`:
- Line 4: The script currently runs cd "$(dirname "$0")/../.." without checking
its success; update the invocation so the script aborts on failure (e.g.,
replace with a guarded form like if ! cd "$(dirname "$0")/../.."; then echo
"Failed to change directory to $(dirname "$0")/../.."; exit 1; fi or append || {
echo "Failed to cd"; exit 1; } ), ensuring the working directory is validated
before proceeding.
- Line 149: The curl invocation that sets FCLUSTERS uses a malformed option
`--max-time 10L`; update the command in the FCLUSTERS assignment so the timeout
and follow-redirect flags are separate (e.g., `--max-time 10 -L` or just
`--max-time 10` if you don't need redirects) to remove the stray `L` and ensure
redirects are followed when desired; modify the line that defines FCLUSTERS
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bdbc12a7-24a2-4916-beb5-6b7bdcb53ca7

📥 Commits

Reviewing files that changed from the base of the PR and between 5b13080 and 730d7d3.

📒 Files selected for processing (8)
  • .github/workflows/functional.yml
  • tests/functional/docker-compose.yml
  • tests/functional/lib.sh
  • tests/functional/orchestrator-raft1.conf.json
  • tests/functional/orchestrator-raft2.conf.json
  • tests/functional/orchestrator-raft3.conf.json
  • tests/functional/test-failover-advanced.sh
  • tests/functional/test-raft.sh

#!/bin/bash
# Tier B+: Advanced failover tests — intermediate master, errant GTID, co-master
set -uo pipefail # no -e: we handle failures ourselves
cd "$(dirname "$0")/../.."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add error handling for cd command.

Same issue as in test-raft.sh - if the cd fails, the script continues from an unexpected location.

🛡️ Proposed fix
-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd "$(dirname "$0")/../.."
cd "$(dirname "$0")/../.." || exit 1
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-failover-advanced.sh` at line 4, The cd invocation cd
"$(dirname "$0")/../.." can fail and leave the script running in the wrong
directory; update that line to handle errors by checking its exit status and
exiting on failure (e.g., change it to perform the cd and then do || { echo
"Failed to change directory"; exit 1; } or otherwise abort on non-zero status)
so the script does not continue when the directory change fails.

#!/bin/bash
# Raft consensus tests -- verify leader election, failover, and follower redirect
set -uo pipefail
cd "$(dirname "$0")/../.."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add error handling for cd command.

If the cd fails (e.g., directory doesn't exist), the script will continue from an unexpected location, which could cause failures or incorrect behavior.

🛡️ Proposed fix
-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd "$(dirname "$0")/../.."
cd "$(dirname "$0")/../.." || exit 1
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` at line 4, The script currently runs cd
"$(dirname "$0")/../.." without checking its success; update the invocation so
the script aborts on failure (e.g., replace with a guarded form like if ! cd
"$(dirname "$0")/../.."; then echo "Failed to change directory to $(dirname
"$0")/../.."; exit 1; fi or append || { echo "Failed to cd"; exit 1; } ),
ensuring the working directory is validated before proceeding.

fi
FPORT="${RAFT_PORTS[$idx]}"
# Followers may redirect or serve data directly; either is valid
FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo: 10L should be 10 -L or just 10.

The curl option --max-time 10L appears to have an errant L. If the intent is to follow redirects (which makes sense for followers that may redirect to the leader), use -L as a separate flag.

🔧 Proposed fix
-        FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
+        FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` at line 149, The curl invocation that sets
FCLUSTERS uses a malformed option `--max-time 10L`; update the command in the
FCLUSTERS assignment so the timeout and follow-redirect flags are separate
(e.g., `--max-time 10 -L` or just `--max-time 10` if you don't need redirects)
to remove the stray `L` and ensure redirects are followed when desired; modify
the line that defines FCLUSTERS accordingly.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tests/functional/test-raft.sh (3)

79-81: Consider quoting the URL to satisfy shellcheck.

The URL in the curl command is technically unquoted. While safe here since port comes from a hardcoded array, quoting prevents potential issues if the code is modified later.

🔧 Proposed fix
-        echo "  Node :${port} raft-status: $(curl -sf --max-time 10 http://localhost:${port}/api/raft-status 2>/dev/null || echo 'unreachable')"
+        echo "  Node :${port} raft-status: $(curl -sf --max-time 10 "http://localhost:${port}/api/raft-status" 2>/dev/null || echo 'unreachable')"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` around lines 79 - 81, The for-loop iterating
RAFT_PORTS uses an unquoted URL in the curl call; update the curl invocation
inside the loop so the URL and the port variable are quoted (e.g.,
"http://localhost:${port}/api/raft-status") to satisfy shellcheck and prevent
future word-splitting or globbing issues when using RAFT_PORTS and the curl
command.

237-239: Consider quoting the URL for consistency.

Same shellcheck suggestion as line 80 - quote the URL to prevent potential word splitting.

🔧 Proposed fix
-            echo "  Node :${port} status: $(curl -sf --max-time 10 http://localhost:${port}/api/raft-status 2>/dev/null || echo 'unreachable')"
+            echo "  Node :${port} status: $(curl -sf --max-time 10 "http://localhost:${port}/api/raft-status" 2>/dev/null || echo 'unreachable')"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` around lines 237 - 239, Loop uses an unquoted
URL in the curl call which can lead to word-splitting; update the curl
invocation inside the for loop that iterates REMAINING_PORTS to quote the
interpolated URL (use "$port" within the URL string) so the argument passed to
curl is a single token (e.g., curl -sf --max-time 10
"http://localhost:${port}/api/raft-status" ...).

62-64: Consider extracting JSON parsing to a helper function.

The pattern curl ... | python3 -c "import json,sys; print(json.load(sys.stdin))" is repeated throughout the script (lines 62-64, 87, 102, 123, 147, 177, 223-224, 253, 261, 284, 300, 314). Extracting this to a helper would reduce duplication and make the script easier to maintain.

♻️ Proposed helper function

Add near the top of the script after sourcing lib.sh:

# Helper to fetch JSON endpoint and extract scalar value
raft_api() {
    local port=$1
    local endpoint=$2
    curl -sf --max-time 10 "http://localhost:${port}/api/${endpoint}" 2>/dev/null \
        | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo ""
}

Then usage becomes:

-        LEADER1=$(curl -sf --max-time 10 "http://localhost:3100/api/raft-leader" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")
+        LEADER1=$(raft_api 3100 raft-leader)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` around lines 62 - 64, The script repeats the
curl+python JSON-extraction pipeline for endpoints (used to set variables like
LEADER1/LEADER2/LEADER3 and other checks); extract that into a small helper
function (e.g., raft_api) defined near the top after sourcing lib.sh that
accepts a port and an endpoint (port, endpoint) and returns the scalar JSON
value or empty string on failure, then replace each occurrence of the inline
pipeline (curl ... | python3 -c "import json,sys; print(json.load(sys.stdin))"
...) with calls like raft_api 3100 raft-leader (or raft_api "$port"
"some-endpoint") for LEADER*, the checks at lines referenced, and any other
occurrences to remove duplication and centralize error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/functional/test-raft.sh`:
- Around line 79-81: The for-loop iterating RAFT_PORTS uses an unquoted URL in
the curl call; update the curl invocation inside the loop so the URL and the
port variable are quoted (e.g., "http://localhost:${port}/api/raft-status") to
satisfy shellcheck and prevent future word-splitting or globbing issues when
using RAFT_PORTS and the curl command.
- Around line 237-239: Loop uses an unquoted URL in the curl call which can lead
to word-splitting; update the curl invocation inside the for loop that iterates
REMAINING_PORTS to quote the interpolated URL (use "$port" within the URL
string) so the argument passed to curl is a single token (e.g., curl -sf
--max-time 10 "http://localhost:${port}/api/raft-status" ...).
- Around line 62-64: The script repeats the curl+python JSON-extraction pipeline
for endpoints (used to set variables like LEADER1/LEADER2/LEADER3 and other
checks); extract that into a small helper function (e.g., raft_api) defined near
the top after sourcing lib.sh that accepts a port and an endpoint (port,
endpoint) and returns the scalar JSON value or empty string on failure, then
replace each occurrence of the inline pipeline (curl ... | python3 -c "import
json,sys; print(json.load(sys.stdin))" ...) with calls like raft_api 3100
raft-leader (or raft_api "$port" "some-endpoint") for LEADER*, the checks at
lines referenced, and any other occurrences to remove duplication and centralize
error handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8d356ac6-1896-44fb-90f6-2dd3b42449bc

📥 Commits

Reviewing files that changed from the base of the PR and between 730d7d3 and 71d1b7f.

📒 Files selected for processing (2)
  • go/raft/rel_store.go
  • tests/functional/test-raft.sh
✅ Files skipped from review due to trivial changes (1)
  • go/raft/rel_store.go

@renecannao renecannao force-pushed the issue84-raft-testing branch from 71d1b7f to f7d24c2 Compare April 9, 2026 13:27
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/functional/docker-compose.yml (1)

208-231: Consider adding a healthcheck for Raft orchestrator services.

Unlike the existing orchestrator service (line 165-170), the Raft services lack healthchecks. While the test script handles bootstrap ordering manually, adding healthchecks would improve CI reliability and enable depends_on with condition: service_healthy between nodes or for downstream steps.

♻️ Optional: Add healthcheck to each Raft service
     ports:
       - "3100:3099"
     depends_on:
       mysql1:
         condition: service_healthy
+    healthcheck:
+      test: ["CMD", "curl", "-sf", "http://localhost:3099/api/raft-status"]
+      interval: 5s
+      timeout: 3s
+      retries: 60
+      start_period: 30s
     networks:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/docker-compose.yml` around lines 208 - 231, The Raft node
services (e.g., orchestrator-raft1) are missing Docker healthchecks; add a
healthcheck block to each Raft orchestrator service (orchestrator-raft1 and the
other orchestrator-raftN services) similar to the non‑Raft orchestrator service:
include a test that hits the internal HTTP port (3099) such as ["CMD", "curl",
"-f", "http://localhost:3099/health"] (or equivalent endpoint), and set sensible
interval/timeout/retries/start_period values (for example interval: 10s,
timeout: 5s, retries: 5, start_period: 5s) so depends_on can use condition:
service_healthy and CI ordering becomes reliable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/functional/docker-compose.yml`:
- Around line 208-231: The Raft node services (e.g., orchestrator-raft1) are
missing Docker healthchecks; add a healthcheck block to each Raft orchestrator
service (orchestrator-raft1 and the other orchestrator-raftN services) similar
to the non‑Raft orchestrator service: include a test that hits the internal HTTP
port (3099) such as ["CMD", "curl", "-f", "http://localhost:3099/health"] (or
equivalent endpoint), and set sensible interval/timeout/retries/start_period
values (for example interval: 10s, timeout: 5s, retries: 5, start_period: 5s) so
depends_on can use condition: service_healthy and CI ordering becomes reliable.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3f8db2b3-e6dd-4122-8511-304330439e88

📥 Commits

Reviewing files that changed from the base of the PR and between 71d1b7f and f7d24c2.

📒 Files selected for processing (8)
  • .github/workflows/functional.yml
  • go/raft/rel_store.go
  • tests/functional/docker-compose.yml
  • tests/functional/orchestrator-raft1.conf.json
  • tests/functional/orchestrator-raft2.conf.json
  • tests/functional/orchestrator-raft3.conf.json
  • tests/functional/test-failover-advanced.sh
  • tests/functional/test-raft.sh
✅ Files skipped from review due to trivial changes (4)
  • go/raft/rel_store.go
  • tests/functional/orchestrator-raft3.conf.json
  • tests/functional/orchestrator-raft1.conf.json
  • tests/functional/orchestrator-raft2.conf.json

- Add 3-node orchestrator Raft cluster to docker-compose.yml
- Add Raft config files for each node (orchestrator-raft{1,2,3}.conf.json)
- Add test-raft.sh with 4 test phases: cluster formation & leader
  election, leader serves topology, leader failure & re-election,
  node rejoin
- Add functional-raft CI job in GitHub Actions workflow
All curl calls lacked --max-time, causing potential test hangs
if any API endpoint was slow to respond.
Starting all 3 Raft nodes simultaneously causes each to call
BootstrapCluster independently, creating conflicting initial states
and perpetual election cycles. Fix by starting node 1 first, waiting
for it to be ready, then starting nodes 2 and 3.
The raft_log.data and raft_store.store_value columns were NOT NULL, but
hashicorp/raft can produce log entries with nil Data (e.g. LogNoop when
a new leader takes over). This caused "NOT NULL constraint failed:
raft_log.data" errors immediately after election, causing the leader to
step down and creating perpetual election cycles.
@renecannao renecannao force-pushed the issue84-raft-testing branch from d8e4de2 to 2ea8a25 Compare April 9, 2026 14:17
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
tests/functional/test-raft.sh (2)

4-4: ⚠️ Potential issue | 🟡 Minor

Guard cd failure on Line 4.

If directory change fails, subsequent docker/paths are evaluated from an undefined working directory.

🔧 Suggested fix
-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || exit 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` at line 4, The script runs cd "$(dirname
"$0")/../.." without checking failure; modify the tests/functional/test-raft.sh
startup to guard that cd (e.g., change the line to cd "$(dirname "$0")/../.." ||
{ echo "Failed to change directory"; exit 1; } or add set -e at top) so that if
the cd command in the script fails the script aborts and subsequent docker/path
operations are not run from an undefined working directory.

176-176: ⚠️ Potential issue | 🟠 Major

Fix malformed curl flags on Line 176; current check is unreliable.

--max-time 10L is malformed and can make the follower check always fall back to [], masking real behavior.

🔧 Suggested fix
-        FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
+        FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
#!/bin/bash
# Verify malformed curl usage and confirm expected curl flags exist.
rg -n -- '--max-time[[:space:]]+[0-9]+L' tests/functional/test-raft.sh
curl --help all | rg -n -- '--max-time|-L'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` at line 176, The curl invocation that sets
FCLUSTERS is using a malformed flag `--max-time 10L` which causes curl to fail
and fallback to `[]`; update the curl call that assigns FCLUSTERS to use a valid
timeout (e.g., `--max-time 10` or an appropriate numeric value) and ensure other
flags (`-s -f`) remain correct so the follower check behaves reliably; locate
the FCLUSTERS assignment line with the `curl ...
"http://localhost:${FPORT}/api/clusters"` invocation and replace the `10L`
timeout token with a valid numeric timeout value.
🧹 Nitpick comments (1)
tests/functional/docker-compose.yml (1)

215-221: Reset per-node Raft state at container boot to reduce flaky reruns.

Right now each node only mkdir -p /tmp/raft*. If an existing stopped container is reused, stale Raft/SQLite state can leak across runs and affect election behavior.

♻️ Suggested hardening
   command: >
     bash -c "
       apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 &&
       mkdir -p /tmp/raft1 &&
+      rm -rf /tmp/raft1/* &&
       cd /orchestrator &&
       orchestrator -config orchestrator.conf.json http
     "
@@
   command: >
     bash -c "
       apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 &&
       mkdir -p /tmp/raft2 &&
+      rm -rf /tmp/raft2/* &&
       cd /orchestrator &&
       orchestrator -config orchestrator.conf.json http
     "
@@
   command: >
     bash -c "
       apt-get update -qq && apt-get install -y -qq curl sqlite3 > /dev/null 2>&1 &&
       mkdir -p /tmp/raft3 &&
+      rm -rf /tmp/raft3/* &&
       cd /orchestrator &&
       orchestrator -config orchestrator.conf.json http
     "

Also applies to: 240-246, 265-271

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/docker-compose.yml` around lines 215 - 221, The container
start command only runs "mkdir -p /tmp/raft1" which leaves stale Raft/SQLite
state if a stopped container is reused; change the startup shell in the
orchestrator command to remove any existing Raft directory before recreating it
(for example use "rm -rf /tmp/raft1 && mkdir -p /tmp/raft1") so old state is
wiped on boot, and apply the same change to the other node startup commands that
create /tmp/raft2 and /tmp/raft3 (i.e., replace their "mkdir -p /tmp/raftN" with
"rm -rf /tmp/raftN && mkdir -p /tmp/raftN").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-raft.sh`:
- Around line 299-307: The test currently treats a restarted node reporting
"Leader" as a skip, weakening the requirement that a rejoining node must settle
as a Follower; update the logic in tests/functional/test-raft.sh to treat RSTATE
== "Leader" as a failure instead of skipping by changing the branch that checks
RSTATE (the code using variables RESTARTED_PORT and RSTATE and calling
pass/skip/fail) so that only "Follower" results in pass and any other value —
including "Leader" — triggers fail with a clear message like "Restarted node in
unexpected state: $RSTATE".
- Around line 168-188: The test currently treats missing follower data as a
non-fatal skip; change it to enforce the follower redirect/serve behavior by
failing the test when a follower neither serves cluster data nor issues a valid
redirect. In the loop that uses FPORT, FCLUSTERS and FCOUNT, update the check to
attempt detecting an HTTP 3xx redirect (inspect status header from curl -I or
-sS -D -) or a non-empty JSON array; if neither condition is met set
FOLLOWER_HAS_DATA=false and call the test failure helper (replace skip "..."
with fail "Follower on :${FPORT} neither served data nor redirected to leader")
so CI fails when followers do not accept redirect/serve behavior, and keep the
pass "Followers serve cluster data (Raft state replicated)" when all followers
pass.

---

Duplicate comments:
In `@tests/functional/test-raft.sh`:
- Line 4: The script runs cd "$(dirname "$0")/../.." without checking failure;
modify the tests/functional/test-raft.sh startup to guard that cd (e.g., change
the line to cd "$(dirname "$0")/../.." || { echo "Failed to change directory";
exit 1; } or add set -e at top) so that if the cd command in the script fails
the script aborts and subsequent docker/path operations are not run from an
undefined working directory.
- Line 176: The curl invocation that sets FCLUSTERS is using a malformed flag
`--max-time 10L` which causes curl to fail and fallback to `[]`; update the curl
call that assigns FCLUSTERS to use a valid timeout (e.g., `--max-time 10` or an
appropriate numeric value) and ensure other flags (`-s -f`) remain correct so
the follower check behaves reliably; locate the FCLUSTERS assignment line with
the `curl ... "http://localhost:${FPORT}/api/clusters"` invocation and replace
the `10L` timeout token with a valid numeric timeout value.

---

Nitpick comments:
In `@tests/functional/docker-compose.yml`:
- Around line 215-221: The container start command only runs "mkdir -p
/tmp/raft1" which leaves stale Raft/SQLite state if a stopped container is
reused; change the startup shell in the orchestrator command to remove any
existing Raft directory before recreating it (for example use "rm -rf /tmp/raft1
&& mkdir -p /tmp/raft1") so old state is wiped on boot, and apply the same
change to the other node startup commands that create /tmp/raft2 and /tmp/raft3
(i.e., replace their "mkdir -p /tmp/raftN" with "rm -rf /tmp/raftN && mkdir -p
/tmp/raftN").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a6f07ef4-bea0-4cda-90b1-5f440a39d31d

📥 Commits

Reviewing files that changed from the base of the PR and between f7d24c2 and 2ea8a25.

📒 Files selected for processing (7)
  • .github/workflows/functional.yml
  • go/raft/rel_store.go
  • tests/functional/docker-compose.yml
  • tests/functional/orchestrator-raft1.conf.json
  • tests/functional/orchestrator-raft2.conf.json
  • tests/functional/orchestrator-raft3.conf.json
  • tests/functional/test-raft.sh
✅ Files skipped from review due to trivial changes (4)
  • go/raft/rel_store.go
  • tests/functional/orchestrator-raft3.conf.json
  • tests/functional/orchestrator-raft1.conf.json
  • tests/functional/orchestrator-raft2.conf.json

Comment on lines +168 to +188
# Verify followers can also return cluster data (Raft replicates state)
FOLLOWER_HAS_DATA=true
for idx in 0 1 2; do
if [ "$idx" = "$LEADER_INDEX" ]; then
continue
fi
FPORT="${RAFT_PORTS[$idx]}"
# Followers may redirect or serve data directly; either is valid
FCLUSTERS=$(curl -sf --max-time 10L "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
FCOUNT=$(echo "$FCLUSTERS" | python3 -c "import json,sys; c=json.load(sys.stdin); print(len(c))" 2>/dev/null || echo "0")
if [ "$FCOUNT" -lt 1 ]; then
FOLLOWER_HAS_DATA=false
echo " Follower on :${FPORT} returned $FCOUNT clusters"
fi
done
if $FOLLOWER_HAS_DATA; then
pass "Followers serve cluster data (Raft state replicated)"
else
# This is not necessarily a failure -- followers may need more time
skip "Some followers do not yet serve cluster data (may need more replication time)"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Follower redirect acceptance criterion is not enforced.

This block treats missing follower data as skip, so CI can pass without proving follower-to-leader redirect behavior required by the PR objective.

🔧 Suggested assertion pattern
-    # Verify followers can also return cluster data (Raft replicates state)
-    FOLLOWER_HAS_DATA=true
+    # Verify follower requests are redirected (or explicitly served) in a strict way
+    FOLLOWER_BEHAVIOR_OK=true
     for idx in 0 1 2; do
         if [ "$idx" = "$LEADER_INDEX" ]; then
             continue
         fi
         FPORT="${RAFT_PORTS[$idx]}"
-        # Followers may redirect or serve data directly; either is valid
-        FCLUSTERS=$(curl -sfL --max-time 10 "http://localhost:${FPORT}/api/clusters" 2>/dev/null || echo "[]")
-        FCOUNT=$(echo "$FCLUSTERS" | python3 -c "import json,sys; c=json.load(sys.stdin); print(len(c))" 2>/dev/null || echo "0")
-        if [ "$FCOUNT" -lt 1 ]; then
-            FOLLOWER_HAS_DATA=false
-            echo "  Follower on :${FPORT} returned $FCOUNT clusters"
-        fi
+        HEADERS=$(curl -sS -D - -o /dev/null --max-time 10 "http://localhost:${FPORT}/api/clusters" || true)
+        CODE=$(echo "$HEADERS" | awk 'toupper($1) ~ /^HTTP/ {print $2}' | tail -1)
+        LOCATION=$(echo "$HEADERS" | awk 'tolower($1)=="location:" {print $2}' | tr -d '\r')
+        if [ "$CODE" -ge 300 ] && [ "$CODE" -lt 400 ]; then
+            [ -n "$LOCATION" ] || FOLLOWER_BEHAVIOR_OK=false
+        elif [ "$CODE" != "200" ]; then
+            FOLLOWER_BEHAVIOR_OK=false
+        fi
     done
-    if $FOLLOWER_HAS_DATA; then
-        pass "Followers serve cluster data (Raft state replicated)"
+    if $FOLLOWER_BEHAVIOR_OK; then
+        pass "Follower behavior validated for /api/clusters"
     else
-        # This is not necessarily a failure -- followers may need more time
-        skip "Some followers do not yet serve cluster data (may need more replication time)"
+        fail "Follower /api/clusters behavior did not meet redirect/serve expectations"
     fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` around lines 168 - 188, The test currently
treats missing follower data as a non-fatal skip; change it to enforce the
follower redirect/serve behavior by failing the test when a follower neither
serves cluster data nor issues a valid redirect. In the loop that uses FPORT,
FCLUSTERS and FCOUNT, update the check to attempt detecting an HTTP 3xx redirect
(inspect status header from curl -I or -sS -D -) or a non-empty JSON array; if
neither condition is met set FOLLOWER_HAS_DATA=false and call the test failure
helper (replace skip "..." with fail "Follower on :${FPORT} neither served data
nor redirected to leader") so CI fails when followers do not accept
redirect/serve behavior, and keep the pass "Followers serve cluster data (Raft
state replicated)" when all followers pass.

Comment on lines +299 to +307
# Verify the restarted node is a follower (not a new leader)
RSTATE=$(curl -sf --max-time 10 "http://localhost:${RESTARTED_PORT}/api/raft-state" 2>/dev/null | python3 -c "import json,sys; print(json.load(sys.stdin))" 2>/dev/null || echo "")
if [ "$RSTATE" = "Follower" ]; then
pass "Restarted node is a Follower (stable leader)"
elif [ "$RSTATE" = "Leader" ]; then
# Leadership may have shifted -- still valid if all agree
skip "Restarted node became Leader (leadership may have shifted)"
else
fail "Restarted node in unexpected state: $RSTATE"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Node rejoin check should fail if the restarted node does not settle as follower.

Line 305 currently downgrades a leader state to skip, which weakens the explicit “rejoins as follower” acceptance criterion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-raft.sh` around lines 299 - 307, The test currently
treats a restarted node reporting "Leader" as a skip, weakening the requirement
that a rejoining node must settle as a Follower; update the logic in
tests/functional/test-raft.sh to treat RSTATE == "Leader" as a failure instead
of skipping by changing the branch that checks RSTATE (the code using variables
RESTARTED_PORT and RSTATE and calling pass/skip/fail) so that only "Follower"
results in pass and any other value — including "Leader" — triggers fail with a
clear message like "Restarted node in unexpected state: $RSTATE".

@renecannao renecannao merged commit 6b6f645 into master Apr 9, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: Raft consensus testing under failure conditions

1 participant