Add functional test CI: Docker MySQL topology + ProxySQL + failover tests by renecannao · Pull Request #51 · ProxySQL/orchestrator

renecannao · 2026-03-24T09:02:30Z

Closes #46, closes #47, closes #48, closes #49, closes #50

Summary

Every manual test from the QA session is now automated and runs in GitHub Actions.

Infrastructure (`tests/functional/docker-compose.yml`)

MySQL 8.4 master + 2 replicas with GTID replication
ProxySQL with writer (HG 10) and reader (HG 20) hostgroups
Auto-configured: replication, orchestrator user, ProxySQL remote admin

Test Suites

Tier A — Smoke (test-smoke.sh): 25+ checks

Discovery (cluster + instances)
Web UI + static files
API v1 endpoints
API v2 endpoints with structured responses
Prometheus /metrics
Health endpoints (/health/live, /health/ready, /health/leader)
ProxySQL CLI (proxysql-test, proxysql-servers)
ProxySQL API (/api/proxysql/servers)

Tier B — Failover (test-failover.sh): 12+ checks

Graceful master takeover with ProxySQL hook validation
Hard failover: docker stop mysql1, verify detection + promotion + ProxySQL update
Recovery audit via /api/v2/recoveries
Topology restore between tests

Tier C — Regression (test-regression.sh): 22+ checks

Chi router v1 API regression
API v2 response format and error codes
Prometheus metric presence and format
Health endpoints
Web UI and static file serving

GitHub Actions (`functional.yml`)

Triggers: push to master, PRs, manual dispatch
Starts Docker infrastructure, waits for health, verifies replication
Runs all 3 test suites sequentially
Uploads logs as artifacts on failure
15 minute timeout

Test plan

CI runs the functional workflow
All 3 test suites pass (smoke, failover, regression)

Summary by CodeRabbit

Tests
- Added a full functional test suite (smoke, regression, failover) with shared test helpers, end-to-end topology and replication checks, ProxySQL and Prometheus validations, web/API/UI checks, and a consolidated pass/fail reporter.
Chores
- Added a CI workflow to run these functional tests (push/PR/manual), provision test infrastructure, run sequential suites, collect logs/artifacts on success/failure, and perform automated cleanup.

Comprehensive functional tests running against real MySQL topology + ProxySQL: Infrastructure (docker-compose.yml): - MySQL 8.4 master + 2 replicas with GTID replication - ProxySQL with writer (HG 10) and reader (HG 20) hostgroups - Auto-configured replication and orchestrator user Test suites: - test-smoke.sh (Tier A): Discovery, API v1/v2, Prometheus, health endpoints, ProxySQL CLI/API, web UI, static files (25+ checks) - test-failover.sh (Tier B): Graceful master takeover with ProxySQL hook validation, hard failover (kill master), recovery audit (12+ checks) - test-regression.sh (Tier C): Full chi router regression, API v2 response format, Prometheus metrics, health endpoints (22+ checks) GitHub Actions workflow (functional.yml): - Triggers on push to master, PRs, and manual dispatch - Starts Docker infrastructure, builds orchestrator, runs all 3 suites - Uploads orchestrator and docker logs as artifacts on failure - 15 minute timeout Closes #46, closes #47, closes #48, closes #49, closes #50

coderabbitai · 2026-03-24T09:02:48Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds a functional test harness: Docker Compose topology (MySQL master + 2 replicas + ProxySQL), initialization and replication setup, orchestrator test config and binary build, three test suites (smoke, regression, failover), shared helpers, a runner and GitHub Actions workflow that boots the stack, runs tests, collects logs/artifacts, and always tears down.

Changes

Cohort / File(s)	Summary
CI Workflow `/.github/workflows/functional.yml`	New GitHub Actions workflow "Functional Tests": builds Go orchestrator, boots docker-compose, waits for service health, runs replication setup, launches orchestrator, executes test suites, uploads logs/artifacts, and always tears down.
Compose & ProxySQL config `tests/functional/docker-compose.yml`, `tests/functional/proxysql/proxysql.cnf`	Adds docker-compose environment with network, 3 MySQL services and ProxySQL, healthchecks, hostgroup/server mapping and ProxySQL admin config.
MySQL configs `tests/functional/mysql/master.cnf`, `tests/functional/mysql/replica.cnf`, `tests/functional/mysql/replica2.cnf`	Server configs for master and two replicas (GTID, binlog, server-id, replica read-only).
MySQL init SQL `tests/functional/mysql/init-master.sql`, `tests/functional/mysql/init-replica.sql`	Initialization SQL to create `orchestrator` and `repl` users and grant appropriate privileges.
Replication setup `tests/functional/setup-replication.sh`	Script to configure GTID-based replication for replicas with retries and verification loops.
Test runner & helpers `tests/functional/lib.sh`, `tests/functional/run-all.sh`	Shared test library (counters, HTTP helpers, discovery/wait helpers, docker inspection) and runner to build, start infra, verify replication, run suites, collect logs, and cleanup.
Test suites `tests/functional/test-smoke.sh`, `tests/functional/test-regression.sh`, `tests/functional/test-failover.sh`	Three executable suites: smoke (basic endpoints, discovery, CLI), regression (API v1/v2, metrics, health, UI), and failover (graceful takeover and hard failover flows with ProxySQL and recovery API checks).
Orchestrator test config `tests/functional/orchestrator-test.conf.json`	Test-specific orchestrator JSON: debug, :3099, SQLite backend, MySQL creds, ProxySQL admin settings, hostgroup mappings, recovery filters, Prometheus enabled.
Runtime artifacts `/tmp/orchestrator-test.log`, `/tmp/orchestrator-test.pid` (created at runtime)	Workflows and scripts produce log and PID files and upload them as artifacts on completion or failure.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions
    participant DC as Docker Compose (mysql1,mysql2,mysql3,proxysql)
    participant SETUP as setup-replication.sh
    participant ORC as Orchestrator (bin/orchestrator)
    participant TEST as Test Suites (smoke/regression/failover)
    participant ART as Artifact Uploader

    GH->>DC: docker compose up -d (tests/functional/docker-compose.yml)
    DC-->>GH: services start & healthchecks
    GH->>SETUP: run tests/functional/setup-replication.sh
    SETUP->>DC: exec MySQL CHANGE REPLICATION / START REPLICA
    SETUP-->>GH: replication verified
    GH->>ORC: build & start orchestrator (orchestrator-test.conf.json)
    ORC-->>GH: HTTP API ready (/api/clusters)
    GH->>TEST: run test-smoke.sh -> test-regression.sh -> test-failover.sh
    TEST->>ORC: HTTP API calls (discovery, recovery endpoints)
    TEST->>DC: docker compose exec for ProxySQL/MySQL inspection
    TEST-->>GH: PASS/FAIL results
    GH->>ORC: collect /tmp/orchestrator-test.log
    GH->>DC: docker compose down -v --remove-orphans
    GH->>ART: upload logs/artifacts

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Add built-in ProxySQL failover hooks #36 — Adds end-to-end functional tests, docker-compose and ProxySQL config, and test scripts that exercise the same ProxySQL integration and CLI commands used here.

Poem

🐰 I hopped into compose and nudged each pod,

Seeded GTIDs, warmed Proxy, and gave the code a prod.
Smoke, regression, failover — the warren chimed with zings,
Logs tucked in burrows, PID snug as strings.
A happy hop: CI passes, and the burrow sings!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add functional test CI: Docker MySQL topology + ProxySQL + failover tests' accurately and concisely describes the main change—adding a comprehensive functional test suite with Docker infrastructure and CI automation.
Linked Issues check	✅ Passed	All five linked issues (`#46`–#50) are fully addressed: Docker infrastructure with MySQL+ProxySQL [`#46`], Tier A smoke tests with 25+ checks [`#47`], Tier B failover tests with graceful/hard failover [`#48`], Tier C regression tests with 22+ checks [`#49`], and GitHub Actions workflow [`#50`].
Out of Scope Changes check	✅ Passed	All changes are directly scoped to functional testing infrastructure and CI: Docker Compose configuration, MySQL/ProxySQL initialization, test scripts, CI workflow, and supporting helpers. No unrelated code modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch functional-tests-ci

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-24T09:02:58Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and automated functional testing framework for the orchestrator. By leveraging Docker Compose, it sets up a realistic MySQL replication topology with ProxySQL, enabling comprehensive validation of topology discovery, API endpoints, and critical failover mechanisms. This significantly enhances the reliability and maintainability of the orchestrator by ensuring that all key functionalities are continuously tested in an integrated environment.

Highlights

Automated Functional Testing: Implemented comprehensive functional tests to automate all previous manual QA sessions, covering core orchestrator functionalities.
Docker-based Test Infrastructure: Established a Docker Compose environment featuring a MySQL 8.4 master with two replicas, integrated with ProxySQL for writer/reader hostgroups, and automated configuration for replication and orchestrator/ProxySQL users.
Tiered Test Suites: Introduced three distinct test suites: Smoke (25+ checks for basic functionality), Failover (12+ checks for graceful and hard failovers with ProxySQL hook validation), and Regression (22+ checks for API v1/v2, Prometheus, and Web UI consistency).
GitHub Actions Integration: Configured a GitHub Actions workflow (functional.yml) to automatically trigger these functional tests on pushes to master, pull requests, and manual dispatches, ensuring continuous validation of the system.

Ignored Files

Ignored by pattern: .github/workflows/** (1)
- .github/workflows/functional.yml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive functional test suite, which is a great addition for ensuring the stability and correctness of orchestrator, especially around failover scenarios. The test setup using Docker Compose is well-structured. I've identified a few areas for improvement, mainly concerning the robustness and maintainability of the test environment and scripts. Key points include pinning Docker image versions for reproducibility, avoiding hardcoded credentials, fixing a potential bug with duplicate MySQL server-ids, and making the test scripts more robust. Addressing these points will make the new test suite even more valuable.

gemini-code-assist · 2026-03-24T09:05:13Z

tests/functional/mysql/replica.cnf

@@ -0,0 +1,9 @@
+[mysqld]
+server-id=100


Both mysql2 and mysql3 replicas use this same configuration file, which means they will both have server-id=100. This creates a server-id collision. While the current test setup might work, this is a latent bug that will cause issues if orchestrator attempts to reconfigure replication between these two instances (e.g., making mysql3 a replica of mysql2 after a failover). Each MySQL instance in a replication topology must have a unique server-id.

To fix this, you should use separate configuration files for each replica (e.g., replica2.cnf and replica3.cnf) with unique server-ids, and update tests/functional/docker-compose.yml accordingly.

gemini-code-assist · 2026-03-24T09:05:13Z

tests/functional/docker-compose.yml

+    image: mysql:8.4
+    hostname: mysql1
+    environment:
+      MYSQL_ROOT_PASSWORD: testpass


Hardcoded credentials are used here and throughout the new test files. While this is a test environment, it's a security risk and a bad practice that could be copied elsewhere. Consider using environment variables to manage secrets. You can define them in a .env file (which should be added to .gitignore) and Docker Compose will automatically pick them up. This would centralize credential management and improve security hygiene.

This issue is also present in:

tests/functional/docker-compose.yml (lines 13, 24, 32, 43, 51, 71)

tests/functional/mysql/init-master.sql

tests/functional/mysql/init-replica.sql

tests/functional/orchestrator-test.conf.json

tests/functional/proxysql/proxysql.cnf

tests/functional/run-all.sh

tests/functional/test-failover.sh

tests/functional/lib.sh

gemini-code-assist · 2026-03-24T09:05:13Z

tests/functional/docker-compose.yml

+      - orchnet
+
+  proxysql:
+    image: proxysql/proxysql:latest


Using the latest tag for Docker images can lead to non-reproducible builds and may cause tests to fail unexpectedly if a new version is pushed with breaking changes. It's best practice to pin images to a specific version.

image: proxysql/proxysql:2.6.2

gemini-code-assist · 2026-03-24T09:05:13Z

tests/functional/lib.sh

+test_endpoint() {
+    local NAME="$1" URL="$2" EXPECT="$3"
+    local CODE
+    CODE=$(curl -s -o /dev/null -w "%{http_code}" "$URL" 2>&1)
+    if [ "$CODE" = "$EXPECT" ]; then
+        pass "$NAME (HTTP $CODE)"
+    else
+        fail "$NAME (HTTP $CODE, expected $EXPECT)"
+    fi
+}


The use of 2>&1 can lead to incorrect behavior if curl fails to connect. In such cases, the error message from curl is redirected to stdout and gets assigned to the CODE variable, which will then contain more than just the HTTP status code. A more robust approach is to remove the redirection and check curl's exit code separately.

Suggested change

test_endpoint() {

local NAME="$1" URL="$2" EXPECT="$3"

local CODE

CODE=$(curl -s -o /dev/null -w "%{http_code}" "$URL" 2>&1)

if [ "$CODE" = "$EXPECT" ]; then

pass "$NAME (HTTP $CODE)"

else

fail "$NAME (HTTP $CODE, expected $EXPECT)"

fi

}

test_endpoint() {

local NAME="$1" URL="$2" EXPECT="$3"

local CODE

CODE=$(curl -s -o /dev/null -w "%{http_code}" "$URL")

if [ $? -ne 0 ]; then

fail "$NAME" "curl command failed for $URL"

return

fi

if [ "$CODE" = "$EXPECT" ]; then

pass "$NAME (HTTP $CODE)"

else

fail "$NAME (HTTP $CODE, expected $EXPECT)"

fi

}

gemini-code-assist · 2026-03-24T09:05:13Z

tests/functional/lib.sh

+test_body_contains() {
+    local NAME="$1" URL="$2" EXPECT="$3"
+    local BODY
+    BODY=$(curl -s "$URL" 2>&1)
+    if echo "$BODY" | grep -q "$EXPECT"; then
+        pass "$NAME"
+    else
+        fail "$NAME" "Response does not contain '$EXPECT'"
+    fi
+}


This function is not fully robust. If curl fails with a network error, the error message is captured in BODY, and the test might fail with a generic message. It's better to explicitly check curl's exit code. Additionally, using grep -F is safer for matching fixed strings, as it prevents misinterpretation of special characters as regex.

Suggested change

test_body_contains() {

local NAME="$1" URL="$2" EXPECT="$3"

local BODY

BODY=$(curl -s "$URL" 2>&1)

if echo "$BODY" | grep -q "$EXPECT"; then

pass "$NAME"

else

fail "$NAME" "Response does not contain '$EXPECT'"

fi

}

test_body_contains() {

local NAME="$1" URL="$2" EXPECT="$3"

local BODY

BODY=$(curl -s "$URL")

if [ $? -ne 0 ]; then

fail "$NAME" "curl command failed for $URL"

return

fi

if echo "$BODY" | grep -qF -- "$EXPECT"; then

pass "$NAME"

else

fail "$NAME" "Response does not contain '$EXPECT'"

fi

}

gemini-code-assist · 2026-03-24T09:05:13Z

tests/functional/lib.sh

+    echo "Waiting for topology discovery..."
+    for i in $(seq 1 60); do
+        local COUNT
+        COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null)


The Python one-liner for parsing JSON is not fully robust. If curl returns a non-JSON response (e.g., an HTML error page), the json.load call will fail, and the COUNT variable will be empty. This can lead to silent failures in the loop condition. It's safer to handle this case by defaulting to 0 if the Python script fails.

Suggested change

COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null)

COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null || echo 0)

gemini-code-assist · 2026-03-24T09:05:13Z

tests/functional/test-failover.sh

+    fail "Graceful takeover API returned: $CODE" "$(echo "$RESULT" | head -c 200)"
+fi
+
+sleep 3


Using fixed sleep durations can make tests flaky or unnecessarily slow. It's more reliable to poll for the desired state. For example, instead of sleeping for 3 seconds after triggering a graceful takeover, you could create a polling loop that repeatedly checks the read_only status of mysql1 and mysql2 until they reflect the role change. This would make the test faster and more robust against timing variations.

This applies to other sleep commands in this file as well (e.g., lines 92, 96, 142, 162).

Copilot

Pull request overview

Adds a Docker-based functional test environment (MySQL + ProxySQL) and wires three tiered functional test suites into a GitHub Actions workflow so manual QA checks run automatically on PRs and master.

Changes:

Introduces tests/functional/ Docker Compose topology (MySQL primary/replicas + ProxySQL) plus an orchestrator test config.
Adds Tier A/B/C bash test suites (smoke, failover, regression) and a local run-all.sh runner.
Adds .github/workflows/functional.yml to run the functional test suite in CI and upload logs as artifacts.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`.github/workflows/functional.yml`	New CI workflow to build orchestrator, start Compose topology, and run functional suites.
`tests/functional/docker-compose.yml`	Defines MySQL+ProxySQL topology used by functional tests.
`tests/functional/lib.sh`	Shared bash helpers (HTTP checks, discovery, mysql/proxysql helpers, summary).
`tests/functional/run-all.sh`	Local runner to build, start infra, start orchestrator, and run all suites.
`tests/functional/test-smoke.sh`	Tier A smoke checks (UI, v1/v2 APIs, metrics, health, ProxySQL API/CLI).
`tests/functional/test-regression.sh`	Tier C regression coverage for key endpoints, metrics, health, static assets.
`tests/functional/test-failover.sh`	Tier B graceful + hard failover validation and ProxySQL hook assertions.
`tests/functional/orchestrator-test.conf.json`	Orchestrator config tuned for functional tests (SQLite backend, ProxySQL settings, metrics).
`tests/functional/proxysql/proxysql.cnf`	ProxySQL config (admin creds, hostgroups, monitor + mysql users).
`tests/functional/mysql/master.cnf`	MySQL primary config (GTID/binlog settings).
`tests/functional/mysql/replica.cnf`	MySQL replica config (GTID/binlog settings, read-only).
`tests/functional/mysql/init-master.sql`	Initializes users for orchestrator + replication.
`tests/functional/mysql/init-replica.sql`	Initializes orchestrator user and sets up replication from primary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T09:08:26Z

tests/functional/docker-compose.yml

+  mysql1:
+    image: mysql:8.4
+    hostname: mysql1
+    environment:
+      MYSQL_ROOT_PASSWORD: testpass


The functional workflow starts orchestrator as a host process, but these Compose services do not publish MySQL/ProxySQL ports to the runner. A host process also cannot resolve mysql1/proxysql via Docker DNS, so orchestrator won't be able to connect to the topology as configured. Either (a) run orchestrator as a Compose service on orchnet and expose 3099, or (b) publish the needed ports and update the test config/URLs to use host-reachable addresses.

Copilot · 2026-03-24T09:08:27Z

tests/functional/mysql/replica.cnf

@@ -0,0 +1,9 @@
+[mysqld]
+server-id=100


Both replicas mount this same config, so they will share the same server-id value. In MySQL replication, each server must have a unique server_id; duplicates can cause replicas to disconnect or behave unpredictably (especially under GTID). Provide distinct configs (e.g., replica2.cnf/replica3.cnf) or override server-id per service.

Suggested change

server-id=100

# server-id must be set uniquely per replica instance (e.g., via instance-specific config)

Copilot · 2026-03-24T09:08:27Z

tests/functional/run-all.sh

+echo "--- Step 4: Run smoke tests ---"
+bash tests/functional/test-smoke.sh
+SMOKE_EXIT=$?
+
+echo ""
+echo "--- Step 5: Run regression tests ---"
+bash tests/functional/test-regression.sh
+REGRESSION_EXIT=$?
+
+echo ""
+echo "--- Step 6: Run failover tests ---"
+bash tests/functional/test-failover.sh
+FAILOVER_EXIT=$?


Because this script runs with set -e, if any suite script exits non-zero the entire run-all.sh will terminate immediately, and the subsequent *_EXIT=$? assignments and cleanup won't run. Disable -e around the suite execution (or use bash ...; SMOKE_EXIT=$? with set +e), so you can collect all exit codes and always perform teardown.

Copilot · 2026-03-24T09:08:27Z

tests/functional/run-all.sh

+echo "Failover:   exit $FAILOVER_EXIT"
+
+# Exit with failure if any suite failed
+[ "$SMOKE_EXIT" -ne 0 ] || [ "$REGRESSION_EXIT" -ne 0 ] || [ "$FAILOVER_EXIT" -ne 0 ] && exit 1


The final failure check uses a chained ||/&& expression whose short-circuiting means it may not exit 1 when smoke/regression fails (because earlier || terms evaluate true and skip the trailing && exit 1). Use an explicit if (or group with { ...; }) to exit non-zero when any suite exit code is non-zero.

Suggested change

[ "$SMOKE_EXIT" -ne 0 ] || [ "$REGRESSION_EXIT" -ne 0 ] || [ "$FAILOVER_EXIT" -ne 0 ] && exit 1

if [ "$SMOKE_EXIT" -ne 0 ] || [ "$REGRESSION_EXIT" -ne 0 ] || [ "$FAILOVER_EXIT" -ne 0 ]; then

exit 1

fi

Copilot · 2026-03-24T09:08:28Z

tests/functional/docker-compose.yml

+      - orchnet
+
+  proxysql:
+    image: proxysql/proxysql:latest


Using proxysql/proxysql:latest makes the functional tests non-reproducible and can break CI when upstream publishes a new image with behavior/config changes. Pin to a known-good ProxySQL version tag (and ideally document the required major/minor).

Suggested change

image: proxysql/proxysql:latest

image: proxysql/proxysql:2.6.3

Copilot · 2026-03-24T09:08:28Z

tests/functional/lib.sh

+
+fail() {
+    echo "  ❌ FAIL: $1"
+    [ -n "$2" ] && echo "         $2"


With set -u enabled in all test scripts, referencing $2 here will trigger an "unbound variable" error when fail is called with a single argument (which happens throughout the suite). Use a safe expansion like ${2-} (or shift/iterate over args) before testing/printing the optional details parameter.

Suggested change

[ -n "$2" ] && echo " $2"

[ -n "${2-}" ] && echo " ${2-}"

Copilot · 2026-03-24T09:08:28Z

tests/functional/lib.sh

+    echo "Waiting for topology discovery..."
+    for i in $(seq 1 60); do
+        local COUNT
+        COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null)


discover_topology assumes /api/cluster/... always returns a JSON array; when it returns an error envelope/object (or temporarily invalid JSON while the cluster is being discovered), json.load may fail or len() may yield a non-instance count (e.g., number of object keys), causing flakiness or false positives. Make the parsing resilient by treating non-list/parse errors as count=0 and continuing to retry.

Suggested change

COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null)

COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c 'import json, sys

try:

data = json.load(sys.stdin)

if isinstance(data, list):

print(len(data))

else:

print(0)

except Exception:

print(0)

' 2>/dev/null)

- mysql3 gets server-id=200 (was 100, same as mysql2) - Replication configured by setup-replication.sh after containers are up (init scripts run during mysqld startup and can't connect to master) - Workflow uses setup-replication.sh instead of inline verification

coderabbitai

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/functional.yml:
- Around line 56-65: The current readiness loop only checks mysql2 for
Replica_IO_Running which can leave mysql3 or the SQL thread unready; update the
readiness check (the timeout loop that runs the mysql -uroot -ptestpass "SHOW
REPLICA STATUS\G" command) to verify all replicas (e.g., mysql2 and mysql3) and
that both Replica_IO_Running and Replica_SQL_Running are "Yes" for each host
before proceeding; implement this by iterating the target containers (mysql2,
mysql3) and asserting both fields are "Yes" in the SHOW REPLICA STATUS\G output
(or by checking the results of grepping for "Replica_IO_Running: Yes" and
"Replica_SQL_Running: Yes" per container) and only exit success when every
replica passes, otherwise keep waiting until timeout.

In `@tests/functional/docker-compose.yml`:
- Line 5: Replace floating image tags in the docker-compose.yml to pin images to
immutable versions: locate the image entries "mysql:8.4" (appearing at the three
occurrences) and "proxysql/proxysql:latest" and update them to digest-pinned
image references (e.g., mysql@sha256:<digest> and
proxysql/proxysql@sha256:<digest>) or a specific immutable version tag; fetch
the appropriate digests from the upstream registries (Docker Hub or the image
provider) and substitute them so the functional test suite uses fixed,
reproducible images.

In `@tests/functional/lib.sh`:
- Around line 57-63: The readiness check in wait_for_orchestrator currently
treats any HTTP response (including 4xx/5xx) as success because curl is invoked
with -s -o /dev/null; change the curl invocation in wait_for_orchestrator (the
call hitting "$ORC_URL/api/clusters") to use flags that cause non-2xx responses
to exit nonzero (e.g., replace -s -o /dev/null with -fsS -o /dev/null or
equivalent) so the loop only reports "Orchestrator ready" on actual HTTP
success.
- Around line 33-53: The test functions test_endpoint and test_body_contains
currently let curl failures abort the whole suite and mix stderr into outputs;
update them to run curl without inheriting set -e (e.g., append || true or
capture exit code immediately), capture stdout and stderr separately (don’t use
2>&1 into the asserted payload), check curl’s exit status and treat any non-zero
exit as a test failure (call fail with diagnostic including the curl stderr and
exit code), and only then assert on the HTTP code or body content; reference
these changes inside the test_endpoint and test_body_contains functions so
failures are recorded instead of aborting the script.
- Around line 9-22: The three reporting functions (pass, fail, skip) are unsafe
under set -euo pipefail: change the post-increment arithmetic ((...++)) to a
safe assignment form (e.g. PASS_COUNT=$((PASS_COUNT+1)),
SKIP_COUNT=$((SKIP_COUNT+1)), FAIL_COUNT=$((FAIL_COUNT+1))) so the increment
never produces a failing exit status, and make fail() guard the optional second
parameter by using a parameter expansion fallback (e.g. "${2:-}") or testing
with "${2+set}" before echoing so referencing $2 cannot trigger an unbound
variable error; update the functions pass, fail, and skip accordingly (ensuring
their echo lines remain) to preserve behavior under strict mode.

In `@tests/functional/mysql/replica.cnf`:
- Line 2: The replica.cnf currently hardcodes server-id=100 which causes
duplicate IDs; remove the server-id line from tests/functional/mysql/replica.cnf
and instead add per-service command overrides in the docker-compose service
definitions for mysql2 and mysql3 (e.g., set --server-id=2 for mysql2 and
--server-id=3 for mysql3) so each replica has a unique MySQL server-id; update
the mysql2 and mysql3 service blocks to include the command override (or
equivalent entrypoint args) rather than relying on the shared replica.cnf.

In `@tests/functional/run-all.sh`:
- Around line 68-80: The script currently aborts early due to set -e when any
test script fails, so wrap each test invocation to prevent immediate exit and
still capture its exit code: run each suite as "bash
tests/functional/test-smoke.sh || true" (and similarly for test-regression.sh
and test-failover.sh), then immediately assign the real exit status to
SMOKE_EXIT, REGRESSION_EXIT and FAILOVER_EXIT via "$?" (e.g., bash ... || true;
SMOKE_EXIT=$?), ensuring the cleanup/summary and trap handlers always run; keep
the rest of the logic unchanged so the final summary uses those variables.
- Around line 48-55: The loop that sets REPL_OK using command substitution exits
under set -euo pipefail when grep finds no match; change the command to avoid
failing the pipeline (e.g., append "|| true" or use grep -q with || true and
capture exit status/output) and parse the mysql SHOW REPLICA STATUS output
safely; extend the check to run against both replicas (mysql2 and mysql3) and
validate both fields Replica_IO_Running and Replica_SQL_Running (aggregate
counts into REPL_OK or separate variables), retry up to the existing timeout,
and if after retries replication is not fully ready, print a clear error and
exit non-zero so the script fails instead of continuing. Ensure you update the
invocation(s) that use $COMPOSE exec -T mysql ... and the REPL_OK logic
accordingly.

In `@tests/functional/test-failover.sh`:
- Around line 78-99: The script restores MySQL and ProxySQL but doesn't verify
the resulting topology; add explicit checks using the helper functions
mysql_read_only, mysql_source_host, and proxysql_servers (from
tests/functional/lib.sh) before declaring PASS and proceeding to the
hard-failover case. After the discovery curls and sleep, call
mysql_read_only(mysql1) to assert mysql1 is writable (read_only=0),
mysql_read_only(mysql2/mysql3) to assert they are read-only, use
mysql_source_host(mysql2) and mysql_source_host(mysql3) to assert their source
is mysql1, and use proxysql_servers() (or proxysql_servers for hostgroup 10/20)
to verify ProxySQL backends were applied; if any check fails, print diagnostics
and exit non‑zero instead of passing.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1895a8cf-54c8-4e03-adda-336b217b2263

📥 Commits

Reviewing files that changed from the base of the PR and between 7fe2446 and 4e3df8a.

📒 Files selected for processing (13)

.github/workflows/functional.yml
tests/functional/docker-compose.yml
tests/functional/lib.sh
tests/functional/mysql/init-master.sql
tests/functional/mysql/init-replica.sql
tests/functional/mysql/master.cnf
tests/functional/mysql/replica.cnf
tests/functional/orchestrator-test.conf.json
tests/functional/proxysql/proxysql.cnf
tests/functional/run-all.sh
tests/functional/test-failover.sh
tests/functional/test-regression.sh
tests/functional/test-smoke.sh

coderabbitai · 2026-03-24T09:14:41Z

.github/workflows/functional.yml

+        timeout 30 bash -c '
+          while true; do
+            REPL=$(docker compose exec -T mysql2 mysql -uroot -ptestpass -Nse "SHOW REPLICA STATUS\G" 2>/dev/null | grep "Replica_IO_Running: Yes" || true)
+            if [ -n "$REPL" ]; then
+              echo "Replication running"
+              exit 0
+            fi
+            sleep 1
+          done
+        ' || { echo "Replication not running"; exit 1; }


⚠️ Potential issue | 🟠 Major

Wait for full replica readiness before starting the suites.

This only checks mysql2 and only Replica_IO_Running. CI can proceed with mysql3 still broken or with a stopped SQL thread, which makes the later failover assertions flaky against a partially healthy topology.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/functional.yml around lines 56 - 65, The current readiness loop only checks mysql2 for Replica_IO_Running which can leave mysql3 or the SQL thread unready; update the readiness check (the timeout loop that runs the mysql -uroot -ptestpass "SHOW REPLICA STATUS\G" command) to verify all replicas (e.g., mysql2 and mysql3) and that both Replica_IO_Running and Replica_SQL_Running are "Yes" for each host before proceeding; implement this by iterating the target containers (mysql2, mysql3) and asserting both fields are "Yes" in the SHOW REPLICA STATUS\G output (or by checking the results of grepping for "Replica_IO_Running: Yes" and "Replica_SQL_Running: Yes" per container) and only exit success when every replica passes, otherwise keep waiting until timeout.

coderabbitai · 2026-03-24T09:14:41Z

tests/functional/docker-compose.yml

+
+services:
+  mysql1:
+    image: mysql:8.4


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Declared images:" rg -n '^\s*image:\s*' tests/functional/docker-compose.yml echo echo "Floating tags detected (latest or major.minor only):" rg -nP 'image:\s*\S+:(latest|\d+\.\d+)$' tests/functional/docker-compose.yml || true

Repository: ProxySQL/orchestrator

Length of output: 355

Pin container images to immutable versions (ideally digest-pinned).

Floating tags (mysql:8.4 and proxysql/proxysql:latest) make the functional test suite non-reproducible and vulnerable to upstream image drift. Each pull of the image may yield different contents, causing tests to break without code changes.

Apply to lines 5, 21, 40 (all mysql:8.4), and line 59 (proxysql/proxysql:latest).

🔧 Proposed fix

- image: mysql:8.4 + image: mysql:8.4.x@sha256:<digest> - image: proxysql/proxysql:latest + image: proxysql/proxysql:<stable-version>@sha256:<digest>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/docker-compose.yml` at line 5, Replace floating image tags in the docker-compose.yml to pin images to immutable versions: locate the image entries "mysql:8.4" (appearing at the three occurrences) and "proxysql/proxysql:latest" and update them to digest-pinned image references (e.g., mysql@sha256:<digest> and proxysql/proxysql@sha256:<digest>) or a specific immutable version tag; fetch the appropriate digests from the upstream registries (Docker Hub or the image provider) and substitute them so the functional test suite uses fixed, reproducible images.

coderabbitai · 2026-03-24T09:14:41Z

tests/functional/lib.sh

+pass() {
+    echo "  ✅ PASS: $1"
+    ((PASS_COUNT++))
+}
+
+fail() {
+    echo "  ❌ FAIL: $1"
+    [ -n "$2" ] && echo "         $2"
+    ((FAIL_COUNT++))
+}
+
+skip() {
+    echo "  ⚠️  SKIP: $1"
+    ((SKIP_COUNT++))


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash bash -euo pipefail -c 'count=0; ((count++)); echo "unreached"' bash -euo pipefail -c 'f(){ [ -n "$2" ] && echo "$2"; }; f only-one-arg'

Repository: ProxySQL/orchestrator

Length of output: 106

🏁 Script executed:

#!/bin/bash # Check lines 1-8 to see if counters are initialized head -n 30 tests/functional/lib.sh

Repository: ProxySQL/orchestrator

Length of output: 566

🏁 Script executed:

#!/bin/bash # Test 1: Post-increment at 0 returns old value (0), causing exit status 1 under set -e bash -euo pipefail -c 'count=0; ((count++)); echo "reached: $count"' echo "Test 1 exit code: $?" # Test 2: Post-increment at 1+ returns old value, causing exit status 0 (truthy) bash -euo pipefail -c 'count=1; ((count++)); echo "reached: $count"' echo "Test 2 exit code: $?" # Test 3: Assignment form always succeeds bash -euo pipefail -c 'count=0; count=$((count+1)); echo "reached: $count"' echo "Test 3 exit code: $?"

Repository: ProxySQL/orchestrator

Length of output: 146

pass/fail/skip are not safe under the strict mode the suites enable.

The post-increment ((...++)) evaluates to the old value and returns exit status 1 when the counter is 0, causing the script to exit on the first call to any of these functions. Additionally, fail() dereferences $2 without guarding for missing positional parameters, which causes an unbound variable error under set -u when called with a single argument.

Because every suite does set -euo pipefail before sourcing tests/functional/lib.sh, the first PASS, the first SKIP, or any one-argument FAIL exits the script instead of recording the result.

Suggested fix

pass() { echo " ✅ PASS: $1" - ((PASS_COUNT++)) + PASS_COUNT=$((PASS_COUNT + 1)) } fail() { echo " ❌ FAIL: $1" - [ -n "$2" ] && echo " $2" - ((FAIL_COUNT++)) + if [ $# -ge 2 ] && [ -n "$2" ]; then + echo " $2" + fi + FAIL_COUNT=$((FAIL_COUNT + 1)) } skip() { echo " ⚠️ SKIP: $1" - ((SKIP_COUNT++)) + SKIP_COUNT=$((SKIP_COUNT + 1)) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

pass() {

echo " ✅ PASS: $1"

((PASS_COUNT++))

}

fail() {

echo " ❌ FAIL: $1"

[ -n "$2" ] && echo " $2"

((FAIL_COUNT++))

}

skip() {

echo " ⚠️ SKIP: $1"

((SKIP_COUNT++))

pass() {

echo " ✅ PASS: $1"

PASS_COUNT=$((PASS_COUNT + 1))

}

fail() {

echo " ❌ FAIL: $1"

if [ $# -ge 2 ] && [ -n "$2" ]; then

echo " $2"

fi

FAIL_COUNT=$((FAIL_COUNT + 1))

}

skip() {

echo " ⚠️ SKIP: $1"

SKIP_COUNT=$((SKIP_COUNT + 1))

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/lib.sh` around lines 9 - 22, The three reporting functions (pass, fail, skip) are unsafe under set -euo pipefail: change the post-increment arithmetic ((...++)) to a safe assignment form (e.g. PASS_COUNT=$((PASS_COUNT+1)), SKIP_COUNT=$((SKIP_COUNT+1)), FAIL_COUNT=$((FAIL_COUNT+1))) so the increment never produces a failing exit status, and make fail() guard the optional second parameter by using a parameter expansion fallback (e.g. "${2:-}") or testing with "${2+set}" before echoing so referencing $2 cannot trigger an unbound variable error; update the functions pass, fail, and skip accordingly (ensuring their echo lines remain) to preserve behavior under strict mode.

tests/functional/lib.sh

coderabbitai · 2026-03-24T09:14:41Z

tests/functional/lib.sh

+wait_for_orchestrator() {
+    echo "Waiting for orchestrator to be ready..."
+    for i in $(seq 1 30); do
+        if curl -s -o /dev/null "$ORC_URL/api/clusters" 2>/dev/null; then
+            echo "Orchestrator ready after ${i}s"
+            return 0
+        fi


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does curlreturn a zero exit status for HTTP 404/500 responses unless-f/--fail is used?

💡 Result:

Yes, curl returns a zero exit status for HTTP 404 and 500 responses unless the -f or --fail option is used. By default, curl considers the transfer successful as long as the HTTP response is received, regardless of the status code (4xx or 5xx). The --fail option makes curl return a non-zero exit code (typically 22) for HTTP responses with status codes >= 400.

Citations:

1: https://curl.se/docs/manpage.html

2: https://man7.org/linux/man-pages/man1/curl.1.html

3: HTTP error codes not reflected in curl exit value curl/curl#3789

4: https://superuser.com/questions/590099/can-i-make-curl-fail-with-an-exitcode-different-than-0-if-the-http-status-code-i

5: https://stackoverflow.com/questions/22506024/get-page-output-with-curl-fail

6: https://everything.curl.dev/http/response.html

Fix readiness check to require HTTP success response.

curl -s -o /dev/null exits with code 0 for HTTP 404/500 responses, allowing wait_for_orchestrator() to incorrectly report success when the endpoint is failing. Use curl -fsS to make curl fail on HTTP error status codes.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/lib.sh` around lines 57 - 63, The readiness check in wait_for_orchestrator currently treats any HTTP response (including 4xx/5xx) as success because curl is invoked with -s -o /dev/null; change the curl invocation in wait_for_orchestrator (the call hitting "$ORC_URL/api/clusters") to use flags that cause non-2xx responses to exit nonzero (e.g., replace -s -o /dev/null with -fsS -o /dev/null or equivalent) so the loop only reports "Orchestrator ready" on actual HTTP success.

coderabbitai · 2026-03-24T09:14:41Z

tests/functional/mysql/replica.cnf

@@ -0,0 +1,9 @@
+[mysqld]
+server-id=100


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "Replica server-id definitions:" rg -n 'server-id\s*=' tests/functional/mysql/*.cnf echo echo "Where replica.cnf is mounted:" rg -n 'replica\.cnf' tests/functional/docker-compose.yml -C2

Repository: ProxySQL/orchestrator

Length of output: 638

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check the full service definitions for mysql2 and mysql3 echo "=== Full mysql2 service definition ===" rg -A 20 'mysql2:' tests/functional/docker-compose.yml echo echo "=== Full mysql3 service definition ===" rg -A 20 'mysql3:' tests/functional/docker-compose.yml # Check if there are any other replicas echo echo "=== All service definitions ===" rg '^\s+[a-z]+[a-z0-9]*:$' tests/functional/docker-compose.yml | head -20

Repository: ProxySQL/orchestrator

Length of output: 2156

Assign unique MySQL server-id values per replica.

Both mysql2 and mysql3 services mount the same replica.cnf with server-id=100, violating MySQL's replication requirement that each server have a unique ID. This causes replication connection conflicts and breaks the topology state needed for failover tests.

Remove the fixed server-id from replica.cnf

Add unique server-id per service via docker-compose command override (e.g., mysql2: --server-id=2, mysql3: --server-id=3)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/mysql/replica.cnf` at line 2, The replica.cnf currently hardcodes server-id=100 which causes duplicate IDs; remove the server-id line from tests/functional/mysql/replica.cnf and instead add per-service command overrides in the docker-compose service definitions for mysql2 and mysql3 (e.g., set --server-id=2 for mysql2 and --server-id=3 for mysql3) so each replica has a unique MySQL server-id; update the mysql2 and mysql3 service blocks to include the command override (or equivalent entrypoint args) rather than relying on the shared replica.cnf.

coderabbitai · 2026-03-24T09:14:41Z

tests/functional/run-all.sh

+for i in $(seq 1 30); do
+    REPL_OK=$($COMPOSE exec -T mysql2 mysql -uroot -ptestpass -Nse "SHOW REPLICA STATUS\G" 2>/dev/null | grep "Replica_IO_Running: Yes" | wc -l)
+    if [ "$REPL_OK" -ge 1 ]; then
+        echo "Replication is running"
+        break
+    fi
+    sleep 1
+done


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash bash -euo pipefail -c 'x=$(printf "" | grep foo | wc -l); echo "unreached"'

Repository: ProxySQL/orchestrator

Length of output: 47

🏁 Script executed:

cd /root/codebase && find . -name "run-all.sh" -type f

Repository: ProxySQL/orchestrator

Length of output: 129

🏁 Script executed:

cd /root/codebase && sed -n '1,60p' tests/functional/run-all.sh | cat -n

Repository: ProxySQL/orchestrator

Length of output: 129

🏁 Script executed:

cd /root/codebase && head -30 tests/functional/run-all.sh | cat -n

Repository: ProxySQL/orchestrator

Length of output: 129

🏁 Script executed:

pwd && git ls-files tests/functional/run-all.sh

Repository: ProxySQL/orchestrator

Length of output: 111

🏁 Script executed:

cat -n tests/functional/run-all.sh | head -60

Repository: ProxySQL/orchestrator

Length of output: 2250

🏁 Script executed:

head -20 tests/functional/run-all.sh && echo "---" && sed -n '40,65p' tests/functional/run-all.sh | cat -n

Repository: ProxySQL/orchestrator

Length of output: 1648

🏁 Script executed:

cat tests/functional/docker-compose.yml | grep -E "mysql[0-9]|container_name|MYSQL_REPLICATION" -A 2 -B 2

Repository: ProxySQL/orchestrator

Length of output: 973

🏁 Script executed:

rg "mysql3" tests/functional/

Repository: ProxySQL/orchestrator

Length of output: 992

🏁 Script executed:

rg "Replica_SQL_Running" tests/functional/

Repository: ProxySQL/orchestrator

Length of output: 47

The replication verification loop will exit prematurely on first iteration if replication is not immediately ready due to pipefail behavior.

With set -euo pipefail on line 3, the command substitution fails when grep finds no match in the pipeline, causing the script to exit immediately instead of retrying. Additionally:

Only mysql2 is verified, but mysql3 (another replica in the topology) is not checked

Only Replica_IO_Running is verified; Replica_SQL_Running should also be validated

The loop completes silently without explicit failure if replication never starts, allowing the script to continue to Step 3

Suggested fix

-echo "Verifying replication..." -for i in $(seq 1 30); do - REPL_OK=$($COMPOSE exec -T mysql2 mysql -uroot -ptestpass -Nse "SHOW REPLICA STATUS\G" 2>/dev/null | grep "Replica_IO_Running: Yes" | wc -l) - if [ "$REPL_OK" -ge 1 ]; then - echo "Replication is running" - break - fi - sleep 1 -done +echo "Verifying replication..." +replica_ready() { + local container="$1" + local status + status=$($COMPOSE exec -T "$container" mysql -uroot -ptestpass -Nse "SHOW REPLICA STATUS\G" 2>/dev/null || true) + echo "$status" | grep -q "Replica_IO_Running: Yes" && + echo "$status" | grep -q "Replica_SQL_Running: Yes" +} + +for i in $(seq 1 30); do + if replica_ready mysql2 && replica_ready mysql3; then + echo "Replication is running" + break + fi + if [ "$i" -eq 30 ]; then + echo "FATAL: Replication not running on all replicas after 30s" + exit 1 + fi + sleep 1 +done

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/run-all.sh` around lines 48 - 55, The loop that sets REPL_OK using command substitution exits under set -euo pipefail when grep finds no match; change the command to avoid failing the pipeline (e.g., append "|| true" or use grep -q with || true and capture exit status/output) and parse the mysql SHOW REPLICA STATUS output safely; extend the check to run against both replicas (mysql2 and mysql3) and validate both fields Replica_IO_Running and Replica_SQL_Running (aggregate counts into REPL_OK or separate variables), retry up to the existing timeout, and if after retries replication is not fully ready, print a clear error and exit non-zero so the script fails instead of continuing. Ensure you update the invocation(s) that use $COMPOSE exec -T mysql ... and the REPL_OK logic accordingly.

coderabbitai · 2026-03-24T09:14:41Z

tests/functional/run-all.sh

+echo "--- Step 4: Run smoke tests ---"
+bash tests/functional/test-smoke.sh
+SMOKE_EXIT=$?
+
+echo ""
+echo "--- Step 5: Run regression tests ---"
+bash tests/functional/test-regression.sh
+REGRESSION_EXIT=$?
+
+echo ""
+echo "--- Step 6: Run failover tests ---"
+bash tests/functional/test-failover.sh
+FAILOVER_EXIT=$?


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash bash -euo pipefail -c 'false; echo "unreached"'

Repository: ProxySQL/orchestrator

Length of output: 47

🏁 Script executed:

# Check if the file exists and examine the structure if [ -f tests/functional/run-all.sh ]; then echo "=== File exists, checking total lines ===" wc -l tests/functional/run-all.sh echo "" echo "=== Checking for set -e/errexit ===" head -20 tests/functional/run-all.sh echo "" echo "=== Lines 68-80 (the section in review) ===" sed -n '68,80p' tests/functional/run-all.sh echo "" echo "=== Checking if there's a summary/cleanup block after line 80 ===" tail -30 tests/functional/run-all.sh else echo "File not found" fi

Repository: ProxySQL/orchestrator

Length of output: 2087

set -e prevents the suite summary and cleanup from running on test failure.

If any test suite (smoke, regression, or failover) exits non-zero, the script terminates immediately before capturing its exit code. The cleanup block (killing ORC_PID, stopping Docker Compose) and the final summary are never reached, leaving infrastructure orphaned. All subsequent test suites also fail to run. Wrap each suite call in conditional statements to allow the script to continue and capture exit codes:

Suggested fix

echo "" echo "--- Step 4: Run smoke tests ---" -bash tests/functional/test-smoke.sh -SMOKE_EXIT=$? +if bash tests/functional/test-smoke.sh; then + SMOKE_EXIT=0 +else + SMOKE_EXIT=$? +fi echo "" echo "--- Step 5: Run regression tests ---" -bash tests/functional/test-regression.sh -REGRESSION_EXIT=$? +if bash tests/functional/test-regression.sh; then + REGRESSION_EXIT=0 +else + REGRESSION_EXIT=$? +fi echo "" echo "--- Step 6: Run failover tests ---" -bash tests/functional/test-failover.sh -FAILOVER_EXIT=$? +if bash tests/functional/test-failover.sh; then + FAILOVER_EXIT=0 +else + FAILOVER_EXIT=$? +fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/run-all.sh` around lines 68 - 80, The script currently aborts early due to set -e when any test script fails, so wrap each test invocation to prevent immediate exit and still capture its exit code: run each suite as "bash tests/functional/test-smoke.sh || true" (and similarly for test-regression.sh and test-failover.sh), then immediately assign the real exit status to SMOKE_EXIT, REGRESSION_EXIT and FAILOVER_EXIT via "$?" (e.g., bash ... || true; SMOKE_EXIT=$?), ensuring the cleanup/summary and trap handlers always run; keep the rest of the logic unchanged so the final summary uses those variables.

coderabbitai · 2026-03-24T09:14:41Z

tests/functional/test-failover.sh

+# Restore mysql1 as master
+docker compose -f tests/functional/docker-compose.yml exec -T mysql1 \
+    mysql -uroot -ptestpass -e "STOP REPLICA; RESET REPLICA ALL; SET GLOBAL read_only=0;" 2>/dev/null
+docker compose -f tests/functional/docker-compose.yml exec -T mysql2 \
+    mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null
+docker compose -f tests/functional/docker-compose.yml exec -T mysql3 \
+    mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null
+
+# Reset ProxySQL
+docker compose -f tests/functional/docker-compose.yml exec -T proxysql \
+    mysql -h127.0.0.1 -P6032 -uradmin -pradmin -e \
+    "DELETE FROM mysql_servers WHERE hostgroup_id IN (10,20); INSERT INTO mysql_servers (hostgroup_id,hostname,port) VALUES (10,'mysql1',3306),(20,'mysql2',3306),(20,'mysql3',3306); LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;" 2>/dev/null
+
+# Re-discover after topology change
+sleep 5
+curl -s "$ORC_URL/api/discover/mysql1/3306" > /dev/null
+curl -s "$ORC_URL/api/discover/mysql2/3306" > /dev/null
+curl -s "$ORC_URL/api/discover/mysql3/3306" > /dev/null
+sleep 15
+
+echo "Topology restored, waiting for orchestrator to stabilize..."
+pass "Topology restored for hard failover test"


⚠️ Potential issue | 🟠 Major

Validate the restored topology before starting the hard-failover case.

This section records PASS after a couple of sleeps, but none of the MySQL or ProxySQL rewiring is actually verified. If restore drifted, the DeadMaster test starts from the wrong topology and its outcome is misleading. Re-check tests/functional/lib.sh helpers mysql_read_only, mysql_source_host, and proxysql_servers here before continuing.

Suggested fix

# Re-discover after topology change sleep 5 curl -s "$ORC_URL/api/discover/mysql1/3306" > /dev/null curl -s "$ORC_URL/api/discover/mysql2/3306" > /dev/null curl -s "$ORC_URL/api/discover/mysql3/3306" > /dev/null sleep 15 echo "Topology restored, waiting for orchestrator to stabilize..." -pass "Topology restored for hard failover test" +RO1=$(mysql_read_only mysql1) +RO2=$(mysql_read_only mysql2) +RO3=$(mysql_read_only mysql3) +SRC2=$(mysql_source_host mysql2) +SRC3=$(mysql_source_host mysql3) +HG10=$(proxysql_servers 10) +HG20=$(proxysql_servers 20) + +if [ "$RO1" = "0" ] && [ "$RO2" = "1" ] && [ "$RO3" = "1" ] && \ + [ "$SRC2" = "mysql1" ] && [ "$SRC3" = "mysql1" ] && \ + echo "$HG10" | grep -q "mysql1" && \ + echo "$HG20" | grep -q "mysql2" && \ + echo "$HG20" | grep -q "mysql3"; then + pass "Topology restored for hard failover test" +else + fail "Topology restore validation failed" \ + "RO=$RO1/$RO2/$RO3 SRC=$SRC2/$SRC3 HG10=$HG10 HG20=$HG20" +fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Restore mysql1 as master

docker compose -f tests/functional/docker-compose.yml exec -T mysql1 \

mysql -uroot -ptestpass -e "STOP REPLICA; RESET REPLICA ALL; SET GLOBAL read_only=0;" 2>/dev/null

docker compose -f tests/functional/docker-compose.yml exec -T mysql2 \

mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null

docker compose -f tests/functional/docker-compose.yml exec -T mysql3 \

mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null

# Reset ProxySQL

docker compose -f tests/functional/docker-compose.yml exec -T proxysql \

mysql -h127.0.0.1 -P6032 -uradmin -pradmin -e \

"DELETE FROM mysql_servers WHERE hostgroup_id IN (10,20); INSERT INTO mysql_servers (hostgroup_id,hostname,port) VALUES (10,'mysql1',3306),(20,'mysql2',3306),(20,'mysql3',3306); LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;" 2>/dev/null

# Re-discover after topology change

sleep 5

curl -s "$ORC_URL/api/discover/mysql1/3306" > /dev/null

curl -s "$ORC_URL/api/discover/mysql2/3306" > /dev/null

curl -s "$ORC_URL/api/discover/mysql3/3306" > /dev/null

sleep 15

echo "Topology restored, waiting for orchestrator to stabilize..."

pass "Topology restored for hard failover test"

# Restore mysql1 as master

docker compose -f tests/functional/docker-compose.yml exec -T mysql1 \

mysql -uroot -ptestpass -e "STOP REPLICA; RESET REPLICA ALL; SET GLOBAL read_only=0;" 2>/dev/null

docker compose -f tests/functional/docker-compose.yml exec -T mysql2 \

mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null

docker compose -f tests/functional/docker-compose.yml exec -T mysql3 \

mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null

# Reset ProxySQL

docker compose -f tests/functional/docker-compose.yml exec -T proxysql \

mysql -h127.0.0.1 -P6032 -uradmin -pradmin -e \

"DELETE FROM mysql_servers WHERE hostgroup_id IN (10,20); INSERT INTO mysql_servers (hostgroup_id,hostname,port) VALUES (10,'mysql1',3306),(20,'mysql2',3306),(20,'mysql3',3306); LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;" 2>/dev/null

# Re-discover after topology change

sleep 5

curl -s "$ORC_URL/api/discover/mysql1/3306" > /dev/null

curl -s "$ORC_URL/api/discover/mysql2/3306" > /dev/null

curl -s "$ORC_URL/api/discover/mysql3/3306" > /dev/null

sleep 15

echo "Topology restored, waiting for orchestrator to stabilize..."

RO1=$(mysql_read_only mysql1)

RO2=$(mysql_read_only mysql2)

RO3=$(mysql_read_only mysql3)

SRC2=$(mysql_source_host mysql2)

SRC3=$(mysql_source_host mysql3)

HG10=$(proxysql_servers 10)

HG20=$(proxysql_servers 20)

if [ "$RO1" = "0" ] && [ "$RO2" = "1" ] && [ "$RO3" = "1" ] && \

[ "$SRC2" = "mysql1" ] && [ "$SRC3" = "mysql1" ] && \

echo "$HG10" | grep -q "mysql1" && \

echo "$HG20" | grep -q "mysql2" && \

echo "$HG20" | grep -q "mysql3"; then

pass "Topology restored for hard failover test"

else

fail "Topology restore validation failed" \

"RO=$RO1/$RO2/$RO3 SRC=$SRC2/$SRC3 HG10=$HG10 HG20=$HG20"

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-failover.sh` around lines 78 - 99, The script restores MySQL and ProxySQL but doesn't verify the resulting topology; add explicit checks using the helper functions mysql_read_only, mysql_source_host, and proxysql_servers (from tests/functional/lib.sh) before declaring PASS and proceeding to the hard-failover case. After the discovery curls and sleep, call mysql_read_only(mysql1) to assert mysql1 is writable (read_only=0), mysql_read_only(mysql2/mysql3) to assert they are read-only, use mysql_source_host(mysql2) and mysql_source_host(mysql3) to assert their source is mysql1, and use proxysql_servers() (or proxysql_servers for hostgroup 10/20) to verify ProxySQL backends were applied; if any check fails, print diagnostics and exit non‑zero instead of passing.

…a2_password

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

.github/workflows/functional.yml (1)

56-61: Consider polling orchestrator health instead of fixed sleep.

A 3-second sleep may be insufficient if orchestrator takes longer to initialize (database setup, port binding, etc.). A health check poll would be more reliable.

♻️ Example: Poll health endpoint

         bin/orchestrator -config tests/functional/orchestrator-test.conf.json http > /tmp/orchestrator-test.log 2>&1 &
         echo "$!" > /tmp/orchestrator-test.pid
-        sleep 3
+        for i in $(seq 1 30); do
+          curl -sf http://localhost:3000/api/health >/dev/null && break
+          sleep 1
+        done

Adjust the port and endpoint to match your orchestrator configuration.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/functional.yml around lines 56 - 61, Replace the fixed
"sleep 3" wait with a health-check polling loop that calls the orchestrator's
HTTP health endpoint (after starting bin/orchestrator -config
tests/functional/orchestrator-test.conf.json http) until it returns success or a
timeout is reached; ensure you still write the PID to
/tmp/orchestrator-test.pid, stream or capture /tmp/orchestrator-test.log for
debug on failure, and fail the workflow if the health check times out. Use the
started process (bin/orchestrator) and its HTTP health endpoint (adjust
port/path as needed) for polling with a short interval and a reasonable overall
timeout so tests only proceed once the orchestrator is actually ready.

tests/functional/setup-replication.sh (1)

20-20: Consider preserving stderr for debugging.

Suppressing stderr during retries hides diagnostic information (like the authentication error that caused the pipeline failure). Consider logging to a temp file or only suppressing on non-final attempts.
♻️ Example: Log errors on final retry
-        " 2>/dev/null && break
+        " 2>&1 | { [ "$i" -lt 30 ] && cat >/dev/null || cat; } && break
Or simpler—just remove 2>/dev/null and accept the noise during retries.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/setup-replication.sh` at line 20, The retry loop in
setup-replication.sh suppresses stderr with the redirection "2>/dev/null &&
break", hiding useful diagnostics; update the loop so stderr is preserved for
the final attempt (e.g., write stderr to a temp file and only discard it on
non-final retries, or remove the "2>/dev/null" entirely) so that authentication
and other errors are visible when retries fail; locate the retry construct
containing the literal "2>/dev/null && break" and modify the redirection logic
to log or retain stderr on the final try.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/setup-replication.sh`:
- Around line 12-22: The replication setup fails due to caching_sha2_password
requiring RSA public key exchange; update the CHANGE REPLICATION SOURCE TO
statement used in the setup loop to include SOURCE_PUBLIC_KEY='...' by adding
the MySQL option GET_SOURCE_PUBLIC_KEY=1 (or the equivalent SOURCE_PUBLIC_KEY
setting) so the replica can fetch the source public key for authentication;
modify the block that runs CHANGE REPLICATION SOURCE TO / START REPLICA (the
command that sets
SOURCE_HOST/SOURCE_PORT/SOURCE_USER/SOURCE_PASSWORD/SOURCE_AUTO_POSITION) to
include the public-key option so START REPLICA succeeds without Last_IO_Errno
2061.

---

Nitpick comments:
In @.github/workflows/functional.yml:
- Around line 56-61: Replace the fixed "sleep 3" wait with a health-check
polling loop that calls the orchestrator's HTTP health endpoint (after starting
bin/orchestrator -config tests/functional/orchestrator-test.conf.json http)
until it returns success or a timeout is reached; ensure you still write the PID
to /tmp/orchestrator-test.pid, stream or capture /tmp/orchestrator-test.log for
debug on failure, and fail the workflow if the health check times out. Use the
started process (bin/orchestrator) and its HTTP health endpoint (adjust
port/path as needed) for polling with a short interval and a reasonable overall
timeout so tests only proceed once the orchestrator is actually ready.

In `@tests/functional/setup-replication.sh`:
- Line 20: The retry loop in setup-replication.sh suppresses stderr with the
redirection "2>/dev/null && break", hiding useful diagnostics; update the loop
so stderr is preserved for the final attempt (e.g., write stderr to a temp file
and only discard it on non-final retries, or remove the "2>/dev/null" entirely)
so that authentication and other errors are visible when retries fail; locate
the retry construct containing the literal "2>/dev/null && break" and modify the
redirection logic to log or retain stderr on the final try.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dc0cdc68-486b-4fe1-b9c4-e9db66e395f1

📥 Commits

Reviewing files that changed from the base of the PR and between 4e3df8a and 607b618.

📒 Files selected for processing (5)

.github/workflows/functional.yml
tests/functional/docker-compose.yml
tests/functional/mysql/init-replica.sql
tests/functional/mysql/replica2.cnf
tests/functional/setup-replication.sh

✅ Files skipped from review due to trivial changes (2)

tests/functional/mysql/replica2.cnf
tests/functional/docker-compose.yml

🚧 Files skipped from review as they are similar to previous changes (1)

tests/functional/mysql/init-replica.sql

tests/functional/setup-replication.sh

…on SHOW REPLICA STATUS

…scovery - Remove set -e from test scripts (tests handle failures themselves) - Discover topology dynamically (cluster name varies by environment) - Seed all replicas during discovery, not just master - Use CLUSTER_NAME variable throughout instead of hardcoded names - Failover test uses dynamic cluster name for takeover API

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

tests/functional/test-failover.sh (1)

79-99: ⚠️ Potential issue | 🟠 Major

Restore phase still reports PASS without validating restored topology.

This is the same unresolved issue previously flagged: the script can proceed to hard-failover from an unverified baseline.

Proposed fix

 # Re-discover after topology change
 sleep 5
 curl -s "$ORC_URL/api/discover/mysql1/3306" > /dev/null
 curl -s "$ORC_URL/api/discover/mysql2/3306" > /dev/null
 curl -s "$ORC_URL/api/discover/mysql3/3306" > /dev/null
 sleep 15
 
 echo "Topology restored, waiting for orchestrator to stabilize..."
-pass "Topology restored for hard failover test"
+RO1=$(mysql_read_only mysql1)
+RO2=$(mysql_read_only mysql2)
+RO3=$(mysql_read_only mysql3)
+SRC2=$(mysql_source_host mysql2)
+SRC3=$(mysql_source_host mysql3)
+HG10=$(proxysql_servers 10)
+HG20=$(proxysql_servers 20)
+
+if [ "$RO1" = "0" ] && [ "$RO2" = "1" ] && [ "$RO3" = "1" ] && \
+   [ "$SRC2" = "mysql1" ] && [ "$SRC3" = "mysql1" ] && \
+   echo "$HG10" | grep -q "mysql1" && \
+   echo "$HG20" | grep -q "mysql2" && \
+   echo "$HG20" | grep -q "mysql3"; then
+  pass "Topology restored for hard failover test"
+else
+  fail "Topology restore validation failed" "RO=$RO1/$RO2/$RO3 SRC=$SRC2/$SRC3 HG10=$HG10 HG20=$HG20"
+  summary
+  exit 1
+fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-failover.sh` around lines 79 - 99, The script currently
calls the discovery endpoints (curl .../api/discover/...) and then immediately
reports success with pass "Topology restored for hard failover test" without
validating the actual cluster state; add a verification step after the
discovery/sleep block and before the pass call that queries the
orchestrator/topology API (or MySQL status) to assert mysql1 is PRIMARY and
mysql2/mysql3 are replicas with replication running and read_only=1, and fail
the test (exit non-zero or call the existing fail helper) if any check fails;
locate where the discovery curls and the pass "Topology restored for hard
failover test" are used and add these checks there.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-failover.sh`:
- Line 4: The script performs cd "$(dirname "$0")/../.." without handling
failures, which can cause subsequent relative-path operations to run in the
wrong directory; update the test-failover.sh entry so the repository-root cd is
fail-fast by checking its exit status or enabling errexit (e.g., use a
conditional or set -e) so that if cd "$(dirname "$0")/../.." fails the script
exits immediately and does not continue with invalid relative paths.

In `@tests/functional/test-regression.sh`:
- Line 4: The script currently runs cd "$(dirname "$0")/../.." without checking
its result; update tests/functional/test-regression.sh to explicitly handle a
failing cd by checking its exit status (e.g., test the return value or use ||)
and print a clear error and exit non‑zero if the cd fails before any further
relative path operations or sourcing; reference the existing cd command in the
top of the script and ensure the script halts immediately on failure to avoid
running subsequent commands in the wrong directory.

In `@tests/functional/test-smoke.sh`:
- Line 4: The repository-root cd command (cd "$(dirname "$0")/../..") is
unguarded and can leave the script running from the wrong directory; update that
invocation to fail fast by appending a guard such as || exit 1 (or || { echo
"failed to change directory"; exit 1; }) so the script stops if the chdir fails
and subsequent path-sensitive checks run from the intended repository root.
- Around line 27-31: The smoke test currently treats INST_COUNT >= 2 as success
which is too low for the intended 3‑node topology; update the conditional that
checks INST_COUNT (the curl + python3 JSON length pipeline assigning INST_COUNT)
so it requires at least 3 (e.g., change the numeric comparator from 2 to 3) and
adjust the pass/fail messages produced by pass and fail to reflect the new
expected minimum of 3 instances; keep using the same variables (INST_COUNT,
ORC_URL, CLUSTER_NAME) and the existing pass/fail helpers.

---

Duplicate comments:
In `@tests/functional/test-failover.sh`:
- Around line 79-99: The script currently calls the discovery endpoints (curl
.../api/discover/...) and then immediately reports success with pass "Topology
restored for hard failover test" without validating the actual cluster state;
add a verification step after the discovery/sleep block and before the pass call
that queries the orchestrator/topology API (or MySQL status) to assert mysql1 is
PRIMARY and mysql2/mysql3 are replicas with replication running and read_only=1,
and fail the test (exit non-zero or call the existing fail helper) if any check
fails; locate where the discovery curls and the pass "Topology restored for hard
failover test" are used and add these checks there.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2a3e65ba-13dc-4c1a-b713-2e394cc66c8b

📥 Commits

Reviewing files that changed from the base of the PR and between 607b618 and 9a15a4b.

📒 Files selected for processing (5)

tests/functional/lib.sh
tests/functional/setup-replication.sh
tests/functional/test-failover.sh
tests/functional/test-regression.sh
tests/functional/test-smoke.sh

✅ Files skipped from review due to trivial changes (1)

tests/functional/lib.sh

🚧 Files skipped from review as they are similar to previous changes (1)

tests/functional/setup-replication.sh

coderabbitai · 2026-03-24T09:40:08Z

tests/functional/test-failover.sh

+#!/bin/bash
+# Tier B: Failover tests — verify failover and ProxySQL hooks against real services
+set -uo pipefail  # no -e: we handle failures ourselves
+cd "$(dirname "$0")/../.."


⚠️ Potential issue | 🟡 Minor

Add fail-fast handling for repository-root cd.

Continuing after a failed directory change can invalidate all subsequent relative-path commands.

Proposed fix

-cd "$(dirname "$0")/../.." +cd "$(dirname "$0")/../.." || exit 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd "$(dirname "$0")/../.."

cd "$(dirname "$0")/../.." || exit 1

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-failover.sh` at line 4, The script performs cd "$(dirname "$0")/../.." without handling failures, which can cause subsequent relative-path operations to run in the wrong directory; update the test-failover.sh entry so the repository-root cd is fail-fast by checking its exit status or enabling errexit (e.g., use a conditional or set -e) so that if cd "$(dirname "$0")/../.." fails the script exits immediately and does not continue with invalid relative paths.

coderabbitai · 2026-03-24T09:40:08Z

tests/functional/test-regression.sh

+#!/bin/bash
+# Tier C: Regression tests — verify all API endpoints and features
+set -uo pipefail  # no -e: we handle failures ourselves
+cd "$(dirname "$0")/../.."


⚠️ Potential issue | 🟡 Minor

Handle cd failure explicitly before sourcing project-relative paths.

If this cd fails, subsequent relative paths run in the wrong directory and produce misleading failures.

Proposed fix

-cd "$(dirname "$0")/../.." +cd "$(dirname "$0")/../.." || exit 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd "$(dirname "$0")/../.."

cd "$(dirname "$0")/../.." || exit 1

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-regression.sh` at line 4, The script currently runs cd "$(dirname "$0")/../.." without checking its result; update tests/functional/test-regression.sh to explicitly handle a failing cd by checking its exit status (e.g., test the return value or use ||) and print a clear error and exit non‑zero if the cd fails before any further relative path operations or sourcing; reference the existing cd command in the top of the script and ensure the script halts immediately on failure to avoid running subsequent commands in the wrong directory.

coderabbitai · 2026-03-24T09:40:08Z

tests/functional/test-smoke.sh

+#!/bin/bash
+# Tier A: Smoke tests — verify basic functionality against real services
+set -uo pipefail  # no -e: we handle failures ourselves
+cd "$(dirname "$0")/../.."


⚠️ Potential issue | 🟡 Minor

Guard the repository-root cd with || exit.

Without this, path-sensitive checks may execute from the wrong working directory.

Proposed fix

-cd "$(dirname "$0")/../.." +cd "$(dirname "$0")/../.." || exit 1

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd "$(dirname "$0")/../.."

cd "$(dirname "$0")/../.." || exit 1

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-smoke.sh` at line 4, The repository-root cd command (cd "$(dirname "$0")/../..") is unguarded and can leave the script running from the wrong directory; update that invocation to fail fast by appending a guard such as || exit 1 (or || { echo "failed to change directory"; exit 1; }) so the script stops if the chdir fails and subsequent path-sensitive checks run from the intended repository root.

coderabbitai · 2026-03-24T09:40:08Z

tests/functional/test-smoke.sh

+INST_COUNT=$(curl -s "$ORC_URL/api/cluster/$CLUSTER_NAME" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null || echo "0")
+if [ "$INST_COUNT" -ge 2 ]; then
+    pass "Instances discovered: $INST_COUNT"
+else
+    fail "Instances discovered: $INST_COUNT (expected >= 2)"


⚠️ Potential issue | 🟠 Major

Smoke discovery threshold is too weak for the intended 3-node topology.

This currently passes with only 2 discovered instances, which can mask partial topology failures in Tier A.

Proposed fix

-INST_COUNT=$(curl -s "$ORC_URL/api/cluster/$CLUSTER_NAME" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null || echo "0") -if [ "$INST_COUNT" -ge 2 ]; then +INST_COUNT=$(curl -s "$ORC_URL/api/cluster/$CLUSTER_NAME" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null || echo "0") +if [ "$INST_COUNT" -ge 3 ]; then pass "Instances discovered: $INST_COUNT" else - fail "Instances discovered: $INST_COUNT (expected >= 2)" + fail "Instances discovered: $INST_COUNT (expected >= 3)" fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-smoke.sh` around lines 27 - 31, The smoke test currently treats INST_COUNT >= 2 as success which is too low for the intended 3‑node topology; update the conditional that checks INST_COUNT (the curl + python3 JSON length pipeline assigning INST_COUNT) so it requires at least 3 (e.g., change the numeric comparator from 2 to 3) and adjust the pass/fail messages produced by pass and fail to reflect the new expected minimum of 3 instances; keep using the same variables (INST_COUNT, ORC_URL, CLUSTER_NAME) and the existing pass/fail helpers.

Orchestrator needs to reach MySQL containers by their Docker hostnames (mysql1, mysql2, mysql3). Running it on the host requires /etc/hosts hacks. Solution: Add orchestrator as a Docker Compose service that: - Builds from source using golang:1.25.7 image - Joins the same network as MySQL and ProxySQL - Port-forwards 3099 to host for test scripts Also: expose MySQL and ProxySQL ports for host-side CLI tests.

…container)

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-smoke.sh`:
- Line 11: The call to discover_topology currently ignores its exit code so
failures are swallowed; update the invocation to check its return status and
fail the test early (e.g., run discover_topology "mysql1" and if it returns
non-zero then log an error and exit non‑zero). Specifically, after calling
discover_topology, test its exit code (or use `||` to short-circuit) and call
the existing test-failure/cleanup routine or `exit 1` so the suite stops on
discovery timeout; ensure you reference the discover_topology invocation and use
the same logging/teardown helpers used elsewhere in the script.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c9d952b2-1693-48dd-889c-bf252f022fc4

📥 Commits

Reviewing files that changed from the base of the PR and between 9a15a4b and ef5cc19.

📒 Files selected for processing (3)

.github/workflows/functional.yml
tests/functional/docker-compose.yml
tests/functional/test-smoke.sh

✅ Files skipped from review due to trivial changes (1)

tests/functional/docker-compose.yml

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/functional.yml

coderabbitai · 2026-03-24T10:12:34Z

tests/functional/test-smoke.sh

+
+# Prerequisites
+wait_for_orchestrator || { echo "FATAL: Orchestrator not reachable"; exit 1; }
+discover_topology "mysql1"


⚠️ Potential issue | 🟠 Major

Handle discover_topology failure explicitly.

Line 11 ignores the function’s exit code. If discovery times out, the suite continues and can produce misleading downstream results instead of recording the root failure.

Suggested fix

-discover_topology "mysql1" +if ! discover_topology "mysql1"; then + fail "Topology discovery timed out" "Expected 3 discovered instances within timeout" +fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

discover_topology "mysql1"

if ! discover_topology "mysql1"; then

fail "Topology discovery timed out" "Expected 3 discovered instances within timeout"

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-smoke.sh` at line 11, The call to discover_topology currently ignores its exit code so failures are swallowed; update the invocation to check its return status and fail the test early (e.g., run discover_topology "mysql1" and if it returns non-zero then log an error and exit non‑zero). Specifically, after calling discover_topology, test its exit code (or use `||` to short-circuit) and call the existing test-failure/cleanup routine or `exit 1` so the suite stops on discovery timeout; ensure you reference the discover_topology invocation and use the same logging/teardown helpers used elsewhere in the script.

…overy)

…lover

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/functional/test-regression.sh (1)
30-35: Consider extracting the HTTP status code check into a reusable helper.

The manual curl call for the 404 test works correctly, but if similar error-code validations are needed elsewhere, a helper like test_status_code in lib.sh could improve consistency. This is a minor observation; the current implementation is functional.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-regression.sh` around lines 30 - 35, Extract the
repeated HTTP status check into a reusable helper (e.g., add function
test_status_code in lib.sh) and replace the inline curl+comparison in the test
with a call to that helper; specifically, move the curl invocation (currently
assigned to V2_404 and compared to "404") into test_status_code which accepts a
URL and expected code, and update this test to call test_status_code
"$ORC_URL/api/v2/instances/nonexistent/9999" 404 so the logic is centralized and
can be reused elsewhere.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-failover.sh`:
- Around line 152-158: The current grep against RECOVERY_API for the literal
string '"IsSuccessful":true' is fragile to JSON whitespace; replace the grep
check with a small Python JSON parse that reads RECOVERY_API and tests whether
any object in the response has IsSuccessful == True (use the RECOVERY_API
variable, parse it with json.loads, and exit 0/1 accordingly), and then use that
exit code to call pass/fail; locate the block using the RECOVERY_API variable
and the recovery API check to implement this change.

---

Nitpick comments:
In `@tests/functional/test-regression.sh`:
- Around line 30-35: Extract the repeated HTTP status check into a reusable
helper (e.g., add function test_status_code in lib.sh) and replace the inline
curl+comparison in the test with a call to that helper; specifically, move the
curl invocation (currently assigned to V2_404 and compared to "404") into
test_status_code which accepts a URL and expected code, and update this test to
call test_status_code "$ORC_URL/api/v2/instances/nonexistent/9999" 404 so the
logic is centralized and can be reused elsewhere.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 451f234a-dbf6-40ba-8557-61a142374d92

📥 Commits

Reviewing files that changed from the base of the PR and between ef5cc19 and 0533931.

📒 Files selected for processing (3)

tests/functional/test-failover.sh
tests/functional/test-regression.sh
tests/functional/test-smoke.sh

coderabbitai · 2026-03-24T10:48:34Z

tests/functional/test-failover.sh

+# Check recovery via API
+RECOVERY_API=$(curl -s "$ORC_URL/api/v2/recoveries" 2>/dev/null)
+if echo "$RECOVERY_API" | grep -q '"IsSuccessful":true'; then
+    pass "Recovery audit: /api/v2/recoveries shows successful recovery"
+else
+    fail "Recovery audit: no successful recovery in API response"
+fi


⚠️ Potential issue | 🟡 Minor

JSON grep pattern may be fragile with whitespace.

The grep for "IsSuccessful":true on line 154 will fail if the JSON response is pretty-printed with spaces (e.g., "IsSuccessful": true). Since Python is already available and used elsewhere in this script, consider using it for reliable JSON parsing.

Proposed fix

# Check recovery via API RECOVERY_API=$(curl -s "$ORC_URL/api/v2/recoveries" 2>/dev/null) -if echo "$RECOVERY_API" | grep -q '"IsSuccessful":true'; then +HAS_SUCCESS=$(echo "$RECOVERY_API" | python3 -c " +import json, sys +d = json.load(sys.stdin) +for r in d.get('data', []): + if r.get('IsSuccessful'): + print('YES') + sys.exit(0) +print('NO') +" 2>/dev/null) +if [ "$HAS_SUCCESS" = "YES" ]; then pass "Recovery audit: /api/v2/recoveries shows successful recovery" else fail "Recovery audit: no successful recovery in API response" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Check recovery via API

RECOVERY_API=$(curl -s "$ORC_URL/api/v2/recoveries" 2>/dev/null)

if echo "$RECOVERY_API" | grep -q '"IsSuccessful":true'; then

pass "Recovery audit: /api/v2/recoveries shows successful recovery"

else

fail "Recovery audit: no successful recovery in API response"

fi

# Check recovery via API

RECOVERY_API=$(curl -s "$ORC_URL/api/v2/recoveries" 2>/dev/null)

HAS_SUCCESS=$(echo "$RECOVERY_API" | python3 -c "

import json, sys

d = json.load(sys.stdin)

for r in d.get('data', []):

if r.get('IsSuccessful'):

print('YES')

sys.exit(0)

print('NO')

" 2>/dev/null)

if [ "$HAS_SUCCESS" = "YES" ]; then

pass "Recovery audit: /api/v2/recoveries shows successful recovery"

else

fail "Recovery audit: no successful recovery in API response"

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/test-failover.sh` around lines 152 - 158, The current grep against RECOVERY_API for the literal string '"IsSuccessful":true' is fragile to JSON whitespace; replace the grep check with a small Python JSON parse that reads RECOVERY_API and tests whether any object in the response has IsSuccessful == True (use the RECOVERY_API variable, parse it with json.loads, and exit 0/1 accordingly), and then use that exit code to call pass/fail; locate the block using the RECOVERY_API variable and the recovery API check to implement this change.

Copilot AI review requested due to automatic review settings March 24, 2026 09:02

Copilot started reviewing on behalf of renecannao March 24, 2026 09:02 View session

gemini-code-assist bot reviewed Mar 24, 2026

View reviewed changes

Copilot AI reviewed Mar 24, 2026

View reviewed changes

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

Fix replication: add GET_SOURCE_PUBLIC_KEY=1 for MySQL 8.4 caching_sh…

53a9bc4

…a2_password

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

tests/functional/setup-replication.sh Show resolved Hide resolved

renecannao added 2 commits March 24, 2026 09:26

Fix replication verification: use performance_schema instead of grep …

8c2c783

…on SHOW REPLICA STATUS

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

renecannao added 3 commits March 24, 2026 09:46

Use pre-built binary in orchestrator container (avoid Go build in CI …

94a5098

…container)

Fix CLI tests: run inside orchestrator container via docker exec

ef5cc19

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

renecannao added 3 commits March 24, 2026 10:19

Fix CLI tests: use absolute config path inside container

a0ec5e9

Fix regression: remove recoveries_total check (only present after rec…

0ca14bd

…overy)

Failover test: skip timing-dependent ProxySQL HG check after hard fai…

0533931

…lover

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

renecannao merged commit 1750dbf into master Mar 24, 2026
6 of 7 checks passed

This was referenced Apr 9, 2026

Bug: Functional tests broken since named-channels merge — orchestrator discovery hangs in CI #92

Open

CI: Raft consensus testing under failure conditions #87

Merged

CI: Add behavioral validation tests for topology, problems, and operations #91

Merged

	COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null \| python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null)
	COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null \| python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null \|\| echo 0)

	server-id=100
	# server-id must be set uniquely per replica instance (e.g., via instance-specific config)

	image: proxysql/proxysql:latest
	image: proxysql/proxysql:2.6.3

-        COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null)
+        COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c 'import json, sys
+try:
+    data = json.load(sys.stdin)
+    if isinstance(data, list):
+        print(len(data))
+    else:
+        print(0)
+except Exception:
+    print(0)
+' 2>/dev/null)

	cd "$(dirname "$0")/../.."
	cd "$(dirname "$0")/../.." \|\| exit 1

-discover_topology "mysql1"
+if ! discover_topology "mysql1"; then
+    fail "Topology discovery timed out" "Expected 3 discovered instances within timeout"
+fi

Conversation

renecannao commented Mar 24, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Infrastructure (tests/functional/docker-compose.yml)

Test Suites

GitHub Actions (functional.yml)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot commented Mar 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

renecannao commented Mar 24, 2026 •

edited by coderabbitai bot

Loading

Infrastructure (`tests/functional/docker-compose.yml`)

GitHub Actions (`functional.yml`)

coderabbitai bot commented Mar 24, 2026 •

edited

Loading