Skip to content

Add functional test CI: Docker MySQL topology + ProxySQL + failover tests#51

Merged
renecannao merged 11 commits intomasterfrom
functional-tests-ci
Mar 24, 2026
Merged

Add functional test CI: Docker MySQL topology + ProxySQL + failover tests#51
renecannao merged 11 commits intomasterfrom
functional-tests-ci

Conversation

@renecannao
Copy link
Copy Markdown

@renecannao renecannao commented Mar 24, 2026

Closes #46, closes #47, closes #48, closes #49, closes #50

Summary

Every manual test from the QA session is now automated and runs in GitHub Actions.

Infrastructure (tests/functional/docker-compose.yml)

  • MySQL 8.4 master + 2 replicas with GTID replication
  • ProxySQL with writer (HG 10) and reader (HG 20) hostgroups
  • Auto-configured: replication, orchestrator user, ProxySQL remote admin

Test Suites

Tier A — Smoke (test-smoke.sh): 25+ checks

  • Discovery (cluster + instances)
  • Web UI + static files
  • API v1 endpoints
  • API v2 endpoints with structured responses
  • Prometheus /metrics
  • Health endpoints (/health/live, /health/ready, /health/leader)
  • ProxySQL CLI (proxysql-test, proxysql-servers)
  • ProxySQL API (/api/proxysql/servers)

Tier B — Failover (test-failover.sh): 12+ checks

  • Graceful master takeover with ProxySQL hook validation
  • Hard failover: docker stop mysql1, verify detection + promotion + ProxySQL update
  • Recovery audit via /api/v2/recoveries
  • Topology restore between tests

Tier C — Regression (test-regression.sh): 22+ checks

  • Chi router v1 API regression
  • API v2 response format and error codes
  • Prometheus metric presence and format
  • Health endpoints
  • Web UI and static file serving

GitHub Actions (functional.yml)

  • Triggers: push to master, PRs, manual dispatch
  • Starts Docker infrastructure, waits for health, verifies replication
  • Runs all 3 test suites sequentially
  • Uploads logs as artifacts on failure
  • 15 minute timeout

Test plan

  • CI runs the functional workflow
  • All 3 test suites pass (smoke, failover, regression)

Summary by CodeRabbit

  • Tests

    • Added a full functional test suite (smoke, regression, failover) with shared test helpers, end-to-end topology and replication checks, ProxySQL and Prometheus validations, web/API/UI checks, and a consolidated pass/fail reporter.
  • Chores

    • Added a CI workflow to run these functional tests (push/PR/manual), provision test infrastructure, run sequential suites, collect logs/artifacts on success/failure, and perform automated cleanup.

Comprehensive functional tests running against real MySQL topology + ProxySQL:

Infrastructure (docker-compose.yml):
- MySQL 8.4 master + 2 replicas with GTID replication
- ProxySQL with writer (HG 10) and reader (HG 20) hostgroups
- Auto-configured replication and orchestrator user

Test suites:
- test-smoke.sh (Tier A): Discovery, API v1/v2, Prometheus, health
  endpoints, ProxySQL CLI/API, web UI, static files (25+ checks)
- test-failover.sh (Tier B): Graceful master takeover with ProxySQL
  hook validation, hard failover (kill master), recovery audit (12+ checks)
- test-regression.sh (Tier C): Full chi router regression, API v2
  response format, Prometheus metrics, health endpoints (22+ checks)

GitHub Actions workflow (functional.yml):
- Triggers on push to master, PRs, and manual dispatch
- Starts Docker infrastructure, builds orchestrator, runs all 3 suites
- Uploads orchestrator and docker logs as artifacts on failure
- 15 minute timeout

Closes #46, closes #47, closes #48, closes #49, closes #50
Copilot AI review requested due to automatic review settings March 24, 2026 09:02
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a functional test harness: Docker Compose topology (MySQL master + 2 replicas + ProxySQL), initialization and replication setup, orchestrator test config and binary build, three test suites (smoke, regression, failover), shared helpers, a runner and GitHub Actions workflow that boots the stack, runs tests, collects logs/artifacts, and always tears down.

Changes

Cohort / File(s) Summary
CI Workflow
/.github/workflows/functional.yml
New GitHub Actions workflow "Functional Tests": builds Go orchestrator, boots docker-compose, waits for service health, runs replication setup, launches orchestrator, executes test suites, uploads logs/artifacts, and always tears down.
Compose & ProxySQL config
tests/functional/docker-compose.yml, tests/functional/proxysql/proxysql.cnf
Adds docker-compose environment with network, 3 MySQL services and ProxySQL, healthchecks, hostgroup/server mapping and ProxySQL admin config.
MySQL configs
tests/functional/mysql/master.cnf, tests/functional/mysql/replica.cnf, tests/functional/mysql/replica2.cnf
Server configs for master and two replicas (GTID, binlog, server-id, replica read-only).
MySQL init SQL
tests/functional/mysql/init-master.sql, tests/functional/mysql/init-replica.sql
Initialization SQL to create orchestrator and repl users and grant appropriate privileges.
Replication setup
tests/functional/setup-replication.sh
Script to configure GTID-based replication for replicas with retries and verification loops.
Test runner & helpers
tests/functional/lib.sh, tests/functional/run-all.sh
Shared test library (counters, HTTP helpers, discovery/wait helpers, docker inspection) and runner to build, start infra, verify replication, run suites, collect logs, and cleanup.
Test suites
tests/functional/test-smoke.sh, tests/functional/test-regression.sh, tests/functional/test-failover.sh
Three executable suites: smoke (basic endpoints, discovery, CLI), regression (API v1/v2, metrics, health, UI), and failover (graceful takeover and hard failover flows with ProxySQL and recovery API checks).
Orchestrator test config
tests/functional/orchestrator-test.conf.json
Test-specific orchestrator JSON: debug, :3099, SQLite backend, MySQL creds, ProxySQL admin settings, hostgroup mappings, recovery filters, Prometheus enabled.
Runtime artifacts
/tmp/orchestrator-test.log, /tmp/orchestrator-test.pid (created at runtime)
Workflows and scripts produce log and PID files and upload them as artifacts on completion or failure.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions
    participant DC as Docker Compose (mysql1,mysql2,mysql3,proxysql)
    participant SETUP as setup-replication.sh
    participant ORC as Orchestrator (bin/orchestrator)
    participant TEST as Test Suites (smoke/regression/failover)
    participant ART as Artifact Uploader

    GH->>DC: docker compose up -d (tests/functional/docker-compose.yml)
    DC-->>GH: services start & healthchecks
    GH->>SETUP: run tests/functional/setup-replication.sh
    SETUP->>DC: exec MySQL CHANGE REPLICATION / START REPLICA
    SETUP-->>GH: replication verified
    GH->>ORC: build & start orchestrator (orchestrator-test.conf.json)
    ORC-->>GH: HTTP API ready (/api/clusters)
    GH->>TEST: run test-smoke.sh -> test-regression.sh -> test-failover.sh
    TEST->>ORC: HTTP API calls (discovery, recovery endpoints)
    TEST->>DC: docker compose exec for ProxySQL/MySQL inspection
    TEST-->>GH: PASS/FAIL results
    GH->>ORC: collect /tmp/orchestrator-test.log
    GH->>DC: docker compose down -v --remove-orphans
    GH->>ART: upload logs/artifacts
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Add built-in ProxySQL failover hooks #36 — Adds end-to-end functional tests, docker-compose and ProxySQL config, and test scripts that exercise the same ProxySQL integration and CLI commands used here.

Poem

🐰 I hopped into compose and nudged each pod,

Seeded GTIDs, warmed Proxy, and gave the code a prod.
Smoke, regression, failover — the warren chimed with zings,
Logs tucked in burrows, PID snug as strings.
A happy hop: CI passes, and the burrow sings!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.55% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add functional test CI: Docker MySQL topology + ProxySQL + failover tests' accurately and concisely describes the main change—adding a comprehensive functional test suite with Docker infrastructure and CI automation.
Linked Issues check ✅ Passed All five linked issues (#46#50) are fully addressed: Docker infrastructure with MySQL+ProxySQL [#46], Tier A smoke tests with 25+ checks [#47], Tier B failover tests with graceful/hard failover [#48], Tier C regression tests with 22+ checks [#49], and GitHub Actions workflow [#50].
Out of Scope Changes check ✅ Passed All changes are directly scoped to functional testing infrastructure and CI: Docker Compose configuration, MySQL/ProxySQL initialization, test scripts, CI workflow, and supporting helpers. No unrelated code modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch functional-tests-ci

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and automated functional testing framework for the orchestrator. By leveraging Docker Compose, it sets up a realistic MySQL replication topology with ProxySQL, enabling comprehensive validation of topology discovery, API endpoints, and critical failover mechanisms. This significantly enhances the reliability and maintainability of the orchestrator by ensuring that all key functionalities are continuously tested in an integrated environment.

Highlights

  • Automated Functional Testing: Implemented comprehensive functional tests to automate all previous manual QA sessions, covering core orchestrator functionalities.
  • Docker-based Test Infrastructure: Established a Docker Compose environment featuring a MySQL 8.4 master with two replicas, integrated with ProxySQL for writer/reader hostgroups, and automated configuration for replication and orchestrator/ProxySQL users.
  • Tiered Test Suites: Introduced three distinct test suites: Smoke (25+ checks for basic functionality), Failover (12+ checks for graceful and hard failovers with ProxySQL hook validation), and Regression (22+ checks for API v1/v2, Prometheus, and Web UI consistency).
  • GitHub Actions Integration: Configured a GitHub Actions workflow (functional.yml) to automatically trigger these functional tests on pushes to master, pull requests, and manual dispatches, ensuring continuous validation of the system.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/functional.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive functional test suite, which is a great addition for ensuring the stability and correctness of orchestrator, especially around failover scenarios. The test setup using Docker Compose is well-structured. I've identified a few areas for improvement, mainly concerning the robustness and maintainability of the test environment and scripts. Key points include pinning Docker image versions for reproducibility, avoiding hardcoded credentials, fixing a potential bug with duplicate MySQL server-ids, and making the test scripts more robust. Addressing these points will make the new test suite even more valuable.

@@ -0,0 +1,9 @@
[mysqld]
server-id=100
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Both mysql2 and mysql3 replicas use this same configuration file, which means they will both have server-id=100. This creates a server-id collision. While the current test setup might work, this is a latent bug that will cause issues if orchestrator attempts to reconfigure replication between these two instances (e.g., making mysql3 a replica of mysql2 after a failover). Each MySQL instance in a replication topology must have a unique server-id.

To fix this, you should use separate configuration files for each replica (e.g., replica2.cnf and replica3.cnf) with unique server-ids, and update tests/functional/docker-compose.yml accordingly.

image: mysql:8.4
hostname: mysql1
environment:
MYSQL_ROOT_PASSWORD: testpass
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoded credentials are used here and throughout the new test files. While this is a test environment, it's a security risk and a bad practice that could be copied elsewhere. Consider using environment variables to manage secrets. You can define them in a .env file (which should be added to .gitignore) and Docker Compose will automatically pick them up. This would centralize credential management and improve security hygiene.

This issue is also present in:

  • tests/functional/docker-compose.yml (lines 13, 24, 32, 43, 51, 71)
  • tests/functional/mysql/init-master.sql
  • tests/functional/mysql/init-replica.sql
  • tests/functional/orchestrator-test.conf.json
  • tests/functional/proxysql/proxysql.cnf
  • tests/functional/run-all.sh
  • tests/functional/test-failover.sh
  • tests/functional/lib.sh

- orchnet

proxysql:
image: proxysql/proxysql:latest
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the latest tag for Docker images can lead to non-reproducible builds and may cause tests to fail unexpectedly if a new version is pushed with breaking changes. It's best practice to pin images to a specific version.

    image: proxysql/proxysql:2.6.2

Comment on lines +33 to +42
test_endpoint() {
local NAME="$1" URL="$2" EXPECT="$3"
local CODE
CODE=$(curl -s -o /dev/null -w "%{http_code}" "$URL" 2>&1)
if [ "$CODE" = "$EXPECT" ]; then
pass "$NAME (HTTP $CODE)"
else
fail "$NAME (HTTP $CODE, expected $EXPECT)"
fi
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of 2>&1 can lead to incorrect behavior if curl fails to connect. In such cases, the error message from curl is redirected to stdout and gets assigned to the CODE variable, which will then contain more than just the HTTP status code. A more robust approach is to remove the redirection and check curl's exit code separately.

Suggested change
test_endpoint() {
local NAME="$1" URL="$2" EXPECT="$3"
local CODE
CODE=$(curl -s -o /dev/null -w "%{http_code}" "$URL" 2>&1)
if [ "$CODE" = "$EXPECT" ]; then
pass "$NAME (HTTP $CODE)"
else
fail "$NAME (HTTP $CODE, expected $EXPECT)"
fi
}
test_endpoint() {
local NAME="$1" URL="$2" EXPECT="$3"
local CODE
CODE=$(curl -s -o /dev/null -w "%{http_code}" "$URL")
if [ $? -ne 0 ]; then
fail "$NAME" "curl command failed for $URL"
return
fi
if [ "$CODE" = "$EXPECT" ]; then
pass "$NAME (HTTP $CODE)"
else
fail "$NAME (HTTP $CODE, expected $EXPECT)"
fi
}

Comment on lines +45 to +54
test_body_contains() {
local NAME="$1" URL="$2" EXPECT="$3"
local BODY
BODY=$(curl -s "$URL" 2>&1)
if echo "$BODY" | grep -q "$EXPECT"; then
pass "$NAME"
else
fail "$NAME" "Response does not contain '$EXPECT'"
fi
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function is not fully robust. If curl fails with a network error, the error message is captured in BODY, and the test might fail with a generic message. It's better to explicitly check curl's exit code. Additionally, using grep -F is safer for matching fixed strings, as it prevents misinterpretation of special characters as regex.

Suggested change
test_body_contains() {
local NAME="$1" URL="$2" EXPECT="$3"
local BODY
BODY=$(curl -s "$URL" 2>&1)
if echo "$BODY" | grep -q "$EXPECT"; then
pass "$NAME"
else
fail "$NAME" "Response does not contain '$EXPECT'"
fi
}
test_body_contains() {
local NAME="$1" URL="$2" EXPECT="$3"
local BODY
BODY=$(curl -s "$URL")
if [ $? -ne 0 ]; then
fail "$NAME" "curl command failed for $URL"
return
fi
if echo "$BODY" | grep -qF -- "$EXPECT"; then
pass "$NAME"
else
fail "$NAME" "Response does not contain '$EXPECT'"
fi
}

echo "Waiting for topology discovery..."
for i in $(seq 1 60); do
local COUNT
COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Python one-liner for parsing JSON is not fully robust. If curl returns a non-JSON response (e.g., an HTML error page), the json.load call will fail, and the COUNT variable will be empty. This can lead to silent failures in the loop condition. It's safer to handle this case by defaulting to 0 if the Python script fails.

Suggested change
COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null)
COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null || echo 0)

fail "Graceful takeover API returned: $CODE" "$(echo "$RESULT" | head -c 200)"
fi

sleep 3
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using fixed sleep durations can make tests flaky or unnecessarily slow. It's more reliable to poll for the desired state. For example, instead of sleeping for 3 seconds after triggering a graceful takeover, you could create a polling loop that repeatedly checks the read_only status of mysql1 and mysql2 until they reflect the role change. This would make the test faster and more robust against timing variations.

This applies to other sleep commands in this file as well (e.g., lines 92, 96, 142, 162).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Docker-based functional test environment (MySQL + ProxySQL) and wires three tiered functional test suites into a GitHub Actions workflow so manual QA checks run automatically on PRs and master.

Changes:

  • Introduces tests/functional/ Docker Compose topology (MySQL primary/replicas + ProxySQL) plus an orchestrator test config.
  • Adds Tier A/B/C bash test suites (smoke, failover, regression) and a local run-all.sh runner.
  • Adds .github/workflows/functional.yml to run the functional test suite in CI and upload logs as artifacts.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
.github/workflows/functional.yml New CI workflow to build orchestrator, start Compose topology, and run functional suites.
tests/functional/docker-compose.yml Defines MySQL+ProxySQL topology used by functional tests.
tests/functional/lib.sh Shared bash helpers (HTTP checks, discovery, mysql/proxysql helpers, summary).
tests/functional/run-all.sh Local runner to build, start infra, start orchestrator, and run all suites.
tests/functional/test-smoke.sh Tier A smoke checks (UI, v1/v2 APIs, metrics, health, ProxySQL API/CLI).
tests/functional/test-regression.sh Tier C regression coverage for key endpoints, metrics, health, static assets.
tests/functional/test-failover.sh Tier B graceful + hard failover validation and ProxySQL hook assertions.
tests/functional/orchestrator-test.conf.json Orchestrator config tuned for functional tests (SQLite backend, ProxySQL settings, metrics).
tests/functional/proxysql/proxysql.cnf ProxySQL config (admin creds, hostgroups, monitor + mysql users).
tests/functional/mysql/master.cnf MySQL primary config (GTID/binlog settings).
tests/functional/mysql/replica.cnf MySQL replica config (GTID/binlog settings, read-only).
tests/functional/mysql/init-master.sql Initializes users for orchestrator + replication.
tests/functional/mysql/init-replica.sql Initializes orchestrator user and sets up replication from primary.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +4 to +8
mysql1:
image: mysql:8.4
hostname: mysql1
environment:
MYSQL_ROOT_PASSWORD: testpass
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functional workflow starts orchestrator as a host process, but these Compose services do not publish MySQL/ProxySQL ports to the runner. A host process also cannot resolve mysql1/proxysql via Docker DNS, so orchestrator won't be able to connect to the topology as configured. Either (a) run orchestrator as a Compose service on orchnet and expose 3099, or (b) publish the needed ports and update the test config/URLs to use host-reachable addresses.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,9 @@
[mysqld]
server-id=100
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both replicas mount this same config, so they will share the same server-id value. In MySQL replication, each server must have a unique server_id; duplicates can cause replicas to disconnect or behave unpredictably (especially under GTID). Provide distinct configs (e.g., replica2.cnf/replica3.cnf) or override server-id per service.

Suggested change
server-id=100
# server-id must be set uniquely per replica instance (e.g., via instance-specific config)

Copilot uses AI. Check for mistakes.
Comment on lines +68 to +80
echo "--- Step 4: Run smoke tests ---"
bash tests/functional/test-smoke.sh
SMOKE_EXIT=$?

echo ""
echo "--- Step 5: Run regression tests ---"
bash tests/functional/test-regression.sh
REGRESSION_EXIT=$?

echo ""
echo "--- Step 6: Run failover tests ---"
bash tests/functional/test-failover.sh
FAILOVER_EXIT=$?
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this script runs with set -e, if any suite script exits non-zero the entire run-all.sh will terminate immediately, and the subsequent *_EXIT=$? assignments and cleanup won't run. Disable -e around the suite execution (or use bash ...; SMOKE_EXIT=$? with set +e), so you can collect all exit codes and always perform teardown.

Copilot uses AI. Check for mistakes.
echo "Failover: exit $FAILOVER_EXIT"

# Exit with failure if any suite failed
[ "$SMOKE_EXIT" -ne 0 ] || [ "$REGRESSION_EXIT" -ne 0 ] || [ "$FAILOVER_EXIT" -ne 0 ] && exit 1
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final failure check uses a chained ||/&& expression whose short-circuiting means it may not exit 1 when smoke/regression fails (because earlier || terms evaluate true and skip the trailing && exit 1). Use an explicit if (or group with { ...; }) to exit non-zero when any suite exit code is non-zero.

Suggested change
[ "$SMOKE_EXIT" -ne 0 ] || [ "$REGRESSION_EXIT" -ne 0 ] || [ "$FAILOVER_EXIT" -ne 0 ] && exit 1
if [ "$SMOKE_EXIT" -ne 0 ] || [ "$REGRESSION_EXIT" -ne 0 ] || [ "$FAILOVER_EXIT" -ne 0 ]; then
exit 1
fi

Copilot uses AI. Check for mistakes.
- orchnet

proxysql:
image: proxysql/proxysql:latest
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using proxysql/proxysql:latest makes the functional tests non-reproducible and can break CI when upstream publishes a new image with behavior/config changes. Pin to a known-good ProxySQL version tag (and ideally document the required major/minor).

Suggested change
image: proxysql/proxysql:latest
image: proxysql/proxysql:2.6.3

Copilot uses AI. Check for mistakes.

fail() {
echo " ❌ FAIL: $1"
[ -n "$2" ] && echo " $2"
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With set -u enabled in all test scripts, referencing $2 here will trigger an "unbound variable" error when fail is called with a single argument (which happens throughout the suite). Use a safe expansion like ${2-} (or shift/iterate over args) before testing/printing the optional details parameter.

Suggested change
[ -n "$2" ] && echo " $2"
[ -n "${2-}" ] && echo " ${2-}"

Copilot uses AI. Check for mistakes.
echo "Waiting for topology discovery..."
for i in $(seq 1 60); do
local COUNT
COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discover_topology assumes /api/cluster/... always returns a JSON array; when it returns an error envelope/object (or temporarily invalid JSON while the cluster is being discovered), json.load may fail or len() may yield a non-instance count (e.g., number of object keys), causing flakiness or false positives. Make the parsing resilient by treating non-list/parse errors as count=0 and continuing to retry.

Suggested change
COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null)
COUNT=$(curl -s "$ORC_URL/api/cluster/$MASTER_HOST:3306" 2>/dev/null | python3 -c 'import json, sys
try:
data = json.load(sys.stdin)
if isinstance(data, list):
print(len(data))
else:
print(0)
except Exception:
print(0)
' 2>/dev/null)

Copilot uses AI. Check for mistakes.
- mysql3 gets server-id=200 (was 100, same as mysql2)
- Replication configured by setup-replication.sh after containers are up
  (init scripts run during mysqld startup and can't connect to master)
- Workflow uses setup-replication.sh instead of inline verification
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/functional.yml:
- Around line 56-65: The current readiness loop only checks mysql2 for
Replica_IO_Running which can leave mysql3 or the SQL thread unready; update the
readiness check (the timeout loop that runs the mysql -uroot -ptestpass "SHOW
REPLICA STATUS\G" command) to verify all replicas (e.g., mysql2 and mysql3) and
that both Replica_IO_Running and Replica_SQL_Running are "Yes" for each host
before proceeding; implement this by iterating the target containers (mysql2,
mysql3) and asserting both fields are "Yes" in the SHOW REPLICA STATUS\G output
(or by checking the results of grepping for "Replica_IO_Running: Yes" and
"Replica_SQL_Running: Yes" per container) and only exit success when every
replica passes, otherwise keep waiting until timeout.

In `@tests/functional/docker-compose.yml`:
- Line 5: Replace floating image tags in the docker-compose.yml to pin images to
immutable versions: locate the image entries "mysql:8.4" (appearing at the three
occurrences) and "proxysql/proxysql:latest" and update them to digest-pinned
image references (e.g., mysql@sha256:<digest> and
proxysql/proxysql@sha256:<digest>) or a specific immutable version tag; fetch
the appropriate digests from the upstream registries (Docker Hub or the image
provider) and substitute them so the functional test suite uses fixed,
reproducible images.

In `@tests/functional/lib.sh`:
- Around line 57-63: The readiness check in wait_for_orchestrator currently
treats any HTTP response (including 4xx/5xx) as success because curl is invoked
with -s -o /dev/null; change the curl invocation in wait_for_orchestrator (the
call hitting "$ORC_URL/api/clusters") to use flags that cause non-2xx responses
to exit nonzero (e.g., replace -s -o /dev/null with -fsS -o /dev/null or
equivalent) so the loop only reports "Orchestrator ready" on actual HTTP
success.
- Around line 33-53: The test functions test_endpoint and test_body_contains
currently let curl failures abort the whole suite and mix stderr into outputs;
update them to run curl without inheriting set -e (e.g., append || true or
capture exit code immediately), capture stdout and stderr separately (don’t use
2>&1 into the asserted payload), check curl’s exit status and treat any non-zero
exit as a test failure (call fail with diagnostic including the curl stderr and
exit code), and only then assert on the HTTP code or body content; reference
these changes inside the test_endpoint and test_body_contains functions so
failures are recorded instead of aborting the script.
- Around line 9-22: The three reporting functions (pass, fail, skip) are unsafe
under set -euo pipefail: change the post-increment arithmetic ((...++)) to a
safe assignment form (e.g. PASS_COUNT=$((PASS_COUNT+1)),
SKIP_COUNT=$((SKIP_COUNT+1)), FAIL_COUNT=$((FAIL_COUNT+1))) so the increment
never produces a failing exit status, and make fail() guard the optional second
parameter by using a parameter expansion fallback (e.g. "${2:-}") or testing
with "${2+set}" before echoing so referencing $2 cannot trigger an unbound
variable error; update the functions pass, fail, and skip accordingly (ensuring
their echo lines remain) to preserve behavior under strict mode.

In `@tests/functional/mysql/replica.cnf`:
- Line 2: The replica.cnf currently hardcodes server-id=100 which causes
duplicate IDs; remove the server-id line from tests/functional/mysql/replica.cnf
and instead add per-service command overrides in the docker-compose service
definitions for mysql2 and mysql3 (e.g., set --server-id=2 for mysql2 and
--server-id=3 for mysql3) so each replica has a unique MySQL server-id; update
the mysql2 and mysql3 service blocks to include the command override (or
equivalent entrypoint args) rather than relying on the shared replica.cnf.

In `@tests/functional/run-all.sh`:
- Around line 68-80: The script currently aborts early due to set -e when any
test script fails, so wrap each test invocation to prevent immediate exit and
still capture its exit code: run each suite as "bash
tests/functional/test-smoke.sh || true" (and similarly for test-regression.sh
and test-failover.sh), then immediately assign the real exit status to
SMOKE_EXIT, REGRESSION_EXIT and FAILOVER_EXIT via "$?" (e.g., bash ... || true;
SMOKE_EXIT=$?), ensuring the cleanup/summary and trap handlers always run; keep
the rest of the logic unchanged so the final summary uses those variables.
- Around line 48-55: The loop that sets REPL_OK using command substitution exits
under set -euo pipefail when grep finds no match; change the command to avoid
failing the pipeline (e.g., append "|| true" or use grep -q with || true and
capture exit status/output) and parse the mysql SHOW REPLICA STATUS output
safely; extend the check to run against both replicas (mysql2 and mysql3) and
validate both fields Replica_IO_Running and Replica_SQL_Running (aggregate
counts into REPL_OK or separate variables), retry up to the existing timeout,
and if after retries replication is not fully ready, print a clear error and
exit non-zero so the script fails instead of continuing. Ensure you update the
invocation(s) that use $COMPOSE exec -T mysql ... and the REPL_OK logic
accordingly.

In `@tests/functional/test-failover.sh`:
- Around line 78-99: The script restores MySQL and ProxySQL but doesn't verify
the resulting topology; add explicit checks using the helper functions
mysql_read_only, mysql_source_host, and proxysql_servers (from
tests/functional/lib.sh) before declaring PASS and proceeding to the
hard-failover case. After the discovery curls and sleep, call
mysql_read_only(mysql1) to assert mysql1 is writable (read_only=0),
mysql_read_only(mysql2/mysql3) to assert they are read-only, use
mysql_source_host(mysql2) and mysql_source_host(mysql3) to assert their source
is mysql1, and use proxysql_servers() (or proxysql_servers for hostgroup 10/20)
to verify ProxySQL backends were applied; if any check fails, print diagnostics
and exit non‑zero instead of passing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1895a8cf-54c8-4e03-adda-336b217b2263

📥 Commits

Reviewing files that changed from the base of the PR and between 7fe2446 and 4e3df8a.

📒 Files selected for processing (13)
  • .github/workflows/functional.yml
  • tests/functional/docker-compose.yml
  • tests/functional/lib.sh
  • tests/functional/mysql/init-master.sql
  • tests/functional/mysql/init-replica.sql
  • tests/functional/mysql/master.cnf
  • tests/functional/mysql/replica.cnf
  • tests/functional/orchestrator-test.conf.json
  • tests/functional/proxysql/proxysql.cnf
  • tests/functional/run-all.sh
  • tests/functional/test-failover.sh
  • tests/functional/test-regression.sh
  • tests/functional/test-smoke.sh

Comment on lines +56 to +65
timeout 30 bash -c '
while true; do
REPL=$(docker compose exec -T mysql2 mysql -uroot -ptestpass -Nse "SHOW REPLICA STATUS\G" 2>/dev/null | grep "Replica_IO_Running: Yes" || true)
if [ -n "$REPL" ]; then
echo "Replication running"
exit 0
fi
sleep 1
done
' || { echo "Replication not running"; exit 1; }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Wait for full replica readiness before starting the suites.

This only checks mysql2 and only Replica_IO_Running. CI can proceed with mysql3 still broken or with a stopped SQL thread, which makes the later failover assertions flaky against a partially healthy topology.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/functional.yml around lines 56 - 65, The current readiness
loop only checks mysql2 for Replica_IO_Running which can leave mysql3 or the SQL
thread unready; update the readiness check (the timeout loop that runs the mysql
-uroot -ptestpass "SHOW REPLICA STATUS\G" command) to verify all replicas (e.g.,
mysql2 and mysql3) and that both Replica_IO_Running and Replica_SQL_Running are
"Yes" for each host before proceeding; implement this by iterating the target
containers (mysql2, mysql3) and asserting both fields are "Yes" in the SHOW
REPLICA STATUS\G output (or by checking the results of grepping for
"Replica_IO_Running: Yes" and "Replica_SQL_Running: Yes" per container) and only
exit success when every replica passes, otherwise keep waiting until timeout.


services:
mysql1:
image: mysql:8.4
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Declared images:"
rg -n '^\s*image:\s*' tests/functional/docker-compose.yml

echo
echo "Floating tags detected (latest or major.minor only):"
rg -nP 'image:\s*\S+:(latest|\d+\.\d+)$' tests/functional/docker-compose.yml || true

Repository: ProxySQL/orchestrator

Length of output: 355


Pin container images to immutable versions (ideally digest-pinned).

Floating tags (mysql:8.4 and proxysql/proxysql:latest) make the functional test suite non-reproducible and vulnerable to upstream image drift. Each pull of the image may yield different contents, causing tests to break without code changes.

Apply to lines 5, 21, 40 (all mysql:8.4), and line 59 (proxysql/proxysql:latest).

🔧 Proposed fix
- image: mysql:8.4
+ image: mysql:8.4.x@sha256:<digest>

- image: proxysql/proxysql:latest
+ image: proxysql/proxysql:<stable-version>@sha256:<digest>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/docker-compose.yml` at line 5, Replace floating image tags
in the docker-compose.yml to pin images to immutable versions: locate the image
entries "mysql:8.4" (appearing at the three occurrences) and
"proxysql/proxysql:latest" and update them to digest-pinned image references
(e.g., mysql@sha256:<digest> and proxysql/proxysql@sha256:<digest>) or a
specific immutable version tag; fetch the appropriate digests from the upstream
registries (Docker Hub or the image provider) and substitute them so the
functional test suite uses fixed, reproducible images.

Comment on lines +9 to +22
pass() {
echo " ✅ PASS: $1"
((PASS_COUNT++))
}

fail() {
echo " ❌ FAIL: $1"
[ -n "$2" ] && echo " $2"
((FAIL_COUNT++))
}

skip() {
echo " ⚠️ SKIP: $1"
((SKIP_COUNT++))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
bash -euo pipefail -c 'count=0; ((count++)); echo "unreached"'
bash -euo pipefail -c 'f(){ [ -n "$2" ] && echo "$2"; }; f only-one-arg'

Repository: ProxySQL/orchestrator

Length of output: 106


🏁 Script executed:

#!/bin/bash
# Check lines 1-8 to see if counters are initialized
head -n 30 tests/functional/lib.sh

Repository: ProxySQL/orchestrator

Length of output: 566


🏁 Script executed:

#!/bin/bash
# Test 1: Post-increment at 0 returns old value (0), causing exit status 1 under set -e
bash -euo pipefail -c 'count=0; ((count++)); echo "reached: $count"'
echo "Test 1 exit code: $?"

# Test 2: Post-increment at 1+ returns old value, causing exit status 0 (truthy)
bash -euo pipefail -c 'count=1; ((count++)); echo "reached: $count"'
echo "Test 2 exit code: $?"

# Test 3: Assignment form always succeeds
bash -euo pipefail -c 'count=0; count=$((count+1)); echo "reached: $count"'
echo "Test 3 exit code: $?"

Repository: ProxySQL/orchestrator

Length of output: 146


pass/fail/skip are not safe under the strict mode the suites enable.

The post-increment ((...++)) evaluates to the old value and returns exit status 1 when the counter is 0, causing the script to exit on the first call to any of these functions. Additionally, fail() dereferences $2 without guarding for missing positional parameters, which causes an unbound variable error under set -u when called with a single argument.

Because every suite does set -euo pipefail before sourcing tests/functional/lib.sh, the first PASS, the first SKIP, or any one-argument FAIL exits the script instead of recording the result.

Suggested fix
 pass() {
     echo "  ✅ PASS: $1"
-    ((PASS_COUNT++))
+    PASS_COUNT=$((PASS_COUNT + 1))
 }
 
 fail() {
     echo "  ❌ FAIL: $1"
-    [ -n "$2" ] && echo "         $2"
-    ((FAIL_COUNT++))
+    if [ $# -ge 2 ] && [ -n "$2" ]; then
+        echo "         $2"
+    fi
+    FAIL_COUNT=$((FAIL_COUNT + 1))
 }
 
 skip() {
     echo "  ⚠️  SKIP: $1"
-    ((SKIP_COUNT++))
+    SKIP_COUNT=$((SKIP_COUNT + 1))
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pass() {
echo " ✅ PASS: $1"
((PASS_COUNT++))
}
fail() {
echo " ❌ FAIL: $1"
[ -n "$2" ] && echo " $2"
((FAIL_COUNT++))
}
skip() {
echo " ⚠️ SKIP: $1"
((SKIP_COUNT++))
pass() {
echo " ✅ PASS: $1"
PASS_COUNT=$((PASS_COUNT + 1))
}
fail() {
echo " ❌ FAIL: $1"
if [ $# -ge 2 ] && [ -n "$2" ]; then
echo " $2"
fi
FAIL_COUNT=$((FAIL_COUNT + 1))
}
skip() {
echo " ⚠️ SKIP: $1"
SKIP_COUNT=$((SKIP_COUNT + 1))
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/lib.sh` around lines 9 - 22, The three reporting functions
(pass, fail, skip) are unsafe under set -euo pipefail: change the post-increment
arithmetic ((...++)) to a safe assignment form (e.g.
PASS_COUNT=$((PASS_COUNT+1)), SKIP_COUNT=$((SKIP_COUNT+1)),
FAIL_COUNT=$((FAIL_COUNT+1))) so the increment never produces a failing exit
status, and make fail() guard the optional second parameter by using a parameter
expansion fallback (e.g. "${2:-}") or testing with "${2+set}" before echoing so
referencing $2 cannot trigger an unbound variable error; update the functions
pass, fail, and skip accordingly (ensuring their echo lines remain) to preserve
behavior under strict mode.

Comment on lines +57 to +63
wait_for_orchestrator() {
echo "Waiting for orchestrator to be ready..."
for i in $(seq 1 30); do
if curl -s -o /dev/null "$ORC_URL/api/clusters" 2>/dev/null; then
echo "Orchestrator ready after ${i}s"
return 0
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does curlreturn a zero exit status for HTTP 404/500 responses unless-f/--fail is used?

💡 Result:

Yes, curl returns a zero exit status for HTTP 404 and 500 responses unless the -f or --fail option is used. By default, curl considers the transfer successful as long as the HTTP response is received, regardless of the status code (4xx or 5xx). The --fail option makes curl return a non-zero exit code (typically 22) for HTTP responses with status codes >= 400.

Citations:


Fix readiness check to require HTTP success response.

curl -s -o /dev/null exits with code 0 for HTTP 404/500 responses, allowing wait_for_orchestrator() to incorrectly report success when the endpoint is failing. Use curl -fsS to make curl fail on HTTP error status codes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/lib.sh` around lines 57 - 63, The readiness check in
wait_for_orchestrator currently treats any HTTP response (including 4xx/5xx) as
success because curl is invoked with -s -o /dev/null; change the curl invocation
in wait_for_orchestrator (the call hitting "$ORC_URL/api/clusters") to use flags
that cause non-2xx responses to exit nonzero (e.g., replace -s -o /dev/null with
-fsS -o /dev/null or equivalent) so the loop only reports "Orchestrator ready"
on actual HTTP success.

@@ -0,0 +1,9 @@
[mysqld]
server-id=100
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Replica server-id definitions:"
rg -n 'server-id\s*=' tests/functional/mysql/*.cnf

echo
echo "Where replica.cnf is mounted:"
rg -n 'replica\.cnf' tests/functional/docker-compose.yml -C2

Repository: ProxySQL/orchestrator

Length of output: 638


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check the full service definitions for mysql2 and mysql3
echo "=== Full mysql2 service definition ==="
rg -A 20 'mysql2:' tests/functional/docker-compose.yml

echo
echo "=== Full mysql3 service definition ==="
rg -A 20 'mysql3:' tests/functional/docker-compose.yml

# Check if there are any other replicas
echo
echo "=== All service definitions ==="
rg '^\s+[a-z]+[a-z0-9]*:$' tests/functional/docker-compose.yml | head -20

Repository: ProxySQL/orchestrator

Length of output: 2156


Assign unique MySQL server-id values per replica.

Both mysql2 and mysql3 services mount the same replica.cnf with server-id=100, violating MySQL's replication requirement that each server have a unique ID. This causes replication connection conflicts and breaks the topology state needed for failover tests.

  • Remove the fixed server-id from replica.cnf
  • Add unique server-id per service via docker-compose command override (e.g., mysql2: --server-id=2, mysql3: --server-id=3)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/mysql/replica.cnf` at line 2, The replica.cnf currently
hardcodes server-id=100 which causes duplicate IDs; remove the server-id line
from tests/functional/mysql/replica.cnf and instead add per-service command
overrides in the docker-compose service definitions for mysql2 and mysql3 (e.g.,
set --server-id=2 for mysql2 and --server-id=3 for mysql3) so each replica has a
unique MySQL server-id; update the mysql2 and mysql3 service blocks to include
the command override (or equivalent entrypoint args) rather than relying on the
shared replica.cnf.

Comment on lines +48 to +55
for i in $(seq 1 30); do
REPL_OK=$($COMPOSE exec -T mysql2 mysql -uroot -ptestpass -Nse "SHOW REPLICA STATUS\G" 2>/dev/null | grep "Replica_IO_Running: Yes" | wc -l)
if [ "$REPL_OK" -ge 1 ]; then
echo "Replication is running"
break
fi
sleep 1
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
bash -euo pipefail -c 'x=$(printf "" | grep foo | wc -l); echo "unreached"'

Repository: ProxySQL/orchestrator

Length of output: 47


🏁 Script executed:

cd /root/codebase && find . -name "run-all.sh" -type f

Repository: ProxySQL/orchestrator

Length of output: 129


🏁 Script executed:

cd /root/codebase && sed -n '1,60p' tests/functional/run-all.sh | cat -n

Repository: ProxySQL/orchestrator

Length of output: 129


🏁 Script executed:

cd /root/codebase && head -30 tests/functional/run-all.sh | cat -n

Repository: ProxySQL/orchestrator

Length of output: 129


🏁 Script executed:

pwd && git ls-files tests/functional/run-all.sh

Repository: ProxySQL/orchestrator

Length of output: 111


🏁 Script executed:

cat -n tests/functional/run-all.sh | head -60

Repository: ProxySQL/orchestrator

Length of output: 2250


🏁 Script executed:

head -20 tests/functional/run-all.sh && echo "---" && sed -n '40,65p' tests/functional/run-all.sh | cat -n

Repository: ProxySQL/orchestrator

Length of output: 1648


🏁 Script executed:

cat tests/functional/docker-compose.yml | grep -E "mysql[0-9]|container_name|MYSQL_REPLICATION" -A 2 -B 2

Repository: ProxySQL/orchestrator

Length of output: 973


🏁 Script executed:

rg "mysql3" tests/functional/

Repository: ProxySQL/orchestrator

Length of output: 992


🏁 Script executed:

rg "Replica_SQL_Running" tests/functional/

Repository: ProxySQL/orchestrator

Length of output: 47


The replication verification loop will exit prematurely on first iteration if replication is not immediately ready due to pipefail behavior.

With set -euo pipefail on line 3, the command substitution fails when grep finds no match in the pipeline, causing the script to exit immediately instead of retrying. Additionally:

  • Only mysql2 is verified, but mysql3 (another replica in the topology) is not checked
  • Only Replica_IO_Running is verified; Replica_SQL_Running should also be validated
  • The loop completes silently without explicit failure if replication never starts, allowing the script to continue to Step 3
Suggested fix
-echo "Verifying replication..."
-for i in $(seq 1 30); do
-    REPL_OK=$($COMPOSE exec -T mysql2 mysql -uroot -ptestpass -Nse "SHOW REPLICA STATUS\G" 2>/dev/null | grep "Replica_IO_Running: Yes" | wc -l)
-    if [ "$REPL_OK" -ge 1 ]; then
-        echo "Replication is running"
-        break
-    fi
-    sleep 1
-done
+echo "Verifying replication..."
+replica_ready() {
+    local container="$1"
+    local status
+    status=$($COMPOSE exec -T "$container" mysql -uroot -ptestpass -Nse "SHOW REPLICA STATUS\G" 2>/dev/null || true)
+    echo "$status" | grep -q "Replica_IO_Running: Yes" &&
+    echo "$status" | grep -q "Replica_SQL_Running: Yes"
+}
+
+for i in $(seq 1 30); do
+    if replica_ready mysql2 && replica_ready mysql3; then
+        echo "Replication is running"
+        break
+    fi
+    if [ "$i" -eq 30 ]; then
+        echo "FATAL: Replication not running on all replicas after 30s"
+        exit 1
+    fi
+    sleep 1
+done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/run-all.sh` around lines 48 - 55, The loop that sets REPL_OK
using command substitution exits under set -euo pipefail when grep finds no
match; change the command to avoid failing the pipeline (e.g., append "|| true"
or use grep -q with || true and capture exit status/output) and parse the mysql
SHOW REPLICA STATUS output safely; extend the check to run against both replicas
(mysql2 and mysql3) and validate both fields Replica_IO_Running and
Replica_SQL_Running (aggregate counts into REPL_OK or separate variables), retry
up to the existing timeout, and if after retries replication is not fully ready,
print a clear error and exit non-zero so the script fails instead of continuing.
Ensure you update the invocation(s) that use $COMPOSE exec -T mysql ... and the
REPL_OK logic accordingly.

Comment on lines +68 to +80
echo "--- Step 4: Run smoke tests ---"
bash tests/functional/test-smoke.sh
SMOKE_EXIT=$?

echo ""
echo "--- Step 5: Run regression tests ---"
bash tests/functional/test-regression.sh
REGRESSION_EXIT=$?

echo ""
echo "--- Step 6: Run failover tests ---"
bash tests/functional/test-failover.sh
FAILOVER_EXIT=$?
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
bash -euo pipefail -c 'false; echo "unreached"'

Repository: ProxySQL/orchestrator

Length of output: 47


🏁 Script executed:

# Check if the file exists and examine the structure
if [ -f tests/functional/run-all.sh ]; then
  echo "=== File exists, checking total lines ==="
  wc -l tests/functional/run-all.sh
  
  echo ""
  echo "=== Checking for set -e/errexit ==="
  head -20 tests/functional/run-all.sh
  
  echo ""
  echo "=== Lines 68-80 (the section in review) ==="
  sed -n '68,80p' tests/functional/run-all.sh
  
  echo ""
  echo "=== Checking if there's a summary/cleanup block after line 80 ==="
  tail -30 tests/functional/run-all.sh
else
  echo "File not found"
fi

Repository: ProxySQL/orchestrator

Length of output: 2087


set -e prevents the suite summary and cleanup from running on test failure.

If any test suite (smoke, regression, or failover) exits non-zero, the script terminates immediately before capturing its exit code. The cleanup block (killing ORC_PID, stopping Docker Compose) and the final summary are never reached, leaving infrastructure orphaned. All subsequent test suites also fail to run. Wrap each suite call in conditional statements to allow the script to continue and capture exit codes:

Suggested fix
 echo ""
 echo "--- Step 4: Run smoke tests ---"
-bash tests/functional/test-smoke.sh
-SMOKE_EXIT=$?
+if bash tests/functional/test-smoke.sh; then
+    SMOKE_EXIT=0
+else
+    SMOKE_EXIT=$?
+fi
 
 echo ""
 echo "--- Step 5: Run regression tests ---"
-bash tests/functional/test-regression.sh
-REGRESSION_EXIT=$?
+if bash tests/functional/test-regression.sh; then
+    REGRESSION_EXIT=0
+else
+    REGRESSION_EXIT=$?
+fi
 
 echo ""
 echo "--- Step 6: Run failover tests ---"
-bash tests/functional/test-failover.sh
-FAILOVER_EXIT=$?
+if bash tests/functional/test-failover.sh; then
+    FAILOVER_EXIT=0
+else
+    FAILOVER_EXIT=$?
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/run-all.sh` around lines 68 - 80, The script currently
aborts early due to set -e when any test script fails, so wrap each test
invocation to prevent immediate exit and still capture its exit code: run each
suite as "bash tests/functional/test-smoke.sh || true" (and similarly for
test-regression.sh and test-failover.sh), then immediately assign the real exit
status to SMOKE_EXIT, REGRESSION_EXIT and FAILOVER_EXIT via "$?" (e.g., bash ...
|| true; SMOKE_EXIT=$?), ensuring the cleanup/summary and trap handlers always
run; keep the rest of the logic unchanged so the final summary uses those
variables.

Comment on lines +78 to +99
# Restore mysql1 as master
docker compose -f tests/functional/docker-compose.yml exec -T mysql1 \
mysql -uroot -ptestpass -e "STOP REPLICA; RESET REPLICA ALL; SET GLOBAL read_only=0;" 2>/dev/null
docker compose -f tests/functional/docker-compose.yml exec -T mysql2 \
mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null
docker compose -f tests/functional/docker-compose.yml exec -T mysql3 \
mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null

# Reset ProxySQL
docker compose -f tests/functional/docker-compose.yml exec -T proxysql \
mysql -h127.0.0.1 -P6032 -uradmin -pradmin -e \
"DELETE FROM mysql_servers WHERE hostgroup_id IN (10,20); INSERT INTO mysql_servers (hostgroup_id,hostname,port) VALUES (10,'mysql1',3306),(20,'mysql2',3306),(20,'mysql3',3306); LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;" 2>/dev/null

# Re-discover after topology change
sleep 5
curl -s "$ORC_URL/api/discover/mysql1/3306" > /dev/null
curl -s "$ORC_URL/api/discover/mysql2/3306" > /dev/null
curl -s "$ORC_URL/api/discover/mysql3/3306" > /dev/null
sleep 15

echo "Topology restored, waiting for orchestrator to stabilize..."
pass "Topology restored for hard failover test"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate the restored topology before starting the hard-failover case.

This section records PASS after a couple of sleeps, but none of the MySQL or ProxySQL rewiring is actually verified. If restore drifted, the DeadMaster test starts from the wrong topology and its outcome is misleading. Re-check tests/functional/lib.sh helpers mysql_read_only, mysql_source_host, and proxysql_servers here before continuing.

Suggested fix
 # Re-discover after topology change
 sleep 5
 curl -s "$ORC_URL/api/discover/mysql1/3306" > /dev/null
 curl -s "$ORC_URL/api/discover/mysql2/3306" > /dev/null
 curl -s "$ORC_URL/api/discover/mysql3/3306" > /dev/null
 sleep 15
 
 echo "Topology restored, waiting for orchestrator to stabilize..."
-pass "Topology restored for hard failover test"
+RO1=$(mysql_read_only mysql1)
+RO2=$(mysql_read_only mysql2)
+RO3=$(mysql_read_only mysql3)
+SRC2=$(mysql_source_host mysql2)
+SRC3=$(mysql_source_host mysql3)
+HG10=$(proxysql_servers 10)
+HG20=$(proxysql_servers 20)
+
+if [ "$RO1" = "0" ] && [ "$RO2" = "1" ] && [ "$RO3" = "1" ] && \
+   [ "$SRC2" = "mysql1" ] && [ "$SRC3" = "mysql1" ] && \
+   echo "$HG10" | grep -q "mysql1" && \
+   echo "$HG20" | grep -q "mysql2" && \
+   echo "$HG20" | grep -q "mysql3"; then
+    pass "Topology restored for hard failover test"
+else
+    fail "Topology restore validation failed" \
+         "RO=$RO1/$RO2/$RO3 SRC=$SRC2/$SRC3 HG10=$HG10 HG20=$HG20"
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Restore mysql1 as master
docker compose -f tests/functional/docker-compose.yml exec -T mysql1 \
mysql -uroot -ptestpass -e "STOP REPLICA; RESET REPLICA ALL; SET GLOBAL read_only=0;" 2>/dev/null
docker compose -f tests/functional/docker-compose.yml exec -T mysql2 \
mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null
docker compose -f tests/functional/docker-compose.yml exec -T mysql3 \
mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null
# Reset ProxySQL
docker compose -f tests/functional/docker-compose.yml exec -T proxysql \
mysql -h127.0.0.1 -P6032 -uradmin -pradmin -e \
"DELETE FROM mysql_servers WHERE hostgroup_id IN (10,20); INSERT INTO mysql_servers (hostgroup_id,hostname,port) VALUES (10,'mysql1',3306),(20,'mysql2',3306),(20,'mysql3',3306); LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;" 2>/dev/null
# Re-discover after topology change
sleep 5
curl -s "$ORC_URL/api/discover/mysql1/3306" > /dev/null
curl -s "$ORC_URL/api/discover/mysql2/3306" > /dev/null
curl -s "$ORC_URL/api/discover/mysql3/3306" > /dev/null
sleep 15
echo "Topology restored, waiting for orchestrator to stabilize..."
pass "Topology restored for hard failover test"
# Restore mysql1 as master
docker compose -f tests/functional/docker-compose.yml exec -T mysql1 \
mysql -uroot -ptestpass -e "STOP REPLICA; RESET REPLICA ALL; SET GLOBAL read_only=0;" 2>/dev/null
docker compose -f tests/functional/docker-compose.yml exec -T mysql2 \
mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null
docker compose -f tests/functional/docker-compose.yml exec -T mysql3 \
mysql -uroot -ptestpass -e "STOP REPLICA; CHANGE REPLICATION SOURCE TO SOURCE_HOST='mysql1', SOURCE_PORT=3306, SOURCE_USER='repl', SOURCE_PASSWORD='repl_pass', SOURCE_AUTO_POSITION=1; START REPLICA; SET GLOBAL read_only=1;" 2>/dev/null
# Reset ProxySQL
docker compose -f tests/functional/docker-compose.yml exec -T proxysql \
mysql -h127.0.0.1 -P6032 -uradmin -pradmin -e \
"DELETE FROM mysql_servers WHERE hostgroup_id IN (10,20); INSERT INTO mysql_servers (hostgroup_id,hostname,port) VALUES (10,'mysql1',3306),(20,'mysql2',3306),(20,'mysql3',3306); LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;" 2>/dev/null
# Re-discover after topology change
sleep 5
curl -s "$ORC_URL/api/discover/mysql1/3306" > /dev/null
curl -s "$ORC_URL/api/discover/mysql2/3306" > /dev/null
curl -s "$ORC_URL/api/discover/mysql3/3306" > /dev/null
sleep 15
echo "Topology restored, waiting for orchestrator to stabilize..."
RO1=$(mysql_read_only mysql1)
RO2=$(mysql_read_only mysql2)
RO3=$(mysql_read_only mysql3)
SRC2=$(mysql_source_host mysql2)
SRC3=$(mysql_source_host mysql3)
HG10=$(proxysql_servers 10)
HG20=$(proxysql_servers 20)
if [ "$RO1" = "0" ] && [ "$RO2" = "1" ] && [ "$RO3" = "1" ] && \
[ "$SRC2" = "mysql1" ] && [ "$SRC3" = "mysql1" ] && \
echo "$HG10" | grep -q "mysql1" && \
echo "$HG20" | grep -q "mysql2" && \
echo "$HG20" | grep -q "mysql3"; then
pass "Topology restored for hard failover test"
else
fail "Topology restore validation failed" \
"RO=$RO1/$RO2/$RO3 SRC=$SRC2/$SRC3 HG10=$HG10 HG20=$HG20"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-failover.sh` around lines 78 - 99, The script restores
MySQL and ProxySQL but doesn't verify the resulting topology; add explicit
checks using the helper functions mysql_read_only, mysql_source_host, and
proxysql_servers (from tests/functional/lib.sh) before declaring PASS and
proceeding to the hard-failover case. After the discovery curls and sleep, call
mysql_read_only(mysql1) to assert mysql1 is writable (read_only=0),
mysql_read_only(mysql2/mysql3) to assert they are read-only, use
mysql_source_host(mysql2) and mysql_source_host(mysql3) to assert their source
is mysql1, and use proxysql_servers() (or proxysql_servers for hostgroup 10/20)
to verify ProxySQL backends were applied; if any check fails, print diagnostics
and exit non‑zero instead of passing.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
.github/workflows/functional.yml (1)

56-61: Consider polling orchestrator health instead of fixed sleep.

A 3-second sleep may be insufficient if orchestrator takes longer to initialize (database setup, port binding, etc.). A health check poll would be more reliable.

♻️ Example: Poll health endpoint
         bin/orchestrator -config tests/functional/orchestrator-test.conf.json http > /tmp/orchestrator-test.log 2>&1 &
         echo "$!" > /tmp/orchestrator-test.pid
-        sleep 3
+        for i in $(seq 1 30); do
+          curl -sf http://localhost:3000/api/health >/dev/null && break
+          sleep 1
+        done

Adjust the port and endpoint to match your orchestrator configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/functional.yml around lines 56 - 61, Replace the fixed
"sleep 3" wait with a health-check polling loop that calls the orchestrator's
HTTP health endpoint (after starting bin/orchestrator -config
tests/functional/orchestrator-test.conf.json http) until it returns success or a
timeout is reached; ensure you still write the PID to
/tmp/orchestrator-test.pid, stream or capture /tmp/orchestrator-test.log for
debug on failure, and fail the workflow if the health check times out. Use the
started process (bin/orchestrator) and its HTTP health endpoint (adjust
port/path as needed) for polling with a short interval and a reasonable overall
timeout so tests only proceed once the orchestrator is actually ready.
tests/functional/setup-replication.sh (1)

20-20: Consider preserving stderr for debugging.

Suppressing stderr during retries hides diagnostic information (like the authentication error that caused the pipeline failure). Consider logging to a temp file or only suppressing on non-final attempts.

♻️ Example: Log errors on final retry
-        " 2>/dev/null && break
+        " 2>&1 | { [ "$i" -lt 30 ] && cat >/dev/null || cat; } && break

Or simpler—just remove 2>/dev/null and accept the noise during retries.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/setup-replication.sh` at line 20, The retry loop in
setup-replication.sh suppresses stderr with the redirection "2>/dev/null &&
break", hiding useful diagnostics; update the loop so stderr is preserved for
the final attempt (e.g., write stderr to a temp file and only discard it on
non-final retries, or remove the "2>/dev/null" entirely) so that authentication
and other errors are visible when retries fail; locate the retry construct
containing the literal "2>/dev/null && break" and modify the redirection logic
to log or retain stderr on the final try.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/setup-replication.sh`:
- Around line 12-22: The replication setup fails due to caching_sha2_password
requiring RSA public key exchange; update the CHANGE REPLICATION SOURCE TO
statement used in the setup loop to include SOURCE_PUBLIC_KEY='...' by adding
the MySQL option GET_SOURCE_PUBLIC_KEY=1 (or the equivalent SOURCE_PUBLIC_KEY
setting) so the replica can fetch the source public key for authentication;
modify the block that runs CHANGE REPLICATION SOURCE TO / START REPLICA (the
command that sets
SOURCE_HOST/SOURCE_PORT/SOURCE_USER/SOURCE_PASSWORD/SOURCE_AUTO_POSITION) to
include the public-key option so START REPLICA succeeds without Last_IO_Errno
2061.

---

Nitpick comments:
In @.github/workflows/functional.yml:
- Around line 56-61: Replace the fixed "sleep 3" wait with a health-check
polling loop that calls the orchestrator's HTTP health endpoint (after starting
bin/orchestrator -config tests/functional/orchestrator-test.conf.json http)
until it returns success or a timeout is reached; ensure you still write the PID
to /tmp/orchestrator-test.pid, stream or capture /tmp/orchestrator-test.log for
debug on failure, and fail the workflow if the health check times out. Use the
started process (bin/orchestrator) and its HTTP health endpoint (adjust
port/path as needed) for polling with a short interval and a reasonable overall
timeout so tests only proceed once the orchestrator is actually ready.

In `@tests/functional/setup-replication.sh`:
- Line 20: The retry loop in setup-replication.sh suppresses stderr with the
redirection "2>/dev/null && break", hiding useful diagnostics; update the loop
so stderr is preserved for the final attempt (e.g., write stderr to a temp file
and only discard it on non-final retries, or remove the "2>/dev/null" entirely)
so that authentication and other errors are visible when retries fail; locate
the retry construct containing the literal "2>/dev/null && break" and modify the
redirection logic to log or retain stderr on the final try.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dc0cdc68-486b-4fe1-b9c4-e9db66e395f1

📥 Commits

Reviewing files that changed from the base of the PR and between 4e3df8a and 607b618.

📒 Files selected for processing (5)
  • .github/workflows/functional.yml
  • tests/functional/docker-compose.yml
  • tests/functional/mysql/init-replica.sql
  • tests/functional/mysql/replica2.cnf
  • tests/functional/setup-replication.sh
✅ Files skipped from review due to trivial changes (2)
  • tests/functional/mysql/replica2.cnf
  • tests/functional/docker-compose.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/functional/mysql/init-replica.sql

…scovery

- Remove set -e from test scripts (tests handle failures themselves)
- Discover topology dynamically (cluster name varies by environment)
- Seed all replicas during discovery, not just master
- Use CLUSTER_NAME variable throughout instead of hardcoded names
- Failover test uses dynamic cluster name for takeover API
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
tests/functional/test-failover.sh (1)

79-99: ⚠️ Potential issue | 🟠 Major

Restore phase still reports PASS without validating restored topology.

This is the same unresolved issue previously flagged: the script can proceed to hard-failover from an unverified baseline.

Proposed fix
 # Re-discover after topology change
 sleep 5
 curl -s "$ORC_URL/api/discover/mysql1/3306" > /dev/null
 curl -s "$ORC_URL/api/discover/mysql2/3306" > /dev/null
 curl -s "$ORC_URL/api/discover/mysql3/3306" > /dev/null
 sleep 15
 
 echo "Topology restored, waiting for orchestrator to stabilize..."
-pass "Topology restored for hard failover test"
+RO1=$(mysql_read_only mysql1)
+RO2=$(mysql_read_only mysql2)
+RO3=$(mysql_read_only mysql3)
+SRC2=$(mysql_source_host mysql2)
+SRC3=$(mysql_source_host mysql3)
+HG10=$(proxysql_servers 10)
+HG20=$(proxysql_servers 20)
+
+if [ "$RO1" = "0" ] && [ "$RO2" = "1" ] && [ "$RO3" = "1" ] && \
+   [ "$SRC2" = "mysql1" ] && [ "$SRC3" = "mysql1" ] && \
+   echo "$HG10" | grep -q "mysql1" && \
+   echo "$HG20" | grep -q "mysql2" && \
+   echo "$HG20" | grep -q "mysql3"; then
+  pass "Topology restored for hard failover test"
+else
+  fail "Topology restore validation failed" "RO=$RO1/$RO2/$RO3 SRC=$SRC2/$SRC3 HG10=$HG10 HG20=$HG20"
+  summary
+  exit 1
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-failover.sh` around lines 79 - 99, The script currently
calls the discovery endpoints (curl .../api/discover/...) and then immediately
reports success with pass "Topology restored for hard failover test" without
validating the actual cluster state; add a verification step after the
discovery/sleep block and before the pass call that queries the
orchestrator/topology API (or MySQL status) to assert mysql1 is PRIMARY and
mysql2/mysql3 are replicas with replication running and read_only=1, and fail
the test (exit non-zero or call the existing fail helper) if any check fails;
locate where the discovery curls and the pass "Topology restored for hard
failover test" are used and add these checks there.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-failover.sh`:
- Line 4: The script performs cd "$(dirname "$0")/../.." without handling
failures, which can cause subsequent relative-path operations to run in the
wrong directory; update the test-failover.sh entry so the repository-root cd is
fail-fast by checking its exit status or enabling errexit (e.g., use a
conditional or set -e) so that if cd "$(dirname "$0")/../.." fails the script
exits immediately and does not continue with invalid relative paths.

In `@tests/functional/test-regression.sh`:
- Line 4: The script currently runs cd "$(dirname "$0")/../.." without checking
its result; update tests/functional/test-regression.sh to explicitly handle a
failing cd by checking its exit status (e.g., test the return value or use ||)
and print a clear error and exit non‑zero if the cd fails before any further
relative path operations or sourcing; reference the existing cd command in the
top of the script and ensure the script halts immediately on failure to avoid
running subsequent commands in the wrong directory.

In `@tests/functional/test-smoke.sh`:
- Line 4: The repository-root cd command (cd "$(dirname "$0")/../..") is
unguarded and can leave the script running from the wrong directory; update that
invocation to fail fast by appending a guard such as || exit 1 (or || { echo
"failed to change directory"; exit 1; }) so the script stops if the chdir fails
and subsequent path-sensitive checks run from the intended repository root.
- Around line 27-31: The smoke test currently treats INST_COUNT >= 2 as success
which is too low for the intended 3‑node topology; update the conditional that
checks INST_COUNT (the curl + python3 JSON length pipeline assigning INST_COUNT)
so it requires at least 3 (e.g., change the numeric comparator from 2 to 3) and
adjust the pass/fail messages produced by pass and fail to reflect the new
expected minimum of 3 instances; keep using the same variables (INST_COUNT,
ORC_URL, CLUSTER_NAME) and the existing pass/fail helpers.

---

Duplicate comments:
In `@tests/functional/test-failover.sh`:
- Around line 79-99: The script currently calls the discovery endpoints (curl
.../api/discover/...) and then immediately reports success with pass "Topology
restored for hard failover test" without validating the actual cluster state;
add a verification step after the discovery/sleep block and before the pass call
that queries the orchestrator/topology API (or MySQL status) to assert mysql1 is
PRIMARY and mysql2/mysql3 are replicas with replication running and read_only=1,
and fail the test (exit non-zero or call the existing fail helper) if any check
fails; locate where the discovery curls and the pass "Topology restored for hard
failover test" are used and add these checks there.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2a3e65ba-13dc-4c1a-b713-2e394cc66c8b

📥 Commits

Reviewing files that changed from the base of the PR and between 607b618 and 9a15a4b.

📒 Files selected for processing (5)
  • tests/functional/lib.sh
  • tests/functional/setup-replication.sh
  • tests/functional/test-failover.sh
  • tests/functional/test-regression.sh
  • tests/functional/test-smoke.sh
✅ Files skipped from review due to trivial changes (1)
  • tests/functional/lib.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/functional/setup-replication.sh

#!/bin/bash
# Tier B: Failover tests — verify failover and ProxySQL hooks against real services
set -uo pipefail # no -e: we handle failures ourselves
cd "$(dirname "$0")/../.."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add fail-fast handling for repository-root cd.

Continuing after a failed directory change can invalidate all subsequent relative-path commands.

Proposed fix
-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd "$(dirname "$0")/../.."
cd "$(dirname "$0")/../.." || exit 1
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-failover.sh` at line 4, The script performs cd
"$(dirname "$0")/../.." without handling failures, which can cause subsequent
relative-path operations to run in the wrong directory; update the
test-failover.sh entry so the repository-root cd is fail-fast by checking its
exit status or enabling errexit (e.g., use a conditional or set -e) so that if
cd "$(dirname "$0")/../.." fails the script exits immediately and does not
continue with invalid relative paths.

#!/bin/bash
# Tier C: Regression tests — verify all API endpoints and features
set -uo pipefail # no -e: we handle failures ourselves
cd "$(dirname "$0")/../.."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle cd failure explicitly before sourcing project-relative paths.

If this cd fails, subsequent relative paths run in the wrong directory and produce misleading failures.

Proposed fix
-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd "$(dirname "$0")/../.."
cd "$(dirname "$0")/../.." || exit 1
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-regression.sh` at line 4, The script currently runs cd
"$(dirname "$0")/../.." without checking its result; update
tests/functional/test-regression.sh to explicitly handle a failing cd by
checking its exit status (e.g., test the return value or use ||) and print a
clear error and exit non‑zero if the cd fails before any further relative path
operations or sourcing; reference the existing cd command in the top of the
script and ensure the script halts immediately on failure to avoid running
subsequent commands in the wrong directory.

#!/bin/bash
# Tier A: Smoke tests — verify basic functionality against real services
set -uo pipefail # no -e: we handle failures ourselves
cd "$(dirname "$0")/../.."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard the repository-root cd with || exit.

Without this, path-sensitive checks may execute from the wrong working directory.

Proposed fix
-cd "$(dirname "$0")/../.."
+cd "$(dirname "$0")/../.." || exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd "$(dirname "$0")/../.."
cd "$(dirname "$0")/../.." || exit 1
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 4-4: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-smoke.sh` at line 4, The repository-root cd command (cd
"$(dirname "$0")/../..") is unguarded and can leave the script running from the
wrong directory; update that invocation to fail fast by appending a guard such
as || exit 1 (or || { echo "failed to change directory"; exit 1; }) so the
script stops if the chdir fails and subsequent path-sensitive checks run from
the intended repository root.

Comment on lines +27 to +31
INST_COUNT=$(curl -s "$ORC_URL/api/cluster/$CLUSTER_NAME" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null || echo "0")
if [ "$INST_COUNT" -ge 2 ]; then
pass "Instances discovered: $INST_COUNT"
else
fail "Instances discovered: $INST_COUNT (expected >= 2)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Smoke discovery threshold is too weak for the intended 3-node topology.

This currently passes with only 2 discovered instances, which can mask partial topology failures in Tier A.

Proposed fix
-INST_COUNT=$(curl -s "$ORC_URL/api/cluster/$CLUSTER_NAME" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null || echo "0")
-if [ "$INST_COUNT" -ge 2 ]; then
+INST_COUNT=$(curl -s "$ORC_URL/api/cluster/$CLUSTER_NAME" 2>/dev/null | python3 -c "import json,sys; print(len(json.load(sys.stdin)))" 2>/dev/null || echo "0")
+if [ "$INST_COUNT" -ge 3 ]; then
     pass "Instances discovered: $INST_COUNT"
 else
-    fail "Instances discovered: $INST_COUNT (expected >= 2)"
+    fail "Instances discovered: $INST_COUNT (expected >= 3)"
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-smoke.sh` around lines 27 - 31, The smoke test
currently treats INST_COUNT >= 2 as success which is too low for the intended
3‑node topology; update the conditional that checks INST_COUNT (the curl +
python3 JSON length pipeline assigning INST_COUNT) so it requires at least 3
(e.g., change the numeric comparator from 2 to 3) and adjust the pass/fail
messages produced by pass and fail to reflect the new expected minimum of 3
instances; keep using the same variables (INST_COUNT, ORC_URL, CLUSTER_NAME) and
the existing pass/fail helpers.

Orchestrator needs to reach MySQL containers by their Docker hostnames
(mysql1, mysql2, mysql3). Running it on the host requires /etc/hosts hacks.

Solution: Add orchestrator as a Docker Compose service that:
- Builds from source using golang:1.25.7 image
- Joins the same network as MySQL and ProxySQL
- Port-forwards 3099 to host for test scripts

Also: expose MySQL and ProxySQL ports for host-side CLI tests.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-smoke.sh`:
- Line 11: The call to discover_topology currently ignores its exit code so
failures are swallowed; update the invocation to check its return status and
fail the test early (e.g., run discover_topology "mysql1" and if it returns
non-zero then log an error and exit non‑zero). Specifically, after calling
discover_topology, test its exit code (or use `||` to short-circuit) and call
the existing test-failure/cleanup routine or `exit 1` so the suite stops on
discovery timeout; ensure you reference the discover_topology invocation and use
the same logging/teardown helpers used elsewhere in the script.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c9d952b2-1693-48dd-889c-bf252f022fc4

📥 Commits

Reviewing files that changed from the base of the PR and between 9a15a4b and ef5cc19.

📒 Files selected for processing (3)
  • .github/workflows/functional.yml
  • tests/functional/docker-compose.yml
  • tests/functional/test-smoke.sh
✅ Files skipped from review due to trivial changes (1)
  • tests/functional/docker-compose.yml
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/functional.yml


# Prerequisites
wait_for_orchestrator || { echo "FATAL: Orchestrator not reachable"; exit 1; }
discover_topology "mysql1"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle discover_topology failure explicitly.

Line 11 ignores the function’s exit code. If discovery times out, the suite continues and can produce misleading downstream results instead of recording the root failure.

Suggested fix
-discover_topology "mysql1"
+if ! discover_topology "mysql1"; then
+    fail "Topology discovery timed out" "Expected 3 discovered instances within timeout"
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
discover_topology "mysql1"
if ! discover_topology "mysql1"; then
fail "Topology discovery timed out" "Expected 3 discovered instances within timeout"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-smoke.sh` at line 11, The call to discover_topology
currently ignores its exit code so failures are swallowed; update the invocation
to check its return status and fail the test early (e.g., run discover_topology
"mysql1" and if it returns non-zero then log an error and exit non‑zero).
Specifically, after calling discover_topology, test its exit code (or use `||`
to short-circuit) and call the existing test-failure/cleanup routine or `exit 1`
so the suite stops on discovery timeout; ensure you reference the
discover_topology invocation and use the same logging/teardown helpers used
elsewhere in the script.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/functional/test-regression.sh (1)

30-35: Consider extracting the HTTP status code check into a reusable helper.

The manual curl call for the 404 test works correctly, but if similar error-code validations are needed elsewhere, a helper like test_status_code in lib.sh could improve consistency. This is a minor observation; the current implementation is functional.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-regression.sh` around lines 30 - 35, Extract the
repeated HTTP status check into a reusable helper (e.g., add function
test_status_code in lib.sh) and replace the inline curl+comparison in the test
with a call to that helper; specifically, move the curl invocation (currently
assigned to V2_404 and compared to "404") into test_status_code which accepts a
URL and expected code, and update this test to call test_status_code
"$ORC_URL/api/v2/instances/nonexistent/9999" 404 so the logic is centralized and
can be reused elsewhere.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-failover.sh`:
- Around line 152-158: The current grep against RECOVERY_API for the literal
string '"IsSuccessful":true' is fragile to JSON whitespace; replace the grep
check with a small Python JSON parse that reads RECOVERY_API and tests whether
any object in the response has IsSuccessful == True (use the RECOVERY_API
variable, parse it with json.loads, and exit 0/1 accordingly), and then use that
exit code to call pass/fail; locate the block using the RECOVERY_API variable
and the recovery API check to implement this change.

---

Nitpick comments:
In `@tests/functional/test-regression.sh`:
- Around line 30-35: Extract the repeated HTTP status check into a reusable
helper (e.g., add function test_status_code in lib.sh) and replace the inline
curl+comparison in the test with a call to that helper; specifically, move the
curl invocation (currently assigned to V2_404 and compared to "404") into
test_status_code which accepts a URL and expected code, and update this test to
call test_status_code "$ORC_URL/api/v2/instances/nonexistent/9999" 404 so the
logic is centralized and can be reused elsewhere.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 451f234a-dbf6-40ba-8557-61a142374d92

📥 Commits

Reviewing files that changed from the base of the PR and between ef5cc19 and 0533931.

📒 Files selected for processing (3)
  • tests/functional/test-failover.sh
  • tests/functional/test-regression.sh
  • tests/functional/test-smoke.sh

Comment on lines +152 to +158
# Check recovery via API
RECOVERY_API=$(curl -s "$ORC_URL/api/v2/recoveries" 2>/dev/null)
if echo "$RECOVERY_API" | grep -q '"IsSuccessful":true'; then
pass "Recovery audit: /api/v2/recoveries shows successful recovery"
else
fail "Recovery audit: no successful recovery in API response"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

JSON grep pattern may be fragile with whitespace.

The grep for "IsSuccessful":true on line 154 will fail if the JSON response is pretty-printed with spaces (e.g., "IsSuccessful": true). Since Python is already available and used elsewhere in this script, consider using it for reliable JSON parsing.

Proposed fix
 # Check recovery via API
 RECOVERY_API=$(curl -s "$ORC_URL/api/v2/recoveries" 2>/dev/null)
-if echo "$RECOVERY_API" | grep -q '"IsSuccessful":true'; then
+HAS_SUCCESS=$(echo "$RECOVERY_API" | python3 -c "
+import json, sys
+d = json.load(sys.stdin)
+for r in d.get('data', []):
+    if r.get('IsSuccessful'):
+        print('YES')
+        sys.exit(0)
+print('NO')
+" 2>/dev/null)
+if [ "$HAS_SUCCESS" = "YES" ]; then
     pass "Recovery audit: /api/v2/recoveries shows successful recovery"
 else
     fail "Recovery audit: no successful recovery in API response"
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Check recovery via API
RECOVERY_API=$(curl -s "$ORC_URL/api/v2/recoveries" 2>/dev/null)
if echo "$RECOVERY_API" | grep -q '"IsSuccessful":true'; then
pass "Recovery audit: /api/v2/recoveries shows successful recovery"
else
fail "Recovery audit: no successful recovery in API response"
fi
# Check recovery via API
RECOVERY_API=$(curl -s "$ORC_URL/api/v2/recoveries" 2>/dev/null)
HAS_SUCCESS=$(echo "$RECOVERY_API" | python3 -c "
import json, sys
d = json.load(sys.stdin)
for r in d.get('data', []):
if r.get('IsSuccessful'):
print('YES')
sys.exit(0)
print('NO')
" 2>/dev/null)
if [ "$HAS_SUCCESS" = "YES" ]; then
pass "Recovery audit: /api/v2/recoveries shows successful recovery"
else
fail "Recovery audit: no successful recovery in API response"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-failover.sh` around lines 152 - 158, The current grep
against RECOVERY_API for the literal string '"IsSuccessful":true' is fragile to
JSON whitespace; replace the grep check with a small Python JSON parse that
reads RECOVERY_API and tests whether any object in the response has IsSuccessful
== True (use the RECOVERY_API variable, parse it with json.loads, and exit 0/1
accordingly), and then use that exit code to call pass/fail; locate the block
using the RECOVERY_API variable and the recovery API check to implement this
change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants