CI: Advanced failover tests and PostgreSQL version matrix#86
CI: Advanced failover tests and PostgreSQL version matrix#86renecannao merged 3 commits intomasterfrom
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 17 minutes and 24 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdded three new advanced MySQL failover test scenarios (chain topology, errant GTID, co-master) to the functional test suite. Parameterized PostgreSQL Docker image version in tests/functional/docker-compose.yml. Updated GitHub Actions workflow to run PostgreSQL tests against versions 15, 16, and 17 with a matrix strategy, and integrated the new MySQL failover test. Changes
Sequence DiagramssequenceDiagram
participant TestScript as Test 1: Chain<br/>Failover
participant MySQL as MySQL<br/>Containers
participant Orchestrator as Orchestrator
participant API as /api/v2/<br/>recoveries
TestScript->>MySQL: Setup chain topology<br/>(mysql1→mysql2→mysql3)
MySQL-->>TestScript: Topology verified
TestScript->>MySQL: Stop mysql2<br/>(intermediate master)
TestScript->>Orchestrator: Monitor recovery detection
Orchestrator->>MySQL: Detect topology change
Orchestrator->>API: Log DeadIntermediateMaster<br/>recovery
TestScript->>API: Poll until recovery<br/>analysis found
API-->>TestScript: Recovery event (90s timeout)
TestScript->>MySQL: Verify mysql3 relocated<br/>to mysql1 replica
MySQL-->>TestScript: Validation complete
sequenceDiagram
participant TestScript as Test 2: Errant<br/>GTID
participant MySQL as MySQL<br/>Containers
participant Orchestrator as Orchestrator
participant API as /api/instance/<br/>mysql2/3306
TestScript->>MySQL: Disable read_only on mysql2
TestScript->>MySQL: Create errant_test database<br/>(errant GTID)
TestScript->>MySQL: Re-enable read_only
TestScript->>Orchestrator: Trigger discovery
Orchestrator->>MySQL: Inspect mysql2 GTID state
TestScript->>API: Poll GtidErrant field
API-->>TestScript: GtidErrant detected (60s timeout)
TestScript->>TestScript: Validation complete
TestScript->>MySQL: Drop errant_test database<br/>cleanup
sequenceDiagram
participant TestScript as Test 3: Co-Master<br/>Failover
participant MySQL as MySQL<br/>Containers
participant Orchestrator as Orchestrator
participant API as /api/instance/<br/>mysql1/3306
TestScript->>MySQL: Setup co-master<br/>(mysql1↔mysql2)
TestScript->>Orchestrator: Trigger discovery
Orchestrator->>MySQL: Detect co-master pair
TestScript->>API: Check IsCoMaster flag
API-->>TestScript: Co-master confirmed
TestScript->>MySQL: Stop mysql2
TestScript->>Orchestrator: Monitor recovery
Orchestrator->>API: Log DeadCoMaster recovery
TestScript->>API: Poll recoveries endpoint
API-->>TestScript: Recovery found (90s timeout)
TestScript->>MySQL: Reset mysql1 replication<br/>cleanup topology
TestScript->>Orchestrator: Rediscover all instances
TestScript->>TestScript: Validation complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes The new test script introduces substantial multi-step logic across three distinct failover scenarios with dense orchestrator API interactions and error handling. Workflow and docker-compose changes are simpler configuration updates, but the heterogeneous nature of the diff (configuration, parameterization, and complex test logic) combined with the length and complexity of the test script drives the overall effort. Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces environment variable support for PostgreSQL images in the functional test environment and adds a new test suite, test-failover-advanced.sh, which covers intermediate master failures, errant GTID detection, and co-master failover scenarios. The review feedback focuses on improving the maintainability of the new shell script by extracting repeated logic into a helper function, ensuring consistent read-only state configurations during cleanup, and simplifying redundant conditional checks.
| source tests/functional/lib.sh | ||
|
|
||
| COMPOSE="docker compose -f tests/functional/docker-compose.yml" | ||
|
|
There was a problem hiding this comment.
The logic for waiting for a MySQL container to become reachable is repeated multiple times in this script. Defining a helper function improves maintainability and ensures consistent error handling across the test suite.
| # Helper to wait for a MySQL container to be ready | |
| wait_for_mysql() { | |
| local HOST="$1" | |
| for i in $(seq 1 30); do | |
| if $COMPOSE exec -T "$HOST" mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then | |
| return 0 | |
| fi | |
| sleep 1 | |
| done | |
| return 1 | |
| } | |
| for HOST in mysql1 mysql2 mysql3; do | ||
| for i in $(seq 1 30); do | ||
| if $COMPOSE exec -T "$HOST" mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then | ||
| break | ||
| fi | ||
| sleep 1 | ||
| done | ||
| done |
There was a problem hiding this comment.
Use the wait_for_mysql helper function to simplify the code and ensure the script fails fast if the initial setup cannot reach the containers.
| for HOST in mysql1 mysql2 mysql3; do | |
| for i in $(seq 1 30); do | |
| if $COMPOSE exec -T "$HOST" mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then | |
| break | |
| fi | |
| sleep 1 | |
| done | |
| done | |
| # Wait for all MySQL containers to be reachable | |
| for HOST in mysql1 mysql2 mysql3; do | |
| wait_for_mysql "$HOST" || { echo "FATAL: $HOST not reachable"; exit 1; } | |
| done |
| for i in $(seq 1 30); do | ||
| if $COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then | ||
| break | ||
| fi | ||
| sleep 1 | ||
| done |
There was a problem hiding this comment.
Use the wait_for_mysql helper function here to reduce code duplication.
| for i in $(seq 1 30); do | |
| if $COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then | |
| break | |
| fi | |
| sleep 1 | |
| done | |
| for i in $(seq 1 30); do | |
| wait_for_mysql mysql2 || fail "mysql2 not reachable after restart" | |
| done |
| $COMPOSE exec -T mysql3 \ | ||
| mysql -uroot -ptestpass -e "$STOP_SQL $CHANGE_TO_MYSQL1 $START_SQL" 2>/dev/null |
There was a problem hiding this comment.
This cleanup step is inconsistent with the initial setup and other cleanup blocks (e.g., line 305). It should explicitly set read_only and super_read_only to ensure the replica is in the correct state after relocation.
| $COMPOSE exec -T mysql3 \ | |
| mysql -uroot -ptestpass -e "$STOP_SQL $CHANGE_TO_MYSQL1 $START_SQL" 2>/dev/null | |
| $COMPOSE exec -T mysql3 \ | |
| mysql -uroot -ptestpass -e "$STOP_SQL $CHANGE_TO_MYSQL1 $START_SQL SET GLOBAL read_only=1; SET GLOBAL super_read_only=1;" 2>/dev/null |
| errant = d.get('GtidErrant', '') | ||
| print(errant) | ||
| " 2>/dev/null || echo "") | ||
| if [ -n "$GTID_ERRANT" ] && [ "$GTID_ERRANT" != "" ]; then |
| for i in $(seq 1 30); do | ||
| if $COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then | ||
| break | ||
| fi | ||
| sleep 1 | ||
| done |
There was a problem hiding this comment.
Use the wait_for_mysql helper function here to reduce code duplication.
| for i in $(seq 1 30); do | |
| if $COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then | |
| break | |
| fi | |
| sleep 1 | |
| done | |
| for i in $(seq 1 30); do | |
| wait_for_mysql mysql2 || fail "mysql2 not reachable after restart" | |
| done |
e81056f to
fa42679
Compare
cfab30d to
a6c5739
Compare
- Add test-failover-advanced.sh with intermediate master failure, errant GTID detection, and co-master failover scenarios - Parameterize PostgreSQL image in docker-compose.yml (PG_IMAGE env) - Add PG version matrix (15, 16, 17) to functional-postgresql CI job
All curl calls lacked --max-time, causing potential test hangs if any API endpoint was slow to respond.
972f2c2 to
9666f57
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/functional.yml (1)
172-255:⚠️ Potential issue | 🟠 MajorLet the PostgreSQL matrix fail the job.
Line 255 still keeps
continue-on-error: true, so regressions on postgres 15/16/17 will still report green. That means the new matrix does not actually enforce the multi-version support promised by issue#83.Suggested fix
- name: Run PostgreSQL tests run: bash tests/functional/test-postgresql.sh - continue-on-error: true # PG support is new — iterating on CI stability🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/functional.yml around lines 172 - 255, The "Run PostgreSQL tests" workflow step currently sets continue-on-error: true which masks failures in the PostgreSQL matrix; update the "Run PostgreSQL tests" step (the step with the name "Run PostgreSQL tests" and the setting continue-on-error: true) to remove that key or set it to false so that failures in the pg_version matrix (postgres 15/16/17) cause the job to fail and surface regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/functional/test-failover-advanced.sh`:
- Around line 252-282: The test only waits for a DeadCoMaster recovery record
(using RECOVERED and the /api/v2/recoveries poll) but doesn't verify actual
topology changes; after detecting recovery (when RECOVERED=true) add a follow-up
polling loop that queries the orchestrator topology (e.g.,
ORC_URL/api/v2/topology or the relevant instance status endpoint) for up to N
seconds and assert the expected state: that mysql1 is no longer an active
co-master/upstream, mysql3 is in the expected role/replica position, and the
cluster has a single promoted master if applicable; fail the test if the
topology does not converge within the timeout and include a debug dump of the
topology JSON as before.
- Around line 166-211: The test only asserts GtidErrant on /api/instance but
never exercises the /api/errant-gtid and /api/locate-gtid-errant APIs nor clears
the errant GTID state on mysql2, leaving mysql2 dirty; after detecting
GTID_ERRANT you should call the orchestrator endpoints (ORC_URL/api/errant-gtid
and ORC_URL/api/locate-gtid-errant) to validate and clear the errant entry and
then restore the replica state on mysql2 (remove errant history and reset any
GTID markers for errant_test) before finishing cleanup so subsequent tests are
not affected. Ensure the sequence happens only when ERRANT_DETECTED is true and
include checks that the API calls succeeded before declaring cleanup/pass.
---
Outside diff comments:
In @.github/workflows/functional.yml:
- Around line 172-255: The "Run PostgreSQL tests" workflow step currently sets
continue-on-error: true which masks failures in the PostgreSQL matrix; update
the "Run PostgreSQL tests" step (the step with the name "Run PostgreSQL tests"
and the setting continue-on-error: true) to remove that key or set it to false
so that failures in the pg_version matrix (postgres 15/16/17) cause the job to
fail and surface regressions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c4118674-8f40-4dda-b788-536211c90570
📒 Files selected for processing (3)
.github/workflows/functional.ymltests/functional/docker-compose.ymltests/functional/test-failover-advanced.sh
| # Inject an errant transaction on mysql2 | ||
| $COMPOSE exec -T mysql2 \ | ||
| mysql -uroot -ptestpass -e " | ||
| SET GLOBAL read_only=0; | ||
| SET GLOBAL super_read_only=0; | ||
| CREATE DATABASE IF NOT EXISTS errant_test; | ||
| SET GLOBAL read_only=1; | ||
| SET GLOBAL super_read_only=1; | ||
| " 2>/dev/null | ||
|
|
||
| echo "Waiting for orchestrator to detect errant GTID (max 60s)..." | ||
| ERRANT_DETECTED=false | ||
| for i in $(seq 1 60); do | ||
| # Force a refresh | ||
| curl -s --max-time 10 "$ORC_URL/api/discover/mysql2/3306" > /dev/null 2>&1 | ||
| sleep 2 | ||
| GTID_ERRANT=$(curl -s --max-time 10 "$ORC_URL/api/instance/mysql2/3306" 2>/dev/null | python3 -c " | ||
| import json, sys | ||
| d = json.load(sys.stdin) | ||
| errant = d.get('GtidErrant', '') | ||
| print(errant) | ||
| " 2>/dev/null || echo "") | ||
| if [ -n "$GTID_ERRANT" ] && [ "$GTID_ERRANT" != "" ]; then | ||
| echo "Errant GTID detected after ${i}s: $GTID_ERRANT" | ||
| ERRANT_DETECTED=true | ||
| break | ||
| fi | ||
| done | ||
|
|
||
| if [ "$ERRANT_DETECTED" = "true" ]; then | ||
| pass "Errant GTID detected on mysql2: $GTID_ERRANT" | ||
| else | ||
| fail "Errant GTID not detected on mysql2 within 60s" | ||
| fi | ||
|
|
||
| # Cleanup: remove the errant database | ||
| $COMPOSE exec -T mysql2 \ | ||
| mysql -uroot -ptestpass -e " | ||
| SET GLOBAL read_only=0; | ||
| SET GLOBAL super_read_only=0; | ||
| DROP DATABASE IF EXISTS errant_test; | ||
| SET GLOBAL read_only=1; | ||
| SET GLOBAL super_read_only=1; | ||
| " 2>/dev/null | ||
|
|
||
| pass "Test 2 cleanup: errant database removed" |
There was a problem hiding this comment.
Exercise the errant-GTID APIs and restore replica state afterward.
This scenario only asserts that GtidErrant becomes non-empty on /api/instance/mysql2/3306. Issue #82 explicitly calls for validating /api/errant-gtid and locate-gtid-errant, and Lines 201-209 only drop the schema — they do not clear the errant GTID history. So the test misses part of the acceptance criteria and leaves mysql2 logically dirty for the later cases in this job.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional/test-failover-advanced.sh` around lines 166 - 211, The test
only asserts GtidErrant on /api/instance but never exercises the
/api/errant-gtid and /api/locate-gtid-errant APIs nor clears the errant GTID
state on mysql2, leaving mysql2 dirty; after detecting GTID_ERRANT you should
call the orchestrator endpoints (ORC_URL/api/errant-gtid and
ORC_URL/api/locate-gtid-errant) to validate and clear the errant entry and then
restore the replica state on mysql2 (remove errant history and reset any GTID
markers for errant_test) before finishing cleanup so subsequent tests are not
affected. Ensure the sequence happens only when ERRANT_DETECTED is true and
include checks that the API calls succeeded before declaring cleanup/pass.
| echo "Waiting for orchestrator to detect DeadCoMaster and recover (max 90s)..." | ||
| RECOVERED=false | ||
| for i in $(seq 1 90); do | ||
| RECOVERIES=$(curl -s --max-time 10 "$ORC_URL/api/v2/recoveries" 2>/dev/null) | ||
| HAS_RECOVERY=$(echo "$RECOVERIES" | python3 -c " | ||
| import json, sys | ||
| d = json.load(sys.stdin) | ||
| data = d.get('data', []) | ||
| for r in data: | ||
| a = r.get('AnalysisEntry', {}).get('Analysis', '') | ||
| s = r.get('IsSuccessful', False) | ||
| if 'DeadCoMaster' in a and s: | ||
| print('RECOVERED') | ||
| sys.exit(0) | ||
| print('WAITING') | ||
| " 2>/dev/null) | ||
| if [ "$HAS_RECOVERY" = "RECOVERED" ]; then | ||
| echo "DeadCoMaster recovery detected after ${i}s" | ||
| RECOVERED=true | ||
| break | ||
| fi | ||
| sleep 1 | ||
| done | ||
|
|
||
| if [ "$RECOVERED" = "true" ]; then | ||
| pass "DeadCoMaster detected and recovered" | ||
| else | ||
| fail "DeadCoMaster: no recovery detected within 90s" | ||
| echo " DEBUG: Recent recoveries:" | ||
| curl -s --max-time 10 "$ORC_URL/api/v2/recoveries" 2>/dev/null | python3 -m json.tool 2>/dev/null | head -30 | ||
| fi |
There was a problem hiding this comment.
Assert the post-recovery topology, not just the recovery record.
A successful DeadCoMaster entry in /api/v2/recoveries only proves orchestrator emitted a recovery event. The test never verifies that the topology actually converged afterward, so this can go green while mysql1 is still attached upstream, still marked co-master, or mysql3 is no longer where you expect it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional/test-failover-advanced.sh` around lines 252 - 282, The test
only waits for a DeadCoMaster recovery record (using RECOVERED and the
/api/v2/recoveries poll) but doesn't verify actual topology changes; after
detecting recovery (when RECOVERED=true) add a follow-up polling loop that
queries the orchestrator topology (e.g., ORC_URL/api/v2/topology or the relevant
instance status endpoint) for up to N seconds and assert the expected state:
that mysql1 is no longer an active co-master/upstream, mysql3 is in the expected
role/replica position, and the cluster has a single promoted master if
applicable; fail the test if the topology does not converge within the timeout
and include a debug dump of the topology JSON as before.
Summary
Changes
test-failover-advanced.shwith chain topology, errant GTID, and co-master testsdocker-compose.ymlwithPG_IMAGEenv varfunctional-postgresqlCI job with unique artifact namesDepends on #85
Closes #82, closes #83
Test plan
Summary by CodeRabbit
Tests
Chores