Skip to content

CI: Advanced failover tests and PostgreSQL version matrix#86

Merged
renecannao merged 3 commits intomasterfrom
issue82-83-failover-pg-matrix
Apr 9, 2026
Merged

CI: Advanced failover tests and PostgreSQL version matrix#86
renecannao merged 3 commits intomasterfrom
issue82-83-failover-pg-matrix

Conversation

@renecannao
Copy link
Copy Markdown

@renecannao renecannao commented Apr 9, 2026

Summary

Changes

  • Created test-failover-advanced.sh with chain topology, errant GTID, and co-master tests
  • Parameterized docker-compose.yml with PG_IMAGE env var
  • Added PG version matrix (15, 16, 17) to functional-postgresql CI job with unique artifact names

Depends on #85
Closes #82, closes #83

Test plan

  • Intermediate master failure: chain topology → kill intermediate → replica relocated
  • Errant GTID: inject transaction on replica → orchestrator detects GtidErrant
  • Co-master failover: circular replication → kill one co-master → recovery
  • PostgreSQL tests pass on PG 15, 16, and 17

Summary by CodeRabbit

  • Tests

    • Added comprehensive advanced failover test scenarios for MySQL to validate recovery mechanisms
    • Expanded PostgreSQL compatibility testing across versions 15, 16, and 17 with automated test artifact management
  • Chores

    • Improved testing infrastructure with flexible database image configuration supporting multiple versions

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

Warning

Rate limit exceeded

@renecannao has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 17 minutes and 24 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 17 minutes and 24 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d0262808-e905-440e-a998-975f0c197b40

📥 Commits

Reviewing files that changed from the base of the PR and between 9666f57 and 7237681.

📒 Files selected for processing (1)
  • tests/functional/test-named-channels.sh
📝 Walkthrough

Walkthrough

Added three new advanced MySQL failover test scenarios (chain topology, errant GTID, co-master) to the functional test suite. Parameterized PostgreSQL Docker image version in tests/functional/docker-compose.yml. Updated GitHub Actions workflow to run PostgreSQL tests against versions 15, 16, and 17 with a matrix strategy, and integrated the new MySQL failover test.

Changes

Cohort / File(s) Summary
GitHub Actions Workflow Configuration
.github/workflows/functional.yml
Added test-failover-advanced.sh step for MySQL. Introduced PostgreSQL version matrix (15/16/17) with fail-fast: false. Wired PG_IMAGE=postgres:${{ matrix.pg_version }} environment variable into PostgreSQL-related steps. Updated artifact names for PostgreSQL logs to include version suffix (-pg-${{ matrix.pg_version }}).
Docker Compose Parameterization
tests/functional/docker-compose.yml
Parameterized PostgreSQL image for pgprimary and pgstandby1 services from fixed postgres:17 to ${PG_IMAGE:-postgres:17}, enabling version override with default fallback.
Advanced Failover Test Suite
tests/functional/test-failover-advanced.sh
New comprehensive test script covering three failover scenarios: (1) DeadIntermediateMaster detection with replica relocation in chain topology, (2) errant GTID injection and detection via orchestrator API, (3) co-master setup, failure, and recovery. Uses Docker Compose orchestration, shared helper functions, and JSON parsing for orchestrator API validation.

Sequence Diagrams

sequenceDiagram
    participant TestScript as Test 1: Chain<br/>Failover
    participant MySQL as MySQL<br/>Containers
    participant Orchestrator as Orchestrator
    participant API as /api/v2/<br/>recoveries

    TestScript->>MySQL: Setup chain topology<br/>(mysql1→mysql2→mysql3)
    MySQL-->>TestScript: Topology verified
    TestScript->>MySQL: Stop mysql2<br/>(intermediate master)
    TestScript->>Orchestrator: Monitor recovery detection
    Orchestrator->>MySQL: Detect topology change
    Orchestrator->>API: Log DeadIntermediateMaster<br/>recovery
    TestScript->>API: Poll until recovery<br/>analysis found
    API-->>TestScript: Recovery event (90s timeout)
    TestScript->>MySQL: Verify mysql3 relocated<br/>to mysql1 replica
    MySQL-->>TestScript: Validation complete
Loading
sequenceDiagram
    participant TestScript as Test 2: Errant<br/>GTID
    participant MySQL as MySQL<br/>Containers
    participant Orchestrator as Orchestrator
    participant API as /api/instance/<br/>mysql2/3306

    TestScript->>MySQL: Disable read_only on mysql2
    TestScript->>MySQL: Create errant_test database<br/>(errant GTID)
    TestScript->>MySQL: Re-enable read_only
    TestScript->>Orchestrator: Trigger discovery
    Orchestrator->>MySQL: Inspect mysql2 GTID state
    TestScript->>API: Poll GtidErrant field
    API-->>TestScript: GtidErrant detected (60s timeout)
    TestScript->>TestScript: Validation complete
    TestScript->>MySQL: Drop errant_test database<br/>cleanup
Loading
sequenceDiagram
    participant TestScript as Test 3: Co-Master<br/>Failover
    participant MySQL as MySQL<br/>Containers
    participant Orchestrator as Orchestrator
    participant API as /api/instance/<br/>mysql1/3306

    TestScript->>MySQL: Setup co-master<br/>(mysql1↔mysql2)
    TestScript->>Orchestrator: Trigger discovery
    Orchestrator->>MySQL: Detect co-master pair
    TestScript->>API: Check IsCoMaster flag
    API-->>TestScript: Co-master confirmed
    TestScript->>MySQL: Stop mysql2
    TestScript->>Orchestrator: Monitor recovery
    Orchestrator->>API: Log DeadCoMaster recovery
    TestScript->>API: Poll recoveries endpoint
    API-->>TestScript: Recovery found (90s timeout)
    TestScript->>MySQL: Reset mysql1 replication<br/>cleanup topology
    TestScript->>Orchestrator: Rediscover all instances
    TestScript->>TestScript: Validation complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

The new test script introduces substantial multi-step logic across three distinct failover scenarios with dense orchestrator API interactions and error handling. Workflow and docker-compose changes are simpler configuration updates, but the heterogeneous nature of the diff (configuration, parameterization, and complex test logic) combined with the length and complexity of the test script drives the overall effort.

Poem

🐰 Hop, skip, and bounce through failover dreams,
Chain topologies and errant GTID schemes,
Co-masters dancing in orchestrated grace,
Postgres versions racing through time and space—
Recovery resilience tested with care,
From fifteen to seventeen, we're there! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the two main changes: advanced failover tests and PostgreSQL version matrix, matching the primary objectives.
Linked Issues check ✅ Passed All coding requirements from #82 and #83 are met: three failover scenarios implemented [#82], PostgreSQL version matrix with PG_IMAGE parameterization added [#83].
Out of Scope Changes check ✅ Passed All changes directly support the linked issues: new failover test script, Docker Compose parameterization, and CI workflow matrix are within scope.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue82-83-failover-pg-matrix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces environment variable support for PostgreSQL images in the functional test environment and adds a new test suite, test-failover-advanced.sh, which covers intermediate master failures, errant GTID detection, and co-master failover scenarios. The review feedback focuses on improving the maintainability of the new shell script by extracting repeated logic into a helper function, ensuring consistent read-only state configurations during cleanup, and simplifying redundant conditional checks.

source tests/functional/lib.sh

COMPOSE="docker compose -f tests/functional/docker-compose.yml"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for waiting for a MySQL container to become reachable is repeated multiple times in this script. Defining a helper function improves maintainability and ensures consistent error handling across the test suite.

Suggested change
# Helper to wait for a MySQL container to be ready
wait_for_mysql() {
local HOST="$1"
for i in $(seq 1 30); do
if $COMPOSE exec -T "$HOST" mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then
return 0
fi
sleep 1
done
return 1
}

Comment on lines +20 to +27
for HOST in mysql1 mysql2 mysql3; do
for i in $(seq 1 30); do
if $COMPOSE exec -T "$HOST" mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then
break
fi
sleep 1
done
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the wait_for_mysql helper function to simplify the code and ensure the script fails fast if the initial setup cannot reach the containers.

Suggested change
for HOST in mysql1 mysql2 mysql3; do
for i in $(seq 1 30); do
if $COMPOSE exec -T "$HOST" mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then
break
fi
sleep 1
done
done
# Wait for all MySQL containers to be reachable
for HOST in mysql1 mysql2 mysql3; do
wait_for_mysql "$HOST" || { echo "FATAL: $HOST not reachable"; exit 1; }
done

Comment on lines +142 to +147
for i in $(seq 1 30); do
if $COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then
break
fi
sleep 1
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the wait_for_mysql helper function here to reduce code duplication.

Suggested change
for i in $(seq 1 30); do
if $COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then
break
fi
sleep 1
done
for i in $(seq 1 30); do
wait_for_mysql mysql2 || fail "mysql2 not reachable after restart"
done

Comment on lines +151 to +152
$COMPOSE exec -T mysql3 \
mysql -uroot -ptestpass -e "$STOP_SQL $CHANGE_TO_MYSQL1 $START_SQL" 2>/dev/null
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This cleanup step is inconsistent with the initial setup and other cleanup blocks (e.g., line 305). It should explicitly set read_only and super_read_only to ensure the replica is in the correct state after relocation.

Suggested change
$COMPOSE exec -T mysql3 \
mysql -uroot -ptestpass -e "$STOP_SQL $CHANGE_TO_MYSQL1 $START_SQL" 2>/dev/null
$COMPOSE exec -T mysql3 \
mysql -uroot -ptestpass -e "$STOP_SQL $CHANGE_TO_MYSQL1 $START_SQL SET GLOBAL read_only=1; SET GLOBAL super_read_only=1;" 2>/dev/null

errant = d.get('GtidErrant', '')
print(errant)
" 2>/dev/null || echo "")
if [ -n "$GTID_ERRANT" ] && [ "$GTID_ERRANT" != "" ]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The condition [ -n "$GTID_ERRANT" ] && [ "$GTID_ERRANT" != "" ] is redundant. A single -n check is sufficient to verify that the string is non-empty.

Suggested change
if [ -n "$GTID_ERRANT" ] && [ "$GTID_ERRANT" != "" ]; then
if [ -n "$GTID_ERRANT" ]; then

Comment on lines +290 to +295
for i in $(seq 1 30); do
if $COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then
break
fi
sleep 1
done
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use the wait_for_mysql helper function here to reduce code duplication.

Suggested change
for i in $(seq 1 30); do
if $COMPOSE exec -T mysql2 mysql -uroot -ptestpass -e "SELECT 1" >/dev/null 2>&1; then
break
fi
sleep 1
done
for i in $(seq 1 30); do
wait_for_mysql mysql2 || fail "mysql2 not reachable after restart"
done

@renecannao renecannao force-pushed the issue80-81-mysql-matrix-channels branch 2 times, most recently from e81056f to fa42679 Compare April 9, 2026 07:33
@renecannao renecannao force-pushed the issue82-83-failover-pg-matrix branch from cfab30d to a6c5739 Compare April 9, 2026 09:34
@renecannao renecannao changed the base branch from issue80-81-mysql-matrix-channels to master April 9, 2026 10:01
- Add test-failover-advanced.sh with intermediate master failure,
  errant GTID detection, and co-master failover scenarios
- Parameterize PostgreSQL image in docker-compose.yml (PG_IMAGE env)
- Add PG version matrix (15, 16, 17) to functional-postgresql CI job
All curl calls lacked --max-time, causing potential test hangs
if any API endpoint was slow to respond.
@renecannao renecannao force-pushed the issue82-83-failover-pg-matrix branch from 972f2c2 to 9666f57 Compare April 9, 2026 13:28
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/functional.yml (1)

172-255: ⚠️ Potential issue | 🟠 Major

Let the PostgreSQL matrix fail the job.

Line 255 still keeps continue-on-error: true, so regressions on postgres 15/16/17 will still report green. That means the new matrix does not actually enforce the multi-version support promised by issue #83.

Suggested fix
     - name: Run PostgreSQL tests
       run: bash tests/functional/test-postgresql.sh
-      continue-on-error: true  # PG support is new — iterating on CI stability
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/functional.yml around lines 172 - 255, The "Run PostgreSQL
tests" workflow step currently sets continue-on-error: true which masks failures
in the PostgreSQL matrix; update the "Run PostgreSQL tests" step (the step with
the name "Run PostgreSQL tests" and the setting continue-on-error: true) to
remove that key or set it to false so that failures in the pg_version matrix
(postgres 15/16/17) cause the job to fail and surface regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/functional/test-failover-advanced.sh`:
- Around line 252-282: The test only waits for a DeadCoMaster recovery record
(using RECOVERED and the /api/v2/recoveries poll) but doesn't verify actual
topology changes; after detecting recovery (when RECOVERED=true) add a follow-up
polling loop that queries the orchestrator topology (e.g.,
ORC_URL/api/v2/topology or the relevant instance status endpoint) for up to N
seconds and assert the expected state: that mysql1 is no longer an active
co-master/upstream, mysql3 is in the expected role/replica position, and the
cluster has a single promoted master if applicable; fail the test if the
topology does not converge within the timeout and include a debug dump of the
topology JSON as before.
- Around line 166-211: The test only asserts GtidErrant on /api/instance but
never exercises the /api/errant-gtid and /api/locate-gtid-errant APIs nor clears
the errant GTID state on mysql2, leaving mysql2 dirty; after detecting
GTID_ERRANT you should call the orchestrator endpoints (ORC_URL/api/errant-gtid
and ORC_URL/api/locate-gtid-errant) to validate and clear the errant entry and
then restore the replica state on mysql2 (remove errant history and reset any
GTID markers for errant_test) before finishing cleanup so subsequent tests are
not affected. Ensure the sequence happens only when ERRANT_DETECTED is true and
include checks that the API calls succeeded before declaring cleanup/pass.

---

Outside diff comments:
In @.github/workflows/functional.yml:
- Around line 172-255: The "Run PostgreSQL tests" workflow step currently sets
continue-on-error: true which masks failures in the PostgreSQL matrix; update
the "Run PostgreSQL tests" step (the step with the name "Run PostgreSQL tests"
and the setting continue-on-error: true) to remove that key or set it to false
so that failures in the pg_version matrix (postgres 15/16/17) cause the job to
fail and surface regressions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c4118674-8f40-4dda-b788-536211c90570

📥 Commits

Reviewing files that changed from the base of the PR and between c3f54cc and 9666f57.

📒 Files selected for processing (3)
  • .github/workflows/functional.yml
  • tests/functional/docker-compose.yml
  • tests/functional/test-failover-advanced.sh

Comment on lines +166 to +211
# Inject an errant transaction on mysql2
$COMPOSE exec -T mysql2 \
mysql -uroot -ptestpass -e "
SET GLOBAL read_only=0;
SET GLOBAL super_read_only=0;
CREATE DATABASE IF NOT EXISTS errant_test;
SET GLOBAL read_only=1;
SET GLOBAL super_read_only=1;
" 2>/dev/null

echo "Waiting for orchestrator to detect errant GTID (max 60s)..."
ERRANT_DETECTED=false
for i in $(seq 1 60); do
# Force a refresh
curl -s --max-time 10 "$ORC_URL/api/discover/mysql2/3306" > /dev/null 2>&1
sleep 2
GTID_ERRANT=$(curl -s --max-time 10 "$ORC_URL/api/instance/mysql2/3306" 2>/dev/null | python3 -c "
import json, sys
d = json.load(sys.stdin)
errant = d.get('GtidErrant', '')
print(errant)
" 2>/dev/null || echo "")
if [ -n "$GTID_ERRANT" ] && [ "$GTID_ERRANT" != "" ]; then
echo "Errant GTID detected after ${i}s: $GTID_ERRANT"
ERRANT_DETECTED=true
break
fi
done

if [ "$ERRANT_DETECTED" = "true" ]; then
pass "Errant GTID detected on mysql2: $GTID_ERRANT"
else
fail "Errant GTID not detected on mysql2 within 60s"
fi

# Cleanup: remove the errant database
$COMPOSE exec -T mysql2 \
mysql -uroot -ptestpass -e "
SET GLOBAL read_only=0;
SET GLOBAL super_read_only=0;
DROP DATABASE IF EXISTS errant_test;
SET GLOBAL read_only=1;
SET GLOBAL super_read_only=1;
" 2>/dev/null

pass "Test 2 cleanup: errant database removed"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Exercise the errant-GTID APIs and restore replica state afterward.

This scenario only asserts that GtidErrant becomes non-empty on /api/instance/mysql2/3306. Issue #82 explicitly calls for validating /api/errant-gtid and locate-gtid-errant, and Lines 201-209 only drop the schema — they do not clear the errant GTID history. So the test misses part of the acceptance criteria and leaves mysql2 logically dirty for the later cases in this job.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-failover-advanced.sh` around lines 166 - 211, The test
only asserts GtidErrant on /api/instance but never exercises the
/api/errant-gtid and /api/locate-gtid-errant APIs nor clears the errant GTID
state on mysql2, leaving mysql2 dirty; after detecting GTID_ERRANT you should
call the orchestrator endpoints (ORC_URL/api/errant-gtid and
ORC_URL/api/locate-gtid-errant) to validate and clear the errant entry and then
restore the replica state on mysql2 (remove errant history and reset any GTID
markers for errant_test) before finishing cleanup so subsequent tests are not
affected. Ensure the sequence happens only when ERRANT_DETECTED is true and
include checks that the API calls succeeded before declaring cleanup/pass.

Comment on lines +252 to +282
echo "Waiting for orchestrator to detect DeadCoMaster and recover (max 90s)..."
RECOVERED=false
for i in $(seq 1 90); do
RECOVERIES=$(curl -s --max-time 10 "$ORC_URL/api/v2/recoveries" 2>/dev/null)
HAS_RECOVERY=$(echo "$RECOVERIES" | python3 -c "
import json, sys
d = json.load(sys.stdin)
data = d.get('data', [])
for r in data:
a = r.get('AnalysisEntry', {}).get('Analysis', '')
s = r.get('IsSuccessful', False)
if 'DeadCoMaster' in a and s:
print('RECOVERED')
sys.exit(0)
print('WAITING')
" 2>/dev/null)
if [ "$HAS_RECOVERY" = "RECOVERED" ]; then
echo "DeadCoMaster recovery detected after ${i}s"
RECOVERED=true
break
fi
sleep 1
done

if [ "$RECOVERED" = "true" ]; then
pass "DeadCoMaster detected and recovered"
else
fail "DeadCoMaster: no recovery detected within 90s"
echo " DEBUG: Recent recoveries:"
curl -s --max-time 10 "$ORC_URL/api/v2/recoveries" 2>/dev/null | python3 -m json.tool 2>/dev/null | head -30
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Assert the post-recovery topology, not just the recovery record.

A successful DeadCoMaster entry in /api/v2/recoveries only proves orchestrator emitted a recovery event. The test never verifies that the topology actually converged afterward, so this can go green while mysql1 is still attached upstream, still marked co-master, or mysql3 is no longer where you expect it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional/test-failover-advanced.sh` around lines 252 - 282, The test
only waits for a DeadCoMaster recovery record (using RECOVERED and the
/api/v2/recoveries poll) but doesn't verify actual topology changes; after
detecting recovery (when RECOVERED=true) add a follow-up polling loop that
queries the orchestrator topology (e.g., ORC_URL/api/v2/topology or the relevant
instance status endpoint) for up to N seconds and assert the expected state:
that mysql1 is no longer an active co-master/upstream, mysql3 is in the expected
role/replica position, and the cluster has a single promoted master if
applicable; fail the test if the topology does not converge within the timeout
and include a debug dump of the topology JSON as before.

@renecannao renecannao merged commit 347a9bb into master Apr 9, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CI: PostgreSQL version matrix (15, 16, 17) CI: Expanded failover scenario tests

1 participant