🏥 CI Failure Investigation - Run #36032
Summary
Integration: CLI Completion & Other fails because TestMCPRegistryClient_LiveGetServer now hits the live MCP registry and the service is returning 503 upstream connect error or disconnect/reset before headers with a delayed connect failure, so the test cannot reach io.github.netdata/mcp-server.
Failure Details
- Run: 22068117409
- Commit:
5e5b9d282752b1430867cdc76a09603348c08d4c
- Trigger: push
Root Cause Analysis
TestMCPRegistryClient_LiveGetServer connects to the live MCP registry while exercising GetServer; the registry returned 503 upstream connect error or disconnect/reset before headers with the latest retry reporting delayed connect error: Connection refused, so the subtest cannot complete.
- Every subtest (
get_github_server and get_nonexistent_server) tries to assert specific output but receives the same 503, which is treated as a failure instead of being skipped or mocked.
Failed Jobs and Errors
- Integration: CLI Completion & Other:
TestMCPRegistryClient_LiveGetServer/get_github_server
mcp_registry_live_test.go:141: GetServer failed for 'io.github.netdata/mcp-server': MCP registry returned status 503: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused
- Integration: CLI Completion & Other:
TestMCPRegistryClient_LiveGetServer/get_nonexistent_server
mcp_registry_live_test.go:175: Expected error to contain 'not found in registry', got: MCP registry returned status 503: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refused
Investigation Findings
- Running
go test -v -tags integration ./pkg/cli -run TestMCPRegistryClient_LiveGetServer against the live registry reproduces the 503/delayed connect error because the test talks to io.github.netdata/mcp-server and the registry is currently refusing connections.
- The integration suite therefore fails before reporting a specific test since the package-level run detects the panic/failure and aborts, logging that no individual test passed cleanly.
Recommended Actions
Prevention Strategies
- Avoid calling production MCP services directly from CI without handling known failure modes (503s, connection refused, etc.) and mark the tests as flaky or skipped when the service is down.
- Use local stubs or recorded fixtures for MCP responses in GitHub Actions so network availability does not gate the whole suite.
AI Team Self-Improvement
- When generating tests that talk to MCP or other external services, guard them with explicit skip/retry logic and explain that 5xx/delayed connect errors should not be treated as regressions.
- Prefer mocking remote MCP responses in CI workflows so the tests stay deterministic even if the upstream service is temporarily unreachable.
Historical Context
🩺 Diagnosis provided by CI Failure Doctor
To install this workflow, run gh aw add githubnext/agentics/workflows/ci-doctor.md@ea350161ad5dcc9624cf510f134c6a9e39a6f94d. View source at https://github.com/githubnext/agentics/tree/ea350161ad5dcc9624cf510f134c6a9e39a6f94d/workflows/ci-doctor.md.
🏥 CI Failure Investigation - Run #36032
Summary
Integration: CLI Completion & Otherfails becauseTestMCPRegistryClient_LiveGetServernow hits the live MCP registry and the service is returning503 upstream connect error or disconnect/reset before headerswith a delayed connect failure, so the test cannot reachio.github.netdata/mcp-server.Failure Details
5e5b9d282752b1430867cdc76a09603348c08d4cRoot Cause Analysis
TestMCPRegistryClient_LiveGetServerconnects to the live MCP registry while exercisingGetServer; the registry returned503 upstream connect error or disconnect/reset before headerswith the latest retry reportingdelayed connect error: Connection refused, so the subtest cannot complete.get_github_serverandget_nonexistent_server) tries to assert specific output but receives the same 503, which is treated as a failure instead of being skipped or mocked.Failed Jobs and Errors
TestMCPRegistryClient_LiveGetServer/get_github_servermcp_registry_live_test.go:141:GetServer failed for 'io.github.netdata/mcp-server': MCP registry returned status 503: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refusedTestMCPRegistryClient_LiveGetServer/get_nonexistent_servermcp_registry_live_test.go:175:Expected error to contain 'not found in registry', got: MCP registry returned status 503: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: Connection refusedInvestigation Findings
go test -v -tags integration ./pkg/cli -run TestMCPRegistryClient_LiveGetServeragainst the live registry reproduces the 503/delayed connect error because the test talks toio.github.netdata/mcp-serverand the registry is currently refusing connections.Recommended Actions
TestMCPRegistryClient_LiveGetServer(and similar MCP live tests) so that 5xx/delayed-connect responses are skipped or stubbed instead of failing the suite, e.g., detect the 503 and mark the test as skipped when the registry is unreachable.Prevention Strategies
AI Team Self-Improvement
Historical Context
TestMCPRegistryClient_LiveGetServerfailure because the MCP registry returned a 503; see #15700 for the prior investigation.