Clean up stale gh-aw gateway containers before binding port on self-hosted runners#39998
Conversation
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
|
Hey A few things to address before this is ready for review:
If you'd like a hand completing this, assign the following prompt to your coding agent:
|
… runners - Add --name awmg-mcpg to docker run in mcp_setup_generator.go so the container can be identified and stopped by name - Add pre-start cleanup (docker rm -f awmg-mcpg) in start_mcp_gateway.sh and start_mcp_gateway.cjs to remove any stale container before launch - Restructure stop_mcp_gateway.sh to run docker stop/rm before PID check so containers get cleaned up even if PID was not captured - Update golden test files and integration test assertions - Regenerate all 250 .lock.yml workflow files via make recompile Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent MCP gateway startup failures on persistent self-hosted runners by ensuring stale gh-aw gateway containers don’t keep host port 8080 bound between jobs. It does this by introducing a stable container name (awmg-mcpg) and adding pre-start cleanup logic, plus updating generated workflow artifacts to match the new docker run shape.
Changes:
- Add
--name awmg-mcpgto the generated MCP gatewaydocker runcommand so the container can be reliably managed by name. - Add “stale container” cleanup (
docker rm -f awmg-mcpg … || true) before starting the gateway (both Bash and JS implementations). - Update tests and compiled
.lock.ymlworkflow outputs to reflect the updated docker command.
Show a summary per file
| File | Description |
|---|---|
| pkg/workflow/mcp_setup_generator.go | Adds --name awmg-mcpg to the generated gateway container command. |
| pkg/workflow/mcp_setup_generator_test.go | Asserts the generated docker command contains the new container name flag. |
| actions/setup/sh/start_mcp_gateway.sh | Adds pre-start removal of a stale awmg-mcpg container. |
| actions/setup/js/start_mcp_gateway.cjs | Adds pre-start removal of a stale awmg-mcpg container (JS runner). |
| actions/setup/sh/stop_mcp_gateway.sh | Changes gateway teardown ordering to stop/remove by container name earlier. |
| actions/setup/sh/start_mcp_gateway_test.sh | Adds a shell test that checks the stale-container cleanup exists. |
| pkg/workflow/testdata/TestWasmGolden_CompileFixtures/basic-copilot.golden | Updates golden output to include --name awmg-mcpg. |
| pkg/workflow/testdata/TestWasmGolden_CompileFixtures/playwright-cli-mode.golden | Updates golden output to include --name awmg-mcpg. |
| pkg/workflow/testdata/TestWasmGolden_CompileFixtures/smoke-copilot.golden | Updates golden output to include --name awmg-mcpg. |
| pkg/workflow/testdata/TestWasmGolden_CompileFixtures/with-imports.golden | Updates golden output to include --name awmg-mcpg. |
| pkg/workflow/testdata/TestWasmGolden_AllEngines/claude.golden | Updates golden output to include --name awmg-mcpg. |
| pkg/workflow/testdata/TestWasmGolden_AllEngines/codex.golden | Updates golden output to include --name awmg-mcpg. |
| pkg/workflow/testdata/TestWasmGolden_AllEngines/copilot.golden | Updates golden output to include --name awmg-mcpg. |
| pkg/workflow/testdata/TestWasmGolden_AllEngines/gemini.golden | Updates golden output to include --name awmg-mcpg. |
| actions/setup-cli/install.sh | Updates the script sync header comment. |
| .github/workflows/workflow-normalizer.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/workflow-health-manager.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/workflow-generator.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/video-analyzer.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/update-astro.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/tidy.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/test-workflow.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/test-quality-sentinel.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/test-project-url-default.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/test-dispatcher.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/super-linter.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/spec-librarian.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/spec-extractor.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/smoke-project.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/smoke-pi.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/smoke-copilot-sdk.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/smoke-ci.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/security-review.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/security-compliance.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/ruflo-backed-task.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/research.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/repo-tree-map.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/release.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/refiner.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/q.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/python-data-charts.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/pr-description-caveman.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/plan.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/pdf-summary.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/outcome-collector.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/org-health-report.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/objective-impact-report.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/metrics-collector.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/mergefest.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/linter-miner.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/lint-monster.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/layout-spec-maintainer.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/jsweep.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/gpclean.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/firewall.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/firewall-escape.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/example-permissions-warning.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/example-failure-category-filter.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/dev-hawk.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/designer-drift-audit.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/dependabot-worker.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/delight.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/daily-windows-terminal-integration-builder.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/daily-team-status.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/daily-safeoutputs-git-simulator.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/daily-max-ai-credits-test.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/daily-credit-limit-test.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/daily-byok-ollama-test.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/craft.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/copilot-opt.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/copilot-centralization-optimizer.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/copilot-centralization-drilldown.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/contribution-check.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/code-simplifier.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/ci-coach.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/brave.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/bot-detection.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/artifacts-summary.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/architecture-guardian.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/archie.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/agentic-token-trend-audit.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/agentic-token-optimizer.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/agentic-token-audit.lock.yml | Regenerates compiled workflow output with updated docker command. |
| .github/workflows/ace-editor.lock.yml | Regenerates compiled workflow output with updated docker command. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 265/265 changed files
- Comments generated: 3
| # Stop the named gateway container first — this is more reliable than killing the | ||
| # docker run wrapper PID because it asks the Docker daemon to stop the container | ||
| # directly. This ensures the host port is freed even on persistent self-hosted | ||
| # runners where a prior kill of the wrapper process left the container running. | ||
| # This runs unconditionally (even when GATEWAY_PID is empty) so that partially | ||
| # started containers are cleaned up if the start step failed before capturing PID. | ||
| echo "Stopping awmg-mcpg container..." | ||
| docker stop awmg-mcpg 2>/dev/null || docker rm -f awmg-mcpg 2>/dev/null || true | ||
|
|
| # On persistent self-hosted runners a prior job's gateway container may still be | ||
| # running and holding the host port, causing "bind: address already in use" when | ||
| # we try to start the new one. Force-removing by the well-known container name | ||
| # is idempotent and safe: if no such container exists the command exits 0. |
| // On persistent self-hosted runners a prior job's gateway container may still | ||
| // be running and holding the host port, causing "bind: address already in use" | ||
| // when we try to start the new one. Force-removing by the well-known container | ||
| // name is idempotent and safe: if no such container exists Docker exits 0. |
|
✅ PR Code Quality Reviewer completed the code quality review. |
|
✅ Design Decision Gate 🏗️ completed the design decision gate check. No ADR enforcement needed: PR #39998 does not have the 'implementation' label (has_implementation_label=false) and has only 12 new lines of code in business logic directories (≤100 threshold). Neither enforcement condition is met. |
|
✅ Test Quality Sentinel completed test quality analysis. |
|
🧠 Matt Pocock Skills Reviewer has completed the skills-based review. ✅ |
🧪 Test Quality Sentinel Report
📊 Metrics & Test Classification (1 test analyzed)
Go: 1 ( Score Breakdown
Test AnalysisThe modification adds one new The shell test addition ( Note: Verdict
|
There was a problem hiding this comment.
🧠 Reviewed using Matt Pocock's skills by Matt Pocock Skills Reviewer
Comments that could not be inline-anchored
actions/setup/sh/stop_mcp_gateway.sh:21
[/diagnose] docker stop || docker rm -f only runs the force-remove if docker stop fails. When docker stop succeeds the container enters stopped state, and release of the name depends on async --rm daemon cleanup — a brief window where a subsequent docker run --name awmg-mcpg could fail with name already in use. The pre-start docker rm -f in start_mcp_gateway.sh closes this window, so no bug today — but for a cleanup path whose sole purpose is freeing the name/port, a single …
actions/setup/sh/start_mcp_gateway.sh:87
[/diagnose] The validation block checks for -i, --rm, and --network flags but not --name awmg-mcpg. If MCP_GATEWAY_DOCKER_COMMAND is ever set manually (or generated by an older/patched compiler), the cleanup at line 159 silently no-ops (no container to remove), the new container starts unnamed, and future stop scripts can't tear it down by name. Adding a validation keeps the invariant visible:
<details>
<summary>💡 Suggested validation</summary>
if ! echo "$MCP_GATEWAY_D…
</details>
<details><summary>pkg/workflow/mcp_setup_generator_test.go:551</summary>
**[/tdd]** `containerNameSnippet` gets a `require.Contains` check, but unlike every other flag snippet (lines 552–560 all have `require.Less` ordering guards), there's no assertion that `--name awmg-mcpg` appears before `--add-host`. A future refactor could silently reorder the flags and still pass this test.
<details>
<summary>💡 Suggested ordering assertion</summary>
```go
require.Less(t, strings.Index(yamlStr, containerNameSnippet), strings.Index(yamlStr, addHostSnippet),
"--name awmg-…
</details>
<details><summary>actions/setup/sh/start_mcp_gateway_test.sh:307</summary>
**[/tdd]** The test greps for the presence of `docker rm -f awmg-mcpg` and `awmg-mcpg` in the script, but doesn't verify that cleanup actually runs *before* the gateway launch. A later edit could move the cleanup after the `$MCP_GATEWAY_DOCKER_COMMAND` invocation and this test would still pass.
<details>
<summary>💡 Stronger ordering check</summary>
```bash
CLEANUP_LINE=$(grep -n 'docker rm -f awmg-mcpg' "$SCRIPT_PATH" | head -1 | cut -d: -f1)
LAUNCH_LINE=$(grep -n 'MCP_GATEWAY_DOCKER_COMMAND…
</details>There was a problem hiding this comment.
REQUEST_CHANGES — The approach is correct and the pre-start cleanup in start_mcp_gateway.sh / .cjs is solid, but stop_mcp_gateway.sh has a blocking correctness issue plus a latency concern.
Blocking issues (must fix before merge)
1. Container stopped before graceful /close endpoint (existing comment, restated for emphasis)
docker stop awmg-mcpg is placed before the /close endpoint attempt and before the $GATEWAY_PID process check. Once the container is stopped, the docker run wrapper ($GATEWAY_PID) exits, the /close curl call always fails (gateway is gone), and the remaining PID-based logic becomes a no-op. This makes every workflow teardown a forced shutdown — the graceful mechanism added for spec compliance is now dead code.
Fix: move the docker stop block to after the /close-then-SIGTERM fallback chain, as the final cleanup step.
2. docker stop default 10-second grace period adds latency (new comment on line 21)
Even after fixing the ordering, docker stop without --time 0 waits up to 10 s for the container to respond to SIGTERM before force-killing. At this point in the script graceful shutdown has already been attempted; using docker stop --time 0 (or equivalent) avoids a silent 10-second CI stall when the container is stuck.
Non-blocking (existing comments)
- Comment in
start_mcp_gateway.shandstart_mcp_gateway.cjsinaccurately statesdocker rm -fexits 0 when the container doesn't exist — it exits non-zero; the|| trueis what makes it idempotent.
🔎 Code quality review by PR Code Quality Reviewer
| # This runs unconditionally (even when GATEWAY_PID is empty) so that partially | ||
| # started containers are cleaned up if the start step failed before capturing PID. | ||
| echo "Stopping awmg-mcpg container..." | ||
| docker stop awmg-mcpg 2>/dev/null || docker rm -f awmg-mcpg 2>/dev/null || true |
There was a problem hiding this comment.
docker stop uses the default 10-second grace period, adding up to 10 s of latency on every workflow teardown if the container is unresponsive.
💡 Suggested fix
This call is a force-cleanup (the /close endpoint is the intended graceful path). Pass --time 0 so Docker skips the SIGTERM wait and immediately forces the container to exit:
docker stop --time 0 awmg-mcpg 2>/dev/null || docker rm -f awmg-mcpg 2>/dev/null || trueThis is orthogonal to the ordering issue already flagged: even after moving this block to after the /close attempt, you still want --time 0 here because graceful shutdown has already been attempted and this is the unconditional cleanup step.
|
@copilot run pr-finisher skill |
- stop_mcp_gateway.sh: use EXIT trap for container cleanup so /close endpoint is attempted first; container teardown still runs on every exit path (missing PID, process gone, successful /close, kill fallback) - start_mcp_gateway.sh: fix misleading comment about docker rm -f idempotency; clarify that || true provides the idempotency, not Docker - start_mcp_gateway.cjs: same comment fix as the shell script Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Done in commit
|
|
🚀 Smoke Gemini MISSION COMPLETE! Gemini has spoken. ✨ |
|
🎬 THE END — Smoke Claude MISSION: ACCOMPLISHED! The hero saves the day! ✨ |
|
📰 BREAKING: Smoke Copilot - AOAI (Entra) is now investigating this pull request. Sources say the story is developing... |
|
📰 BREAKING: Smoke Copilot is now investigating this pull request. Sources say the story is developing... |
|
📰 BREAKING: Smoke Copilot - AOAI (apikey) is now investigating this pull request. Sources say the story is developing... |
|
Commit pushed:
|
Agent Container Tool Check
Result: 12/12 tools available ✅ Overall Status: PASS
|
|
✅ Test 1: GitHub MCP Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "localhost"See Network Configuration for more information.
|
|
Smoke test 27758408912 Warning Firewall blocked 6 domainsThe following domains were blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "accounts.google.com"
- "android.clients.google.com"
- "clients2.google.com"
- "contentautofill.googleapis.com"
- "safebrowsingohttpgateway.googleapis.com"
- "www.google.com"See Network Configuration for more information.
|
Comment MemoryNote This comment is managed by comment memory.It stores persistent context for this thread in the code block at the top of this comment. Warning Firewall blocked 6 domainsThe following domains were blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "accounts.google.com"
- "android.clients.google.com"
- "clients2.google.com"
- "contentautofill.googleapis.com"
- "safebrowsingohttpgateway.googleapis.com"
- "www.google.com"See Network Configuration for more information.
|
|
Smoke run: PASS Warning Firewall blocked 6 domainsThe following domains were blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "accounts.google.com"
- "android.clients.google.com"
- "clients2.google.com"
- "contentautofill.googleapis.com"
- "safebrowsingohttpgateway.googleapis.com"
- "www.google.com"See Network Configuration for more information.
|
Comment MemoryNote This comment is managed by comment memory.It stores persistent context for this thread in the code block at the top of this comment. Warning Firewall blocked 6 domainsThe following domains were blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "accounts.google.com"
- "android.clients.google.com"
- "clients2.google.com"
- "contentautofill.googleapis.com"
- "safebrowsingohttpgateway.googleapis.com"
- "www.google.com"See Network Configuration for more information.
|
There was a problem hiding this comment.
Smoke review done. Cave eyes saw 2 anchors.
Warning
Firewall blocked 6 domains
The following domains were blocked by the firewall during workflow execution:
accounts.google.comandroid.clients.google.comclients2.google.comcontentautofill.googleapis.comsafebrowsingohttpgateway.googleapis.comwww.google.com
To allow these domains, add them to the
network.allowedlist in your workflow frontmatter:
network:
allowed:
- defaults
- "accounts.google.com"
- "android.clients.google.com"
- "clients2.google.com"
- "contentautofill.googleapis.com"
- "safebrowsingohttpgateway.googleapis.com"
- "www.google.com"See Network Configuration for more information.
📰 BREAKING: Report filed by Smoke Copilot
🧪 Smoke Test: Claude — Run 27758408911Core #1-12: ✅ all passed Overall: PARTIAL (0 failed, 1 skipped) — Claude engine nominal! 💥 Warning Firewall blocked 6 domainsThe following domains were blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "accounts.google.com"
- "android.clients.google.com"
- "clients2.google.com"
- "contentautofill.googleapis.com"
- "safebrowsingohttpgateway.googleapis.com"
- "www.google.com"See Network Configuration for more information.
|
There was a problem hiding this comment.
💥 Automated smoke test review - all systems nominal!
Warning
Firewall blocked 6 domains
The following domains were blocked by the firewall during workflow execution:
accounts.google.comandroid.clients.google.comclients2.google.comcontentautofill.googleapis.comsafebrowsingohttpgateway.googleapis.comwww.google.com
To allow these domains, add them to the
network.allowedlist in your workflow frontmatter:
network:
allowed:
- defaults
- "accounts.google.com"
- "android.clients.google.com"
- "clients2.google.com"
- "contentautofill.googleapis.com"
- "safebrowsingohttpgateway.googleapis.com"
- "www.google.com"See Network Configuration for more information.
💥 [THE END] — Illustrated by Smoke Claude
| @@ -0,0 +1,5 @@ | |||
| --- | |||
| "gh-aw": patch | |||
There was a problem hiding this comment.
Good: pinning the changeset type to patch matches the bugfix scope of this PR.
| "gh-aw": patch | ||
| --- | ||
|
|
||
| Fixed stale `gh-aw` gateway containers on self-hosted runners so port 8080 is cleaned up before the next gateway start. |
There was a problem hiding this comment.
Clear changeset summary — it explains the port 8080 cleanup behavior well.
|
Smoke tests failed: 1✅ 2❌ 3❌ 4❌ 5✅ 6✅ overall FAIL. See issue #40047.
|
On persistent self-hosted runners, leftover
ghcr.io/github/gh-aw-mcpgcontainers from prior jobs hold port 8080, causing the next job to fail at gateway startup withbind: address already in use. The prior stop logic (kill $GATEWAY_PID) does not stop the Docker container — only the wrapper process — leaving the gateway bound to the host port.Changes
mcp_setup_generator.go: Add--name awmg-mcpgto the generateddocker runcommand, enabling reliable container management by name (same pattern asawmg-proxy/awmg-cli-proxy).start_mcp_gateway.sh/start_mcp_gateway.cjs: Add pre-start cleanup (docker rm -f awmg-mcpg 2>/dev/null || true) before launching the container. Idempotent — no-ops when no stale container exists.stop_mcp_gateway.sh: Movedocker stop awmg-mcpg/docker rm -f awmg-mcpgbefore theGATEWAY_PIDempty-check so container teardown happens unconditionally, even if the start step never captured a PID.Tests / golden files: Update
mcp_setup_generator_test.go,start_mcp_gateway_test.sh, WASM golden snapshots, and all 250 compiled.lock.ymlworkflow files to reflect the new docker command shape.Generated docker command (before → after)
Changeset
gh-awgateway containers on self-hosted runners so port 8080 is cleaned up before the next gateway start.✨ PR Review Safe Output Test - Run 27758408911
Warning
Firewall blocked 6 domains
The following domains were blocked by the firewall during workflow execution:
accounts.google.comandroid.clients.google.comclients2.google.comcontentautofill.googleapis.comsafebrowsingohttpgateway.googleapis.comwww.google.comSee Network Configuration for more information.