From 205ee2c58b6f6dcf68269c2eb200b5e8d640fb68 Mon Sep 17 00:00:00 2001 From: danielmeppiel Date: Sat, 25 Apr 2026 00:20:37 +0200 Subject: [PATCH 1/2] fix(ci): add merge_group trigger to Merge Gate so it reports in queue Branch protection / merge-queue ruleset requires the 'gate' check on both PR-time and merge-queue contexts, but the gate workflow only fired on 'pull_request'. In the merge queue, GitHub fires 'merge_group' events against a temp merge commit -- the gate check was never created on that SHA, so PRs sat in the queue with 'gate' stuck in 'Expected -- Waiting for status to be reported' indefinitely (observed on PR #899). Changes ------- .github/workflows/merge-gate.yml - Add 'merge_group' (types: checks_requested) and keep existing 'pull_request' + 'workflow_dispatch' triggers. - Resolve head SHA per event: workflow_dispatch -> gh api .../pulls/N --jq .head.sha merge_group -> github.event.merge_group.head_sha pull_request -> github.event.pull_request.head.sha - Branch EXPECTED_CHECKS by event: pull_request / workflow_dispatch: 'Build & Test (Linux),APM Self-Check' merge_group: + 'Build (Linux),Smoke Test (Linux), Integration Tests (Linux),Release Validation (Linux)' (the merge_group-only checks emitted by ci-integration.yml plus the ci.yml checks that also run on merge_group) - Bump TIMEOUT_MIN 30 -> 55 and job timeout-minutes 35 -> 60 to absorb ci-integration.yml's theoretical worst-case critical path (Build -> Smoke -> Integration[20m] -> Release Validation[20m]). - Update header comment + recovery instructions to cover both contexts. .github/scripts/ci/merge_gate_wait.sh - Accept new optional EVENT_NAME env var; emit event-aware recovery instructions on exit code 2 (in merge_group context, pushing a commit does NOT retrigger the merge_group event -- the user must re-queue). - Add '&filter=latest' to the Checks API query so GitHub returns only the latest run per name, removing reliance on client-side sort and pagination order. Concurrency ----------- The existing key 'merge-gate-${{ pull_request.number || inputs.pr_number || github.ref }}' falls through to github.ref in merge_group context. github.ref there is 'refs/heads/gh-readonly-queue/main/pr-N-', unique per queue entry, so cancel-in-progress dedupes correctly within a single temp branch and never collides across PR/merge_group channels. Self-deadlock ------------- 'gate' is intentionally absent from EXPECTED_CHECKS in both contexts. Audit ----- Design audited against live GitHub docs: - docs.github.com/.../webhook-events-and-payloads#merge_group - docs.github.com/.../managing-a-merge-queue - docs.github.com/en/rest/checks/runs Verdict: ship with the event-aware recovery message included here. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/scripts/ci/merge_gate_wait.sh | 23 ++++-- .github/workflows/merge-gate.yml | 103 +++++++++++++++++--------- 2 files changed, 87 insertions(+), 39 deletions(-) diff --git a/.github/scripts/ci/merge_gate_wait.sh b/.github/scripts/ci/merge_gate_wait.sh index c5b17d91f..7c14d285c 100755 --- a/.github/scripts/ci/merge_gate_wait.sh +++ b/.github/scripts/ci/merge_gate_wait.sh @@ -18,10 +18,14 @@ # Inputs (environment variables): # GH_TOKEN required. Token with 'checks:read' for the repo. # REPO required. owner/repo (e.g. microsoft/apm). -# SHA required. Head SHA of the PR. +# SHA required. Head SHA to poll (PR head, merge_group temp +# branch head, or workflow_dispatch-resolved PR head). # EXPECTED_CHECKS required. Comma-separated list of check-run names to # wait for. Whitespace around commas is trimmed. # Example: "Build & Test (Linux),Build (Linux)" +# EVENT_NAME optional. The triggering event ('pull_request', +# 'merge_group', 'workflow_dispatch'). Used only to +# emit the right recovery instructions on timeout. # TIMEOUT_MIN optional. Total wall-clock budget in minutes. # Default: 30. # POLL_SEC optional. Poll interval in seconds. Default: 30. @@ -96,11 +100,13 @@ while [ "$(date +%s)" -lt "$deadline" ]; do [ "${check_status[i]}" = "pending" ] || continue pending_count=$((pending_count + 1)) - # Filter by check-run name server-side. Most-recent first. + # Filter by check-run name server-side, asking GitHub for only the + # latest run per name (avoids client-side sort / pagination races + # when a check has been re-run on the same SHA). encoded=$(jq -rn --arg n "$c" '$n|@uri') payload=$(gh api \ -H "Accept: application/vnd.github+json" \ - "repos/${REPO}/commits/${SHA}/check-runs?check_name=${encoded}&per_page=10" \ + "repos/${REPO}/commits/${SHA}/check-runs?check_name=${encoded}&filter=latest&per_page=10" \ 2>/dev/null) || payload='{"check_runs":[]}' total=$(echo "$payload" | jq '.check_runs | length' 2>/dev/null || echo 0) @@ -166,8 +172,15 @@ if [ "${#missing[@]}" -gt 0 ]; then for c in "${missing[@]}"; do echo " - ${c}"; done echo "" echo "This usually indicates a transient GitHub Actions webhook delivery failure. Recovery:" - echo " 1. Push an empty commit to retrigger: git commit --allow-empty -m 'ci: retrigger' && git push" - echo " 2. If that fails, close and reopen the PR." + if [ "${EVENT_NAME:-}" = "merge_group" ]; then + echo " Merge-queue context: pushing a commit will NOT retrigger the merge_group event." + echo " 1. Remove the PR from the merge queue and re-add it." + echo " 2. If it still fails, push an empty commit to the PR branch and re-queue:" + echo " git commit --allow-empty -m 'ci: retrigger' && git push" + else + echo " 1. Push an empty commit to retrigger: git commit --allow-empty -m 'ci: retrigger' && git push" + echo " 2. If that fails, close and reopen the PR." + fi echo "" echo "Merge Gate catches this failure mode so it surfaces as a clear red check instead of a stuck 'Expected -- Waiting'. See .github/workflows/merge-gate.yml." } >&2 diff --git a/.github/workflows/merge-gate.yml b/.github/workflows/merge-gate.yml index aaa6d85e8..96b8dee92 100644 --- a/.github/workflows/merge-gate.yml +++ b/.github/workflows/merge-gate.yml @@ -1,29 +1,42 @@ # Merge Gate -- single-authority orchestrator that aggregates ALL required -# PR-time checks into one verdict. Branch protection requires only this -# check; this workflow polls the Checks API for all underlying checks. +# checks into one verdict for BOTH PR-time and merge-queue contexts. +# Branch protection / merge-queue ruleset requires only this check; this +# workflow polls the Checks API for all underlying checks. # # Why this file exists: # GitHub's required-status-checks model is name-based, not workflow-based. -# Without this gate, branch protection had to require each PR-time check -# individually -- adding or renaming a check meant a ruleset edit. With -# this gate, branch protection requires only 'gate' (the check-run name -# of the job below) and the gate aggregates whatever underlying checks -# we declare in EXPECTED_CHECKS. Tide / bors single-authority pattern. +# Without this gate, every required check would need to be listed in the +# ruleset (and listed again for the merge-queue ruleset) -- adding or +# renaming a check meant editing rulesets. With this gate, the ruleset +# requires only 'gate' and the gate aggregates whatever underlying checks +# we declare in EXPECTED_CHECKS for each event context. # -# Why a single trigger (not dual pull_request + pull_request_target): +# Why both pull_request and merge_group triggers: +# The merge-queue ruleset also requires 'gate'. Without a merge_group +# trigger, the gate check would never fire against the merge-queue temp +# branch SHA and PRs would sit in the queue with 'gate' stuck in +# "Expected -- Waiting for status to be reported" indefinitely +# (observed on PR #899). The merge_group trigger fires the same gate +# logic against the temp branch SHA and aggregates the merge-queue-time +# checks (ci.yml + ci-integration.yml). +# +# Why no pull_request_target dual trigger: # We tried dual-trigger redundancy in PR #865 to harden against rare # dropped 'pull_request' webhook deliveries (observed once on PR #856). # It backfired: 'concurrency: cancel-in-progress' produced TWO check-runs # per SHA -- one SUCCESS and one CANCELLED -- which poisons branch # protection's status-check rollup ('CANCELLED' counts as failure -> -# PR BLOCKED). No GitHub Actions primitive cleanly de-duplicates checks -# across event channels. World-class OSS projects (k8s, rust, deno, -# next.js) accept this and use a single trigger plus manual recovery. +# PR BLOCKED). pull_request and merge_group are different event channels +# that target different SHAs, so they don't collide. # -# Recovery if a 'pull_request' webhook is dropped: -# - Push an empty commit: git commit --allow-empty -m 'retrigger' && git push -# - Or trigger manually: gh workflow run merge-gate.yml -f pr_number=NNN -# - Or close + reopen the PR. +# Recovery if a webhook is dropped: +# pull_request context: +# - Push an empty commit: git commit --allow-empty -m 'retrigger' && git push +# - Or trigger manually: gh workflow run merge-gate.yml -f pr_number=NNN +# - Or close + reopen the PR. +# merge_group context (NB: pushing a commit will not retrigger the +# merge_group event, only pull_request): +# - Remove the PR from the merge queue and re-add it. name: Merge Gate @@ -34,6 +47,9 @@ on: - 'docs/**' - '.gitignore' - 'LICENSE' + merge_group: + branches: [ main ] + types: [ checks_requested ] workflow_dispatch: inputs: pr_number: @@ -41,9 +57,11 @@ on: required: true type: string -# Dedup pushes to the same PR: cancel any older in-flight gate run on -# the same PR head. Now safe -- only one trigger channel, so cancellations -# only happen on rapid push-after-push, not on cross-event collisions. +# Dedup pushes to the same PR / merge-queue entry: cancel any older +# in-flight gate run on the same head. In merge_group context, github.ref +# is refs/heads/gh-readonly-queue/main/pr-N-, which is unique per +# queue entry -- so cancel-in-progress only cancels rapid push-after-push +# on the same temp branch, never across PR <-> merge_group channels. concurrency: group: merge-gate-${{ github.event.pull_request.number || inputs.pr_number || github.ref }} cancel-in-progress: true @@ -57,26 +75,37 @@ jobs: gate: name: gate runs-on: ubuntu-24.04 - timeout-minutes: 35 + # Job timeout sized above the poll budget (TIMEOUT_MIN below) to leave + # headroom for setup/teardown without false-failing the gate. + timeout-minutes: 60 steps: - - name: Resolve PR head SHA + - name: Resolve head SHA id: sha env: GH_TOKEN: ${{ github.token }} run: | - if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then - sha=$(gh api "repos/${{ github.repository }}/pulls/${{ inputs.pr_number }}" --jq '.head.sha') - else - sha="${{ github.event.pull_request.head.sha }}" - fi + case "${{ github.event_name }}" in + workflow_dispatch) + sha=$(gh api "repos/${{ github.repository }}/pulls/${{ inputs.pr_number }}" --jq '.head.sha') + ;; + merge_group) + # Temp merge commit on the merge-queue temp branch; check + # runs from ci.yml/ci-integration.yml are reported here, NOT + # against the PR head SHA. + sha="${{ github.event.merge_group.head_sha }}" + ;; + *) + sha="${{ github.event.pull_request.head.sha }}" + ;; + esac if [ -z "$sha" ]; then - echo "::error::Could not resolve PR head SHA" + echo "::error::Could not resolve head SHA for event ${{ github.event_name }}" exit 1 fi echo "sha=$sha" >> "$GITHUB_OUTPUT" - echo "[merge-gate] resolved head SHA: $sha" + echo "[merge-gate] event=${{ github.event_name }} resolved head SHA: $sha" - - name: Checkout PR head + - name: Checkout head uses: actions/checkout@v4 with: ref: ${{ steps.sha.outputs.sha }} @@ -88,17 +117,23 @@ jobs: GH_TOKEN: ${{ github.token }} REPO: ${{ github.repository }} SHA: ${{ steps.sha.outputs.sha }} - # All PR-time checks the gate aggregates. Keep this in sync with - # the underlying workflows. Currently only ci.yml emits PR-time - # checks ('Build & Test (Linux)', 'APM Self-Check'); - # ci-integration.yml is merge_group-only and is NOT polled here. + EVENT_NAME: ${{ github.event_name }} + # All required checks the gate aggregates, branched by event: + # pull_request / workflow_dispatch -> only ci.yml runs at PR time + # merge_group -> ci.yml AND ci-integration.yml run + # Keep this in sync with the underlying workflows. # NOTE: 'gate' (this job) MUST NOT appear here -- it would # deadlock waiting for itself. - EXPECTED_CHECKS: 'Build & Test (Linux),APM Self-Check' - TIMEOUT_MIN: '30' + EXPECTED_CHECKS: ${{ github.event_name == 'merge_group' && 'Build & Test (Linux),APM Self-Check,Build (Linux),Smoke Test (Linux),Integration Tests (Linux),Release Validation (Linux)' || 'Build & Test (Linux),APM Self-Check' }} + # Poll budget: ci-integration.yml chains Build -> Smoke -> + # Integration (timeout 20m) -> Release Validation (timeout 20m). + # Theoretical worst case ~50m; observed today ~5m end-to-end. + # Sized at 55m to absorb growth without false-failing the gate. + TIMEOUT_MIN: '55' POLL_SEC: '30' run: | chmod +x .github/scripts/ci/merge_gate_wait.sh .github/scripts/ci/merge_gate_wait.sh + From 9f3749c4c79ea99cad3f04ad06e03d902bd9feb2 Mon Sep 17 00:00:00 2001 From: danielmeppiel Date: Sat, 25 Apr 2026 00:28:42 +0200 Subject: [PATCH 2/2] fix(ci): drop paths-ignore from gate + ci so docs-only PRs satisfy gate Both .github/workflows/merge-gate.yml and .github/workflows/ci.yml carried identical paths-ignore (docs/**, .gitignore, LICENSE). For a docs-only PR neither workflow fires, so the 'gate' check-run is never created -- if the PR ruleset requires 'gate', branch protection displays it as 'Expected -- Waiting' forever and the PR cannot merge. Removing paths-ignore from BOTH (not just one) is required: dropping it only from merge-gate.yml would leave the gate polling for ci.yml checks that never appear, timing out at TIMEOUT_MIN with exit 2 (false failure). Removing from both means ci.yml runs on docs-only PRs (~5 min of free GitHub-hosted runner time) and the gate aggregates as normal -- coherent regardless of which ruleset tier requires gate. Caught in code review on PR #921. Same observation was flagged but left out-of-scope in the original PR description; folding in now. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .github/workflows/ci.yml | 4 ---- .github/workflows/merge-gate.yml | 4 ---- 2 files changed, 8 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 5cb128b8c..499dfe9b2 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -6,10 +6,6 @@ env: on: pull_request: branches: [ main ] - paths-ignore: - - 'docs/**' - - '.gitignore' - - 'LICENSE' # Tier 1 also runs in merge queue context so the same unit + build checks # execute against the tentative merge commit that the queue creates. See # microsoft/apm#770 for the design. diff --git a/.github/workflows/merge-gate.yml b/.github/workflows/merge-gate.yml index 96b8dee92..e0d7598f5 100644 --- a/.github/workflows/merge-gate.yml +++ b/.github/workflows/merge-gate.yml @@ -43,10 +43,6 @@ name: Merge Gate on: pull_request: branches: [ main ] - paths-ignore: - - 'docs/**' - - '.gitignore' - - 'LICENSE' merge_group: branches: [ main ] types: [ checks_requested ]