From 198d3b8f9a6681f7f679a06d565069cff9ce79ee Mon Sep 17 00:00:00 2001 From: Tri Lam Date: Mon, 18 May 2026 22:20:46 -0700 Subject: [PATCH 1/2] [ci] split verify into 3 parallel jobs + aggregator (~7m -> ~2:45m) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The verify job ran 16 steps sequentially for 427s wall time on PR #70. Split into three parallel jobs grouped by the longest pole: - verify-test: coverage-check (race tests) ~125s - verify-lint: vet + lint ~155s - verify-static: license/generate/build-tags/tidy/register-lint/ nccl-fr RCE/actionlint/zizmor/fuzz/govulncheck/ doc-check/build ~140s A final aggregator job `verify` declares `needs: [verify-test, verify-lint, verify-static]` and echoes success. Branch protection already requires `verify`; keeping the name means zero protection churn while the three sub-jobs surface as informational rows. Wall time drops from ~7m to ~2:45m (bounded by verify-lint). Runner-minute cost is ~3.3x current — each job pays setup-go cache. `make ci` is unchanged — still sequential developer convenience. Signed-off-by: Tri Lam --- .github/workflows/ci.yml | 72 ++++++++++++++++++++++++++++++---------- 1 file changed, 54 insertions(+), 18 deletions(-) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 52ba8ce6..4cd9ba50 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -9,18 +9,52 @@ permissions: contents: read jobs: - verify: - # Each step name mirrors a Makefile target so a failing gate is - # identifiable at a glance in the GitHub UI without scrolling - # through one undifferentiated bash step. Sequential + fail-fast - # matches `make ci`'s local-run semantics. - # - # We previously ran a 2-arm matrix (go.mod + stable) for early - # warning of Go-toolchain regressions; dropped because the second - # arm added two visible check rows per PR for ~no actionable - # signal at the cadence we ship. If we ever need that signal - # back, add it as a separate scheduled job, not a matrix. - name: verify + # `verify` is split into three parallel jobs (verify-test, verify-lint, + # verify-static) that feed an aggregator named `verify`. Wall time drops + # from ~7m to ~2:45m without touching branch protection — the aggregator + # inherits failure from any sub-job via `needs:` short-circuit, so the + # existing required-check `verify` stays accurate. `make ci` is still + # sequential locally; only CI parallelizes. + # + # Partition rationale: keep the longest single step (`coverage-check`, + # ~125s) on its own job so it bounds wall time. Pair `vet` + `lint` + # because they share golangci-lint setup. Everything else lands in + # verify-static — a grab-bag bounded by `build` (~55s) + `fuzz` (~40s). + # + # We previously ran a 2-arm matrix (go.mod + stable) for early warning + # of Go-toolchain regressions; dropped because the second arm added + # two visible check rows per PR for ~no actionable signal at the + # cadence we ship. If we ever need that signal back, add it as a + # separate scheduled job, not a matrix. + + verify-test: + name: verify-test + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 + - uses: actions/setup-go@4a3601121dd01d1626a1e23e37211e3254c1c06c # v6.4.0 + with: + go-version-file: go.mod + cache: true + - name: test (race) + coverage-check + run: make coverage-check + + verify-lint: + name: verify-lint + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 + - uses: actions/setup-go@4a3601121dd01d1626a1e23e37211e3254c1c06c # v6.4.0 + with: + go-version-file: go.mod + cache: true + - name: vet + run: make vet + - name: lint + run: make lint + + verify-static: + name: verify-static runs-on: ubuntu-latest steps: - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 @@ -32,14 +66,10 @@ jobs: run: make license-check - name: generate-check run: make generate-check - - name: vet - run: make vet - name: build-tags run: make build-tags - name: tidy-check run: make tidy-check - - name: lint - run: make lint - name: nccl_fr RCE gate run: make nccl-fr-rce-gate - name: register-lint @@ -52,8 +82,6 @@ jobs: echo "$HOME/.local/bin" >> "$GITHUB_PATH" - name: zizmor run: make zizmor - - name: test (race) + coverage-check - run: make coverage-check - name: 30s fuzz (nccl_fr parser) run: make ci-fuzz-nccl-fr - name: govulncheck @@ -63,6 +91,14 @@ jobs: - name: build run: make build + verify: + name: verify + runs-on: ubuntu-latest + needs: [verify-test, verify-lint, verify-static] + steps: + - name: aggregator + run: echo "all verify-* gates passed" + build: # Cross-compiles release-candidate binaries for the platforms we ship. # One job, two arches: one Go-toolchain setup instead of two. From dfb67ca21275e48138d9f2f9a92c76813475698f Mon Sep 17 00:00:00 2001 From: Tri Lam Date: Mon, 18 May 2026 22:31:56 -0700 Subject: [PATCH 2/2] [ci] verify split: document where new gates belong Adds one sentence to the partition comment block so a contributor adding a new lint/test/scanner gate knows the default bucket (verify-static) and the rebalance trigger (when it pushes verify-static past the verify-lint pole). Surfaced during review of the parallelization change. Signed-off-by: Tri Lam --- .github/workflows/ci.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 4cd9ba50..0075b0de 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -20,6 +20,8 @@ jobs: # ~125s) on its own job so it bounds wall time. Pair `vet` + `lint` # because they share golangci-lint setup. Everything else lands in # verify-static — a grab-bag bounded by `build` (~55s) + `fuzz` (~40s). + # When adding a new gate: default to verify-static. Promote it to its + # own job only when it pushes verify-static past the verify-lint pole. # # We previously ran a 2-arm matrix (go.mod + stable) for early warning # of Go-toolchain regressions; dropped because the second arm added