Skip to content

ci(compat): publish v1.0-rc1 compat-matrix workflow#355

Merged
trilamsr merged 3 commits into
mainfrom
ci/compat-matrix-v1-rc1
Jun 1, 2026
Merged

ci(compat): publish v1.0-rc1 compat-matrix workflow#355
trilamsr merged 3 commits into
mainfrom
ci/compat-matrix-v1-rc1

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes v1.0-rc1 cut criteria 5 (support matrix published) and 6 (compat CI matrix green) together: ships the missing .github/workflows/compat-matrix.yml and creates docs/SUPPORT-MATRIX.md with every Tier-1 row backed by a named CI workflow per the criterion 5 rubric.

  • .github/workflows/compat-matrix.ymlschedule: weekly Mon 04:00 UTC + workflow_dispatch + push tags: v*. Matrix is skinny by design (~10-15 min wall-time):

    • 4 cells: (k8s 1.30 / 1.31 / 1.32 / 1.33) × pinned OTel v0.130 — blocking.
    • 1 cell: k8s 1.32 × OTel v0.129 (previous minor via make bump-otel reverse) — continue-on-error: true, informational until a published N-back support policy lands.
    • Per cell: OCB build → docker build → kind cluster → helm install --wait → port-forward smoke (healthcheck on :13133 + self-metrics on :8888, asserting non-zero otelcol_* counters) → uninstall + tear-down.
    • report aggregator job dedupes red runs onto a single compat-matrix-red-labeled tracking issue (criterion 6 rubric: "failures opening an issue automatically").
  • docs/SUPPORT-MATRIX.md — new file. k8s 1.30 / 1.31 / 1.33 flip from "Tier 2 — compat-matrix pending" to Tier 1 (CI-gated by compat-matrix.yml); 1.32 stays Tier 1 (already gated by chart.yml, now also by the matrix); 1.28 / 1.29 stay Tier 2 with an explicit "compat-matrix not exercised" note (outside the upstream-supported window). Note: this file is also created by #350; whichever PR merges second resolves the trivial conflict by adopting both edits.

  • docs/v1-rc1-cut-criteria.md — criterion 5 ☐ → ☑ and criterion 6 ☐ → ☑ with the workflow + matrix cells named.

  • docs/RELEASE-CHECKLIST.md — RC "Compat CI" line cites the new workflow + a manual workflow_dispatch step during release-prep.

Why this design

  • Not on PRs to main. Each cell takes ~10 min; 5 cells in parallel = 10-15 min wall-time. That would dominate PR CI for zero per-PR signal (the cross-product moves with upstream releases, not with code diffs).
  • k8s axis is the fat one. Go pin is gated by ci.yml every PR; OTel axis is gated by ci.yml + chart.yml + install-bench.yml every PR. The compat matrix's job is the (k8s × OTel) cross-product, which nothing else exercises.
  • Tag-trigger included. Every v* tag re-validates before goreleaser ships — catches a release that would crash on a newer k8s minor we haven't bumped to in months.

Disciplines

### Added
- `.github/workflows/compat-matrix.yml` — weekly + tag-triggered
  (k8s 1.30/1.31/1.32/1.33 × OTel v0.130) compat coverage with an
  informational previous-minor (v0.129) cell. Closes v1.0-rc1 cut
  criterion 6.
- `docs/SUPPORT-MATRIX.md` — published support envelope (Go × OTel
  × k8s × distro × GPU vendor × CNI). Every Tier-1 row maps to a
  named CI workflow. Closes v1.0-rc1 cut criterion 5.

### Changed
- `docs/v1-rc1-cut-criteria.md` — criteria 5 + 6 marked shipped.
- `docs/RELEASE-CHECKLIST.md` — RC compat-CI gate cites the new
  workflow + a `workflow_dispatch` step during release-prep.
  • actionlint .github/workflows/compat-matrix.yml — clean
  • python3 -c "import yaml; yaml.safe_load(open(...))" — clean
  • bash scripts/zizmor.sh (--min-severity=high) — clean. Workflow-level perms scoped to contents: read; issues: write granted only on the report job. setup-go cache disabled (this workflow publishes on tag, so the cache-poisoning audit fires on cache: true).
  • make doc-check — clean
  • Pre-commit hook (lint, vet, mod-verify, attribute-namespace-check) — clean

Test plan

  • After merge: trigger workflow_dispatch once to confirm all 5 cells run end-to-end on main.
  • Confirm the first scheduled run (next Monday 04:00 UTC) lands green; if red, confirm the report job opens an issue labeled compat-matrix-red.
  • On the next v* tag (release-prep PR), confirm the workflow gates the tag push before goreleaser ships.

trilamsr and others added 2 commits June 1, 2026 02:23
Closes v1.0-rc1 cut criteria 5 + 6:

- .github/workflows/compat-matrix.yml: weekly cron + workflow_dispatch
  + v* tag triggers run (k8s 1.30/1.31/1.32/1.33 x pinned OTel v0.130)
  plus a non-blocking (k8s 1.32 x previous-minor OTel v0.129) cell.
  Each cell does OCB build, docker build, kind install, helm install,
  port-forward healthcheck + self-metrics. report job opens a
  dedupe-by-label tracking issue on red per the criterion 6 rubric.
- docs/SUPPORT-MATRIX.md: k8s rows 1.30/1.31/1.33 flipped from
  'Tier 2 - compat-matrix pending' to Tier 1 with the new workflow
  as source of truth; 1.28/1.29 stay Tier 2 (out of upstream support
  window); OTel previous-minor row tagged informational until N-back
  policy adopted.
- docs/v1-rc1-cut-criteria.md: criterion 5 unboxed (support matrix
  published) and criterion 6 unboxed (compat CI matrix wired).
- docs/RELEASE-CHECKLIST.md: RC additional gates cite the new
  workflow + manual workflow_dispatch step during release-prep.

Disciplines:
- actionlint clean
- python3 yaml.safe_load clean
- zizmor --min-severity=high clean (workflow-level permissions
  scoped down to read; issues:write granted only on the report
  job; setup-go cache disabled on this publishing workflow)
- All actions pinned to commit SHA matching the canonical
  references in chart.yml + install-bench.yml

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Addressed reviewer findings:

Fixed:

  • comment now matches code (otelcol_* not tracecore_*).
  • close-on-green logic: report job comments + closes any open compat-matrix-red issue when matrix is all-green; prevents stale-issue misattribution.
  • demotion rubric tightened: 'without an open, compat-matrix-red-labeled issue.'
  • SUPPORT-MATRIX 'NOT promise' calls out smoke = startup + self-telemetry, NOT verdict round-trip.
  • tag trigger clarified as informational; release-prep maintainers run workflow_dispatch BEFORE tagging to gate.
  • concurrency group docstring explains per-ref serialization intent.

# Conflicts:
#	docs/RELEASE-CHECKLIST.md
#	docs/SUPPORT-MATRIX.md
#	docs/v1-rc1-cut-criteria.md
@trilamsr trilamsr merged commit ef2eab3 into main Jun 1, 2026
11 checks passed
@trilamsr trilamsr deleted the ci/compat-matrix-v1-rc1 branch June 1, 2026 10:17
trilamsr added a commit that referenced this pull request Jun 2, 2026
Closes #445.

## Root cause

The `changes` job in `.github/workflows/ci.yml` has existed since #355
(Mon 2026-06-01) emitting an `outputs.code` boolean that flags docs-only
PRs. The accompanying comment claimed `verify-test`, `verify-lint`, and
`build` skip on doc-only changes — **but no job actually consumed the
output**. Every PR has been running the full ci-full suite
(coverage-check, vet+lint, build×2 arches, verify-static, sdk-python)
regardless of diff scope.

PR #442 (#424) wired `bench-allocs-check` into `make ci-full` and the CI
`bench-check` step picked up the +80s standalone cost (+54s effective
with warm cache). That made the long-orphaned `changes`-job miswiring
visible enough to file as #445, but the underlying defect is older: a
path-filter shipped without consumers.

## Change

Two outputs on the `changes` job, consumed by `if:` conditions:

- **`code`** (true iff any non-`docs/*`-non-`*.md` file changed) gates
`verify-test`, `verify-lint`, `build`, `smoke-test-binary`,
`sdk-python`. Go gates cannot regress on doc-only edits.
- **`bench`** (true iff `module/pkg/patterns/**`, `bench/**`,
`scripts/bench-*.sh`, or `Makefile` changed) gates the `bench-check`
step *inside* `verify-static`. `bench-allocs-check` cannot regress on a
diff that doesn't touch detector source, the bench harness, the bench
scripts, or the bench-target wiring.

Both filters **fail open** on an empty diff or unreachable base ref, so
every push to `main` and every bench-touching PR keeps the gate
mandatory.

`verify-static` and `validator-recipe` keep running on every PR — they
own the doc-touching gates (`doc-check`, `cut-criteria-check`,
`slo-rules-check`, recipe-YAML validation under
`docs/integrations/examples/`).

The `verify` aggregator now treats `result=skipped` as satisfied on the
doc-only path for the skippable sub-jobs (verify-test, verify-lint,
sdk-python). verify-static + validator-recipe still must be `success` on
every PR — encoded in the aggregator, not relying on branch-protection
SKIPPED-is-OK semantics.

## Wall-time impact

| PR shape | Before | After | Δ |
|---|---|---|---|
| Docs-only (e.g. #423) | ~283s (full suite) | bounded by
`verify-static` minus bench-check step | drops the +54s bench delta +
the verify-test 125s pole + the verify-lint 60s pole |
| Bench-touching (e.g. #442, this one) | ~283s | ~283s | 0 (gate still
runs) |
| Non-bench code (e.g. processor change) | ~283s | ~203s | -80s
(bench-check skip) |

## TDD (red→green)

Mirrored the workflow filter logic in `/tmp/test-changes-filter.sh` (12
cases × code+bench outputs = 24 assertions). Covered: docs-only,
detector source, bench-script, bench-registry-via-`bench-*.sh`-glob,
Makefile, non-detector go, bench-baseline.txt under testdata, mixed
docs+code, empty-diff fail-open, bench/ dir, top-level `.md`
(PRINCIPLES.md), workflow-file change. All 24 green.

Real-PR validation:
- `gh pr diff 423` (docs-only) → `code=false bench=false` (heavy gates +
bench-check skip)
- `gh pr diff 442` (Makefile + PRINCIPLES.md) → `code=true bench=true`
(gates run)

## Verify (green)

- `make actionlint` — clean (initial run flagged SC2221/SC2222 on a
redundant `scripts/bench-registry.sh` pattern that the
`scripts/bench-*.sh` glob already subsumed; removed)
- `make lint` — 0 issues
- `make ci-fast` — clean (lint, vet, mod-verify,
attribute-namespace-check, doc-check, alert-check,
chart-appversion-check, rfc-status-check, slo-rules-check)
- `make doc-check` — clean
- Pre-commit hooks — clean (golangci-lint, vet, mod-verify,
attribute-namespace-check, hit-line-format-stable,
no-autoupdate-check_test)

## A+ audit notes

Audited other ci-full jobs for docs-only skip eligibility:
- `validator-recipe` — NOT skipped. Recipe YAMLs live under
`docs/integrations/examples/`, so docs-classified diffs can include
functional content. Out-of-scope for safe skip.
- `verify-static` — runs always. Contains `doc-check`,
`cut-criteria-check`, `slo-rules-check`, `actionlint`, `zizmor`,
`register-lint`, `deprecation-check` — all can regress on docs.
Step-level skip applied only to the `install benchstat` + `bench-check`
pair via `changes.outputs.bench`.
- `validator-recipe` and `smoke-test-binary` not eligible for `bench`
output (don't touch detector code).

PRINCIPLES.md §10 updated with the path-filter policy (which jobs skip
on docs-only, why verify-static stays, fail-open semantics) so the next
reader can see the policy without grepping the workflow.

```release-notes
- CI bench-check step now skips on PRs that don't touch the detector tree, the bench harness, the bench scripts, or the Makefile. Heavy Go gates (verify-test, verify-lint, build, smoke-test-binary, sdk-python) skip on docs-only PRs. The verify aggregator treats SKIPPED-on-docs-only as success; verify-static and validator-recipe still run on every PR. Closes #445.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant