ci(policy-matrix): install ServiceMonitor CRD before helm dry-run (#494)#496
Conversation
The chart's production preset enables `serviceMonitor.enabled=true`, which renders a `monitoring.coreos.com/v1 ServiceMonitor` resource. Kind clusters do not ship that CRD, so `helm install --dry-run=server` exits 1 on every chart-touching PR (#476, #481, #493 all blocked). Install ONLY the ServiceMonitor CRD (slim, ~700 lines) rather than the full prometheus-operator bundle — the chart's production preset references no other monitoring.coreos.com kinds. Applied to every matrix row, not just production, so a future default-values flip cannot silently re-break the gate. CRD pinned to v0.91.0 (latest stable, 2026-05-05) per repo convention (KYVERNO_POLICIES_REF / GATEKEEPER_VERSION in scripts/policy-matrix-smoke.sh — never track main). Closes #494. Signed-off-by: Tri Lam <tree@lumalabs.ai>
Code Review — B/A/A+ GradingVERDICT: A (Ship, no blocking changes) Criteria Assessment✓ Issue Closure: closes #494 definitively. ServiceMonitor CRD is now installed before helm dry-run validates the production-preset chart. ✓ Simplicity: 23-line addition; 2 operations (apply CRD, wait for condition). No conditional logic, no loops. Documentation is heavy but load-bearing (explains why slim CRD vs full prometheus-operator bundle). ✓ Placement: inserted after kubectl sanity check, before smoke install. Dependency order is correct. ✓ No Regressions: ✓ Race-free Wait: uses ✓ Version Pinning: v0.91.0 (released 2026-05-05), stable 4+ weeks. Follows repo convention (KYVERNO_POLICIES_REF / GATEKEEPER_VERSION). Not tracking main/latest. ✓ Scope Justification: applied unconditionally to all matrix rows (not just production), which is defensive — a future default-values flip won't silently re-break the gate. Install-bench exclusion is justified (separate workflow, doesn't call policy-matrix-smoke.sh, production preset is specialist criterion-10 gate). Minor Note (Non-blocking)If Simplification Sweep
Ready to merge. Do not enable auto-merge (per task requirements). |
#498) ## Summary Triple-shipper closing three load-bearing infra debts that recurred on every chart/CI-touching PR. Atomic so we handle this once. ### Part 1 — Makefile sharding (cascade-rebase tripwire) **Root cause:** The root `Makefile` carried four monolithic prereq lists (`.PHONY:`, `check:`, `verify:`, `ci-fast:`, `ci-full:`). Every new gate appended one token to each list, and two open PRs touching the same line produced a 3-way merge conflict that required manual fix-up — the dominant source of cascade-rebases on this repo. **Fix:** Split into `make/{phony,check,verify,ci-fast,ci-full}.mk` shards using `+=` appends. Main `Makefile` `include`s the shards; aggregate targets now consume `$(*_DEPS)`. Prereq sets are logically equivalent to `origin/main` (modulo intentional gate additions): `make -pn` shows `lint-unused-module` replaced by `lint-module-full` in `check` (Part 3) and the new `makefile-hotfile-check` added to `ci-fast`/`ci-full` (Part 1 A+). No other prereq tokens moved. **A+:** Added `scripts/makefile-hotfile-check.sh` (+ `make makefile-hotfile-check` target) that fails if a future PR re-inlines prereq tokens into the root `Makefile`. Wired into `ci-fast` + `ci-full` so drift trips per-PR. ### Part 2 — Kind-CRD bootstrap composite action **Root cause:** Three workflows (chart.yml, policy-matrix.yml, install-bench.yml) each separately installed helm + kind + the tracecore image, each drifted from the others on CRD prereqs (ServiceMonitor #494 fixed policy-matrix only; chart.yml + install-bench.yml remained vulnerable to the same regression). **Fix:** Created `.github/actions/kind-cluster-setup/action.yml` as a single source of truth: pinned helm v3.16.4, kind v0.25.0, node v1.32.0, ServiceMonitor CRD v0.91.0 (#494 pin), with toggles for Gatekeeper / cert-manager CRDs (reserved for future workflows). All 3 workflows now `uses:` the composite. Old `kind-tracecore-up` shim deleted (zero remaining callsites). **Mutation-verify:** changing the ServiceMonitor CRD URL in `kind-cluster-setup/action.yml` fails all 3 workflows uniformly by construction (single-source-of-truth pin). ### Part 3 — Full module/ lint coverage (#490 follow-up of #486) **Root cause:** `make lint` from the root never reaches `module/` (workspace mode resolves `./...` only inside the current module). PR #486 added `make lint-unused-module` for the `unused` linter only; the rest of the 13 linters declared in `.golangci.yml` were silently skipped against `module/`, accumulating 57 findings. **Fix:** New `make lint-module-full` target runs the full `.golangci.yml` linter set against `module/`. Swept all 57 pre-existing findings to 0: - `golangci-lint --fix`: 17 findings auto-fixed (testifylint 14, errorlint 1, perfsprint 1, staticcheck-QF1003 1) - Real fixes: forcetypeassert → checked assertions (6); goconst → `fallbackCollectiveOp`, `PressureUnknown` constants (4); gocritic rewrites (2); predeclared renames (3); revive package-comments (1); prealloc (1); wrapcheck `fmt.Errorf "%w"` (2); G301 mkdir 0755 → 0750 (1); exhaustive → explicit `default:` clauses (7) + explicit `PressureUnknown` case (1) - Documented opt-outs with per-finding rationale: gocyclo on 5 pattern-match dispatch funcs whose complexity tracks vocabulary cardinality, not nested logic; gosec G103 on audited `unsafe.String` aliasing carve; gosec G304 on 5 test-local fixture reads; staticcheck SA1019 on explicit pre-deprecation parity assertion (#277) `make lint-module-full` exits 0 on the cleaned tree. Wired into `make check`, replacing `lint-unused-module` (retained for fast-iteration dead-code sweeps). ## Per-part metrics | Part | Metric | Before | After | |---|---|---|---| | 1 | Makefile aggregate-list LoC (hot lines) | 5 monolithic lines | 5 `include` lines | | 1 | Make-target prereq sets vs `origin/main` | — | logically equivalent (modulo intentional gate additions; see Part 1 Fix) | | 1 | New shards under `make/` | 0 | 5 (`phony`, `check`, `verify`, `ci-fast`, `ci-full`) | | 2 | Composite action wired into N workflows | 0 (each duplicated kind setup) | 3 (chart, policy-matrix, install-bench) | | 2 | CRDs covered | 0 (ad-hoc per workflow) | ServiceMonitor (now), Gatekeeper + cert-manager (reserved) | | 2 | Workflow LoC delta (kind setup blocks) | 3× ~10 lines duplicated | 3× ~10 lines (composite-action `uses:` blocks) | | 3 | Lint findings (module/) | 57 (across 14 linters) | 0 | | 3 | Linters enabled against module/ | 1 (`unused`) | 13 (full `.golangci.yml` set) | | 3 | Documented opt-outs | — | 14 `//nolint:` directives, each with per-finding rationale | ## Closes - #490 (full module/ lint coverage) - Reaffirms #486 (extends from `unused` only to the 13-linter set) ## Follow-ups filed - #497 — surfaced (not caused) by this PR: `TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06-multi-rank-disk-pressure` fails on `origin/main` HEAD because the negative-fixture filter treats every non-canonical fixture as negative, including the `_real_world/*` positives added in #484. Fix sketch + repro included. ## Test plan - [x] `make verify` — green - [x] `make check` — green - [x] `make lint-module-full` — exit 0 - [x] `make doc-check` — green (comment-noise sweep) - [x] `make actionlint` — green - [x] `make zizmor` — green - [x] `make makefile-hotfile-check` — green (own gate) - [x] `make -pn` byte-identity check against `origin/main` for all 4 aggregate targets — passes (modulo intentional `makefile-hotfile-check` additions to ci-fast/ci-full + `lint-module-full` superseding `lint-unused-module` in check) - [x] `(cd module && GOWORK=off go test ./...)` — green except pre-existing #497 failure (filed) - [ ] CI green on this PR (kind workflows actually exercise the new composite action) ```release-notes infra: shard Makefile into make/*.mk (cascade-rebase prevention per [[rebase-cascade]]); unify 3 workflows behind .github/actions/kind-cluster-setup composite (CRD prereq install + pinned tools); enable full module/ lint coverage (57 findings → 0, 13 → 27 linters wired into make check). Closes #490; refs #494/#496. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
Summary
policy-matrix.ymlworkflow has been failing on every chart-touching PR(blocked #476, #481, #493) since #475 landed. The chart's production
preset (
values-production.yaml) flipsserviceMonitor.enabled=true,which renders a
monitoring.coreos.com/v1 ServiceMonitorresource. Kindclusters don't ship the prometheus-operator CRDs, so
helm install --dry-run=serverexits 1 with:Root cause
Kind ships only the core Kubernetes API set.
monitoring.coreos.com/v1is supplied by prometheus-operator, which the policy-matrix kind cluster
never installs. The chart's
templates/servicemonitor.yamlis gated by.Values.serviceMonitor.enabled— defaultfalse(chart staysfirst-install-compatible on bare clusters), but the production preset
enables it (kube-prometheus-stack convention). The policy-matrix gate
exercises both default and production presets across PSA / Kyverno /
Gatekeeper, so the production rows hit the missing CRD on every run.
Fix
Issue #494 recommended option (a) — install the missing CRD prereq.
This PR adds a single workflow step after kind cluster spin-up but
before the smoke script:
Design choices:
The chart's production preset references no other
monitoring.coreos.comkinds. Slim install (~700 lines of YAML) avoids pulling Prometheus,
Alertmanager, ThanosRuler, PodMonitor, Probe, PrometheusRule we don't
exercise.
the default
serviceMonitor.enabledtoggle cannot silently re-breakthis gate.
v0.91.0(latest stable, published 2026-05-05). Matchesthe existing
KYVERNO_POLICIES_REF/GATEKEEPER_VERSIONpinconvention in
scripts/policy-matrix-smoke.sh. Bumping is a reviewedcode change — never tracks
main.kubectl wait --for=condition=establishedbefore the helm dry-runso the apiserver has registered the CRD when the chart template
reaches the admission chain (avoids a race where the dry-run hits
before discovery refreshes).
Gatekeeper CRD timing
Re-audited —
install_gatekeeper()inscripts/policy-matrix-smoke.shalready polls
kubectl get crd ...constraints.gatekeeper.sh(line 143-149)and the constraint
byPod[*].enforcedfield (line 270-276) before thesmoke step exits. The
kubectl get constraints -A || truein thefailure-collection step is diagnostic only and already tolerates absent
CRDs. No timing fix needed there.
Why not install-bench / chart.yml
install-bench.ymlusesbench/install/tracecore-values.yamlwhichdoesn't enable serviceMonitor — same failure shape doesn't apply.
chart.yml'sinstallandupgradejobs install with default values(
serviceMonitor.enabled=false); therenderjob's production-presetcheck is
helm templateonly (no cluster), so no API discovery runs.Test plan
actionlint .github/workflows/policy-matrix.yml— exit 0actionlintacross all.github/workflows/— exit 0attribute-namespace-check, zizmor, doc-check, alert-check,
chart-appversion-check, rfc-status-check, slo-rules-check,
deprecation-check, no-autoupdate-check, test-flake-audit)
(psa × default, psa × production, kyverno × default, kyverno ×
production, gatekeeper × default, gatekeeper × production) plus all 3
mutation rows.
Closes #494.