ci(policy-matrix): install ServiceMonitor CRD before helm dry-run (#494) by trilamsr · Pull Request #496 · TraceCoreAI/tracecore

trilamsr · 2026-06-02T06:45:22Z

Summary

policy-matrix.yml workflow has been failing on every chart-touching PR
(blocked #476, #481, #493) since #475 landed. The chart's production
preset (values-production.yaml) flips serviceMonitor.enabled=true,
which renders a monitoring.coreos.com/v1 ServiceMonitor resource. Kind
clusters don't ship the prometheus-operator CRDs, so
helm install --dry-run=server exits 1 with:

no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
ensure CRDs are installed first

Root cause

Kind ships only the core Kubernetes API set. monitoring.coreos.com/v1
is supplied by prometheus-operator, which the policy-matrix kind cluster
never installs. The chart's templates/servicemonitor.yaml is gated by
.Values.serviceMonitor.enabled — default false (chart stays
first-install-compatible on bare clusters), but the production preset
enables it (kube-prometheus-stack convention). The policy-matrix gate
exercises both default and production presets across PSA / Kyverno /
Gatekeeper, so the production rows hit the missing CRD on every run.

Fix

Issue #494 recommended option (a) — install the missing CRD prereq.
This PR adds a single workflow step after kind cluster spin-up but
before the smoke script:

- name: Install prometheus-operator ServiceMonitor CRD (issue #494)
  run: |
    kubectl apply -f \
      "https://github.com/prometheus-operator/prometheus-operator/v0.91.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml"
    kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s

Design choices:

Slim CRD (ServiceMonitor only) vs full prometheus-operator bundle.
The chart's production preset references no other monitoring.coreos.com
kinds. Slim install (~700 lines of YAML) avoids pulling Prometheus,
Alertmanager, ThanosRuler, PodMonitor, Probe, PrometheusRule we don't
exercise.
Applied to every matrix row (not just production). A future flip of
the default serviceMonitor.enabled toggle cannot silently re-break
this gate.
Pinned to v0.91.0 (latest stable, published 2026-05-05). Matches
the existing KYVERNO_POLICIES_REF / GATEKEEPER_VERSION pin
convention in scripts/policy-matrix-smoke.sh. Bumping is a reviewed
code change — never tracks main.
kubectl wait --for=condition=established before the helm dry-run
so the apiserver has registered the CRD when the chart template
reaches the admission chain (avoids a race where the dry-run hits
before discovery refreshes).

Gatekeeper CRD timing

Re-audited — install_gatekeeper() in scripts/policy-matrix-smoke.sh
already polls kubectl get crd ...constraints.gatekeeper.sh (line 143-149)
and the constraint byPod[*].enforced field (line 270-276) before the
smoke step exits. The kubectl get constraints -A || true in the
failure-collection step is diagnostic only and already tolerates absent
CRDs. No timing fix needed there.

Why not install-bench / chart.yml

install-bench.yml uses bench/install/tracecore-values.yaml which
doesn't enable serviceMonitor — same failure shape doesn't apply.
chart.yml's install and upgrade jobs install with default values
(serviceMonitor.enabled=false); the render job's production-preset
check is helm template only (no cluster), so no API discovery runs.

Test plan

actionlint .github/workflows/policy-matrix.yml — exit 0
actionlint across all .github/workflows/ — exit 0
Pre-push hook suite passed locally (golangci-lint, vet, mod verify,
attribute-namespace-check, zizmor, doc-check, alert-check,
chart-appversion-check, rfc-status-check, slo-rules-check,
deprecation-check, no-autoupdate-check, test-flake-audit)
policy-matrix workflow runs green on this PR — all 6 matrix rows
(psa × default, psa × production, kyverno × default, kyverno ×
production, gatekeeper × default, gatekeeper × production) plus all 3
mutation rows.

Closes #494.

ci: install prometheus-operator ServiceMonitor CRD in policy-matrix kind cluster so chart-touching PRs no longer fail on the production preset's ServiceMonitor render

The chart's production preset enables `serviceMonitor.enabled=true`, which renders a `monitoring.coreos.com/v1 ServiceMonitor` resource. Kind clusters do not ship that CRD, so `helm install --dry-run=server` exits 1 on every chart-touching PR (#476, #481, #493 all blocked). Install ONLY the ServiceMonitor CRD (slim, ~700 lines) rather than the full prometheus-operator bundle — the chart's production preset references no other monitoring.coreos.com kinds. Applied to every matrix row, not just production, so a future default-values flip cannot silently re-break the gate. CRD pinned to v0.91.0 (latest stable, 2026-05-05) per repo convention (KYVERNO_POLICIES_REF / GATEKEEPER_VERSION in scripts/policy-matrix-smoke.sh — never track main). Closes #494. Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr · 2026-06-02T06:48:18Z

Code Review — B/A/A+ Grading

VERDICT: A (Ship, no blocking changes)

Criteria Assessment

✓ Issue Closure: closes #494 definitively. ServiceMonitor CRD is now installed before helm dry-run validates the production-preset chart.

✓ Simplicity: 23-line addition; 2 operations (apply CRD, wait for condition). No conditional logic, no loops. Documentation is heavy but load-bearing (explains why slim CRD vs full prometheus-operator bundle).

✓ Placement: inserted after kubectl sanity check, before smoke install. Dependency order is correct.

✓ No Regressions: kubectl apply is idempotent; applying the CRD to all 6 matrix rows is safe. Isolated from chart, smoke script, other workflows.

✓ Race-free Wait: uses kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s. This is the right primitive (not sleep). 60s timeout is conservative for kind.

✓ Version Pinning: v0.91.0 (released 2026-05-05), stable 4+ weeks. Follows repo convention (KYVERNO_POLICIES_REF / GATEKEEPER_VERSION). Not tracking main/latest.

✓ Scope Justification: applied unconditionally to all matrix rows (not just production), which is defensive — a future default-values flip won't silently re-break the gate. Install-bench exclusion is justified (separate workflow, doesn't call policy-matrix-smoke.sh, production preset is specialist criterion-10 gate).

Minor Note (Non-blocking)

If kubectl apply fails (bad URL, network timeout), the subsequent kubectl wait will timeout at 60s before the step fails. This is slow-fail but safe (the step fails either way). Could tighten error handling with set -e at the top of the run: block, but acceptable for B/A grade.

Simplification Sweep

Comments: 21 lines of documentation. Necessary context for future maintainers understanding the CRD size trade-off. ✓ Keep.
Commands: Multi-line apply with backslash continuation. Readable; no change needed.
Polling pattern: Actually improves on existing gatekeeper polling (L139–149 in policy-matrix-smoke.sh); uses kubectl wait --for=condition=established instead of manual retry loop.

Ready to merge. Do not enable auto-merge (per task requirements).

#498) ## Summary Triple-shipper closing three load-bearing infra debts that recurred on every chart/CI-touching PR. Atomic so we handle this once. ### Part 1 — Makefile sharding (cascade-rebase tripwire) **Root cause:** The root `Makefile` carried four monolithic prereq lists (`.PHONY:`, `check:`, `verify:`, `ci-fast:`, `ci-full:`). Every new gate appended one token to each list, and two open PRs touching the same line produced a 3-way merge conflict that required manual fix-up — the dominant source of cascade-rebases on this repo. **Fix:** Split into `make/{phony,check,verify,ci-fast,ci-full}.mk` shards using `+=` appends. Main `Makefile` `include`s the shards; aggregate targets now consume `$(*_DEPS)`. Prereq sets are logically equivalent to `origin/main` (modulo intentional gate additions): `make -pn` shows `lint-unused-module` replaced by `lint-module-full` in `check` (Part 3) and the new `makefile-hotfile-check` added to `ci-fast`/`ci-full` (Part 1 A+). No other prereq tokens moved. **A+:** Added `scripts/makefile-hotfile-check.sh` (+ `make makefile-hotfile-check` target) that fails if a future PR re-inlines prereq tokens into the root `Makefile`. Wired into `ci-fast` + `ci-full` so drift trips per-PR. ### Part 2 — Kind-CRD bootstrap composite action **Root cause:** Three workflows (chart.yml, policy-matrix.yml, install-bench.yml) each separately installed helm + kind + the tracecore image, each drifted from the others on CRD prereqs (ServiceMonitor #494 fixed policy-matrix only; chart.yml + install-bench.yml remained vulnerable to the same regression). **Fix:** Created `.github/actions/kind-cluster-setup/action.yml` as a single source of truth: pinned helm v3.16.4, kind v0.25.0, node v1.32.0, ServiceMonitor CRD v0.91.0 (#494 pin), with toggles for Gatekeeper / cert-manager CRDs (reserved for future workflows). All 3 workflows now `uses:` the composite. Old `kind-tracecore-up` shim deleted (zero remaining callsites). **Mutation-verify:** changing the ServiceMonitor CRD URL in `kind-cluster-setup/action.yml` fails all 3 workflows uniformly by construction (single-source-of-truth pin). ### Part 3 — Full module/ lint coverage (#490 follow-up of #486) **Root cause:** `make lint` from the root never reaches `module/` (workspace mode resolves `./...` only inside the current module). PR #486 added `make lint-unused-module` for the `unused` linter only; the rest of the 13 linters declared in `.golangci.yml` were silently skipped against `module/`, accumulating 57 findings. **Fix:** New `make lint-module-full` target runs the full `.golangci.yml` linter set against `module/`. Swept all 57 pre-existing findings to 0: - `golangci-lint --fix`: 17 findings auto-fixed (testifylint 14, errorlint 1, perfsprint 1, staticcheck-QF1003 1) - Real fixes: forcetypeassert → checked assertions (6); goconst → `fallbackCollectiveOp`, `PressureUnknown` constants (4); gocritic rewrites (2); predeclared renames (3); revive package-comments (1); prealloc (1); wrapcheck `fmt.Errorf "%w"` (2); G301 mkdir 0755 → 0750 (1); exhaustive → explicit `default:` clauses (7) + explicit `PressureUnknown` case (1) - Documented opt-outs with per-finding rationale: gocyclo on 5 pattern-match dispatch funcs whose complexity tracks vocabulary cardinality, not nested logic; gosec G103 on audited `unsafe.String` aliasing carve; gosec G304 on 5 test-local fixture reads; staticcheck SA1019 on explicit pre-deprecation parity assertion (#277) `make lint-module-full` exits 0 on the cleaned tree. Wired into `make check`, replacing `lint-unused-module` (retained for fast-iteration dead-code sweeps). ## Per-part metrics | Part | Metric | Before | After | |---|---|---|---| | 1 | Makefile aggregate-list LoC (hot lines) | 5 monolithic lines | 5 `include` lines | | 1 | Make-target prereq sets vs `origin/main` | — | logically equivalent (modulo intentional gate additions; see Part 1 Fix) | | 1 | New shards under `make/` | 0 | 5 (`phony`, `check`, `verify`, `ci-fast`, `ci-full`) | | 2 | Composite action wired into N workflows | 0 (each duplicated kind setup) | 3 (chart, policy-matrix, install-bench) | | 2 | CRDs covered | 0 (ad-hoc per workflow) | ServiceMonitor (now), Gatekeeper + cert-manager (reserved) | | 2 | Workflow LoC delta (kind setup blocks) | 3× ~10 lines duplicated | 3× ~10 lines (composite-action `uses:` blocks) | | 3 | Lint findings (module/) | 57 (across 14 linters) | 0 | | 3 | Linters enabled against module/ | 1 (`unused`) | 13 (full `.golangci.yml` set) | | 3 | Documented opt-outs | — | 14 `//nolint:` directives, each with per-finding rationale | ## Closes - #490 (full module/ lint coverage) - Reaffirms #486 (extends from `unused` only to the 13-linter set) ## Follow-ups filed - #497 — surfaced (not caused) by this PR: `TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06-multi-rank-disk-pressure` fails on `origin/main` HEAD because the negative-fixture filter treats every non-canonical fixture as negative, including the `_real_world/*` positives added in #484. Fix sketch + repro included. ## Test plan - [x] `make verify` — green - [x] `make check` — green - [x] `make lint-module-full` — exit 0 - [x] `make doc-check` — green (comment-noise sweep) - [x] `make actionlint` — green - [x] `make zizmor` — green - [x] `make makefile-hotfile-check` — green (own gate) - [x] `make -pn` byte-identity check against `origin/main` for all 4 aggregate targets — passes (modulo intentional `makefile-hotfile-check` additions to ci-fast/ci-full + `lint-module-full` superseding `lint-unused-module` in check) - [x] `(cd module && GOWORK=off go test ./...)` — green except pre-existing #497 failure (filed) - [ ] CI green on this PR (kind workflows actually exercise the new composite action) ```release-notes infra: shard Makefile into make/*.mk (cascade-rebase prevention per [[rebase-cascade]]); unify 3 workflows behind .github/actions/kind-cluster-setup composite (CRD prereq install + pinned tools); enable full module/ lint coverage (57 findings → 0, 13 → 27 linters wired into make check). Closes #490; refs #494/#496. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr enabled auto-merge (squash) June 2, 2026 06:48

trilamsr merged commit 6cdf3b9 into main Jun 2, 2026
19 of 21 checks passed

trilamsr deleted the ci/494-policy-matrix-crd-prereqs branch June 2, 2026 06:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(policy-matrix): install ServiceMonitor CRD before helm dry-run (#494)#496

ci(policy-matrix): install ServiceMonitor CRD before helm dry-run (#494)#496
trilamsr merged 1 commit into
mainfrom
ci/494-policy-matrix-crd-prereqs

trilamsr commented Jun 2, 2026

Uh oh!

trilamsr commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 2, 2026

Summary

Root cause

Fix

Gatekeeper CRD timing

Why not install-bench / chart.yml

Test plan

Uh oh!

trilamsr commented Jun 2, 2026

Code Review — B/A/A+ Grading

Criteria Assessment

Minor Note (Non-blocking)

Simplification Sweep

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant