Skip to content

ci(policy-matrix): install ServiceMonitor CRD before helm dry-run (#494)#496

Merged
trilamsr merged 1 commit into
mainfrom
ci/494-policy-matrix-crd-prereqs
Jun 2, 2026
Merged

ci(policy-matrix): install ServiceMonitor CRD before helm dry-run (#494)#496
trilamsr merged 1 commit into
mainfrom
ci/494-policy-matrix-crd-prereqs

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

policy-matrix.yml workflow has been failing on every chart-touching PR
(blocked #476, #481, #493) since #475 landed. The chart's production
preset (values-production.yaml) flips serviceMonitor.enabled=true,
which renders a monitoring.coreos.com/v1 ServiceMonitor resource. Kind
clusters don't ship the prometheus-operator CRDs, so
helm install --dry-run=server exits 1 with:

no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
ensure CRDs are installed first

Root cause

Kind ships only the core Kubernetes API set. monitoring.coreos.com/v1
is supplied by prometheus-operator, which the policy-matrix kind cluster
never installs. The chart's templates/servicemonitor.yaml is gated by
.Values.serviceMonitor.enabled — default false (chart stays
first-install-compatible on bare clusters), but the production preset
enables it (kube-prometheus-stack convention). The policy-matrix gate
exercises both default and production presets across PSA / Kyverno /
Gatekeeper, so the production rows hit the missing CRD on every run.

Fix

Issue #494 recommended option (a) — install the missing CRD prereq.
This PR adds a single workflow step after kind cluster spin-up but
before the smoke script:

- name: Install prometheus-operator ServiceMonitor CRD (issue #494)
  run: |
    kubectl apply -f \
      "https://github.com/prometheus-operator/prometheus-operator/v0.91.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml"
    kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s

Design choices:

  • Slim CRD (ServiceMonitor only) vs full prometheus-operator bundle.
    The chart's production preset references no other monitoring.coreos.com
    kinds. Slim install (~700 lines of YAML) avoids pulling Prometheus,
    Alertmanager, ThanosRuler, PodMonitor, Probe, PrometheusRule we don't
    exercise.
  • Applied to every matrix row (not just production). A future flip of
    the default serviceMonitor.enabled toggle cannot silently re-break
    this gate.
  • Pinned to v0.91.0 (latest stable, published 2026-05-05). Matches
    the existing KYVERNO_POLICIES_REF / GATEKEEPER_VERSION pin
    convention in scripts/policy-matrix-smoke.sh. Bumping is a reviewed
    code change — never tracks main.
  • kubectl wait --for=condition=established before the helm dry-run
    so the apiserver has registered the CRD when the chart template
    reaches the admission chain (avoids a race where the dry-run hits
    before discovery refreshes).

Gatekeeper CRD timing

Re-audited — install_gatekeeper() in scripts/policy-matrix-smoke.sh
already polls kubectl get crd ...constraints.gatekeeper.sh (line 143-149)
and the constraint byPod[*].enforced field (line 270-276) before the
smoke step exits. The kubectl get constraints -A || true in the
failure-collection step is diagnostic only and already tolerates absent
CRDs. No timing fix needed there.

Why not install-bench / chart.yml

  • install-bench.yml uses bench/install/tracecore-values.yaml which
    doesn't enable serviceMonitor — same failure shape doesn't apply.
  • chart.yml's install and upgrade jobs install with default values
    (serviceMonitor.enabled=false); the render job's production-preset
    check is helm template only (no cluster), so no API discovery runs.

Test plan

  • actionlint .github/workflows/policy-matrix.yml — exit 0
  • actionlint across all .github/workflows/ — exit 0
  • Pre-push hook suite passed locally (golangci-lint, vet, mod verify,
    attribute-namespace-check, zizmor, doc-check, alert-check,
    chart-appversion-check, rfc-status-check, slo-rules-check,
    deprecation-check, no-autoupdate-check, test-flake-audit)
  • policy-matrix workflow runs green on this PR — all 6 matrix rows
    (psa × default, psa × production, kyverno × default, kyverno ×
    production, gatekeeper × default, gatekeeper × production) plus all 3
    mutation rows.

Closes #494.

ci: install prometheus-operator ServiceMonitor CRD in policy-matrix kind cluster so chart-touching PRs no longer fail on the production preset's ServiceMonitor render

The chart's production preset enables `serviceMonitor.enabled=true`,
which renders a `monitoring.coreos.com/v1 ServiceMonitor` resource.
Kind clusters do not ship that CRD, so `helm install --dry-run=server`
exits 1 on every chart-touching PR (#476, #481, #493 all blocked).

Install ONLY the ServiceMonitor CRD (slim, ~700 lines) rather than the
full prometheus-operator bundle — the chart's production preset
references no other monitoring.coreos.com kinds. Applied to every
matrix row, not just production, so a future default-values flip
cannot silently re-break the gate.

CRD pinned to v0.91.0 (latest stable, 2026-05-05) per repo
convention (KYVERNO_POLICIES_REF / GATEKEEPER_VERSION in
scripts/policy-matrix-smoke.sh — never track main).

Closes #494.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Code Review — B/A/A+ Grading

VERDICT: A (Ship, no blocking changes)

Criteria Assessment

Issue Closure: closes #494 definitively. ServiceMonitor CRD is now installed before helm dry-run validates the production-preset chart.

Simplicity: 23-line addition; 2 operations (apply CRD, wait for condition). No conditional logic, no loops. Documentation is heavy but load-bearing (explains why slim CRD vs full prometheus-operator bundle).

Placement: inserted after kubectl sanity check, before smoke install. Dependency order is correct.

No Regressions: kubectl apply is idempotent; applying the CRD to all 6 matrix rows is safe. Isolated from chart, smoke script, other workflows.

Race-free Wait: uses kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s. This is the right primitive (not sleep). 60s timeout is conservative for kind.

Version Pinning: v0.91.0 (released 2026-05-05), stable 4+ weeks. Follows repo convention (KYVERNO_POLICIES_REF / GATEKEEPER_VERSION). Not tracking main/latest.

Scope Justification: applied unconditionally to all matrix rows (not just production), which is defensive — a future default-values flip won't silently re-break the gate. Install-bench exclusion is justified (separate workflow, doesn't call policy-matrix-smoke.sh, production preset is specialist criterion-10 gate).

Minor Note (Non-blocking)

If kubectl apply fails (bad URL, network timeout), the subsequent kubectl wait will timeout at 60s before the step fails. This is slow-fail but safe (the step fails either way). Could tighten error handling with set -e at the top of the run: block, but acceptable for B/A grade.

Simplification Sweep

  • Comments: 21 lines of documentation. Necessary context for future maintainers understanding the CRD size trade-off. ✓ Keep.
  • Commands: Multi-line apply with backslash continuation. Readable; no change needed.
  • Polling pattern: Actually improves on existing gatekeeper polling (L139–149 in policy-matrix-smoke.sh); uses kubectl wait --for=condition=established instead of manual retry loop.

Ready to merge. Do not enable auto-merge (per task requirements).

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 06:48
@trilamsr trilamsr merged commit 6cdf3b9 into main Jun 2, 2026
19 of 21 checks passed
@trilamsr trilamsr deleted the ci/494-policy-matrix-crd-prereqs branch June 2, 2026 06:54
trilamsr added a commit that referenced this pull request Jun 2, 2026
#498)

## Summary

Triple-shipper closing three load-bearing infra debts that recurred on
every chart/CI-touching PR. Atomic so we handle this once.

### Part 1 — Makefile sharding (cascade-rebase tripwire)

**Root cause:** The root `Makefile` carried four monolithic prereq lists
(`.PHONY:`, `check:`, `verify:`, `ci-fast:`, `ci-full:`). Every new gate
appended one token to each list, and two open PRs touching the same line
produced a 3-way merge conflict that required manual fix-up — the
dominant source of cascade-rebases on this repo.

**Fix:** Split into `make/{phony,check,verify,ci-fast,ci-full}.mk`
shards using `+=` appends. Main `Makefile` `include`s the shards;
aggregate targets now consume `$(*_DEPS)`. Prereq sets are logically
equivalent to `origin/main` (modulo intentional gate additions): `make
-pn` shows `lint-unused-module` replaced by `lint-module-full` in
`check` (Part 3) and the new `makefile-hotfile-check` added to
`ci-fast`/`ci-full` (Part 1 A+). No other prereq tokens moved.

**A+:** Added `scripts/makefile-hotfile-check.sh` (+ `make
makefile-hotfile-check` target) that fails if a future PR re-inlines
prereq tokens into the root `Makefile`. Wired into `ci-fast` + `ci-full`
so drift trips per-PR.

### Part 2 — Kind-CRD bootstrap composite action

**Root cause:** Three workflows (chart.yml, policy-matrix.yml,
install-bench.yml) each separately installed helm + kind + the tracecore
image, each drifted from the others on CRD prereqs (ServiceMonitor #494
fixed policy-matrix only; chart.yml + install-bench.yml remained
vulnerable to the same regression).

**Fix:** Created `.github/actions/kind-cluster-setup/action.yml` as a
single source of truth: pinned helm v3.16.4, kind v0.25.0, node v1.32.0,
ServiceMonitor CRD v0.91.0 (#494 pin), with toggles for Gatekeeper /
cert-manager CRDs (reserved for future workflows). All 3 workflows now
`uses:` the composite. Old `kind-tracecore-up` shim deleted (zero
remaining callsites).

**Mutation-verify:** changing the ServiceMonitor CRD URL in
`kind-cluster-setup/action.yml` fails all 3 workflows uniformly by
construction (single-source-of-truth pin).

### Part 3 — Full module/ lint coverage (#490 follow-up of #486)

**Root cause:** `make lint` from the root never reaches `module/`
(workspace mode resolves `./...` only inside the current module). PR
#486 added `make lint-unused-module` for the `unused` linter only; the
rest of the 13 linters declared in `.golangci.yml` were silently skipped
against `module/`, accumulating 57 findings.

**Fix:** New `make lint-module-full` target runs the full
`.golangci.yml` linter set against `module/`. Swept all 57 pre-existing
findings to 0:
- `golangci-lint --fix`: 17 findings auto-fixed (testifylint 14,
errorlint 1, perfsprint 1, staticcheck-QF1003 1)
- Real fixes: forcetypeassert → checked assertions (6); goconst →
`fallbackCollectiveOp`, `PressureUnknown` constants (4); gocritic
rewrites (2); predeclared renames (3); revive package-comments (1);
prealloc (1); wrapcheck `fmt.Errorf "%w"` (2); G301 mkdir 0755 → 0750
(1); exhaustive → explicit `default:` clauses (7) + explicit
`PressureUnknown` case (1)
- Documented opt-outs with per-finding rationale: gocyclo on 5
pattern-match dispatch funcs whose complexity tracks vocabulary
cardinality, not nested logic; gosec G103 on audited `unsafe.String`
aliasing carve; gosec G304 on 5 test-local fixture reads; staticcheck
SA1019 on explicit pre-deprecation parity assertion (#277)

`make lint-module-full` exits 0 on the cleaned tree. Wired into `make
check`, replacing `lint-unused-module` (retained for fast-iteration
dead-code sweeps).

## Per-part metrics

| Part | Metric | Before | After |
|---|---|---|---|
| 1 | Makefile aggregate-list LoC (hot lines) | 5 monolithic lines | 5
`include` lines |
| 1 | Make-target prereq sets vs `origin/main` | — | logically
equivalent (modulo intentional gate additions; see Part 1 Fix) |
| 1 | New shards under `make/` | 0 | 5 (`phony`, `check`, `verify`,
`ci-fast`, `ci-full`) |
| 2 | Composite action wired into N workflows | 0 (each duplicated kind
setup) | 3 (chart, policy-matrix, install-bench) |
| 2 | CRDs covered | 0 (ad-hoc per workflow) | ServiceMonitor (now),
Gatekeeper + cert-manager (reserved) |
| 2 | Workflow LoC delta (kind setup blocks) | 3× ~10 lines duplicated |
3× ~10 lines (composite-action `uses:` blocks) |
| 3 | Lint findings (module/) | 57 (across 14 linters) | 0 |
| 3 | Linters enabled against module/ | 1 (`unused`) | 13 (full
`.golangci.yml` set) |
| 3 | Documented opt-outs | — | 14 `//nolint:` directives, each with
per-finding rationale |

## Closes

- #490 (full module/ lint coverage)
- Reaffirms #486 (extends from `unused` only to the 13-linter set)

## Follow-ups filed

- #497 — surfaced (not caused) by this PR:
`TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06-multi-rank-disk-pressure`
fails on `origin/main` HEAD because the negative-fixture filter treats
every non-canonical fixture as negative, including the `_real_world/*`
positives added in #484. Fix sketch + repro included.

## Test plan

- [x] `make verify` — green
- [x] `make check` — green
- [x] `make lint-module-full` — exit 0
- [x] `make doc-check` — green (comment-noise sweep)
- [x] `make actionlint` — green
- [x] `make zizmor` — green
- [x] `make makefile-hotfile-check` — green (own gate)
- [x] `make -pn` byte-identity check against `origin/main` for all 4
aggregate targets — passes (modulo intentional `makefile-hotfile-check`
additions to ci-fast/ci-full + `lint-module-full` superseding
`lint-unused-module` in check)
- [x] `(cd module && GOWORK=off go test ./...)` — green except
pre-existing #497 failure (filed)
- [ ] CI green on this PR (kind workflows actually exercise the new
composite action)

```release-notes
infra: shard Makefile into make/*.mk (cascade-rebase prevention per [[rebase-cascade]]); unify 3 workflows behind .github/actions/kind-cluster-setup composite (CRD prereq install + pinned tools); enable full module/ lint coverage (57 findings → 0, 13 → 27 linters wired into make check). Closes #490; refs #494/#496.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ci(policy-matrix): install ServiceMonitor + Gatekeeper CRDs in kind setup (regression since #475)

1 participant