ci(policy-matrix): production values + mutation gate (#138 A+)#475
Conversation
Extend the existing policy-matrix workflow with the A+ deliverables from the original #138 task brief that were deferred when PR #289 landed: 1. Two-dimensional matrix — every engine (PSA-restricted, Kyverno baseline+restricted, Gatekeeper PSP) now runs against BOTH the chart defaults AND values-production.yaml (the v1.0-rc1 cut-criteria-10 preset). The task brief was explicit: validate production values against real policy engines, not just defaults. 2. policy-matrix-mutation job — a falsifier that applies the conftest testdata fixture bad-allowprivilegeescalation.yaml via kubectl apply --dry-run=server and asserts every engine REJECTS it. Without this gate, a no-op policy bundle (forgot Enforce mode, forgot constraint apply, forgot PSA label) would let the policy-matrix rows pass green and ship false confidence. 3. Chart README — new "Live-cluster policy validation" subsection cross-links to the workflow + smoke script + local repro recipe. The smoke script gains a VALUES_FILE env knob for the production overlay and a SKIP_SMOKE knob the mutation job uses to provision the engine without running the chart dry-run (the chart's values.schema forbids the mutation at template-render time; we bypass helm and apply the bad fixture directly so the API server's policy chain is what we exercise). actionlint + shellcheck + zizmor (vs repo artipacked baseline) clean. Signed-off-by: Tri Lam <tree@lumalabs.ai>
Adversarial Review: PR #475 — policy-matrix A+ follow-upB/A/A+ Grading Criteria
Findings.github/workflows/policy-matrix.yml:206: Kyverno denial message format is bundle-version-dependent. If upstream .github/workflows/policy-matrix.yml:174: SKIP_SMOKE=1 in mutation job hides engine provisioning errors. If install/kubernetes/tracecore/README.md:589+: 47-line addition includes integration docs (workflow explanation, engine/bundle table, mutation rationale, local repro recipe). All content load-bearing; no fluff detected. scripts/policy-matrix-smoke.sh:24-40: VALUES_FILE parameter defaults to empty string; helm layering conditional works correctly. File existence validated. Sound. Overall design: Production preset dimension + automated falsifier close stated A+ gaps from #138. Matrix remains proportional (3×2), testdata DRY wins, ENV knobs necessary. VERDICTGrade: A — Ship ready. Both identified risks have documented mitigations; neither blocks merge. |
…) (#496) ## Summary `policy-matrix.yml` workflow has been failing on every chart-touching PR (blocked #476, #481, #493) since #475 landed. The chart's production preset (`values-production.yaml`) flips `serviceMonitor.enabled=true`, which renders a `monitoring.coreos.com/v1 ServiceMonitor` resource. Kind clusters don't ship the prometheus-operator CRDs, so `helm install --dry-run=server` exits 1 with: ``` no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" ensure CRDs are installed first ``` ## Root cause Kind ships only the core Kubernetes API set. `monitoring.coreos.com/v1` is supplied by prometheus-operator, which the policy-matrix kind cluster never installs. The chart's `templates/servicemonitor.yaml` is gated by `.Values.serviceMonitor.enabled` — default `false` (chart stays first-install-compatible on bare clusters), but the production preset enables it (kube-prometheus-stack convention). The policy-matrix gate exercises both default and production presets across PSA / Kyverno / Gatekeeper, so the production rows hit the missing CRD on every run. ## Fix Issue #494 recommended option (a) — install the missing CRD prereq. This PR adds a single workflow step after kind cluster spin-up but before the smoke script: ```yaml - name: Install prometheus-operator ServiceMonitor CRD (issue #494) run: | kubectl apply -f \ "https://github.com/prometheus-operator/prometheus-operator/v0.91.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml" kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s ``` Design choices: - **Slim CRD (ServiceMonitor only)** vs full prometheus-operator bundle. The chart's production preset references no other `monitoring.coreos.com` kinds. Slim install (~700 lines of YAML) avoids pulling Prometheus, Alertmanager, ThanosRuler, PodMonitor, Probe, PrometheusRule we don't exercise. - **Applied to every matrix row** (not just production). A future flip of the default `serviceMonitor.enabled` toggle cannot silently re-break this gate. - **Pinned to `v0.91.0`** (latest stable, published 2026-05-05). Matches the existing `KYVERNO_POLICIES_REF` / `GATEKEEPER_VERSION` pin convention in `scripts/policy-matrix-smoke.sh`. Bumping is a reviewed code change — never tracks `main`. - **`kubectl wait --for=condition=established`** before the helm dry-run so the apiserver has registered the CRD when the chart template reaches the admission chain (avoids a race where the dry-run hits before discovery refreshes). ## Gatekeeper CRD timing Re-audited — `install_gatekeeper()` in `scripts/policy-matrix-smoke.sh` already polls `kubectl get crd ...constraints.gatekeeper.sh` (line 143-149) and the constraint `byPod[*].enforced` field (line 270-276) before the smoke step exits. The `kubectl get constraints -A || true` in the failure-collection step is diagnostic only and already tolerates absent CRDs. No timing fix needed there. ## Why not install-bench / chart.yml - `install-bench.yml` uses `bench/install/tracecore-values.yaml` which doesn't enable serviceMonitor — same failure shape doesn't apply. - `chart.yml`'s `install` and `upgrade` jobs install with default values (`serviceMonitor.enabled=false`); the `render` job's production-preset check is `helm template` only (no cluster), so no API discovery runs. ## Test plan - [x] `actionlint .github/workflows/policy-matrix.yml` — exit 0 - [x] `actionlint` across all `.github/workflows/` — exit 0 - [x] Pre-push hook suite passed locally (golangci-lint, vet, mod verify, attribute-namespace-check, zizmor, doc-check, alert-check, chart-appversion-check, rfc-status-check, slo-rules-check, deprecation-check, no-autoupdate-check, test-flake-audit) - [ ] policy-matrix workflow runs green on this PR — all 6 matrix rows (psa × default, psa × production, kyverno × default, kyverno × production, gatekeeper × default, gatekeeper × production) plus all 3 mutation rows. Closes #494. ```release-notes ci: install prometheus-operator ServiceMonitor CRD in policy-matrix kind cluster so chart-touching PRs no longer fail on the production preset's ServiceMonitor render ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
## Summary Removes `.github/workflows/policy-matrix.yml`. Engine-specific admission validation (PSA-restricted × Kyverno × Gatekeeper × default+production) delivered negative ROI at rc1. ## Root cause 4 PRs blocked or chasing this workflow's flakes (#475 introduction, #481, #498, #501). Caught zero real regressions; only its own infra bugs: - ServiceMonitor CRD bootstrap race (#494) - AppArmor host-capability mismatch (#481 → #493) - kubectl wait .status.conditions nil race (#500 → #501) ## Coverage retained (without policy-matrix) - `conftest` — offline PSS-baseline + restricted validation. - `helm lint` — chart structural validation. - `kubeconform` — K8s API conformance. - `kubectl apply --dry-run=server` (chart.yml install/upgrade jobs) — API-level breakage on generic kind cluster. ## What stays in tree - `scripts/policy-matrix-smoke.sh` + Gatekeeper/Kyverno bundle refs — cheap reactivation when GA triggers fire. - `install/kubernetes/tracecore/policies/conftest/**` — offline policy bundle (still active). ## Re-enable triggers (tracked in #502) - GA criterion #1 (third-party audit) requests engine-specific compat validation. - First operator running under Kyverno/Gatekeeper reports admission rot. - CRD-bootstrap pattern stabilises across other workflows. ## Test plan - [x] `make doc-check` exit 0 (post comment-edit in kind-cluster-setup action.yml). - [x] No remaining policy-matrix.yml references in repo (verified by grep). - [x] Pre-commit hooks green (lint/vet/mod-verify/attribute-namespace). - [x] README + install-bench stale refs scrubbed (follow-up commit). ```release-notes ci: defer engine-specific policy-matrix workflow (PSA × Kyverno × Gatekeeper admission validation) to GA. Coverage retained via conftest + helm lint + kubeconform + kubectl apply --dry-run=server. Re-enable tracked in #502. ``` Refs #502 #475 #494 #500. --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
Summary
Completes the A+ deliverables for issue #138 that were deferred when PR #289 landed. #138 is already closed, so this PR is a follow-up that closes the remaining gaps from the original task brief, not a reopen.
What #289 shipped (B/A grade):
policy-matrix.ymlworkflow with three engines (PSA-restricted, Kyverno baseline+restricted, Gatekeeper PSP constraint templates) runninghelm install --dry-run=serveragainst the chart with default values only.What this PR adds:
install/kubernetes/tracecore/values-production.yaml(the v1.0-rc1 cut-criteria-10 preset — NetworkPolicy + PDB + ServiceMonitor + hardened gracePeriod + pinned image policy). The original [followup] Live-cluster policy-engine validation for example daemonsets #138 task brief was explicit: validate production values against real policy engines, not just defaults. Matrix grows from 3 rows to 6.policy-matrix-mutationjob — automated falsifier. Applies the existing conftest testdata fixturebad-allowprivilegeescalation.yamlviakubectl apply --dry-run=serveragainst a namespace governed by each engine, then asserts the API server rejects it. Without this gate, a no-op policy bundle (forgot Enforce mode on Kyverno, forgot to apply Gatekeeper constraints, forgot the PSA namespace label) would let everypolicy-matrixrow pass green and ship false confidence.helmfor the mutation because the chart'svalues.schema.jsonpinscontainerSecurityContext.allowPrivilegeEscalation: const false— helm itself would reject the values before the API server saw the manifest. The point of the mutation gate is to exercise the API server's policy chain, not the chart schema (the conftest gate inchart.ymlalready covers that).VALUES_FILE(production overlay) andSKIP_SMOKE(engine-only provision, used by the mutation job).Test plan
actionlint .github/workflows/policy-matrix.yml— exit 0 (clean).shellcheck scripts/policy-matrix-smoke.sh— exit 0 (clean).zizmor .github/workflows/policy-matrix.yml— sameartipackedlow-confidence baseline as the rest of the repo, no new findings.bash -n scripts/policy-matrix-smoke.sh— syntax clean.policy-matrixrows green (3 engines × {default, production} values profiles).policy-matrix-mutationrows green (each engine rejectsbad-allowprivilegeescalation.yamlwithallowPrivilegeEscalationin the denial).Root cause (why this is needed)
Issue #138 acceptance criteria included: "passes today; fails on the daemonsets if they violate a policy". PR #289 implemented the "passes today" half against defaults. It did not implement (a) the production-preset coverage the rc1 cut now requires or (b) an automated mutation falsifier — the test plan listed the mutation as a manual checkbox. Both are load-bearing for the falsifiability claim of the gate. This PR closes that gap.
Hard rules respected
chart.yml's install-to-Ready M5b gate (touched zero install/upgrade-job lines).docs/followups/opportunistic.md"premature".