fix(chart): flip AppArmor default to opt-in (#492) (refs #481) by trilamsr · Pull Request #493 · TraceCoreAI/tracecore

trilamsr · 2026-06-02T06:16:03Z

Summary

PR #481 shipped securityHardening.appArmorProfile.enabled: true as the default in install/kubernetes/tracecore/values.yaml. Kubelet rejects pod-create when pod.securityContext.appArmorProfile references a profile the host cannot resolve, so the chart no longer installs on AppArmor-less nodes — including the ubuntu-latest GitHub Actions runner image (AppArmor dropped post-2024) and RHEL/SELinux production hosts. install-bench regressed; PRs #491, #484, #479, #431 are blocked behind this.

This PR implements option (a) from #492: flip the default to opt-in. values-production.yaml keeps enabled: true since AppArmor-equipped Linux clusters (the production target) ship RuntimeDefault via containerd / CRI-O.

Root cause

Default-on AppArmor in values.yaml violated the chart contract that the default render installs on a vanilla cluster. The defense-in-depth posture is correct for production-preset users; it was wrong as the unconditional default. PR #481 didn't add a CI gate to assert "default render installs on a host without AppArmor", so the regression escaped review.

Changes

install/kubernetes/tracecore/values.yaml: securityHardening.appArmorProfile.enabled: true -> false; in-line guidance reflects opt-in posture and names the failing-host classes (CI runners, RHEL/SELinux).
install/kubernetes/tracecore/values-production.yaml: unchanged — production preset still hardens with enabled: true.
install/kubernetes/tracecore/README.md: defaults table + Defense-in-depth section explain the opt-in posture, point operators at values-production.yaml for the prior behavior, and link regression(chart): #481 AppArmor default-on breaks install-bench on AppArmor-less hosts #492.
.github/workflows/chart.yml: AppArmor mutation tests reshuffled from 6 to 8 cases. T1/T2 now assert default render emits no AppArmor field or annotation on K8s 1.30 + 1.28 (regression-prevent for regression(chart): #481 AppArmor default-on breaks install-bench on AppArmor-less hosts #492). T3/T4 cover the opt-in path (--set enabled=true) and pin pre-regression(chart): #481 AppArmor default-on breaks install-bench on AppArmor-less hosts #492 production-preset behavior. T7/T8 explicitly pass --set enabled=true so the Localhost-profile contract still fires under the new default. Production-preset assertion (appArmorProfile.type=RuntimeDefault from values-production.yaml) is untouched.

Backward compatibility

Behavior change for default-values users. Operators who installed via helm install ... install/kubernetes/tracecore (no production preset) and depended on the AppArmor hardening that #481 added will see it disappear on next upgrade. Two ways to keep the prior behavior:

# Option 1 — adopt the production preset (recommended).
helm upgrade demo install/kubernetes/tracecore \
  --values install/kubernetes/tracecore/values-production.yaml

# Option 2 — keep your current values, just flip the flag.
helm upgrade demo install/kubernetes/tracecore \
  --set securityHardening.appArmorProfile.enabled=true

Operators who relied on the chart's documented default (#481 was three days old; opt-in is the chart-hygiene norm for defense-in-depth knobs) get a quieter install on AppArmor-less hosts.

Test plan

Refs

Closes #492 (refs #481).

**Breaking (default-values users only).** `securityHardening.appArmorProfile.enabled` now defaults to `false` in `values.yaml` so the chart installs on AppArmor-less nodes (CI runners, RHEL/SELinux). The `values-production.yaml` preset still ships `enabled: true` — production Linux clusters that package the `RuntimeDefault` profile (every distro with containerd / CRI-O) keep the hardening when they layer that preset. Operators upgrading default-values installs who want the prior behavior can either adopt `values-production.yaml` or set `--set securityHardening.appArmorProfile.enabled=true`. Fixes the install-bench regression introduced in #481.

PR #481 shipped securityHardening.appArmorProfile.enabled: true as the default in install/kubernetes/tracecore/values.yaml. Kubelet rejects pod-create when the appArmorProfile field references a profile the host cannot resolve, which breaks installs on AppArmor-less nodes including the ubuntu-latest GitHub Actions runner image (post-2024) and RHEL/SELinux hosts. install-bench + downstream PRs (#491, #484, #479, #431) all regressed. Flip the default to false in values.yaml so the chart installs on a vanilla cluster. values-production.yaml retains enabled: true since production Linux clusters package the RuntimeDefault profile via containerd / CRI-O. Operators who want the hardening either layer values-production.yaml or set the flag explicitly. Chart mutation tests in .github/workflows/chart.yml updated: - T1, T2 now assert default render emits NEITHER the structured appArmorProfile field nor the legacy container.apparmor.security.beta.kubernetes.io annotation on K8s 1.30 or 1.28 (regression-prevent for #492). - T3, T4 cover the opt-in path (--set enabled=true) and pin the pre-#492 behaviour for production-preset users. - T7, T8 explicitly pass --set enabled=true so the Localhost-profile contract is still exercised under the new default. - Production-preset assertion at line ~509 is untouched and still asserts appArmorProfile.type=RuntimeDefault. README defense-in-depth section clarifies opt-in posture and points operators at values-production.yaml. Verified locally: - helm lint: 0 warnings. - conftest: 52 default / 91 production tests pass. - All 8 mutation gates green. Signed-off-by: Tri Lam <tree@lumalabs.ai>

…) (#496) ## Summary `policy-matrix.yml` workflow has been failing on every chart-touching PR (blocked #476, #481, #493) since #475 landed. The chart's production preset (`values-production.yaml`) flips `serviceMonitor.enabled=true`, which renders a `monitoring.coreos.com/v1 ServiceMonitor` resource. Kind clusters don't ship the prometheus-operator CRDs, so `helm install --dry-run=server` exits 1 with: ``` no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" ensure CRDs are installed first ``` ## Root cause Kind ships only the core Kubernetes API set. `monitoring.coreos.com/v1` is supplied by prometheus-operator, which the policy-matrix kind cluster never installs. The chart's `templates/servicemonitor.yaml` is gated by `.Values.serviceMonitor.enabled` — default `false` (chart stays first-install-compatible on bare clusters), but the production preset enables it (kube-prometheus-stack convention). The policy-matrix gate exercises both default and production presets across PSA / Kyverno / Gatekeeper, so the production rows hit the missing CRD on every run. ## Fix Issue #494 recommended option (a) — install the missing CRD prereq. This PR adds a single workflow step after kind cluster spin-up but before the smoke script: ```yaml - name: Install prometheus-operator ServiceMonitor CRD (issue #494) run: | kubectl apply -f \ "https://github.com/prometheus-operator/prometheus-operator/v0.91.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml" kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s ``` Design choices: - **Slim CRD (ServiceMonitor only)** vs full prometheus-operator bundle. The chart's production preset references no other `monitoring.coreos.com` kinds. Slim install (~700 lines of YAML) avoids pulling Prometheus, Alertmanager, ThanosRuler, PodMonitor, Probe, PrometheusRule we don't exercise. - **Applied to every matrix row** (not just production). A future flip of the default `serviceMonitor.enabled` toggle cannot silently re-break this gate. - **Pinned to `v0.91.0`** (latest stable, published 2026-05-05). Matches the existing `KYVERNO_POLICIES_REF` / `GATEKEEPER_VERSION` pin convention in `scripts/policy-matrix-smoke.sh`. Bumping is a reviewed code change — never tracks `main`. - **`kubectl wait --for=condition=established`** before the helm dry-run so the apiserver has registered the CRD when the chart template reaches the admission chain (avoids a race where the dry-run hits before discovery refreshes). ## Gatekeeper CRD timing Re-audited — `install_gatekeeper()` in `scripts/policy-matrix-smoke.sh` already polls `kubectl get crd ...constraints.gatekeeper.sh` (line 143-149) and the constraint `byPod[*].enforced` field (line 270-276) before the smoke step exits. The `kubectl get constraints -A || true` in the failure-collection step is diagnostic only and already tolerates absent CRDs. No timing fix needed there. ## Why not install-bench / chart.yml - `install-bench.yml` uses `bench/install/tracecore-values.yaml` which doesn't enable serviceMonitor — same failure shape doesn't apply. - `chart.yml`'s `install` and `upgrade` jobs install with default values (`serviceMonitor.enabled=false`); the `render` job's production-preset check is `helm template` only (no cluster), so no API discovery runs. ## Test plan - [x] `actionlint .github/workflows/policy-matrix.yml` — exit 0 - [x] `actionlint` across all `.github/workflows/` — exit 0 - [x] Pre-push hook suite passed locally (golangci-lint, vet, mod verify, attribute-namespace-check, zizmor, doc-check, alert-check, chart-appversion-check, rfc-status-check, slo-rules-check, deprecation-check, no-autoupdate-check, test-flake-audit) - [ ] policy-matrix workflow runs green on this PR — all 6 matrix rows (psa × default, psa × production, kyverno × default, kyverno × production, gatekeeper × default, gatekeeper × production) plus all 3 mutation rows. Closes #494. ```release-notes ci: install prometheus-operator ServiceMonitor CRD in policy-matrix kind cluster so chart-touching PRs no longer fail on the production preset's ServiceMonitor render ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>

## Summary Removes `.github/workflows/policy-matrix.yml`. Engine-specific admission validation (PSA-restricted × Kyverno × Gatekeeper × default+production) delivered negative ROI at rc1. ## Root cause 4 PRs blocked or chasing this workflow's flakes (#475 introduction, #481, #498, #501). Caught zero real regressions; only its own infra bugs: - ServiceMonitor CRD bootstrap race (#494) - AppArmor host-capability mismatch (#481 → #493) - kubectl wait .status.conditions nil race (#500 → #501) ## Coverage retained (without policy-matrix) - `conftest` — offline PSS-baseline + restricted validation. - `helm lint` — chart structural validation. - `kubeconform` — K8s API conformance. - `kubectl apply --dry-run=server` (chart.yml install/upgrade jobs) — API-level breakage on generic kind cluster. ## What stays in tree - `scripts/policy-matrix-smoke.sh` + Gatekeeper/Kyverno bundle refs — cheap reactivation when GA triggers fire. - `install/kubernetes/tracecore/policies/conftest/**` — offline policy bundle (still active). ## Re-enable triggers (tracked in #502) - GA criterion #1 (third-party audit) requests engine-specific compat validation. - First operator running under Kyverno/Gatekeeper reports admission rot. - CRD-bootstrap pattern stabilises across other workflows. ## Test plan - [x] `make doc-check` exit 0 (post comment-edit in kind-cluster-setup action.yml). - [x] No remaining policy-matrix.yml references in repo (verified by grep). - [x] Pre-commit hooks green (lint/vet/mod-verify/attribute-namespace). - [x] README + install-bench stale refs scrubbed (follow-up commit). ```release-notes ci: defer engine-specific policy-matrix workflow (PSA × Kyverno × Gatekeeper admission validation) to GA. Coverage retained via conftest + helm lint + kubeconform + kubectl apply --dry-run=server. Re-enable tracked in #502. ``` Refs #502 #475 #494 #500. --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr enabled auto-merge (squash) June 2, 2026 06:21

trilamsr merged commit a0aa4df into main Jun 2, 2026
19 of 25 checks passed

trilamsr deleted the fix/492-apparmor-default-opt-in branch June 2, 2026 06:24

This was referenced Jun 2, 2026

ci(policy-matrix): install ServiceMonitor + Gatekeeper CRDs in kind setup (regression since #475) #494

Closed

ci(policy-matrix): install ServiceMonitor CRD before helm dry-run (#494) #496

Merged

trilamsr mentioned this pull request Jun 2, 2026

chore: defer engine-specific policy-matrix workflow to GA #503

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(chart): flip AppArmor default to opt-in (#492) (refs #481)#493

fix(chart): flip AppArmor default to opt-in (#492) (refs #481)#493
trilamsr merged 1 commit into
mainfrom
fix/492-apparmor-default-opt-in

trilamsr commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 2, 2026

Summary

Root cause

Changes

Backward compatibility

Test plan

Refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant