Skip to content

fix(chart): flip AppArmor default to opt-in (#492) (refs #481)#493

Merged
trilamsr merged 1 commit into
mainfrom
fix/492-apparmor-default-opt-in
Jun 2, 2026
Merged

fix(chart): flip AppArmor default to opt-in (#492) (refs #481)#493
trilamsr merged 1 commit into
mainfrom
fix/492-apparmor-default-opt-in

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

PR #481 shipped securityHardening.appArmorProfile.enabled: true as the default in install/kubernetes/tracecore/values.yaml. Kubelet rejects pod-create when pod.securityContext.appArmorProfile references a profile the host cannot resolve, so the chart no longer installs on AppArmor-less nodes — including the ubuntu-latest GitHub Actions runner image (AppArmor dropped post-2024) and RHEL/SELinux production hosts. install-bench regressed; PRs #491, #484, #479, #431 are blocked behind this.

This PR implements option (a) from #492: flip the default to opt-in. values-production.yaml keeps enabled: true since AppArmor-equipped Linux clusters (the production target) ship RuntimeDefault via containerd / CRI-O.

Root cause

Default-on AppArmor in values.yaml violated the chart contract that the default render installs on a vanilla cluster. The defense-in-depth posture is correct for production-preset users; it was wrong as the unconditional default. PR #481 didn't add a CI gate to assert "default render installs on a host without AppArmor", so the regression escaped review.

Changes

Backward compatibility

Behavior change for default-values users. Operators who installed via helm install ... install/kubernetes/tracecore (no production preset) and depended on the AppArmor hardening that #481 added will see it disappear on next upgrade. Two ways to keep the prior behavior:

# Option 1 — adopt the production preset (recommended).
helm upgrade demo install/kubernetes/tracecore \
  --values install/kubernetes/tracecore/values-production.yaml

# Option 2 — keep your current values, just flip the flag.
helm upgrade demo install/kubernetes/tracecore \
  --set securityHardening.appArmorProfile.enabled=true

Operators who relied on the chart's documented default (#481 was three days old; opt-in is the chart-hygiene norm for defense-in-depth knobs) get a quieter install on AppArmor-less hosts.

Test plan

Refs

Closes #492 (refs #481).

**Breaking (default-values users only).** `securityHardening.appArmorProfile.enabled` now defaults to `false` in `values.yaml` so the chart installs on AppArmor-less nodes (CI runners, RHEL/SELinux). The `values-production.yaml` preset still ships `enabled: true` — production Linux clusters that package the `RuntimeDefault` profile (every distro with containerd / CRI-O) keep the hardening when they layer that preset. Operators upgrading default-values installs who want the prior behavior can either adopt `values-production.yaml` or set `--set securityHardening.appArmorProfile.enabled=true`. Fixes the install-bench regression introduced in #481.

PR #481 shipped securityHardening.appArmorProfile.enabled: true as the
default in install/kubernetes/tracecore/values.yaml. Kubelet rejects
pod-create when the appArmorProfile field references a profile the
host cannot resolve, which breaks installs on AppArmor-less nodes
including the ubuntu-latest GitHub Actions runner image (post-2024)
and RHEL/SELinux hosts. install-bench + downstream PRs (#491, #484,
#479, #431) all regressed.

Flip the default to false in values.yaml so the chart installs on a
vanilla cluster. values-production.yaml retains enabled: true since
production Linux clusters package the RuntimeDefault profile via
containerd / CRI-O. Operators who want the hardening either layer
values-production.yaml or set the flag explicitly.

Chart mutation tests in .github/workflows/chart.yml updated:
- T1, T2 now assert default render emits NEITHER the structured
  appArmorProfile field nor the legacy
  container.apparmor.security.beta.kubernetes.io annotation on
  K8s 1.30 or 1.28 (regression-prevent for #492).
- T3, T4 cover the opt-in path (--set enabled=true) and pin the
  pre-#492 behaviour for production-preset users.
- T7, T8 explicitly pass --set enabled=true so the Localhost-profile
  contract is still exercised under the new default.
- Production-preset assertion at line ~509 is untouched and still
  asserts appArmorProfile.type=RuntimeDefault.

README defense-in-depth section clarifies opt-in posture and points
operators at values-production.yaml.

Verified locally:
- helm lint: 0 warnings.
- conftest: 52 default / 91 production tests pass.
- All 8 mutation gates green.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 06:21
@trilamsr trilamsr merged commit a0aa4df into main Jun 2, 2026
19 of 25 checks passed
@trilamsr trilamsr deleted the fix/492-apparmor-default-opt-in branch June 2, 2026 06:24
trilamsr added a commit that referenced this pull request Jun 2, 2026
…) (#496)

## Summary

`policy-matrix.yml` workflow has been failing on every chart-touching PR
(blocked #476, #481, #493) since #475 landed. The chart's production
preset (`values-production.yaml`) flips `serviceMonitor.enabled=true`,
which renders a `monitoring.coreos.com/v1 ServiceMonitor` resource. Kind
clusters don't ship the prometheus-operator CRDs, so
`helm install --dry-run=server` exits 1 with:

```
no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
ensure CRDs are installed first
```

## Root cause

Kind ships only the core Kubernetes API set. `monitoring.coreos.com/v1`
is supplied by prometheus-operator, which the policy-matrix kind cluster
never installs. The chart's `templates/servicemonitor.yaml` is gated by
`.Values.serviceMonitor.enabled` — default `false` (chart stays
first-install-compatible on bare clusters), but the production preset
enables it (kube-prometheus-stack convention). The policy-matrix gate
exercises both default and production presets across PSA / Kyverno /
Gatekeeper, so the production rows hit the missing CRD on every run.

## Fix

Issue #494 recommended option (a) — install the missing CRD prereq.
This PR adds a single workflow step after kind cluster spin-up but
before the smoke script:

```yaml
- name: Install prometheus-operator ServiceMonitor CRD (issue #494)
  run: |
    kubectl apply -f \
      "https://github.com/prometheus-operator/prometheus-operator/v0.91.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml"
    kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s
```

Design choices:

- **Slim CRD (ServiceMonitor only)** vs full prometheus-operator bundle.
The chart's production preset references no other
`monitoring.coreos.com`
  kinds. Slim install (~700 lines of YAML) avoids pulling Prometheus,
  Alertmanager, ThanosRuler, PodMonitor, Probe, PrometheusRule we don't
  exercise.
- **Applied to every matrix row** (not just production). A future flip
of
  the default `serviceMonitor.enabled` toggle cannot silently re-break
  this gate.
- **Pinned to `v0.91.0`** (latest stable, published 2026-05-05). Matches
  the existing `KYVERNO_POLICIES_REF` / `GATEKEEPER_VERSION` pin
  convention in `scripts/policy-matrix-smoke.sh`. Bumping is a reviewed
  code change — never tracks `main`.
- **`kubectl wait --for=condition=established`** before the helm dry-run
  so the apiserver has registered the CRD when the chart template
  reaches the admission chain (avoids a race where the dry-run hits
  before discovery refreshes).

## Gatekeeper CRD timing

Re-audited — `install_gatekeeper()` in `scripts/policy-matrix-smoke.sh`
already polls `kubectl get crd ...constraints.gatekeeper.sh` (line
143-149)
and the constraint `byPod[*].enforced` field (line 270-276) before the
smoke step exits. The `kubectl get constraints -A || true` in the
failure-collection step is diagnostic only and already tolerates absent
CRDs. No timing fix needed there.

## Why not install-bench / chart.yml

- `install-bench.yml` uses `bench/install/tracecore-values.yaml` which
  doesn't enable serviceMonitor — same failure shape doesn't apply.
- `chart.yml`'s `install` and `upgrade` jobs install with default values
  (`serviceMonitor.enabled=false`); the `render` job's production-preset
  check is `helm template` only (no cluster), so no API discovery runs.

## Test plan

- [x] `actionlint .github/workflows/policy-matrix.yml` — exit 0
- [x] `actionlint` across all `.github/workflows/` — exit 0
- [x] Pre-push hook suite passed locally (golangci-lint, vet, mod
verify,
  attribute-namespace-check, zizmor, doc-check, alert-check,
  chart-appversion-check, rfc-status-check, slo-rules-check,
  deprecation-check, no-autoupdate-check, test-flake-audit)
- [ ] policy-matrix workflow runs green on this PR — all 6 matrix rows
  (psa × default, psa × production, kyverno × default, kyverno ×
  production, gatekeeper × default, gatekeeper × production) plus all 3
  mutation rows.

Closes #494.

```release-notes
ci: install prometheus-operator ServiceMonitor CRD in policy-matrix kind cluster so chart-touching PRs no longer fail on the production preset's ServiceMonitor render
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

Removes `.github/workflows/policy-matrix.yml`. Engine-specific admission
validation (PSA-restricted × Kyverno × Gatekeeper × default+production)
delivered negative ROI at rc1.

## Root cause

4 PRs blocked or chasing this workflow's flakes (#475 introduction,
#481, #498, #501). Caught zero real regressions; only its own infra
bugs:
- ServiceMonitor CRD bootstrap race (#494)
- AppArmor host-capability mismatch (#481#493)
- kubectl wait .status.conditions nil race (#500#501)

## Coverage retained (without policy-matrix)

- `conftest` — offline PSS-baseline + restricted validation.
- `helm lint` — chart structural validation.
- `kubeconform` — K8s API conformance.
- `kubectl apply --dry-run=server` (chart.yml install/upgrade jobs) —
API-level breakage on generic kind cluster.

## What stays in tree

- `scripts/policy-matrix-smoke.sh` + Gatekeeper/Kyverno bundle refs —
cheap reactivation when GA triggers fire.
- `install/kubernetes/tracecore/policies/conftest/**` — offline policy
bundle (still active).

## Re-enable triggers (tracked in #502)

- GA criterion #1 (third-party audit) requests engine-specific compat
validation.
- First operator running under Kyverno/Gatekeeper reports admission rot.
- CRD-bootstrap pattern stabilises across other workflows.

## Test plan

- [x] `make doc-check` exit 0 (post comment-edit in kind-cluster-setup
action.yml).
- [x] No remaining policy-matrix.yml references in repo (verified by
grep).
- [x] Pre-commit hooks green (lint/vet/mod-verify/attribute-namespace).
- [x] README + install-bench stale refs scrubbed (follow-up commit).

```release-notes
ci: defer engine-specific policy-matrix workflow (PSA × Kyverno × Gatekeeper admission validation) to GA. Coverage retained via conftest + helm lint + kubeconform + kubectl apply --dry-run=server. Re-enable tracked in #502.
```

Refs #502 #475 #494 #500.

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

regression(chart): #481 AppArmor default-on breaks install-bench on AppArmor-less hosts

1 participant