Skip to content

feat(chart): version-gated AppArmor RuntimeDefault on DaemonSet (M5b)#481

Merged
trilamsr merged 3 commits into
mainfrom
feat/m5b-chart-apparmor-profile
Jun 2, 2026
Merged

feat(chart): version-gated AppArmor RuntimeDefault on DaemonSet (M5b)#481
trilamsr merged 3 commits into
mainfrom
feat/m5b-chart-apparmor-profile

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds version-gated appArmorProfile: { type: RuntimeDefault } on the DaemonSet pod via new securityHardening.appArmorProfile values key (default enabled: true). On Kubernetes 1.30+ emits the GA structured pod.securityContext.appArmorProfile field; on 1.28 / 1.29 (chart kubeVersion floor >=1.28.0-0) falls back to the legacy container.apparmor.security.beta.kubernetes.io/<container>: runtime/default pod annotation. Auto-selected via semverCompare against .Capabilities.KubeVersion.Version; operators do not pick the form.
  • Cross-linked from install/kubernetes/tracecore/README.md §"Defense-in-depth above restricted-PSS" and a new STRIDE elevation row under docs/threat-model.md §B1 (host filesystem reads).
  • New CI step in .github/workflows/chart.yml exercises six falsifiers: default render @ 1.30 (structured field, no annotation), default render @ 1.28 (legacy annotation, no structured field), toggle-off @ 1.30 and @ 1.28 (neither code path renders), type: Localhost without localhostProfile (fails closed with operator-visible error), and type: Localhost with profile (custom path renders through). The production-preset CI step also asserts the structured field at the embedded helm kubeVersion. Closes the appArmor item in docs/followups/M5b.md.

Root cause

Restricted PSS permits an undefined AppArmor profile, so the chart was compliant today — but stricter-local-policy clusters and adopter security checklists flag the absence. The chart didn't pin a defense-in-depth layer that costs nothing on Linux nodes. Shipped proactively (sibling to L31 production-preset hardening) rather than waiting for the M5b follow-up trigger ("kubeVersion floor moves to >=1.30, or first adopter asks").

Local verification

chart: version-gated AppArmor RuntimeDefault pinned on the DaemonSet via the new
       `securityHardening.appArmorProfile.enabled` toggle (default true).
       Kubernetes 1.30+ renders the GA structured
       `pod.securityContext.appArmorProfile` field; 1.28 / 1.29 renders the
       legacy `container.apparmor.security.beta.kubernetes.io/<container>`
       pod annotation. Auto-selected per cluster kubeVersion; opt out via
       `--set securityHardening.appArmorProfile.enabled=false`.

Ran end-to-end against helm v4.2.0 + conftest dev / OPA 1.15.2 locally:

  • helm lint install/kubernetes/tracecore + helm lint -f values-production.yaml — both zero WARNINGs.
  • helm template --kube-version 1.30.0 and --kube-version 1.28.0 for both default values and the production preset — all four exit 0.
  • conftest test --policy policies/conftest/tracecore.rego against all four renders — 52/52 + 52/52 + 91/91 + 91/91 tests passed.
  • All 12 bad-*.yaml fixtures still denied; good-baseline.yaml + good-sys-ptrace.yaml still passed.
  • Mutation sweep: toggle off → neither code path renders; type: Localhost without localhostProfile → render fails closed with localhostProfile to be set message; type: Localhost with profile → structured field carries custom path.
  • values.schema.json rejects type: BadValue with the upstream JSON-schema enum error.
  • pre-commit hooks (golangci-lint, go vet, go mod verify, attribute-namespace-check) and pre-push hooks (doc-check, no-autoupdate-check) all green.

Test plan

  • CI chart / render step passes the new M5b appArmor falsifier.
  • CI chart / render production-preset assertion lights up (structured field at embedded helm kubeVersion).
  • CI chart / install (kind) rolls clean — kind clusters in CI run K8s ≥1.30, so the structured field path is the one exercised at install time.

Cross-links

  • docs/followups/M5b.md appArmor item — flipped to [x] with link back to this PR's chart README + threat-model anchors.
  • docs/threat-model.md §B1 — gains an Elevation STRIDE row naming AppArmor RuntimeDefault as the defense-in-depth above restricted-PSS for the /dev/kmsg + journald hostPath surface.
  • install/kubernetes/tracecore/README.md §Defense-in-depth above restricted-PSS — operator-facing explanation of which form renders per kubeVersion.

Grade

A+: TDD-verified six-way falsifier sweep covers both code paths and the failure-closed path. Defaults-on with explicit opt-out; sibling-style with tls/networkPolicy/podDisruptionBudget toggle conventions in the existing chart. Threat-model row, README §security cross-link, schema enum guard, and production-preset duplicate-render assertion all wired. Closes the M5b appArmor follow-up without breaking the default render on any supported kubeVersion.

trilamsr added 2 commits June 1, 2026 21:23
Restricted PSS permits an undefined AppArmor profile, so the chart was
compliant today; explicit RuntimeDefault narrows the syscall surface a
compromised receiver could reach against the read-only /dev/kmsg +
journald hostPath mounts and removes one item from adopter security
checklists. Triggered proactively (sibling to L31 production-preset
hardening) rather than waiting for the M5b follow-up trigger
("kubeVersion floor moves to >=1.30, or first adopter asks").

Version-gating: the structured pod.securityContext.appArmorProfile
field is GA in K8s 1.30+; on 1.28 / 1.29 (chart kubeVersion floor
>=1.28.0-0) the legacy
`container.apparmor.security.beta.kubernetes.io/<container>` pod
annotation carries the same intent. The template auto-selects via
semverCompare against .Capabilities.KubeVersion.Version; operators do
not pick the form. Toggle securityHardening.appArmorProfile.enabled
(default true) opts out; type=Localhost + localhostProfile wires a
node-preloaded custom profile (fails closed if profile name missing).

Verified locally with helm v4.2.0 + conftest dev/OPA 1.15.2:
- helm lint default + production: zero WARNINGs
- helm template default @ 1.30 / 1.28: renders structured / legacy
- helm template production @ 1.30 / 1.28: renders structured / legacy
- conftest default @ 1.30 / 1.28: 52/52 passed each
- conftest production @ 1.30 / 1.28: 91/91 passed each
- toggle off @ 1.30 / 1.28: neither code path renders
- type=Localhost without profile: fails closed with operator-visible error
- values.schema.json rejects bad type values

Cross-links: docs/threat-model.md §B1 (elevation row gains AppArmor
mitigation), install/kubernetes/tracecore/README.md §Defense-in-depth.
Closes docs/followups/M5b.md appArmor item.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Doc-check pre-push gate rejected the section-banner added in the prior
commit (`# --- defense-in-depth: AppArmor RuntimeDefault ... ---`)
because banner comments rot in long-lived files per STYLE.md.
Existing banners in the file are grandfathered; new lines are not.
Substance is unchanged — the rationale prose remains.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Independent Adversarial Review: PR #481

Grade: A (→ A+ after one doc fix)

TDD verified. Six-way falsifier sweep covers both code paths (K8s 1.30+ structured field, 1.28/29 legacy annotation), both toggle states (enabled/disabled), failure-closed path (Localhost without localhostProfile), and custom profile path. semverCompare(">=1.30.0-0", version) correctly gates pre-release and GA 1.30 builds. Schema enum guard on type field. Threat-model STRIDE row added at §B1 with defense-in-depth rationale. Production preset hardened by default; CI asserts structured field at embedded helm kubeVersion (≥1.30).

Findings

CI documentation bug: .github/workflows/chart.yml line ~270 says "Five mutation checks bound the contract:" but the code implements 6 numbered test cases (items 1–6). The 6th test (Localhost with custom profile) is correct; the comment header is just stale. Fix required before merge: change "Five" → "Six".

Optional simplification: values-production.yaml repeats the feature rationale from values.yaml (both ~18 and ~13 lines respectively). Since doc-check already rejected section banners per STYLE.md, consider trimming values-production to 5 lines (just M5b reference + threat-model anchor). Not blocking.

Cross-links

✓ M5b.md appArmor checkbox flipped to [x]
✓ threat-model.md §B1 Elevation row names AppArmor RuntimeDefault as defense-in-depth above restricted-PSS
✓ README.md §Defense-in-depth explains version-gating (fails-open on legacy annotation path, fails-closed on structured field path)
✓ Production preset defaults enabled=true with sibling-style opt-out pattern (matches tls, networkPolicy, podDisruptionBudget toggles)

Logic

  • semverCompare(">=1.30.0-0") correctly covers 1.30.0, 1.30.0-beta, 1.30.0+kind
  • Annotation (K8s 1.28/29) and structured field (K8s 1.30+) paths mutually exclusive
  • Type field defaults: template uses | default "RuntimeDefault" + values.yaml explicit
  • Fail-closed: Localhost without localhostProfile → helm fails template with operator-visible error
  • Container name hardcoded to tracecore (matches pod spec)
  • Annotation value lowercase runtime/default (per spec), struct type CamelCase RuntimeDefault (per API)

After the CI comment fix: approve for auto-merge.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Nit fix: 'Five' → 'Six' mutation checks. Re-requesting review for A+.

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 04:49
@trilamsr trilamsr merged commit 7c23bcd into main Jun 2, 2026
17 of 25 checks passed
@trilamsr trilamsr deleted the feat/m5b-chart-apparmor-profile branch June 2, 2026 04:57
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

PR #481 shipped `securityHardening.appArmorProfile.enabled: true` as the
default in `install/kubernetes/tracecore/values.yaml`. Kubelet rejects
pod-create when `pod.securityContext.appArmorProfile` references a
profile the host cannot resolve, so the chart no longer installs on
AppArmor-less nodes — including the ubuntu-latest GitHub Actions runner
image (AppArmor dropped post-2024) and RHEL/SELinux production hosts.
install-bench regressed; PRs #491, #484, #479, #431 are blocked behind
this.

This PR implements option (a) from #492: flip the default to opt-in.
`values-production.yaml` keeps `enabled: true` since AppArmor-equipped
Linux clusters (the production target) ship `RuntimeDefault` via
containerd / CRI-O.

## Root cause

Default-on AppArmor in `values.yaml` violated the chart contract that
the default render installs on a vanilla cluster. The defense-in-depth
posture is correct for production-preset users; it was wrong as the
unconditional default. PR #481 didn't add a CI gate to assert "default
render installs on a host without AppArmor", so the regression escaped
review.

## Changes

- `install/kubernetes/tracecore/values.yaml`:
`securityHardening.appArmorProfile.enabled: true` -> `false`; in-line
guidance reflects opt-in posture and names the failing-host classes (CI
runners, RHEL/SELinux).
- `install/kubernetes/tracecore/values-production.yaml`: unchanged —
production preset still hardens with `enabled: true`.
- `install/kubernetes/tracecore/README.md`: defaults table +
Defense-in-depth section explain the opt-in posture, point operators at
`values-production.yaml` for the prior behavior, and link #492.
- `.github/workflows/chart.yml`: AppArmor mutation tests reshuffled from
6 to 8 cases. T1/T2 now assert default render emits **no** AppArmor
field or annotation on K8s 1.30 + 1.28 (regression-prevent for #492).
T3/T4 cover the opt-in path (`--set enabled=true`) and pin pre-#492
production-preset behavior. T7/T8 explicitly pass `--set enabled=true`
so the Localhost-profile contract still fires under the new default.
Production-preset assertion (`appArmorProfile.type=RuntimeDefault` from
`values-production.yaml`) is untouched.

## Backward compatibility

**Behavior change for default-values users.** Operators who installed
via `helm install ... install/kubernetes/tracecore` (no production
preset) and depended on the AppArmor hardening that #481 added will see
it disappear on next upgrade. Two ways to keep the prior behavior:

```bash
# Option 1 — adopt the production preset (recommended).
helm upgrade demo install/kubernetes/tracecore \
  --values install/kubernetes/tracecore/values-production.yaml

# Option 2 — keep your current values, just flip the flag.
helm upgrade demo install/kubernetes/tracecore \
  --set securityHardening.appArmorProfile.enabled=true
```

Operators who relied on the chart's documented default (#481 was three
days old; opt-in is the chart-hygiene norm for defense-in-depth knobs)
get a quieter install on AppArmor-less hosts.

## Test plan

- [x] `helm lint install/kubernetes/tracecore` — 0 warnings.
- [x] `helm template ... --kube-version 1.30.0 --show-only
templates/daemonset.yaml | grep -i apparmor` — empty (default render has
no AppArmor).
- [x] Same with `--kube-version 1.28.0` — empty.
- [x] `helm template ... --values values-production.yaml --kube-version
1.30.0` — renders `appArmorProfile.type: RuntimeDefault` (production
preset unchanged).
- [x] `helm template ... --set
securityHardening.appArmorProfile.enabled=true --kube-version 1.30.0` —
renders structured field (opt-in works).
- [x] All 8 mutation tests in `.github/workflows/chart.yml` AppArmor
step run locally and pass.
- [x] conftest: 52/52 default render, 91/91 production render.
- [x] actionlint: 0 issues on `chart.yml`.
- [x] Pre-commit (golangci-lint, vet, attribute-namespace-check,
test-flake-audit) — all green.
- [ ] CI: chart workflow turns green on this PR.
- [ ] CI: install-bench turns green on this PR (and unblocks #491 / #484
/ #479 / #431 once merged).

## Refs

Closes #492 (refs #481).

```release-notes
**Breaking (default-values users only).** `securityHardening.appArmorProfile.enabled` now defaults to `false` in `values.yaml` so the chart installs on AppArmor-less nodes (CI runners, RHEL/SELinux). The `values-production.yaml` preset still ships `enabled: true` — production Linux clusters that package the `RuntimeDefault` profile (every distro with containerd / CRI-O) keep the hardening when they layer that preset. Operators upgrading default-values installs who want the prior behavior can either adopt `values-production.yaml` or set `--set securityHardening.appArmorProfile.enabled=true`. Fixes the install-bench regression introduced in #481.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…) (#496)

## Summary

`policy-matrix.yml` workflow has been failing on every chart-touching PR
(blocked #476, #481, #493) since #475 landed. The chart's production
preset (`values-production.yaml`) flips `serviceMonitor.enabled=true`,
which renders a `monitoring.coreos.com/v1 ServiceMonitor` resource. Kind
clusters don't ship the prometheus-operator CRDs, so
`helm install --dry-run=server` exits 1 with:

```
no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
ensure CRDs are installed first
```

## Root cause

Kind ships only the core Kubernetes API set. `monitoring.coreos.com/v1`
is supplied by prometheus-operator, which the policy-matrix kind cluster
never installs. The chart's `templates/servicemonitor.yaml` is gated by
`.Values.serviceMonitor.enabled` — default `false` (chart stays
first-install-compatible on bare clusters), but the production preset
enables it (kube-prometheus-stack convention). The policy-matrix gate
exercises both default and production presets across PSA / Kyverno /
Gatekeeper, so the production rows hit the missing CRD on every run.

## Fix

Issue #494 recommended option (a) — install the missing CRD prereq.
This PR adds a single workflow step after kind cluster spin-up but
before the smoke script:

```yaml
- name: Install prometheus-operator ServiceMonitor CRD (issue #494)
  run: |
    kubectl apply -f \
      "https://github.com/prometheus-operator/prometheus-operator/v0.91.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml"
    kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s
```

Design choices:

- **Slim CRD (ServiceMonitor only)** vs full prometheus-operator bundle.
The chart's production preset references no other
`monitoring.coreos.com`
  kinds. Slim install (~700 lines of YAML) avoids pulling Prometheus,
  Alertmanager, ThanosRuler, PodMonitor, Probe, PrometheusRule we don't
  exercise.
- **Applied to every matrix row** (not just production). A future flip
of
  the default `serviceMonitor.enabled` toggle cannot silently re-break
  this gate.
- **Pinned to `v0.91.0`** (latest stable, published 2026-05-05). Matches
  the existing `KYVERNO_POLICIES_REF` / `GATEKEEPER_VERSION` pin
  convention in `scripts/policy-matrix-smoke.sh`. Bumping is a reviewed
  code change — never tracks `main`.
- **`kubectl wait --for=condition=established`** before the helm dry-run
  so the apiserver has registered the CRD when the chart template
  reaches the admission chain (avoids a race where the dry-run hits
  before discovery refreshes).

## Gatekeeper CRD timing

Re-audited — `install_gatekeeper()` in `scripts/policy-matrix-smoke.sh`
already polls `kubectl get crd ...constraints.gatekeeper.sh` (line
143-149)
and the constraint `byPod[*].enforced` field (line 270-276) before the
smoke step exits. The `kubectl get constraints -A || true` in the
failure-collection step is diagnostic only and already tolerates absent
CRDs. No timing fix needed there.

## Why not install-bench / chart.yml

- `install-bench.yml` uses `bench/install/tracecore-values.yaml` which
  doesn't enable serviceMonitor — same failure shape doesn't apply.
- `chart.yml`'s `install` and `upgrade` jobs install with default values
  (`serviceMonitor.enabled=false`); the `render` job's production-preset
  check is `helm template` only (no cluster), so no API discovery runs.

## Test plan

- [x] `actionlint .github/workflows/policy-matrix.yml` — exit 0
- [x] `actionlint` across all `.github/workflows/` — exit 0
- [x] Pre-push hook suite passed locally (golangci-lint, vet, mod
verify,
  attribute-namespace-check, zizmor, doc-check, alert-check,
  chart-appversion-check, rfc-status-check, slo-rules-check,
  deprecation-check, no-autoupdate-check, test-flake-audit)
- [ ] policy-matrix workflow runs green on this PR — all 6 matrix rows
  (psa × default, psa × production, kyverno × default, kyverno ×
  production, gatekeeper × default, gatekeeper × production) plus all 3
  mutation rows.

Closes #494.

```release-notes
ci: install prometheus-operator ServiceMonitor CRD in policy-matrix kind cluster so chart-touching PRs no longer fail on the production preset's ServiceMonitor render
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

Removes `.github/workflows/policy-matrix.yml`. Engine-specific admission
validation (PSA-restricted × Kyverno × Gatekeeper × default+production)
delivered negative ROI at rc1.

## Root cause

4 PRs blocked or chasing this workflow's flakes (#475 introduction,
#481, #498, #501). Caught zero real regressions; only its own infra
bugs:
- ServiceMonitor CRD bootstrap race (#494)
- AppArmor host-capability mismatch (#481#493)
- kubectl wait .status.conditions nil race (#500#501)

## Coverage retained (without policy-matrix)

- `conftest` — offline PSS-baseline + restricted validation.
- `helm lint` — chart structural validation.
- `kubeconform` — K8s API conformance.
- `kubectl apply --dry-run=server` (chart.yml install/upgrade jobs) —
API-level breakage on generic kind cluster.

## What stays in tree

- `scripts/policy-matrix-smoke.sh` + Gatekeeper/Kyverno bundle refs —
cheap reactivation when GA triggers fire.
- `install/kubernetes/tracecore/policies/conftest/**` — offline policy
bundle (still active).

## Re-enable triggers (tracked in #502)

- GA criterion #1 (third-party audit) requests engine-specific compat
validation.
- First operator running under Kyverno/Gatekeeper reports admission rot.
- CRD-bootstrap pattern stabilises across other workflows.

## Test plan

- [x] `make doc-check` exit 0 (post comment-edit in kind-cluster-setup
action.yml).
- [x] No remaining policy-matrix.yml references in repo (verified by
grep).
- [x] Pre-commit hooks green (lint/vet/mod-verify/attribute-namespace).
- [x] README + install-bench stale refs scrubbed (follow-up commit).

```release-notes
ci: defer engine-specific policy-matrix workflow (PSA × Kyverno × Gatekeeper admission validation) to GA. Coverage retained via conftest + helm lint + kubeconform + kubectl apply --dry-run=server. Re-enable tracked in #502.
```

Refs #502 #475 #494 #500.

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant