diff --git a/install/kubernetes/tracecore/README.md b/install/kubernetes/tracecore/README.md index 694e75e6..758efa4c 100644 --- a/install/kubernetes/tracecore/README.md +++ b/install/kubernetes/tracecore/README.md @@ -1,699 +1,298 @@ # tracecore Helm chart Minimal-privilege DaemonSet for the [tracecore](https://github.com/tracecoreai/tracecore) -OpenTelemetry collector. Renders a `restricted`-class Pod Security Standard -pod spec by default; per-receiver toggles let an operator wire upstream -OCB-bundled OTel receivers (hostmetrics, journaldreceiver, -filelogreceiver, k8sobjectsreceiver, prometheusreceiver) and the -bundled in-tree moat receivers (nccl_fr, pyspy) without changing the -template. - -| Chart attribute | Value | -| --- | --- | -| `apiVersion` | v2 | -| `version` (chart) | 0.1.0 | -| `appVersion` (binary) | tracked to the tracecore release the chart was tested against | -| `kubeVersion` | `>=1.28.0-0` | +OpenTelemetry collector. Renders a `restricted`-class PSS pod spec by +default; per-receiver toggles wire OCB-bundled OTel receivers (hostmetrics, +journald, filelog, k8sobjects, prometheus) and in-tree moat receivers +(nccl_fr, pyspy) without changing the template. Chart `apiVersion: v2`, +`version: 0.1.0`, `appVersion` tracks the tracecore release, +`kubeVersion: >=1.28.0-0`. + +Canonical docs: knobs in [`values.yaml`](./values.yaml); hardened overlay +[`values-production.yaml`](./values-production.yaml); incident runbook +[`docs/RUNBOOK-tracecore.md`](../../../docs/RUNBOOK-tracecore.md); SLOs ++ alerts [`docs/SLOs.md`](../../../docs/SLOs.md); backend recipes +[`docs/integrations/`](../../../docs/integrations/); cosign / SLSA +verification [`docs/reproducibility.md`](../../../docs/reproducibility.md); +auto-update contract +[RFC-0008](../../../docs/rfcs/0008-auto-update-boundary.md). ## Install -Pull the chart as an OCI artifact from GitHub Container Registry (the -production path; works on air-gapped clusters once the registry is -mirrored): +OCI (production; air-gap-friendly once the registry is mirrored): ```bash helm install tracecore oci://ghcr.io/tracecoreai/charts/tracecore \ - --version 0.2.0 \ - --namespace tracecore-system --create-namespace + --version 0.2.0 --namespace tracecore-system --create-namespace ``` `helm pull oci://ghcr.io/tracecoreai/charts/tracecore --version 0.2.0` -fetches the same `.tgz` for offline review or air-gap mirroring; the -chart is cosign-signed keyless under the same workflow identity as the -binary archives and container image (see `docs/reproducibility.md` for -the `cosign verify` invocation). +fetches the same `.tgz` for offline review. cosign-signed keyless under +the same identity as the binary + image; verify per +[`docs/reproducibility.md`](../../../docs/reproducibility.md). -Or add the chart from a local checkout (development / unreleased SHAs): +Local checkout (dev): +`helm install tracecore install/kubernetes/tracecore --namespace tracecore-system --create-namespace`. +Render-only dry-run: +`helm template tracecore install/kubernetes/tracecore -n tracecore-system | kubectl apply --dry-run=server -f -`. -```bash -helm install tracecore install/kubernetes/tracecore \ - --namespace tracecore-system --create-namespace -``` - -Or render and apply manually for a dry-run review: - -```bash -helm template tracecore install/kubernetes/tracecore \ - --namespace tracecore-system \ - | kubectl apply --dry-run=server -f - -``` - -The default values enable the upstream OCB-bundled `hostmetricsreceiver` -(loadscraper @ 1s) paired with the upstream `debug` exporter; the -DaemonSet boots cleanly on a no-GPU cluster and writes load-average -metrics to pod stdout, visible via `kubectl logs`. To enable additional -receivers (`pyspy`, `nccl_fr`) or swap exporters (`otlphttp`, vendor -backends via the free-form `config:` block), see `values.yaml` and the -deviations table in "Pod Security Standard compliance" below. Swap -`debug` for `otlphttp` (also OCB-bundled) before treating the DaemonSet -as a steady-state production deployment — see the worked overlay below. -Operator-side migration recipes for receivers retired by RFC-0013 PR-K -ship under [`docs/integrations/`](../../../docs/integrations/). +Defaults: `hostmetricsreceiver` + `debug` exporter; the DaemonSet boots +on no-GPU clusters. Swap `debug` for `otlphttp` before production — see +[`docs/integrations/otel-backend.md`](../../../docs/integrations/otel-backend.md). ## Upgrade -The chart follows SemVer. Backwards-incompatible values changes carry a -MAJOR bump and a `BREAKING CHANGES.md` entry under the chart directory. -Patch and minor upgrades: +SemVer. Backwards-incompatible values changes carry a MAJOR bump plus a +`BREAKING CHANGES.md` entry. Patch/minor: ```bash helm upgrade tracecore install/kubernetes/tracecore \ --namespace tracecore-system --reuse-values ``` -To inspect the rendered config diff before applying: +Diff first via [helm-diff](https://github.com/databus23/helm-diff): +`helm diff upgrade tracecore install/kubernetes/tracecore -n tracecore-system`. -```bash -helm diff upgrade tracecore install/kubernetes/tracecore \ - --namespace tracecore-system -``` - -(requires the [helm-diff plugin](https://github.com/databus23/helm-diff).) - -### Upgrade posture - -The tracecore binary contains no in-binary self-update mechanism, no -background fetcher, and no remote update channel. Upgrades are -operator-pulled: pick a tag, `helm upgrade` (or wire Flux / Argo CD / -RenovateBot / image-updater against the published chart and image -tags). For manual upgrade without external automation, the -`helm upgrade` command at the top of this section is the supported -path; for unattended updates, wire one of the delivery systems above. - -On failed upgrade, roll back via `helm rollback tracecore --namespace tracecore-system`; -confirm the rollback landed via pod status rather than a probe call. -`helm rollback` exits 0 on revision-revert even when the new -revision's pod is still failing, so post-rollback status is the -load-bearing signal: +On failed upgrade: `helm rollback tracecore -n tracecore-system`, then +confirm via pod status (rollback exits 0 even when the new revision is +still failing): ```bash kubectl -n tracecore-system rollout status daemonset/tracecore --timeout=5m kubectl -n tracecore-system get pods -l app.kubernetes.io/name=tracecore -kubectl -n tracecore-system logs -l app.kubernetes.io/name=tracecore --tail=100 ``` -The OCB-assembled binary uses the upstream +Kubelet probes target upstream [`healthcheckextension`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/healthcheckextension) -for kubelet liveness/readiness probes. The extension serves both -probes on a single path (chart default `/` on port `:13133` — the -extension's defaults), so there is no distinct separate-path readiness -endpoint to curl: the kubelet's readiness probe state IS the readiness -signal, and `kubectl rollout status` is the right surface to gate -human-driven rollback on. For port-forwarded ad-hoc checks during -incident response: +at `:13133/` (both probes, single path). Ad-hoc: +`kubectl -n tracecore-system port-forward daemonset/tracecore 13133:13133 && curl localhost:13133/`. +In Flux / Argo CD, gate promotion on +`status.numberReady == status.desiredNumberScheduled`. -```bash -kubectl -n tracecore-system port-forward daemonset/tracecore 13133:13133 -curl -sS http://localhost:13133/ # 200 OK ⇔ ready, 503 ⇔ not-ready -``` - -In a delivery system (Flux / Argo CD), gate promotion on the -DaemonSet's `status.numberReady == status.desiredNumberScheduled` -invariant — the same signal `kubectl rollout status` polls — so a -failed rollout never auto-promotes. - -Container images publish to `ghcr.io/tracecoreai/tracecore:` on -every release tag. Each image carries a keyless cosign signature and a -SLSA v1.0 provenance attestation stored alongside the manifest in the -registry; verify both before deploying per -[`docs/reproducibility.md`](../../../docs/reproducibility.md) steps 8 -and 9. Stable releases (no `-` in the SemVer pre-release field) also -float `:latest`; pre-releases do not. +### Upgrade posture -The full rationale and the contract for what tracecore commits to -(immutable digests, lockstep `appVersion`/binary) lives in -[RFC-0008: auto-update boundary](../../../docs/rfcs/0008-auto-update-boundary.md). +No in-binary self-update, no background fetcher, no remote update +channel. Upgrades are operator-pulled: pick a tag, `helm upgrade`, or +wire Flux / Argo CD / RenovateBot against published chart + image tags. +Images publish to `ghcr.io/tracecoreai/tracecore:` with cosign + +SLSA v1.0 provenance; stable releases float `:latest`. Full contract: +[RFC-0008](../../../docs/rfcs/0008-auto-update-boundary.md). ## Uninstall -```bash -helm uninstall tracecore --namespace tracecore-system -kubectl delete namespace tracecore-system # only if no other workloads live there -``` - -The chart does not create the namespace and will not delete it on -uninstall. ConfigMaps/ServiceAccounts owned by the release are removed -automatically; PersistentVolumeClaims (if any are added via the -`config:` override) are not. +`helm uninstall tracecore -n tracecore-system`. Chart does not create +the namespace and will not delete it; PVCs added via `config:` are not +removed automatically. ## Values reference -| Path | Type | Default | Purpose | -| --- | --- | --- | --- | -| `namespace` | string | `tracecore-system` | Target namespace for all chart objects. | -| `image.repository` | string | `ghcr.io/tracecoreai/tracecore` | Container image repository. | -| `image.tag` | string | `""` (falls back to `.Chart.AppVersion`) | Override for local kind-loaded images. Ignored when `image.digest` is non-empty. | -| `image.digest` | string | `""` | Pin the image by `sha256:` content hash; takes precedence over `tag`. Production posture — paste the published digest from the release notes. | -| `image.pullPolicy` | string | `IfNotPresent` | Standard kubelet pull policy. | -| `serviceAccount.create` | bool | `true` | Render a ServiceAccount alongside the DaemonSet. | -| `serviceAccount.automount` | bool | `false` | The collector does not call the API server by default. | -| `podSecurityContext.runAsNonRoot` | bool | `true` | restricted-PSS gate. | -| `podSecurityContext.runAsUser` | int | `65532` | Non-zero UID. | -| `podSecurityContext.seccompProfile.type` | string | `RuntimeDefault` | restricted-PSS gate. | -| `containerSecurityContext.allowPrivilegeEscalation` | bool | `false` | restricted-PSS gate. | -| `containerSecurityContext.readOnlyRootFilesystem` | bool | `true` | tracecore writes only to `/tmp` (emptyDir). | -| `containerSecurityContext.capabilities.drop` | list | `[ALL]` | restricted-PSS gate. | -| `containerSecurityContext.capabilities.add` | list | `[]` | SYS_PTRACE is the only allowed addition; conftest rejects any other. | -| `securityHardening.appArmorProfile.enabled` | bool | `false` (default `values.yaml`); `true` in `values-production.yaml` | Pin AppArmor `RuntimeDefault` on the DaemonSet pod (M5b). Opt-in by default so the chart installs on AppArmor-less nodes (CI runners, RHEL/SELinux hosts); the production preset turns it on. Version-gated: K8s 1.30+ renders `pod.securityContext.appArmorProfile`; 1.28 / 1.29 renders the legacy `container.apparmor.security.beta.kubernetes.io/tracecore` annotation. Auto-selected via `.Capabilities.KubeVersion.Version`. See [#492](https://github.com/lumalabs/tracecore/issues/492). | -| `securityHardening.appArmorProfile.type` | string | `RuntimeDefault` | `RuntimeDefault` \| `Unconfined` \| `Localhost`. The latter requires `localhostProfile` (chart fails closed without it). | -| `securityHardening.appArmorProfile.localhostProfile` | string | `""` | Path within the node's AppArmor profile directory; required when `type: Localhost`. | -| `telemetry.enabled` | bool | `true` | Toggle for the chart-rendered self-metrics + healthcheck surface. With `enabled: false` the chart omits both the `service.telemetry.metrics` block and the `healthcheckextension`, and the kubelet probes drop off the rendered DaemonSet. | -| `telemetry.metricsListen` | string | `0.0.0.0:8888` | `service.telemetry.metrics` Prometheus-scrape listener for the collector's own metrics (chart port `telemetry`). | -| `telemetry.healthListen` | string | `0.0.0.0:13133` | `healthcheckextension` listener; kubelet liveness AND readiness probes hit this port (chart port `health`). The extension serves both probes on the single path at `telemetry.healthPath` — there is no separate-path readiness endpoint. | -| `telemetry.healthPath` | string | `/` | Path served by `healthcheckextension`; default `/` matches the extension's default. Override only if you front the collector with a proxy that requires a non-root path. | -| `serviceMonitor.enabled` | bool | `false` | Render a Prometheus Operator `ServiceMonitor`. Default OFF — the CRD comes from kube-prometheus-stack / prometheus-operator and is not on bare clusters. Flip to true on Operator-managed clusters. | -| `serviceMonitor.namespace` | string | `""` (falls back to `.Values.namespace`) | Override the namespace the ServiceMonitor lands in (e.g. `monitoring` if the Operator selects from there). | -| `serviceMonitor.labels` | map | `{}` | Extra labels — e.g. `{release: prometheus-stack}` to match `serviceMonitorSelector`. | -| `serviceMonitor.interval` | string | `30s` | Scrape interval. | -| `serviceMonitor.scrapeTimeout` | string | `10s` | Per-scrape timeout. | -| `prometheusScrape.enabled` | bool | `true` | Stamp `prometheus.io/scrape` + `prometheus.io/port` + `prometheus.io/path` annotations on the DaemonSet pods. Picked up by vanilla Prometheus + `kubernetes_sd_configs` `role: pod`. Harmless on Operator clusters. | -| `receivers..enabled` | bool | varies | Toggle per receiver. `hostmetrics` on by default; `pyspy` off. | -| `exporters..enabled` | bool | varies | Toggle per exporter. `debug` on by default; `otlphttp` off. | -| `pipelines.` | map | `metrics: {receivers:[hostmetrics], exporters:[debug]}` | Pipeline wiring. References to disabled components are silently dropped at render time. | -| `config` | map | `{}` | Free-form override deep-merged INTO the rendered tracecore config last. Do NOT place credentials here; ConfigMaps are unencrypted in etcd. | -| `resources.requests` | map | `{cpu: 10m, memory: 32Mi}` | Conservative defaults; tune for receiver load. | -| `resources.limits` | map | `{cpu: 100m, memory: 128Mi}` | Conservative defaults; tune for receiver load. | -| `updateStrategy` | map | RollingUpdate / maxUnavailable=1 | DaemonSet rollout cadence. Bump `maxUnavailable` to a percentage (e.g. `10%`) on fleets >500 nodes to avoid multi-hour rollouts. | -| `minReadySeconds` | int | `10` | Soak window past first readiness-probe pass before the DaemonSet controller treats a pod as Available and rolls forward. Catches the ready→crash flap a single probe pass cannot. Per-pod rollout cost grows by this many seconds; on fleets >500 nodes, increase `updateStrategy.rollingUpdate.maxUnavailable` to keep total rollout-time bounded. | -| `terminationGracePeriodSeconds` | int | `30` | SIGTERM→SIGKILL grace window. Production preset raises to 60s so an in-flight `otlphttp` batch flushes before kubelet kills the pod. | -| `podDisruptionBudget.enabled` | bool | `false` | Render a `policy/v1` PodDisruptionBudget protecting the DaemonSet against voluntary disruptions (drain, cluster autoscaler eviction). DaemonSet rolling updates bypass the PDB by k8s invariant. | -| `podDisruptionBudget.minAvailable` | int OR string | `""` | Mutually exclusive with `maxUnavailable`. Integer (`1`) OR percentage string (`"50%"`). When neither is set with `enabled: true`, chart falls back to `minAvailable: 1`. | -| `podDisruptionBudget.maxUnavailable` | int OR string | `""` | Alternative to `minAvailable`. When both set, `minAvailable` wins. | -| `priorityClassName` | string | `""` | PriorityClass for node-pressure eviction survival. Empty falls back to the cluster default; supply a name when tracecore is part of your incident-response surface. | -| `tolerations` | list | `[]` | Drop a `{operator: Exists}` entry to schedule on tainted nodes (control plane, GPU pools). | -| `probes.liveness.{initialDelaySeconds,periodSeconds,failureThreshold}` | map | `{10, 30, 3}` | Liveness probe timing (100s before kubelet restarts on `/healthz` failure). | -| `probes.readiness.{initialDelaySeconds,periodSeconds,failureThreshold}` | map | `{5, 10, 4}` | Readiness probe timing (45s grace window). | -| `networkPolicy.enabled` | bool | `false` | Default-deny ingress/egress NetworkPolicy with allow-list rules for scrape-in + OTLP-out (#301). Off by default for CNI compatibility; enable on Calico / Cilium / kube-router. | -| `networkPolicy.allowedScrapers` | list | `[]` | NetworkPolicyPeer list (namespaceSelector, podSelector, ipBlock). Empty resolves to `namespaceSelector: {}` (same-namespace scrapers). | -| `networkPolicy.allowedEgressEndpoints` | list | `[]` | `{cidr, port, protocol, except?}` entries for OTLP-out. Operator declares so the policy is auditable. | -| `networkPolicy.dnsNamespaceSelector` | map | `{kubernetes.io/metadata.name: kube-system}` | DNS resolver namespace label. Override if your DNS lives elsewhere. | -| `networkPolicy.dnsPodSelector` | map | `{k8s-app: kube-dns}` | DNS resolver pod label. Override for non-coredns/kube-dns setups. | -| `networkPolicy.kubeletProbes.enabled` | bool | `true` | Carve an `ipBlock` ingress rule on the `health` port so kubelet liveness/readiness probes survive the default-deny baseline. Probes originate from the node IP (host network), which is NOT selectable via namespaceSelector / podSelector; without this rule, every pod flips NotReady within one `failureThreshold` window (M5b chart opportunistic #1). Disable only when a CNI-specific rule already covers host-network probe traffic (Cilium `fromEntities: [host, remote-node]`, Calico host-endpoint selector). | -| `networkPolicy.kubeletProbes.cidr` | string | `0.0.0.0/0` | Source CIDR for the probe rule. Default permissive because kube-apiserver does not expose a cluster-wide node-CIDR primitive a chart can template against; the rule is L4-scoped to the health port so the surface stays narrow. Tighten to the cluster node CIDR if it is fixed and known. | -| `networkPolicy.kubeletProbes.except` | list | `[]` | CIDRs to exclude from `kubeletProbes.cidr` (NetworkPolicy `ipBlock.except` semantics). | -| `tls.enabled` | bool | `false` | Mount a `kubernetes.io/tls` Secret (typically [cert-manager](../../../docs/integrations/cert-manager-mtls.md)-issued) into the DaemonSet at `tls.mountPath`. Operators wire `tls.cert_file` / `tls.key_file` / `tls.ca_file` (or `client_ca_file`) into the free-form `config:` block referencing the projected file literals; the chart does NOT inject `tls:` clauses (#301). | -| `tls.certificateRef` | string | `""` | Name of the `kubernetes.io/tls` Secret in `.Values.namespace`. Required when `tls.enabled` is true; the helm-template render fails closed with a clear error if empty. | -| `tls.mountPath` | string | `/etc/tracecore/tls` | Absolute directory the Secret projects into. Schema-validated `^/`. Path literals across `docs/integrations/` assume the default. | - -The chart's authoritative defaults live in -[`values.yaml`](./values.yaml); the table above is a narrative -companion, not the schema. If the two disagree, `values.yaml` wins — -file a bug against this README. +Authoritative defaults + per-knob commentary live in +[`values.yaml`](./values.yaml). Top-level groups: + +- `image.*` — pinning (set `image.digest` for production). +- `podSecurityContext.*` / `containerSecurityContext.*` / + `securityHardening.appArmorProfile.*` — PSS hardening (see + [Pod Security Standard compliance](#pod-security-standard-compliance)). +- `telemetry.*` — self-metrics + healthcheck listeners. +- `serviceMonitor.*` / `prometheusScrape.enabled` — Prometheus wiring + (pick one; both double-scrape). +- `receivers.*` / `exporters.*` / `pipelines.*` — structured + toggles; disabled-component references are dropped at render time. +- `config` — free-form deep-merge last. Do NOT place credentials; + ConfigMaps are unencrypted in etcd. +- `resources.*`, `updateStrategy`, `minReadySeconds`, + `terminationGracePeriodSeconds`, `podDisruptionBudget.*`, + `priorityClassName`, `tolerations`, `probes.*` — rollout, capacity, + scheduling, availability. +- `networkPolicy.*` — opt-in default-deny with scrape / OTLP / probe + allow-list. +- `tls.*` — operator-supplied mTLS Secret projection; wiring via + `config:`. See + [`docs/integrations/cert-manager-mtls.md`](../../../docs/integrations/cert-manager-mtls.md). + +If `values.yaml` and any commentary disagree, `values.yaml` wins. ## Production preset -For a hardened, opinionated starting overlay, layer -[`values-production.yaml`](./values-production.yaml) on top of the -default values. The preset is the rc1-binding answer to -[`docs/v1-rc1-cut-criteria.md` §10](../../../docs/v1-rc1-cut-criteria.md) -and turns on: - -- **NetworkPolicy** (default-deny ingress/egress; operator declares - `allowedScrapers` + `allowedEgressEndpoints` in their site overlay). -- **PodDisruptionBudget** (`minAvailable: 1`) so a `kubectl drain` or - cluster-autoscaler eviction cannot take the last collector pod - offline. -- **ServiceMonitor** (kube-prometheus-stack convention; vanilla - annotation-scrape disabled to avoid double-scrape). -- **Resource bounds** sized at p50/p95 for the hostmetrics + - otlphttp pairing (`requests: 50m/128Mi`, `limits: 500m/512Mi`). -- **Hardened probes** — wider liveness/readiness windows so a - transient OTLP-out backend unreachability does not flip the pod - NotReady (and miss a scrape). -- **`terminationGracePeriodSeconds: 60`** — double the upstream OTel - default so an in-flight `otlphttp` batch flushes before SIGKILL. -- **`logs.level: warn`** via the deep-merge `config:` block; the - collector's per-batch debug lines drown an aggregator under - steady-state load. -- **`tolerations: [{operator: Exists}]`** so tracecore lands on - control-plane and tainted GPU pools by default. -- **AppArmor `RuntimeDefault`** (M5b follow-up) — pins the AppArmor - profile via the GA `pod.securityContext.appArmorProfile` field on - K8s 1.30+ and the legacy annotation on 1.28 / 1.29. Hardens the - syscall surface above what restricted-PSS requires. - -The preset assumes the cluster CNI honors NetworkPolicy -(Calico / Cilium / kube-router / canal-flannel — NOT bare Flannel). -Operators on a NetworkPolicy-ignorant CNI MUST disable the policy -explicitly (`--set networkPolicy.enabled=false`) or rendering it -will mislead. - -**Image pinning is operator responsibility.** The committed -`values-production.yaml` leaves `image.digest` empty so the chart -renders against a fresh checkout via `image.tag` (= `.Chart.AppVersion`). -Before promoting the install beyond a dev cluster, paste the published -digest from the release notes: +[`values-production.yaml`](./values-production.yaml) — hardened overlay +satisfying +[`docs/v1-rc1-cut-criteria.md` §10](../../../docs/v1-rc1-cut-criteria.md): +NetworkPolicy on, PDB `minAvailable: 1`, ServiceMonitor, p50/p95-sized +resources, wider probes, 60s termination grace, `logs.level: warn`, +`tolerations: [{operator: Exists}]`, AppArmor `RuntimeDefault`. + +Requires a NetworkPolicy-honoring CNI (Calico / Cilium / kube-router / +canal-flannel — NOT bare Flannel); else `--set networkPolicy.enabled=false`. + +**Image pinning is operator responsibility.** The preset leaves +`image.digest` empty; paste the published digest before production: ```yaml # site-overlay.yaml -image: - digest: sha256:0123456789abcdef... # from the v1.0.0-rc1 release notes +image: {digest: sha256:0123456789abcdef...} # from release notes ``` -Apply: - ```bash helm install tracecore install/kubernetes/tracecore \ - --namespace tracecore-system --create-namespace \ + -n tracecore-system --create-namespace \ -f install/kubernetes/tracecore/values-production.yaml \ -f site-overlay.yaml ``` -Or, against the OCI-published chart: - -```bash -helm install tracecore oci://ghcr.io/tracecoreai/charts/tracecore \ - --version 0.2.0 \ - --namespace tracecore-system --create-namespace \ - -f values-production.yaml \ - -f site-overlay.yaml -``` - -The chart's CI (`.github/workflows/chart.yml`) gates that -`helm lint -f values-production.yaml` and `helm template -f -values-production.yaml` both render clean and that the rendered -DaemonSet passes the conftest privilege-escalation policy bundle — -the preset cannot drift unnoticed. - -Cross-references: - -- [`docs/v1-rc1-cut-criteria.md` §10](../../../docs/v1-rc1-cut-criteria.md) - for the rubric this preset satisfies. -- [`docs/RELEASE-CHECKLIST.md`](../../../docs/RELEASE-CHECKLIST.md) - "Production-preset Helm values" row. -- The reference architectures under - [`docs/reference-architectures/`](../../../docs/reference-architectures/) - (criterion 9) wrap this preset with site-specific receiver - selection. +CI (`.github/workflows/chart.yml`) gates `helm lint` + `helm template` ++ conftest against the preset — it cannot drift unnoticed. See also +[`docs/RELEASE-CHECKLIST.md`](../../../docs/RELEASE-CHECKLIST.md) and +[`docs/reference-architectures/`](../../../docs/reference-architectures/). ## Common configurations -A few worked examples for typical adopter overlays. Save each as a -file and pass with `-f `; `--reuse-values` preserves anything -not overridden. - -**Scrape NVIDIA DCGM on every node (post-RFC-0013 PR-J recipe; requires the -[dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) DaemonSet reachable):** - -The in-tree `dcgm` receiver was retired by PR-K; the replacement recipe -runs the upstream `prometheusreceiver` against an external dcgm-exporter -DaemonSet. See [`docs/integrations/prometheus-scrape.md`](../../../docs/integrations/prometheus-scrape.md) -for the full config; the chart-side overlay is: - -```yaml -# dcgm-overlay.yaml -config: - receivers: - prometheus/dcgm: - config: - scrape_configs: - - job_name: dcgm - scrape_interval: 15s - static_configs: - - targets: ["dcgm-exporter.gpu-operator:9400"] - service: - pipelines: - metrics: - receivers: [hostmetrics, prometheus/dcgm] - exporters: [debug] -``` - -Apply: `helm upgrade tracecore install/kubernetes/tracecore -n tracecore-system -f dcgm-overlay.yaml` - +Save each overlay as a file; apply with `-f `. `--reuse-values` +preserves unoverridden fields. -**Route output to an OTLP backend (structured `exporters.otlphttp` toggle):** +**OTLP backend** (structured toggle): ```yaml -# otlp-overlay.yaml exporters: - debug: - enabled: false - otlphttp: - enabled: true - endpoint: https://collector.example.com:4318 + debug: {enabled: false} + otlphttp: {enabled: true, endpoint: https://collector.example.com:4318} pipelines: - metrics: - receivers: [hostmetrics] - exporters: [otlphttp] + metrics: {receivers: [hostmetrics], exporters: [otlphttp]} ``` -The full otlphttp field reference (headers, compression, timeout, retry_on_failure, sending_queue, tls.*, ...) follows the upstream [`otlphttpexporter`](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlphttpexporter) README. For fields the structured block doesn't expose, use the free-form `config.exporters.otlphttp.*` deep-merge block. +Headers, compression, retry, sending_queue, tls.* via deep-merge under +`config.exporters.otlphttp.*` — see upstream +[`otlphttpexporter`](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlphttpexporter). +Full recipe (including DCGM scrape via external dcgm-exporter, vanilla +Prometheus pod-SD wiring, multi-cluster): +[`docs/integrations/`](../../../docs/integrations/). -**Wire Prometheus scrape so `dashboards/slo-rules.yaml` lights up (issue #296):** - -The SLO rules under `dashboards/slo-rules.yaml` query the canonical -`job="tracecore"` label. The chart ships two complementary scrape paths; -pick one per cluster (running both produces double scrapes against the -same endpoint). - -On Operator-managed clusters (kube-prometheus-stack, prometheus-operator), -turn on the `ServiceMonitor`: +**Prometheus scrape** so `dashboards/slo-rules.yaml` lights up +(`job="tracecore"`). On operator-managed clusters +(kube-prometheus-stack): ```yaml -# scrape-operator-overlay.yaml -serviceMonitor: - enabled: true - # If kube-prometheus-stack was installed with - # serviceMonitorSelector.matchLabels.release=prometheus-stack: - labels: - release: prometheus-stack -prometheusScrape: - enabled: false # avoid double-scrape +serviceMonitor: {enabled: true, labels: {release: prometheus-stack}} +prometheusScrape: {enabled: false} # avoid double-scrape ``` -The ServiceMonitor's `jobLabel: app.kubernetes.io/name` + the chart's -`app.kubernetes.io/name: tracecore` selector label renders the scrape job -as `job="tracecore"` — exactly what `slo-rules.yaml` queries. - -On vanilla Prometheus clusters (no Operator), the chart's default -pod-annotation scrape works out of the box. Wire the standard pod-SD -job in `prometheus.yml`: +Vanilla Prometheus: chart pod annotations are on by default — see +[`docs/integrations/prometheus-scrape.md`](../../../docs/integrations/prometheus-scrape.md). +Verify rules: `promtool check rules install/kubernetes/tracecore/dashboards/slo-rules.yaml`. -```yaml -scrape_configs: - - job_name: tracecore - kubernetes_sd_configs: - - role: pod - relabel_configs: - - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] - action: keep - regex: "true" - - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name] - action: keep - regex: tracecore - - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] - action: replace - target_label: __address__ - regex: (.+) - replacement: $1 - - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] - target_label: __metrics_path__ - regex: (.+) -``` - -Verify with `promtool check rules install/kubernetes/tracecore/dashboards/slo-rules.yaml` -and `kubectl -n tracecore-system port-forward svc/tracecore 8888:8888 && curl localhost:8888/metrics`. - -**Default-deny NetworkPolicy with allow-list for scrape + OTLP-out + kubelet probes (issues #301, M5b chart opportunistic #1):** - -The chart ships an opt-in `NetworkPolicy` template that isolates the -collector pods at L3/L4. Off by default for CNI compatibility (Flannel -without canal ignores NetworkPolicy and rendering one would mislead). -Enable on Calico / Cilium / kube-router clusters. See -[`docs/threat-model.md`](../../../docs/threat-model.md) §6.G for the -audit-RFP scope this template satisfies (network-surface inventory + -default-deny verification). - -The policy carves three rule families back open against the -`policyTypes: [Ingress, Egress]` baseline: -- **Scrape-in** — Prometheus / ServiceMonitor traffic to the - `telemetry` + `health` ports, restricted by - `networkPolicy.allowedScrapers` (default: same-namespace). -- **Kubelet probes** — liveness + readiness probes from the node IP - to the `health` port. Probes originate from host-network, NOT from - a selectable namespace/pod, so the rule uses `ipBlock` - (default `0.0.0.0/0`, L4-scoped to the health port). -- **Egress** — DNS to the cluster resolver + OTLP-out to - `networkPolicy.allowedEgressEndpoints`. +**Default-deny NetworkPolicy** (scrape-in / OTLP-out / kubelet-probe +allow-list): ```yaml -# networkpolicy-overlay.yaml networkPolicy: enabled: true - # Scrape-in: restrict to the Prometheus namespace. Without this - # entry the policy resolves to `namespaceSelector: {}` (same-namespace - # scrapers only); kube-prometheus installs typically live in - # `monitoring`, so an explicit selector is the production posture. allowedScrapers: - namespaceSelector: - matchLabels: - kubernetes.io/metadata.name: monitoring - # OTLP-out: the operator-declared exporter destination. CIDR + port - # pair so the rule works against external endpoints (a managed-OTLP - # backend, a sibling cluster's aggregation listener). The chart does - # NOT inspect the configured exporter to derive this — declaring it - # explicitly keeps the policy auditable. + matchLabels: {kubernetes.io/metadata.name: monitoring} allowedEgressEndpoints: - - cidr: 10.0.0.0/8 - port: 4318 - protocol: TCP + - {cidr: 10.0.0.0/8, port: 4318, protocol: TCP} ``` -Apply: `helm upgrade tracecore install/kubernetes/tracecore -n tracecore-system -f networkpolicy-overlay.yaml` - -Pair with the [cert-manager mTLS recipe](../../../docs/integrations/cert-manager-mtls.md) -for cryptographic identity on the egress path; the NetworkPolicy keeps -the L3 surface narrow even when a downstream backend's auth is -compromised. +Kubelet probes originate from the node IP (host-network), NOT +selectable via namespace/pod selectors; the chart's +`kubeletProbes.cidr` default carves an L4-scoped `ipBlock` on the +health port. Pair with +[`cert-manager mTLS`](../../../docs/integrations/cert-manager-mtls.md) +for cryptographic identity on egress. Audit scope: +[`docs/threat-model.md`](../../../docs/threat-model.md) §6.G. -Verify on Calico: `calicoctl get networkpolicy -n tracecore-system`. On -Cilium: `cilium policy get`. The render passes `helm lint` + the -chart's conftest gate by construction (no privilege-escalation paths -on a NetworkPolicy object). +**Schedule on tainted nodes:** `tolerations: [{operator: Exists}]`. -**Run on every node including tainted ones (control plane, GPU pools):** - -```yaml -# all-nodes-overlay.yaml -tolerations: - - operator: Exists -``` +More backends (Loki, Tempo, Datadog, Honeycomb, ClickHouse, +multi-cluster, journald/kernel, k8sobjects, filelog): +[`docs/integrations/`](../../../docs/integrations/). ## Troubleshooting -**Pod stuck in `CrashLoopBackOff` after install.** Run `kubectl logs` -on the failing pod; the most common cause on first install is an -unreachable `image.repository`. The default tag is the chart's -`appVersion` — pre-release clusters need either an explicit -`--set image.tag=` or a `kind load docker-image` step. - -**`helm install` succeeds but `helm status` reports -`STATUS: deployed` with zero ready pods.** Either no nodes match the -default tolerations (the chart tolerates nothing by default — only -worker nodes are eligible) or the kubelet probe is failing. Inspect -`kubectl describe pod` for taint mismatches and `kubectl logs` for -listener bind errors. Override `tolerations: [{operator: Exists}]` to -schedule on control-plane and GPU-tainted nodes. - -**Receiver shows up in `tracecore validate --explain` but emits no -data.** The receiver is enabled but its hardware/kernel dependency is -unavailable. Check the per-receiver README under -`components/receivers//`; degraded mode is the documented -contract for missing dependencies. - -**`helm lint` reports `[WARNING]`.** Treat WARNING as error — the CI -gate fails on any WARNING. The common cause is a stale `Chart.yaml` -`apiVersion` (must be v2) or a missing `kubeVersion` clause. - -**Conftest rejects the rendered DaemonSet.** The chart's own output -must pass the bundled policy; if it does not, you have changed the -template in a way that violates the minimum-privilege charter. Re-read -[`policies/conftest/tracecore.rego`](./policies/conftest/tracecore.rego) -and the fixture set under `policies/conftest/testdata/` before -patching the template. - -**`OOMKilled` after wiring journald/filelog ingest.** Kernel/journald -sources can buffer large batches under load; the chart's default -`resources.limits.memory: 128Mi` is sized for the hostmetrics + debug -pairing. Receivers that buffer large batches (filelog, journald, -kafkareceiver) push RSS well above that. Bump to `256Mi` or higher -(`--set resources.limits.memory=256Mi`) and monitor RSS with -`kubectl top pod`. Use the [`journald-kernel`](../../../docs/integrations/journald-kernel.md) -recipe via the free-form `config:` block for kernel/journald ingest now -that the in-tree `kernelevents` receiver was retired by PR-K. - -**Rollout takes hours on fleets above ~500 nodes.** Default -`updateStrategy.rollingUpdate.maxUnavailable: 1` × per-node readiness -grace serializes the rollout. The per-node grace is - -``` -probes.readiness.initialDelaySeconds (5s) - + probes.readiness.periodSeconds (10s) × probes.readiness.failureThreshold (4) - + minReadySeconds (10s) -= ~55s upper bound under default values -``` - -Override with `--set updateStrategy.rollingUpdate.maxUnavailable=10%` -to parallelize; if a fast rollout matters more than ready→crash flap -detection, also `--set minReadySeconds=0` (accept the soak-window -race in exchange for ~10s/pod off the wall-clock). - -**ImagePullBackOff on first install.** The default image -(`ghcr.io/tracecoreai/tracecore`) is a public registry; air-gapped -clusters must mirror the image to an internal registry and set -`--set image.repository=/tracecore` (+ optional -`imagePullSecrets`). For local evaluation against an unreleased SHA, -build the image with [`ko`](https://ko.build) (the production image -builder per RFC-0013 PR-D; see `.ko.yaml` at the repo root) and load -it into kind. Since RFC-0013 PR-A2 (2026-05-30) the binary is -generated by OCB under `./_build/`, so the ko invocation runs from -inside that directory: -`make build && cd ./_build && KO_CONFIG_PATH=../.ko.yaml -KO_DOCKER_REPO=ko.local ko build --bare --local . --tags dev && -kind load docker-image ko.local/tracecore:dev`; then install with -`--set image.repository=ko.local/tracecore --set image.tag=dev`. -The chart-local reference `Dockerfile` under -`install/kubernetes/tracecore/Dockerfile` remains for kind-CI use -(`.github/workflows/chart.yml` + `install-bench.yml`); it is not the -production build path. - -**`helm upgrade --reuse-values` ignores a chart-level default I want.** -`--reuse-values` is intentionally additive: a chart that added a new -field (e.g. `priorityClassName` in chart `0.1.x`) keeps the operator's -old missing-field state. Re-render with explicit overrides or omit -`--reuse-values` to pick up new defaults — `helm diff upgrade` shows -exactly which fields would change. - -**Cluster-wide PSS enforcement.** The chart renders pods that comply -with `restricted`; cluster-level enforcement is the operator's -responsibility. Label the target namespace once at install time: - -```bash -kubectl label namespace tracecore-system \ - pod-security.kubernetes.io/enforce=restricted \ - pod-security.kubernetes.io/audit=restricted \ - pod-security.kubernetes.io/warn=restricted -``` - -**`make helm-install-rolling-report` reports median above 300s.** The -M3 carry-forward rubric (`docs/MILESTONES.md` L209) requires the -`helm install` + DaemonSet `Ready` wall-clock to land at a median ≤5 -min across 10 successful CI runs. `chart.yml`'s `install` job uploads -each run's `helm-install-duration-` artifact; the script -`scripts/helm-install-rolling.sh` (operator entry point: `make -helm-install-rolling-report`) downloads the last 10 via `gh run -download` and computes the median. - -When the median trips the 300s gate: - -1. Run `make helm-install-rolling-report` locally to see per-run - samples. Borderline (~290-310s) often means flake noise; sustained - means real regression. -2. If a single run jumped to 400-500s, `gh run view --log` and - look for image-pull or probe-misconfig stalls in the kind-up step. -3. If every run jumped, suspect a chart template edit. `git bisect` - between the last-green run sha and the first-red run sha against - `install/kubernetes/tracecore/`. - -The single-run ≤300s gate is the hard fail inside the workflow; the -rolling-median view is the carry-forward layer that flips ⧗ → ☑ once -10 successful main-branch runs have artifacts. Sibling pattern: PR -#446's `bench-cv-rolling` for per-detector allocs/op CV. +Operational incident playbook: +[`docs/RUNBOOK-tracecore.md`](../../../docs/RUNBOOK-tracecore.md). +Chart-install gotchas only below. + +- **`CrashLoopBackOff` on install.** Usually unreachable + `image.repository`. `--set image.tag=` or `kind load docker-image` + for pre-release. +- **`STATUS: deployed`, zero ready pods.** No nodes match default + tolerations (chart tolerates nothing — only worker nodes) OR the + kubelet probe fails. Inspect `kubectl describe pod` + `kubectl logs`; + override `tolerations: [{operator: Exists}]` for control-plane / + GPU-tainted nodes. +- **`helm lint` `[WARNING]`.** CI treats as error. Usually stale + `Chart.yaml` `apiVersion` or missing `kubeVersion`. +- **Conftest rejects the render.** The chart's own output must pass + the bundled policy — see + [`policies/conftest/tracecore.rego`](./policies/conftest/tracecore.rego). +- **`OOMKilled` after journald/filelog ingest.** Default + `resources.limits.memory: 128Mi` sizes for hostmetrics + debug. + `--set resources.limits.memory=256Mi`+ for buffering receivers; see + [`docs/integrations/journald-kernel.md`](../../../docs/integrations/journald-kernel.md). +- **Rollout takes hours on >500-node fleets.** Default + `maxUnavailable: 1` serializes. `--set updateStrategy.rollingUpdate.maxUnavailable=10%`; + for fastest rollout also `--set minReadySeconds=0` (accept ready→crash + flap soak-window race). +- **`ImagePullBackOff` on first install.** Public registry; air-gapped + clusters must mirror + `--set image.repository=/tracecore`. + Local-SHA evaluation: build with [`ko`](https://ko.build) (RFC-0013 PR-D): + `make build && cd ./_build && KO_CONFIG_PATH=../.ko.yaml KO_DOCKER_REPO=ko.local ko build --bare --local . --tags dev && kind load docker-image ko.local/tracecore:dev` + then `--set image.repository=ko.local/tracecore --set image.tag=dev`. +- **`helm upgrade --reuse-values` ignores a new chart default.** + `--reuse-values` is additive: new fields keep the operator's old + missing state. Omit it or override explicitly; `helm diff upgrade` + shows the delta. +- **Cluster-wide PSS enforcement** is operator-side. Label the namespace: + `kubectl label namespace tracecore-system pod-security.kubernetes.io/enforce=restricted pod-security.kubernetes.io/audit=restricted pod-security.kubernetes.io/warn=restricted`. +- **`make helm-install-rolling-report` median > 300s.** M3 rubric + requires `helm install` + `Ready` median ≤5min across 10 CI runs + (`chart.yml` uploads `helm-install-duration-` artifacts; + `scripts/helm-install-rolling.sh` aggregates). Borderline (~290-310s) + is flake noise; sustained = regression, `git bisect` against + `install/kubernetes/tracecore/`. See `docs/MILESTONES.md` L209. ## Pod Security Standard compliance -The chart targets the Kubernetes [`restricted`](https://kubernetes.io/docs/concepts/security/pod-security-standards/) -Pod Security Standard. Every restricted-profile assertion is enforced -by the bundled conftest policy and CI gate: - -| Assertion | Where enforced | -| --- | --- | -| `securityContext.runAsNonRoot: true` | values.yaml `podSecurityContext.runAsNonRoot` | -| `securityContext.runAsUser != 0` | values.yaml `podSecurityContext.runAsUser` (default 65532) | -| `seccompProfile.type: RuntimeDefault` | values.yaml `podSecurityContext.seccompProfile.type` | -| `allowPrivilegeEscalation: false` | values.yaml `containerSecurityContext.allowPrivilegeEscalation` | -| `readOnlyRootFilesystem: true` | values.yaml `containerSecurityContext.readOnlyRootFilesystem` + conftest deny | -| `capabilities.drop: [ALL]` | values.yaml `containerSecurityContext.capabilities.drop` | -| `hostPID: false` | DaemonSet template (not values-tunable) + conftest deny | -| `hostIPC: false` | DaemonSet template (not values-tunable) + conftest deny | -| `hostNetwork: false` | DaemonSet template (not values-tunable) + conftest deny | - -### Defense-in-depth above restricted-PSS - -Restricted PSS *permits* an undefined AppArmor profile, so the chart -default values are compliant. The chart offers one step further: -pinning `RuntimeDefault` — the syscall-narrowing profile shipped with -every containerd / CRI-O package — under -`securityHardening.appArmorProfile.enabled`. This narrows the syscall -surface a compromised receiver could reach against the read-only -`/dev/kmsg` + journald hostPath mounts; see -[`docs/threat-model.md`](../../../docs/threat-model.md) §B1 for the -boundary. - -**Default posture is opt-in (`enabled: false`) in `values.yaml`.** -Kubelet rejects pod-create when the `appArmorProfile` field references -a profile the node cannot resolve, and that breaks installs on stock -CI runners (e.g. ubuntu-latest GitHub Actions images post-2024) and -RHEL/SELinux nodes that ship without AppArmor. The -`values-production.yaml` preset flips `enabled: true` — that's the -right posture for AppArmor-equipped Linux production clusters (the -common case). Operators who know their nodes carry the profile should -either layer `values-production.yaml` or set -`securityHardening.appArmorProfile.enabled: true` directly. See -[#492](https://github.com/lumalabs/tracecore/issues/492) for the -regression that prompted the opt-in flip. - -When enabled, the chart auto-selects the render form via -`semverCompare` against `.Capabilities.KubeVersion.Version`: - -- **Kubernetes 1.30+** — emits the GA structured field - `pod.securityContext.appArmorProfile: { type: RuntimeDefault }`. - Kubelet rejects pod-create on an unknown profile name (fails closed). -- **Kubernetes 1.28 / 1.29** — emits the legacy pod annotation - `container.apparmor.security.beta.kubernetes.io/tracecore: runtime/default`. - Deprecated in K8s 1.30 but still honored. Fails open - (unknown-profile name is silently dropped) — that's the upstream - semantics, not a chart bug. The 1.30 floor closes the gap. - -Operators do not pick which form renders. Override -`type: Localhost` + `localhostProfile: ` to wire a node-preloaded -custom profile. +Targets Kubernetes +[`restricted`](https://kubernetes.io/docs/concepts/security/pod-security-standards/) +PSS; every assertion is enforced by bundled conftest + CI. The defaults +in `values.yaml` (`runAsNonRoot`, `runAsUser: 65532`, +`seccompProfile: RuntimeDefault`, `allowPrivilegeEscalation: false`, +`readOnlyRootFilesystem: true`, `capabilities.drop: [ALL]`) plus the +non-tunable DaemonSet template (`hostPID/hostIPC/hostNetwork: false`) +cover the restricted-PSS surface. + +Defense-in-depth: `securityHardening.appArmorProfile.enabled` pins +`RuntimeDefault` (opt-in by default; on in `values-production.yaml`). +Chart auto-selects render form via `.Capabilities.KubeVersion.Version` +— GA structured field on K8s 1.30+, legacy annotation on 1.28/1.29. +Threat-model boundary: +[`docs/threat-model.md`](../../../docs/threat-model.md) §B1. Opt-in +rationale: [#492](https://github.com/lumalabs/tracecore/issues/492). ### Documented deviations -The `restricted` profile permits the empty capability set only. The -chart's deviations from a literal reading of `restricted`: - -1. **`SYS_PTRACE` is allowed in `capabilities.add`.** Some receivers - (e.g. future host-process inspection receivers) need `ptrace` to - read `/proc/` of other processes for failure attribution. The - capability is in the conftest allowlist; any other addition rejects - the build. - -2. **Host-path mounts are required for journald/kmsg ingest.** The - PR-J [`journald-kernel`](../../../docs/integrations/journald-kernel.md) - recipe requires `hostPath` mounts (`/dev/kmsg` read-only, optionally - `/var/log/journal` and `/run/systemd/journal`). The chart does not - render those mounts by default; operators opt in via the `config:` - override and accept the deviation. - -3. **DCGM scrape connects to an external dcgm-exporter.** Per the PR-J - [`prometheus-scrape`](../../../docs/integrations/prometheus-scrape.md) - recipe, GPU metrics come from an external dcgm-exporter DaemonSet - (operator-deployed) scraped by `prometheusreceiver`. The chart does - not run `nv-hostengine` in-process and does not add capabilities - for it. - -Each deviation is bounded by the conftest policy: the policy only -permits SYS_PTRACE, never relaxes hostPID/hostIPC/hostNetwork, and -fails the CI gate on any privileged container. +1. **`SYS_PTRACE` allowed in `capabilities.add`.** Reserved for a + future host-process inspection receiver; conftest allowlists this + only. +2. **Host-path mounts for journald/kmsg ingest.** Not rendered by + default; operators opt in via `config:` per + [`docs/integrations/journald-kernel.md`](../../../docs/integrations/journald-kernel.md). +3. **DCGM via external dcgm-exporter.** Chart does not run + `nv-hostengine` or add capabilities for it — see + [`docs/integrations/prometheus-scrape.md`](../../../docs/integrations/prometheus-scrape.md). + +Each deviation is conftest-bounded: only SYS_PTRACE permitted, +hostPID/hostIPC/hostNetwork stay false, CI fails on any privileged container. ### Live-cluster policy validation (deferred to GA) Engine-specific admission validation (PSA-restricted × Kyverno × -Gatekeeper) is deferred to v1.0-ga. The chart-side `conftest` gate + -`helm lint` + `kubeconform` + `kubectl apply --dry-run=server` (in -[`.github/workflows/chart.yml`](../../../.github/workflows/chart.yml)) -cover structural and API-conformance breakage at rc1. See issue #502 -for re-enable triggers. - -The local smoke harness ships in tree at -[`scripts/policy-matrix-smoke.sh`](../../../scripts/policy-matrix-smoke.sh) -for operators who want to spot-check engine-specific compatibility on -their own cluster: - -```bash -export POLICY_ENGINE=kyverno # or psa | gatekeeper -export VALUES_FILE=install/kubernetes/tracecore/values-production.yaml -bash scripts/policy-matrix-smoke.sh -``` +Gatekeeper) is deferred to v1.0-ga. rc1 coverage: chart-side `conftest` ++ `helm lint` + `kubeconform` + `kubectl apply --dry-run=server` in +[`.github/workflows/chart.yml`](../../../.github/workflows/chart.yml); +re-enable tracked in +[#502](https://github.com/lumalabs/tracecore/issues/502). Local +spot-check: +`POLICY_ENGINE=kyverno VALUES_FILE=install/kubernetes/tracecore/values-production.yaml bash scripts/policy-matrix-smoke.sh` +(`POLICY_ENGINE` ∈ `psa|kyverno|gatekeeper`).