Skip to content

[feat] M5b — Helm chart + minimal-privilege pod spec#29

Merged
trilamsr merged 6 commits into
mainfrom
feat/m5b-helm-chart
May 15, 2026
Merged

[feat] M5b — Helm chart + minimal-privilege pod spec#29
trilamsr merged 6 commits into
mainfrom
feat/m5b-helm-chart

Conversation

@trilamsr

@trilamsr trilamsr commented May 15, 2026

Copy link
Copy Markdown
Contributor

Problem

tracecore has no operator-installable surface today. M5 / M20 install
benchmarks have nothing to install and v0.1.0 (M21) cannot ship until
M5b lands per MILESTONES.md §M5b.

Impact

Operators evaluating tracecore today must hand-roll a DaemonSet from
components/receivers/*/example-daemonset.yaml snippets. This blocks
the critical path to v0.1.0 (M3 → M5b → M6 → M21).

Solution

install/kubernetes/tracecore/ — Helm chart v2:

  • Chart.yaml: apiVersion v2, SemVer version 0.1.0, appVersion held
    in sync with the binary by the CI gate.
  • values.yaml: namespace, per-receiver receivers.<name>.enabled
    toggles (clockreceiver on, dcgm/kernelevents off), free-form
    config: override deep-merged INTO the rendered config last,
    resource/probe/toleration/updateStrategy knobs.
  • DaemonSet template renders a Kubernetes restricted Pod Security
    Standard pod spec: runAsNonRoot, runAsUser: 65532,
    seccompProfile: RuntimeDefault, allowPrivilegeEscalation: false,
    readOnlyRootFilesystem: true, capabilities.drop: [ALL],
    hostPID/hostIPC/hostNetwork pinned false at the template level
    (not values-tunable).
  • policies/conftest/tracecore.rego rejects privileged, hostPID,
    hostIPC, hostNetwork, missing readOnlyRootFilesystem, and any
    capability addition other than SYS_PTRACE. Six fixture-driven bad
    manifests + two good (baseline + SYS_PTRACE) pin policy behavior.
  • README.md with the five rubric-required H2 sections (Install,
    Upgrade, Uninstall, Values reference, Troubleshooting) plus a
    documented PSS deviation table.
  • .github/workflows/chart.yml runs OUTSIDE the 60s make ci budget:
    helm lint (zero WARNINGs gate), helm template + tracecore validate
    on rendered all-off and one-receiver-on configs, yq assertions on
    Chart.yaml + DaemonSet securityContext fields, conftest
    deny-fixture + allow-fixture tests, end-to-end kind-cluster install
    asserting STATUS: deployed and install-to-Ready ≤300s.

Test plan

  • chart workflow / render job green (helm lint + helm template
    • tracecore validate + yq + conftest).
  • chart workflow / install (kind) job green (kind 0.25.0,
    kindest/node v1.32.0, install-to-Ready measured + asserted ≤300s).
  • AI-vocab grep gate returns no hits on install/ and
    .github/workflows/chart.yml.
  • Existing CI workflow stays green; make ci budget unchanged.
  • Local helm lint install/kubernetes/tracecore → 0 issues
    (verified pre-push).
  • Local conftest test against policies/conftest/testdata/
    bad/good fixtures behaves as expected (verified pre-push).

Open follow-ups

  • Reproducible image build (M3) is independent of this PR. The chart
    ships a reference Dockerfile for the kind-install CI gate; M3 will
    own the canonical release image.
  • 10-run install-to-Ready median measurement (the §M5b non-functional
    rubric) accumulates as the workflow runs over time on main.
[feat] Helm chart at `install/kubernetes/tracecore/` ships a restricted-PSS DaemonSet with per-receiver toggles, conftest-enforced minimum-privilege policy, and a kind-cluster install CI gate. Operators can `helm install tracecore install/kubernetes/tracecore --namespace tracecore-system --create-namespace`.

trilamsr added 6 commits May 15, 2026 02:53
tracecore lacks an operator-installable surface; without a chart
M5/M20 install benchmarks have nothing to install and v0.1.0 cannot
ship per MILESTONES.md §M5b. Operators evaluating tracecore would
need to hand-roll a DaemonSet from `components/receivers/*/example-
daemonset.yaml` snippets.

Add `install/kubernetes/tracecore/`:

- Chart.yaml apiVersion v2, SemVer version 0.1.0, appVersion
  pinned to the binary release the chart was tested against.
- values.yaml exposes `namespace`, per-receiver
  `receivers.<name>.enabled` toggles, free-form `config:` override,
  resource/probe/tolerations knobs.
- Templates render a Kubernetes `restricted` Pod Security Standard
  DaemonSet: runAsNonRoot, runAsUser 65532, seccompProfile
  RuntimeDefault, allowPrivilegeEscalation false, readOnlyRootFilesystem
  true, capabilities drop [ALL] add []. hostPID/hostIPC/hostNetwork
  are pinned false at the template level, not values-tunable.
- Conftest policy rejects privileged/hostPID/hostIPC/missing-
  readOnlyRootFilesystem/disallowed-capabilities-except-SYS_PTRACE
  with fixture-driven good and bad manifests.
- README has the five required H2 sections (Install, Upgrade,
  Uninstall, Values reference, Troubleshooting) plus a deviations
  table documenting SYS_PTRACE allowance and hostPath mount opt-ins.
- `.github/workflows/chart.yml` runs OUTSIDE the 60s `make ci`
  budget: helm lint, helm template, `tracecore validate` on rendered
  all-off and one-on configs, yq assertions on Chart.yaml +
  securityContext fields, conftest deny/allow fixtures, and an end-
  to-end kind-cluster install that asserts install-to-Ready ≤300s.

Test plan:
- helm lint install/kubernetes/tracecore → 0 issues (verified locally).
- helm template ... | tracecore validate → exit 0 for both all-off
  and one-receiver-on values overlays (verified locally).
- conftest test against six bad fixtures (each denies on its specific
  rule) and two good fixtures + the chart's own render (all pass —
  verified locally with conftest v0.62.0 + OPA 1.15.2).
- yq assertions parse the rendered DaemonSet and pin every
  restricted-PSS field (verified inline in the workflow).
- AI-vocab grep gate returns no hits on install/ and the new workflow.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
The `helm status — STATUS: deployed` step name contained an unquoted
colon, which YAML interprets as a mapping value and rejects at parse
time. actionlint flags it cleanly; quoting the name keeps the colon
literal.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
When the Helm release name contains the chart name, the chart's
fullname template returns the release name unchanged ("tracecore"),
not the "release-chart" composite. The install gate's hard-coded
"tracecore-tracecore" mismatched this, so `kubectl rollout status`
errored out NotFound right after a successful `helm install --wait`.

Resolve the DaemonSet by label selector
(`app.kubernetes.io/instance=tracecore`) instead. Same fix for the
follow-up port-forward step.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Pass-1 review surfaced gaps a values-override could exploit, plus
supply-chain and operator-UX paper cuts:

- conftest policy now denies pod-level securityContext.runAsNonRoot=false
  and a missing seccompProfile.type (PSS restricted requirements that the
  prior rules did not enforce). Both rules gated on
  input.spec.template.spec so ConfigMap/ServiceAccount documents are
  exempt.
- Three new bad fixtures (bad-hostnetwork, bad-runasroot, bad-missing-
  seccomp) close gaps in deny-path coverage; every existing fixture
  now carries the pod-level securityContext so each one fails on its
  specific rule, not on the new policy floor.
- Dockerfile pins both base images by sha256 digest. Tags are mutable;
  digests are the supply-chain anchor.
- chart workflow installs yq and conftest via `go install`, which
  resolves through the Go module proxy + checksum database. The prior
  `curl` of the yq release binary had no integrity check.
- New CI gate asserts the rendered DaemonSet probe paths match the
  rendered configmap's telemetry.paths — without it, a template edit
  could wire /healthz to a path tracecore never serves.
- values.yaml: priorityClassName knob (defaults to empty so behavior
  is unchanged for operators not opting in); readiness probe grace
  window widened from 17s to 45s; inline comments on dcgm.endpoint
  and serviceAccount.automount call out the CrashLoop trap and the
  forward-looking SA purpose.
- NOTES.txt validate command now references the chart by its repo-
  rooted path so an operator copy-pasting from `helm install` output
  outside the checkout doesn't hit a "could not find chart" error.

All 8 bad fixtures fail on their specific rule; both good fixtures and
the chart's own render pass; helm lint, actionlint, and the AI-vocab
gate are clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Pass-2 review surfaced four restricted-PSS holes that values overrides
could still slip through, an under-tested rendering surface, and a set
of adopter docs that left common questions unanswered:

- conftest now denies runAsUser=0, runAsGroup=0, hostUsers=true,
  procMount=Unmasked, and allowPrivilegeEscalation=true. Five new bad
  fixtures pin each rule. The hostPID/hostIPC/hostNetwork rules gain
  explicit `input.spec.template.spec` guards so non-pod docs cannot
  cause undefined evaluations to leak through.
- chart workflow adds a render-correctness gate for the two value-
  conditional template paths yq cannot infer from default-render
  output: priorityClassName injection and probe omission when
  telemetry.enabled=false.
- README gains rows for priorityClassName + probes in the values
  table, a "Common configurations" section with three worked overlays
  (DCGM, OTLP backend, all-nodes tolerations), and Troubleshooting
  entries for OOMKilled, large-fleet rollout, ImagePullBackOff,
  --reuse-values caveat, and namespace PSS enforcement.
- NOTES.txt warns when stdoutexporter is rendering against the default
  image — the chart's "validation default" should not silently graduate
  to production. values.yaml warns operators away from putting
  credentials in the `config:` override block (ConfigMap → unencrypted
  etcd).
- .helmignore drops the kind-install Dockerfile from the packaged
  chart (it is CI scaffolding, not a distribution artifact).
- testdata/README.md documents the fixture shape so a new contributor
  adds a fixture that fails on its rule, not on the policy floor.
- CHANGELOG entry extended to enumerate the full conftest denial set
  and the new values knobs.

All 13 bad fixtures fail on their specific rule; good fixtures + chart
render pass; helm lint, actionlint, and the AI-vocab gate are clean;
priorityClassName + telemetry-off render checks verified locally.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
values.schema.json validates the values.yaml shape at `helm install` /
`helm template` time. Catches typos and type mismatches (e.g.
`receivers.clockreceiver.enabled=notabool`) before the template
renders. The schema:

- pins image.pullPolicy to {Always, IfNotPresent, Never}
- requires podSecurityContext.seccompProfile.type ∈ {RuntimeDefault, Localhost}
- requires containerSecurityContext.allowPrivilegeEscalation=false
  and readOnlyRootFilesystem=true (matches the conftest charter)
- bounds containerSecurityContext.capabilities.add to {SYS_PTRACE}
- pins telemetry.paths.{metrics,healthz,readyz} to absolute paths
- pins runAsUser / runAsGroup minimum to 1 (rejects root)

Chart.yaml gains Artifact Hub annotations (artifacthub.io/license,
prerelease, links, changes, operator) so the chart publishes to
Artifact Hub cleanly when the project opens external distribution.

docs/FOLLOWUPS.md gains a new "M5b chart — opportunistic deferrals"
section enumerating five items review surfaced as out-of-scope or
trigger-based: NetworkPolicy template, image scanning + SBOM on the
chart image, per-receiver resource guidance docs, 10-run install-to-
Ready median aggregate, and appArmorProfile rendering. Each carries
an explicit trigger.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr merged commit 316d63f into main May 15, 2026
7 checks passed
@trilamsr trilamsr deleted the feat/m5b-helm-chart branch May 15, 2026 11:47
trilamsr added a commit that referenced this pull request May 15, 2026
Bring M5b (#29) into the M3 branch so PR #28 is mergeable. One real
content conflict in CHANGELOG.md — both branches added an entry
under [Unreleased] / Added; resolved by keeping both, M3 listed
before M5b per the M3 → M5b → M6 → M21 minimum-viable v0.1.0
dependency chain documented in MILESTONES.md. docs/FOLLOWUPS.md
auto-merged cleanly (M5b's additions land in disjoint sections from
the M3 release-pipeline-hardening section added on this branch). No
other files conflicted.

`make ci` green (12.5 s wall, well under PRINCIPLES §10 60 s).
doc-check passes; 214 markdown links resolve.

Per MEMORY.md `feedback_no_history_rewrites` the resolution is a
merge commit, not a rebase — origin/main is pushed history and
cannot be rebased over.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Catch the branch up to main: PR #30 (M4b failure-injection harness)
and PR #31 (M11 NCCL FlightRecorder receiver + safe pickle parser)
landed since the prior merge of #29. Both auto-merged cleanly into
disjoint sections of CHANGELOG.md and docs/FOLLOWUPS.md — no manual
conflict resolution required.

`make ci` green (58.6 s wall, within PRINCIPLES §10 60 s; up from
12 s pre-merge because the M4b harness + chaos suite and M11 pickle
parser added a substantial test surface). doc-check: 193 markdown
links resolve. release.yml unchanged across the merge so the
end-to-end attestation run on v0.0.0-m3test-7 still characterises
the post-merge HEAD's release behavior.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
…act Hub annotations (#36)

## Problem

The M5b CHANGELOG entry that landed in #29 understates scope: the
follow-on commit added `values.schema.json` and Artifact Hub
annotations on `Chart.yaml`, but the [Unreleased]/Added bullet was
not re-edited before merge.

## Impact

Adopters scanning the changelog before evaluating tracecore miss two
operator-visible features: schema-validated values overlays and
Artifact Hub publish-readiness.

## Solution

Extend the M5b bullet with one sentence covering both. No other
entries touched; no chart-side changes.

## Test plan

- [x] `make doc-check` clean.
- [ ] `pr-lint` green.
- [ ] `CI verify` unchanged.

```release-notes
[docs] CHANGELOG M5b bullet extended to mention values.schema.json and Artifact Hub annotations that shipped in #29 but were missed in the entry.
```

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
## Problem
v0.1.0 (M21) needs every release artifact to be byte-reproducible from
the same SHA and verifiable end-to-end. M3 owns that pipeline.

## Impact
Closes MILESTONES.md §M3. Unblocks M21 (release tag).

## Solution
- `release.yml` builds `linux/amd64` twice at every `v*` tag into
isolated `mktemp -d` dirs with fresh `GOCACHE` (cold-vs-cold
byte-equality is the assertion a third party reproduces — not
warm-vs-warm); `diffoscope` fails closed on any byte diff. `-trimpath` +
`SOURCE_DATE_EPOCH` honored; embedded `BuildDate` cross-checked against
`strings` of the binary.
- CycloneDX SBOM via `cyclonedx-gomod mod`. Coverage gate intersects
direct `go.mod` requires with `go list -deps ./cmd/tracecore`, so the
SBOM must enumerate every runtime-reachable direct module (test-only
deps like `testify`/`goleak` correctly excluded — they don't ship in the
binary).
- Cosign keyless `sign-blob` + smoke `verify-blob`. Cert identity pinned
to
`^https://github.com/<repo>/\.github/workflows/release\.yml@refs/tags/`
so a sibling workflow on another branch cannot mint a passing bundle. No
long-lived secrets; `id-token: write` is the only elevated permission
outside the `release` job.
- SLSA v1.0 provenance via `actions/attest-build-provenance`. Bundle's
`predicateType == https://slsa.dev/provenance/v1` and
`subject[0].digest.sha256` are cross-checked against the build job's
digest before the artifact is uploaded.
- Every third-party action pinned to a 40-char commit SHA. SBOM-job
checkout pins to `${{ github.sha }}` (not the mutable tag) so a
force-push between build and SBOM cannot diverge the BOM from the
binary.
- `docs/reproducibility.md` is the third-party verification recipe
(rebuild → diffoscope → cosign → `gh attestation verify --bundle
--signer-workflow --owner --predicate-type` → SBOM inspect). Works
offline because step 6 reads the bundle from disk.
`scripts/doc-check.sh` gates its presence and `bash -n` syntax of every
fenced block; mutation-tested.

M3 steps live in `release.yml`, not `make ci`. The base branch absorbed
M5b (#29), M4b (#30), and M11 (#31) during this PR's life; merges with
each are reflected in the branch and `make ci` still runs under the
PRINCIPLES §10 60 s budget on darwin/arm64 (`(unverified — local
measurement, not CI-cited)`).

## Test plan
- [x] `make ci` green on the latest commit; under 60 s — see `verify` on
the PR's CI workflow run.
- [x] doc-check mutation test: inject `if then fi` into a fenced block →
gate fails with block-#; restore → gate passes.
- [x] `release.yml` runs end-to-end on `v0.0.0-m3test-7` (commit
`b6c745f`) with diffoscope / cosign / SLSA / SBOM all green — [run
25914673989](https://github.com/TraceCoreAI/tracecore/actions/runs/25914673989).
`release.yml` is unchanged across the subsequent commits on this branch
(the two merges of `main` plus two doc fixups touched no workflow file),
so the run still characterizes the current HEAD's release behavior.
- [x] Recipe verified locally against `v0.0.0-m3test-7`: `cosign
verify-blob` → `Verified OK`; `gh attestation verify` exit 0 with
`predicateType=https://slsa.dev/provenance/v1` and `buildSignerURI`
pinned to `release.yml@refs/tags/v0.0.0-m3test-7`.

## Out of scope (filed as follow-ups)
Above-the-floor hardening (SLSA L3 via the reusable-workflow generator,
`zizmor` / `actionlint`, repo tag-protection on `v*`, nightly
rebuild-and-diff cron, `go mod verify`, build-env sanitization, cosign +
`gh attestation` flag tightening, CycloneDX `mod`→`app`, Rekor
log-index, recipe polish, apt cache) lives in
[`docs/FOLLOWUPS.md`](docs/FOLLOWUPS.md) under "M3 release-pipeline
hardening (post-PR #28)" and as a Carry-forward bullet on
`MILESTONES.md` §M21.

---------

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
…test notes (#48)

## Problem

Four durable lessons surfaced during the M5b chart work (PR #29 +
follow-up #36) were not recorded in repo-resident notes.

## Impact

Without anchors in `docs/notes/`, the next contributor re-discovers
the same patterns: PR bodies drift across review rounds, memory-rule
collisions get relitigated, rego rules fire on non-pod documents,
iterative reviews repeat prior findings.

## Solution

Three new topic notes under `docs/notes/`, each with anchors that
catch regression:

- `pr-workflow.md` — body-sync per review round + iterative-review
  prompt discipline.
- `memory-rules.md` — forward-only compliance as the resolution for
  rule collisions.
- `conftest.md` — `input.spec.template.spec` guard on every rego
  deny rule that references pod-spec fields.

`AGENTS.md` topic index gains one line per new note. Captured via the
`learn-from-mistakes` skill — banned-vocabulary check clean, AI
attribution clean, anchors present, AGENTS.md at 55/150 lines.

This PR replaces #41, which was opened off a now-stale base. Closing
#41 in favor of this branch keeps history append-only per the
project's no-rebase-after-push rule.

## Test plan

- [x] `make doc-check` clean post-rebase (202 markdown links resolve).
- [ ] `pr-lint` green.
- [ ] `verify` job unchanged.

```release-notes
[docs] Three new topic notes (`pr-workflow`, `memory-rules`, `conftest`) capture lessons from the M5b chart work; `AGENTS.md` topic index extended.
```

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant