[feat] M5b — Helm chart + minimal-privilege pod spec#29
Merged
Conversation
tracecore lacks an operator-installable surface; without a chart M5/M20 install benchmarks have nothing to install and v0.1.0 cannot ship per MILESTONES.md §M5b. Operators evaluating tracecore would need to hand-roll a DaemonSet from `components/receivers/*/example- daemonset.yaml` snippets. Add `install/kubernetes/tracecore/`: - Chart.yaml apiVersion v2, SemVer version 0.1.0, appVersion pinned to the binary release the chart was tested against. - values.yaml exposes `namespace`, per-receiver `receivers.<name>.enabled` toggles, free-form `config:` override, resource/probe/tolerations knobs. - Templates render a Kubernetes `restricted` Pod Security Standard DaemonSet: runAsNonRoot, runAsUser 65532, seccompProfile RuntimeDefault, allowPrivilegeEscalation false, readOnlyRootFilesystem true, capabilities drop [ALL] add []. hostPID/hostIPC/hostNetwork are pinned false at the template level, not values-tunable. - Conftest policy rejects privileged/hostPID/hostIPC/missing- readOnlyRootFilesystem/disallowed-capabilities-except-SYS_PTRACE with fixture-driven good and bad manifests. - README has the five required H2 sections (Install, Upgrade, Uninstall, Values reference, Troubleshooting) plus a deviations table documenting SYS_PTRACE allowance and hostPath mount opt-ins. - `.github/workflows/chart.yml` runs OUTSIDE the 60s `make ci` budget: helm lint, helm template, `tracecore validate` on rendered all-off and one-on configs, yq assertions on Chart.yaml + securityContext fields, conftest deny/allow fixtures, and an end- to-end kind-cluster install that asserts install-to-Ready ≤300s. Test plan: - helm lint install/kubernetes/tracecore → 0 issues (verified locally). - helm template ... | tracecore validate → exit 0 for both all-off and one-receiver-on values overlays (verified locally). - conftest test against six bad fixtures (each denies on its specific rule) and two good fixtures + the chart's own render (all pass — verified locally with conftest v0.62.0 + OPA 1.15.2). - yq assertions parse the rendered DaemonSet and pin every restricted-PSS field (verified inline in the workflow). - AI-vocab grep gate returns no hits on install/ and the new workflow. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
The `helm status — STATUS: deployed` step name contained an unquoted colon, which YAML interprets as a mapping value and rejects at parse time. actionlint flags it cleanly; quoting the name keeps the colon literal. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
When the Helm release name contains the chart name, the chart's
fullname template returns the release name unchanged ("tracecore"),
not the "release-chart" composite. The install gate's hard-coded
"tracecore-tracecore" mismatched this, so `kubectl rollout status`
errored out NotFound right after a successful `helm install --wait`.
Resolve the DaemonSet by label selector
(`app.kubernetes.io/instance=tracecore`) instead. Same fix for the
follow-up port-forward step.
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Pass-1 review surfaced gaps a values-override could exploit, plus supply-chain and operator-UX paper cuts: - conftest policy now denies pod-level securityContext.runAsNonRoot=false and a missing seccompProfile.type (PSS restricted requirements that the prior rules did not enforce). Both rules gated on input.spec.template.spec so ConfigMap/ServiceAccount documents are exempt. - Three new bad fixtures (bad-hostnetwork, bad-runasroot, bad-missing- seccomp) close gaps in deny-path coverage; every existing fixture now carries the pod-level securityContext so each one fails on its specific rule, not on the new policy floor. - Dockerfile pins both base images by sha256 digest. Tags are mutable; digests are the supply-chain anchor. - chart workflow installs yq and conftest via `go install`, which resolves through the Go module proxy + checksum database. The prior `curl` of the yq release binary had no integrity check. - New CI gate asserts the rendered DaemonSet probe paths match the rendered configmap's telemetry.paths — without it, a template edit could wire /healthz to a path tracecore never serves. - values.yaml: priorityClassName knob (defaults to empty so behavior is unchanged for operators not opting in); readiness probe grace window widened from 17s to 45s; inline comments on dcgm.endpoint and serviceAccount.automount call out the CrashLoop trap and the forward-looking SA purpose. - NOTES.txt validate command now references the chart by its repo- rooted path so an operator copy-pasting from `helm install` output outside the checkout doesn't hit a "could not find chart" error. All 8 bad fixtures fail on their specific rule; both good fixtures and the chart's own render pass; helm lint, actionlint, and the AI-vocab gate are clean. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Pass-2 review surfaced four restricted-PSS holes that values overrides could still slip through, an under-tested rendering surface, and a set of adopter docs that left common questions unanswered: - conftest now denies runAsUser=0, runAsGroup=0, hostUsers=true, procMount=Unmasked, and allowPrivilegeEscalation=true. Five new bad fixtures pin each rule. The hostPID/hostIPC/hostNetwork rules gain explicit `input.spec.template.spec` guards so non-pod docs cannot cause undefined evaluations to leak through. - chart workflow adds a render-correctness gate for the two value- conditional template paths yq cannot infer from default-render output: priorityClassName injection and probe omission when telemetry.enabled=false. - README gains rows for priorityClassName + probes in the values table, a "Common configurations" section with three worked overlays (DCGM, OTLP backend, all-nodes tolerations), and Troubleshooting entries for OOMKilled, large-fleet rollout, ImagePullBackOff, --reuse-values caveat, and namespace PSS enforcement. - NOTES.txt warns when stdoutexporter is rendering against the default image — the chart's "validation default" should not silently graduate to production. values.yaml warns operators away from putting credentials in the `config:` override block (ConfigMap → unencrypted etcd). - .helmignore drops the kind-install Dockerfile from the packaged chart (it is CI scaffolding, not a distribution artifact). - testdata/README.md documents the fixture shape so a new contributor adds a fixture that fails on its rule, not on the policy floor. - CHANGELOG entry extended to enumerate the full conftest denial set and the new values knobs. All 13 bad fixtures fail on their specific rule; good fixtures + chart render pass; helm lint, actionlint, and the AI-vocab gate are clean; priorityClassName + telemetry-off render checks verified locally. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
values.schema.json validates the values.yaml shape at `helm install` /
`helm template` time. Catches typos and type mismatches (e.g.
`receivers.clockreceiver.enabled=notabool`) before the template
renders. The schema:
- pins image.pullPolicy to {Always, IfNotPresent, Never}
- requires podSecurityContext.seccompProfile.type ∈ {RuntimeDefault, Localhost}
- requires containerSecurityContext.allowPrivilegeEscalation=false
and readOnlyRootFilesystem=true (matches the conftest charter)
- bounds containerSecurityContext.capabilities.add to {SYS_PTRACE}
- pins telemetry.paths.{metrics,healthz,readyz} to absolute paths
- pins runAsUser / runAsGroup minimum to 1 (rejects root)
Chart.yaml gains Artifact Hub annotations (artifacthub.io/license,
prerelease, links, changes, operator) so the chart publishes to
Artifact Hub cleanly when the project opens external distribution.
docs/FOLLOWUPS.md gains a new "M5b chart — opportunistic deferrals"
section enumerating five items review surfaced as out-of-scope or
trigger-based: NetworkPolicy template, image scanning + SBOM on the
chart image, per-receiver resource guidance docs, 10-run install-to-
Ready median aggregate, and appArmorProfile rendering. Each carries
an explicit trigger.
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Bring M5b (#29) into the M3 branch so PR #28 is mergeable. One real content conflict in CHANGELOG.md — both branches added an entry under [Unreleased] / Added; resolved by keeping both, M3 listed before M5b per the M3 → M5b → M6 → M21 minimum-viable v0.1.0 dependency chain documented in MILESTONES.md. docs/FOLLOWUPS.md auto-merged cleanly (M5b's additions land in disjoint sections from the M3 release-pipeline-hardening section added on this branch). No other files conflicted. `make ci` green (12.5 s wall, well under PRINCIPLES §10 60 s). doc-check passes; 214 markdown links resolve. Per MEMORY.md `feedback_no_history_rewrites` the resolution is a merge commit, not a rebase — origin/main is pushed history and cannot be rebased over. Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Catch the branch up to main: PR #30 (M4b failure-injection harness) and PR #31 (M11 NCCL FlightRecorder receiver + safe pickle parser) landed since the prior merge of #29. Both auto-merged cleanly into disjoint sections of CHANGELOG.md and docs/FOLLOWUPS.md — no manual conflict resolution required. `make ci` green (58.6 s wall, within PRINCIPLES §10 60 s; up from 12 s pre-merge because the M4b harness + chaos suite and M11 pickle parser added a substantial test surface). doc-check: 193 markdown links resolve. release.yml unchanged across the merge so the end-to-end attestation run on v0.0.0-m3test-7 still characterises the post-merge HEAD's release behavior. Signed-off-by: Tri Lam <trilamsr@gmail.com>
This was referenced May 15, 2026
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
…act Hub annotations (#36) ## Problem The M5b CHANGELOG entry that landed in #29 understates scope: the follow-on commit added `values.schema.json` and Artifact Hub annotations on `Chart.yaml`, but the [Unreleased]/Added bullet was not re-edited before merge. ## Impact Adopters scanning the changelog before evaluating tracecore miss two operator-visible features: schema-validated values overlays and Artifact Hub publish-readiness. ## Solution Extend the M5b bullet with one sentence covering both. No other entries touched; no chart-side changes. ## Test plan - [x] `make doc-check` clean. - [ ] `pr-lint` green. - [ ] `CI verify` unchanged. ```release-notes [docs] CHANGELOG M5b bullet extended to mention values.schema.json and Artifact Hub annotations that shipped in #29 but were missed in the entry. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
## Problem v0.1.0 (M21) needs every release artifact to be byte-reproducible from the same SHA and verifiable end-to-end. M3 owns that pipeline. ## Impact Closes MILESTONES.md §M3. Unblocks M21 (release tag). ## Solution - `release.yml` builds `linux/amd64` twice at every `v*` tag into isolated `mktemp -d` dirs with fresh `GOCACHE` (cold-vs-cold byte-equality is the assertion a third party reproduces — not warm-vs-warm); `diffoscope` fails closed on any byte diff. `-trimpath` + `SOURCE_DATE_EPOCH` honored; embedded `BuildDate` cross-checked against `strings` of the binary. - CycloneDX SBOM via `cyclonedx-gomod mod`. Coverage gate intersects direct `go.mod` requires with `go list -deps ./cmd/tracecore`, so the SBOM must enumerate every runtime-reachable direct module (test-only deps like `testify`/`goleak` correctly excluded — they don't ship in the binary). - Cosign keyless `sign-blob` + smoke `verify-blob`. Cert identity pinned to `^https://github.com/<repo>/\.github/workflows/release\.yml@refs/tags/` so a sibling workflow on another branch cannot mint a passing bundle. No long-lived secrets; `id-token: write` is the only elevated permission outside the `release` job. - SLSA v1.0 provenance via `actions/attest-build-provenance`. Bundle's `predicateType == https://slsa.dev/provenance/v1` and `subject[0].digest.sha256` are cross-checked against the build job's digest before the artifact is uploaded. - Every third-party action pinned to a 40-char commit SHA. SBOM-job checkout pins to `${{ github.sha }}` (not the mutable tag) so a force-push between build and SBOM cannot diverge the BOM from the binary. - `docs/reproducibility.md` is the third-party verification recipe (rebuild → diffoscope → cosign → `gh attestation verify --bundle --signer-workflow --owner --predicate-type` → SBOM inspect). Works offline because step 6 reads the bundle from disk. `scripts/doc-check.sh` gates its presence and `bash -n` syntax of every fenced block; mutation-tested. M3 steps live in `release.yml`, not `make ci`. The base branch absorbed M5b (#29), M4b (#30), and M11 (#31) during this PR's life; merges with each are reflected in the branch and `make ci` still runs under the PRINCIPLES §10 60 s budget on darwin/arm64 (`(unverified — local measurement, not CI-cited)`). ## Test plan - [x] `make ci` green on the latest commit; under 60 s — see `verify` on the PR's CI workflow run. - [x] doc-check mutation test: inject `if then fi` into a fenced block → gate fails with block-#; restore → gate passes. - [x] `release.yml` runs end-to-end on `v0.0.0-m3test-7` (commit `b6c745f`) with diffoscope / cosign / SLSA / SBOM all green — [run 25914673989](https://github.com/TraceCoreAI/tracecore/actions/runs/25914673989). `release.yml` is unchanged across the subsequent commits on this branch (the two merges of `main` plus two doc fixups touched no workflow file), so the run still characterizes the current HEAD's release behavior. - [x] Recipe verified locally against `v0.0.0-m3test-7`: `cosign verify-blob` → `Verified OK`; `gh attestation verify` exit 0 with `predicateType=https://slsa.dev/provenance/v1` and `buildSignerURI` pinned to `release.yml@refs/tags/v0.0.0-m3test-7`. ## Out of scope (filed as follow-ups) Above-the-floor hardening (SLSA L3 via the reusable-workflow generator, `zizmor` / `actionlint`, repo tag-protection on `v*`, nightly rebuild-and-diff cron, `go mod verify`, build-env sanitization, cosign + `gh attestation` flag tightening, CycloneDX `mod`→`app`, Rekor log-index, recipe polish, apt cache) lives in [`docs/FOLLOWUPS.md`](docs/FOLLOWUPS.md) under "M3 release-pipeline hardening (post-PR #28)" and as a Carry-forward bullet on `MILESTONES.md` §M21. --------- Signed-off-by: Tri Lam <trilamsr@gmail.com>
This was referenced May 15, 2026
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
…test notes (#48) ## Problem Four durable lessons surfaced during the M5b chart work (PR #29 + follow-up #36) were not recorded in repo-resident notes. ## Impact Without anchors in `docs/notes/`, the next contributor re-discovers the same patterns: PR bodies drift across review rounds, memory-rule collisions get relitigated, rego rules fire on non-pod documents, iterative reviews repeat prior findings. ## Solution Three new topic notes under `docs/notes/`, each with anchors that catch regression: - `pr-workflow.md` — body-sync per review round + iterative-review prompt discipline. - `memory-rules.md` — forward-only compliance as the resolution for rule collisions. - `conftest.md` — `input.spec.template.spec` guard on every rego deny rule that references pod-spec fields. `AGENTS.md` topic index gains one line per new note. Captured via the `learn-from-mistakes` skill — banned-vocabulary check clean, AI attribution clean, anchors present, AGENTS.md at 55/150 lines. This PR replaces #41, which was opened off a now-stale base. Closing #41 in favor of this branch keeps history append-only per the project's no-rebase-after-push rule. ## Test plan - [x] `make doc-check` clean post-rebase (202 markdown links resolve). - [ ] `pr-lint` green. - [ ] `verify` job unchanged. ```release-notes [docs] Three new topic notes (`pr-workflow`, `memory-rules`, `conftest`) capture lessons from the M5b chart work; `AGENTS.md` topic index extended. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
tracecore has no operator-installable surface today. M5 / M20 install
benchmarks have nothing to install and v0.1.0 (M21) cannot ship until
M5b lands per MILESTONES.md §M5b.
Impact
Operators evaluating tracecore today must hand-roll a DaemonSet from
components/receivers/*/example-daemonset.yamlsnippets. This blocksthe critical path to v0.1.0 (M3 → M5b → M6 → M21).
Solution
install/kubernetes/tracecore/— Helm chart v2:Chart.yaml: apiVersion v2, SemVer version0.1.0,appVersionheldin sync with the binary by the CI gate.
values.yaml:namespace, per-receiverreceivers.<name>.enabledtoggles (clockreceiver on, dcgm/kernelevents off), free-form
config:override deep-merged INTO the rendered config last,resource/probe/toleration/updateStrategy knobs.
restrictedPod SecurityStandard pod spec:
runAsNonRoot,runAsUser: 65532,seccompProfile: RuntimeDefault,allowPrivilegeEscalation: false,readOnlyRootFilesystem: true,capabilities.drop: [ALL],hostPID/hostIPC/hostNetworkpinnedfalseat the template level(not values-tunable).
policies/conftest/tracecore.regorejectsprivileged,hostPID,hostIPC,hostNetwork, missingreadOnlyRootFilesystem, and anycapability addition other than
SYS_PTRACE. Six fixture-driven badmanifests + two good (baseline + SYS_PTRACE) pin policy behavior.
README.mdwith the five rubric-required H2 sections (Install,Upgrade, Uninstall, Values reference, Troubleshooting) plus a
documented PSS deviation table.
.github/workflows/chart.ymlruns OUTSIDE the 60smake cibudget:helm lint (zero WARNINGs gate), helm template +
tracecore validateon rendered all-off and one-receiver-on configs, yq assertions on
Chart.yaml+ DaemonSetsecurityContextfields, conftestdeny-fixture + allow-fixture tests, end-to-end kind-cluster install
asserting
STATUS: deployedand install-to-Ready ≤300s.Test plan
chartworkflow /renderjob green (helm lint + helm templatetracecore validate+ yq + conftest).chartworkflow /install (kind)job green (kind 0.25.0,kindest/node v1.32.0, install-to-Ready measured + asserted ≤300s).
install/and.github/workflows/chart.yml.CIworkflow stays green;make cibudget unchanged.helm lint install/kubernetes/tracecore→ 0 issues(verified pre-push).
conftest testagainstpolicies/conftest/testdata/bad/good fixtures behaves as expected (verified pre-push).
Open follow-ups
ships a reference
Dockerfilefor the kind-install CI gate; M3 willown the canonical release image.
rubric) accumulates as the workflow runs over time on
main.