Skip to content

chore(pivot): PR-K.3 — drop helm dead toggles for deleted receivers#234

Merged
trilamsr merged 1 commit into
mainfrom
cleanup-helm-dead-toggles-pr-k-3-partial
May 31, 2026
Merged

chore(pivot): PR-K.3 — drop helm dead toggles for deleted receivers#234
trilamsr merged 1 commit into
mainfrom
cleanup-helm-dead-toggles-pr-k-3-partial

Conversation

@trilamsr

Copy link
Copy Markdown
Contributor

PR-K.2 (#217) deleted the in-tree clockreceiver, dcgm, kernelevents, containerstdout receivers and stdoutexporter from the binary; the chart still carried their values blocks, RBAC, volume mounts, and policy exemptions. Operators turning any toggle on would get a render that the OCB binary cannot load. This drops the dead surface — pre-1.0, no operator-deprecation tax owed (per single-contributor latitude).

Partial close of #220: chart bits only. The k8sevents values key referenced in the issue is not touched here (the receiver still ships); the component-bug-kernelevents.yml issue template referenced in #220 was already absent on main.

What changed

  • values.yaml: deletes the 5 retired toggle blocks (clockreceiver, dcgm, kernelevents, containerstdout, stdoutexporter); keeps pyspy (in-tree-only pending OTel Profiles GA, per scope).
  • templates/_helpers.tpl: drops the containerstdout chart-only-keys omit branch in renderedConfig; refreshes the retire-soon comment to present-state language.
  • templates/daemonset.yaml: drops the conditional automountServiceAccountToken, conditional root securityContext, conditional downward-API env block, and conditional volumes/volumeMounts. The mainline restricted-PSS path is the only path left.
  • templates/containerstdout-rbac.yaml: deleted.
  • templates/NOTES.txt: replaces the multi-toggle WARNING with a pyspy-only note.
  • policies/conftest/tracecore.rego: drops containerstdout_enabled exemption from the runAsNonRoot / runAsUser=0 / runAsGroup=0 rules, and removes the full M15 operational-invariants block (4 deny rules + 2 helpers) that no longer have a receiver to gate.
  • README.md: replaces the dcgm overlay worked example with the otlphttp pattern; updates the defaults table + lead paragraph from clockreceiver+stdoutexporter to hostmetrics+debug; rewrites the kernelevents-OOMKilled troubleshooting note as a generic high-volume-receiver bullet; trims deviations table.
  • ci/{all-receivers-off,one-receiver-on,pyspy-on}-values.yaml: drops enabled: false entries for the deleted toggles (no render impact, just cleanup).

Verification

Locally, against the worktree:

  • helm lint install/kubernetes/tracecore/ — 0 failures, 0 WARNINGs.
  • helm template against the default values + all 3 ci fixtures — renders clean; default automountServiceAccountToken: false; pyspy-on still emits the pyspy receiver block; all-receivers-off correctly produces no receivers: and no pipelines:.
  • conftest against the rendered default DaemonSet — 39/39 pass. Each bad-*.yaml fixture in policies/conftest/testdata/ still denies (12/13 pass with 1 expected deny each — bad-runasuser-0 / bad-runasgroup-0 / bad-runasroot now deny without the exemption escape hatch).
  • good-baseline.yaml + good-sys-ptrace.yaml still pass — 26/26.
  • make check + make doc-check — green.

CI (chart workflow): exercises every gate above against an actual helm + conftest install, plus the kind-cluster e2e install, plus tracecore validate on the one-receiver-on rendered config.

[CHANGE] helm chart: dropped values toggles for clockreceiver, dcgm, kernelevents, containerstdout, and stdoutexporter (retired from the binary in v0.1.0-m9-alpha per RFC-0013 PR-K.2). The chart now ships hostmetrics + debug as the hardware-free default. Operators on prior overlays that set `enabled: false` on these keys must remove those entries on upgrade.

Refs: #220

PR-K.2 deleted clockreceiver/dcgm/kernelevents/containerstdout
in-tree receivers + stdoutexporter; the chart still carried their
toggles, RBAC, volume mounts, and policy exemptions. Wipe them so
the values surface matches what the OCB binary can actually load.

values.yaml drops the 5 retired blocks; helpers/daemonset/NOTES drop
the conditional branches; containerstdout-rbac.yaml deleted entirely;
rego policy drops the runAsUser=0 exemption and the M15 operational
invariants that no longer have a receiver to gate. README replaces
the dcgm overlay example with the otlphttp pattern. CI fixtures
shrink to just the toggles that still resolve.

helm lint, helm template against all 3 ci fixtures, conftest against
the rendered DaemonSet + every bad-*.yaml fixture, make check, and
make doc-check all green.

Refs: #220

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) May 31, 2026 18:34
@trilamsr trilamsr merged commit a813ec6 into main May 31, 2026
14 checks passed
@trilamsr trilamsr deleted the cleanup-helm-dead-toggles-pr-k-3-partial branch May 31, 2026 18:43
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
All three referenced PRs (#224 / #246 / #234) merged ahead of v0.2.0
cut. Stale [ ] checkboxes flagged by PR #251 review.

No body changes — only the open-items checklist.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Release-prep cut for v0.2.0. Pure version-string bump — no code or dep
changes.

- `builder-config.yaml` `dist.version`: `0.1.0-m9-alpha` → `0.2.0`.
- `install/kubernetes/tracecore/Chart.yaml` `version`: `0.1.0` →
`0.2.0`; `appVersion`: `"0.1.0-m9-alpha"` → `"0.2.0"` (kept in lockstep
by `scripts/chart-appversion-check.sh`).
- `CHANGELOG.md`: promote the existing `[Unreleased]` body to a new `##
[0.2.0] - 2026-05-31` section with a one-paragraph user-facing summary
on top (OCB-built binary, OTel v0.130, in-repo `module/` submodule,
self-tel rename, ko-published image, RFC-0013 pointer). A fresh empty
`[Unreleased]` header sits above it.
- `docs/migration/v0.1-to-v0.2.md`: flip 3 stale `[ ]` checkboxes
(PR-I.1b #224, PR-I.2 #246, PR-K.3 #234) to `[x]` — all three landed
pre-bump.

After this merges, the operator follow-up is the manual `git tag -s
v0.2.0 && git push origin v0.2.0` — that triggers
`.github/workflows/release.yml` (goreleaser + ko + cosign + SLSA
provenance). The tag push is intentionally NOT in this PR per
release-sequencing discipline.

## Verification

- `make check` — green (fmt, tidy-check, lint, vet, mod-verify) pre-bump
AND post-bump.
- `bash scripts/chart-appversion-check.sh` — green: `Chart.yaml
appVersion (0.2.0) matches builder-config.yaml dist.version`.
- `make build` — succeeded; OCB-built `./_build/tracecore` linked.
- `./_build/tracecore --version` — prints `tracecore version
v0.1.0-m1-206-g69f8981-dev` pre-tag (expected: no v0.2.0 tag yet; `git
describe --match 'v*' --dirty=-dev` resolves against `v0.1.0-m1`). The
release pipeline pins `TRACECORE_VERSION="$TAG"` for both goreleaser and
ko-publish, so the in-image binary published from the v0.2.0 tag will
report `v0.2.0` verbatim. The hardcoded `dist.version` fallback covers
tarball-extract / no-git scenarios.
- `make test` — green (root unit tests, race detector, all packages).
- `cd module && go test ./...` — green (submodule: nccl_fr parser,
patterns, replay, patterndetectorprocessor, rankjoinprocessor,
ncclfrreceiver).

## Known follow-up (not in this PR)

- CHANGELOG `[0.2.0]` section still carries pre-pivot M1-M11 `Added`
rows that describe code PR-F.1/F.2/K.2 (same section) explicitly delete.
Honest history but operators reading top-down can't tell what landed in
v0.2.0 vs what was pre-pivot scaffolding. Filed for a separate sweep PR
— kept out of this scope-of-3 contract.

## Test plan

- [ ] Reviewer confirms `Chart.yaml::appVersion` matches
`builder-config.yaml::dist.version` after merge (`bash
scripts/chart-appversion-check.sh`).
- [ ] Reviewer confirms `./_build/tracecore --version` will report
`v0.2.0` when built from the v0.2.0 tag commit.
- [ ] Operator follow-up post-merge: `git tag -s v0.2.0 -m "tracecore
v0.2.0" <merge-sha> && git push origin v0.2.0` — release workflow fires
automatically from the `push: tags: ['v*']` trigger.

## Release notes

```release-notes
- release(v0.2.0): bump builder-config.yaml dist.version and install/kubernetes/tracecore/Chart.yaml version + appVersion from 0.1.0-m9-alpha to 0.2.0; promote CHANGELOG [Unreleased] body to a tagged [0.2.0] - 2026-05-31 section; sweep stale migration-guide checkboxes.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…460) (#466)

## Summary

Closes #460. The `exit 0` on `scripts/doc-check.sh` ran unconditionally
whenever `docs/FAILURE-MODES.md` carried no `Test*`/`Fuzz*`/`Benchmark*`
identifiers (its current state on `main` — `grep -c` = 0), silently
bypassing every gate below it. Fix scopes the skip to the Go-test parity
block only (if/else, not `exit`), then surfaces and fixes the dead refs
the gates were supposed to be catching.

## Root cause

Commit a57883f (#13) shipped `doc-check.sh` with one gate — the Go-test
name parity check — so `[ -z "$referenced" ] && exit 0` was correct
then. PRs #28, #56, #115, #131, #144, #149, #195, #234, #241, #443,
#455, #459 (and others) appended gates **below** that line without
recognising they'd become dead code whenever `FAILURE-MODES.md` lost its
`Test*` references. PR #459 worked around the bug by placing its new
YAML gate *above* line 99 and tracked the root cause separately as #460.

## What surfaced

Once `exit 0` was removed, three real issues fired:

1. **Dead `.md` link**: `docs/FOLLOWUPS.md` → `followups/otlphttp.md`.
The shard was never committed to `main`'s ancestry. Folded into the
existing "Shards deleted post-v0.2.0 as fully resolved-via-pivot" prose
block (sibling treatment to M9, M14, M16).
2. **Banned-phrase hits** (3x `production-grade`): reworded in
`docs/cut-criteria.yaml.md` (2x) and
`install/kubernetes/tracecore/README.md` (1x) to falsifiable language.
3. **`docs/getting-started.md` block cap**: 7 fenced bash/sh blocks. The
M6 cap of 5 was set for the quickstart only — `## Install via Helm` and
`## Air-gapped install` are alternate deployment paths that landed
post-M6 and aren't part of the quickstart budget. Rescoped the gate to
count blocks inside the `## Walkthrough` H2 section only (1 block, well
under cap).

## Gate count

Empirically verified via `grep -c '^doc-check: '` on `make doc-check`
output on a clean tree:

| State | Status lines emitted | Gates the early-exit was hiding |
|---|---|---|
| Pre-fix on `main` (post-#459) | 3 (trust-posture, YAML cross-link,
parity-skip) | 14 |
| Post-fix this PR (post-rebase) | 17 | 0 |

The "14 gates hidden" number is invariant across the rebase: it counts
gates placed below the early-exit line. The "3 → 17" total reflects
post-#459 reality on `main`; pre-#459 baseline was "2 → 16" (the figure
originally in this PR body), and #459 itself worked around the bug by
placing its YAML gate above line 99.

## Mutation tests

Each gate below the original early-exit was confirmed to fire post-fix:

| Mutation | Gate expected to fire | Exit code post-mutation | Exit code
post-restore |
|---|---|---|---|
| Inject `[bad](nonexistent-ghost.md)` into `docs/FOLLOWUPS.md` |
markdown link-rot | 1 | 0 |
| Append `blazing-fast` + `rock-solid` to `docs/getting-started.md` |
banned-phrase lint | 1 | 0 |
| Delete `<!-- tested-against: ... -->` from
`docs/integrations/datadog.md` | M6 recipe markers | 1 | 0 |

## Test plan

- [x] `make doc-check` exits 0 on clean tree (re-run post-rebase onto
origin/main; 17 status lines)
- [x] 3 mutation tests above each toggle exit 1 → 0 across mutate /
restore
- [x] Pre-push hooks green: golangci-lint (0 issues), `go vet ./...`,
`go mod verify`, `attribute-namespace-check` (100 attrs, all
documented), `register-lint`, `actionlint`, `zizmor`,
`deprecation-check`, `no-autoupdate-check`
- [x] Rebased onto current `origin/main` (includes #459, #461, #462,
#456); no conflicts; gate count re-verified empirically post-rebase
- [x] No changes to gates above line 99 (the trust-posture callout +
YAML cross-link gate from #459 still run and emit unchanged status
lines)

## Self-grade

**A+** — root cause named in commit body (a57883f #13 with one gate;
gates appended below without exit-path awareness); 3 mutation tests
(success criteria required 1–2); rescoped the getting-started gate to
match M6 intent rather than papering over the surfaced overflow; the `[
-z "$referenced" ]` legitimate skip is preserved via if/else (not `:`
no-op, which would have left the `defined=` / `orphans=` block running
on empty input); gate count corrected empirically post-rebase per
reviewer B feedback.

```release-notes
- fix(ci): `scripts/doc-check.sh` no longer exits 0 at the Go-test parity gate when `docs/FAILURE-MODES.md` carries no `Test*` references. 14 gates below that line (link-rot, banned-phrase, M6 recipe markers, etc.) are now actually enforced on every `make doc-check` invocation. Closes #460.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant