Skip to content

feat(pivot): PR-K.2 — delete 4 in-tree receivers + xidgen#217

Merged
trilamsr merged 2 commits into
mainfrom
pr-k-2-delete-receivers
May 31, 2026
Merged

feat(pivot): PR-K.2 — delete 4 in-tree receivers + xidgen#217
trilamsr merged 2 commits into
mainfrom
pr-k-2-delete-receivers

Conversation

@trilamsr

@trilamsr trilamsr commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

RFC-0013 §migration PR-K.2: delete the four in-tree receivers
(clockreceiver, kernelevents, k8sevents, containerstdout)
plus the xidgen failure-injector (whose sole consumer was
kernelevents's wire shape per RFC §migration L253). PR-K.1
(#211) just severed internal/synthesis/patterns/ + replay from
k8sevents, unblocking the source-tree cut. PR-J (#195) shipped
the four upstream-OTel recipes that replace the in-tree receivers
in the bundled Helm-chart pipeline; this PR retires the
behind-the-curtain code.

Chart cleanup (values keys, DaemonSet template refs, NOTES.txt
deprecation warnings) intentionally stays in PR-K.3 so operators
get one minor of deprecation telemetry before the values shape
breaks.

Branch state note: PR-F.2 (#215, internal/{componentstatus,pipeline,pipelinebuilder,config,consumer,fanout,runtime/lifecycle} deletion) and PR-I.1a (#214, module/ Go submodule scaffold + go.work) both landed on main mid-flight and have been merged into this branch. The _test.go placeholder-name migration originally scoped here (RFC §migration's "~86 fixture refs" line) is now moot — PR-F.2 deleted every internal/* file that held those refs, so the migration target evaporated. make ci + make verify + make build re-run green against the merged tree.

[CHANGE] In-tree receivers clockreceiver/kernelevents/k8sevents/containerstdout deleted in favor of PR-J upstream-OTel recipes. xidgen failure-injector deleted alongside kernelevents (sole consumer).

What lands

Deletions — receivers (4)

Path Files LOC Replacement (PR-J recipe)
components/receivers/clockreceiver/ 10 ~1.4k hostmetricsreceiver (loadscraper @ 1s) — PR-E landed in #180. RFC-0013 §migration originally named telemetrygeneratorreceiver but that receiver does not exist in opentelemetry-collector-contrib (contrib #41687 + #43657 both closed not_planned).
components/receivers/kernelevents/ 49 ~13.2k journaldreceiver + filelogreceiver (kmsg) + OTTL Xid transform. Customer-stable kernelevents.xid + gpu.id attributes preserved via the OTTL transform per RFC-0013 §3.
components/receivers/k8sevents/ 37 ~6.8k k8sobjectsreceiver (watch mode on events) + OTTL k8s.event.hint transform. 11-entry hint enum preserved via the OTTL transform per RFC-0013 §3. The typed internal/synthesis/patterns/Record + NodeRecord (severed in PR-K.1) keeps M19's pod-evicted detector pinned.
components/receivers/containerstdout/ 56 ~6.4k filelogreceiver + container stanza + file_storage extension. Per-rank attribution + dataloader-timing extraction move to OTTL transforms in the bundled recipe per RFC-0013 §3.

Deletions — supporting infra

Path Why
tools/failure-inject/xidgen/ (2 files, 281 LOC) Sole consumer was the kernelevents wire shape per RFC-0013 §migration L253. Operators inject NVRM Xid via real /dev/kmsg (sudo tee) or systemd-cat against the journald recipe.
install/kubernetes/tracecore/ci/containerstdout-on-values.yaml (41 LOC) Chart-render fixture for the deleted receiver. Remaining chart fixtures (all-receivers-off-values.yaml, one-receiver-on-values.yaml, pyspy-on-values.yaml) untouched; their clockreceiver: enabled: false / kernelevents: enabled: false rows survive to PR-K.3's chart cleanup.
.github/ISSUE_TEMPLATE/component-bug-kernelevents.yml Receiver-specific bug template; no surviving receiver.

failure-inject CLI surface

failure-inject xid --code=N [--format=…] [--count=N] removed.
failure-inject {pod-evict,nccl-hang,cpu-steal} unchanged.
tools/failure-inject/testdata/golden.sha256 drops the two xid
golden rows. .github/workflows/chaos.yml drops the two
two-run-determinism steps that exercised the xid byte-determinism
contract; pod-evict's determinism + the golden-SHA replay loop
survive.

Tooling shed

  • Makefile: retire test-extras-sustained body (was kernelevents-
    only; now @true — target retained so downstream automation has
    a stable name; future sustained-load suites slot back in).
    Retire test-extras-fuzz-kmsg / test-extras-fuzz-journald;
    test-extras-fuzz loop drops to nccl-fr only. Drop the
    kernelevents row from test-extras-race. Empty the bench-check
    for-loop (k8sevents was the only baseline; PR-F.2 already
    rewrote the comment block to reflect this — the merge keeps
    both edits aligned).
  • go.mod: sheds k8s.io/{api,apimachinery,client-go,klog/v2,kube-openapi,utils},
    sigs.k8s.io/{json,randfill,structured-merge-diff/v6,yaml},
    gopkg.in/{evanphx/json-patch.v4,inf.v0} — the dep cluster
    k8sevents dragged in. go.uber.org/goleak also dropped post-
    merge (was held only by PR-F.2's-deleted
    internal/pipeline/chaos_test.go).

Doc + comment sweep

Comment-only references to deleted receivers in
components/exporters/{otlphttp,stdoutexporter}/,
components/receivers/{nccl_fr,pyspy}/,
internal/synthesis/patterns/{doc,model,verdict}.go rewired to
surviving references (or to upstream recipe pointers).
docs/README.md per-component-docs table drops five dead links
(caught by doc-check.sh's rotten-link gate — that is in fact
how I caught the last drift). bench/install/README.md
tick-alias note + schema-v2-rename note updated.
tools/failure-inject/README.md xid section removed; status
banner rewritten.

CHANGELOG (new PR-K.2 entry under Unreleased + wave-4 paragraph
re-balanced), MILESTONES (M1 + M9 + M10 + M15 status lines flipped
from "DELETED at v0.2.0" → "DELETED in PR-K.2" with file pointers
to the integration recipes), docs/migration/v0.1-to-v0.2.md
(PR-K.1 + PR-K.2 checkboxes flipped, PR-F.2 + PR-I.1a status
block updated post-merge), AGENTS.md (queued-for-deletion
paragraph updated to current reality) all swept.

What evaporated mid-flight

Originally this PR was also going to migrate a handful of
_test.go files that held placeholder-name string references
to clockreceiver (in internal/pipeline/saferun_test.go,
internal/config/fuzz_test.go, internal/pipelinebuilder/fuzz_test.go).
PR-F.2 (#215) deleted those files outright while this branch
was being drafted, so the migration target disappeared. No
follow-up needed.

Net LOC delta

192 files changed, 120 insertions(+), 28,927 deletions(-)

What is intentionally NOT in this PR

  • **Helm-chart receivers.clockreceiver / receivers.kernelevents
    / receivers.containerstdout toggles + DaemonSet template refs
    • containerstdout-rbac.yaml template** — stays for PR-K.3 so
      operators see NOTES.txt deprecation warnings before the values
      shape breaks. The toggles are already inert post-PR-A2 (enabling
      any of them in values.yaml crashes the OCB binary at boot
      because the factories are not registered) — keeping them as
      no-ops for one minor preserves operator UX.
  • internal/{componentstatus,pipeline,…}/ deletion — already
    done in PR-F.2 (feat(pivot): PR-F.2 — delete in-tree boot-path internals #215) before this branch landed; merged in.
  • Chart values.yaml # clockreceiver — in-tree heartbeat retired by RFC-0013 PR-A2 style retire-banners — stays for
    PR-K.3 alongside the actual values-keys cleanup.
  • tools/failure-inject/ncclhang/ — KEPT. Used by
    pkg/nccl/fr_parser/synthesize_test.go +
    bench/overhead/nccl_fr_bench_test.go; this is the canonical
    example of a failure-injector that survives the v0.2.0 cut.

Root cause

The four receivers + xidgen survived only because PR-K.1's pattern-
library severance from k8sevents was still in flight. With #211
merged at the start of this session, the deletion is unblocked.
There is no workaround being applied here — this PR is the
root-cause deletion of the in-tree receivers themselves, which
RFC-0013 §migration set as the v0.2.0 deletion target.

Test plan

  • make ci green post-merge (verify + license + nccl-fr-rce-gate +
    register-lint + actionlint + zizmor + coverage-check +
    ci-fuzz-nccl-fr + govulncheck + doc-check + no-autoupdate-check +
    build).
  • make verify green post-merge.
  • make build green post-merge (OCB compile against
    builder-config.yaml with GOWORK=off per the PR-I.1a
    isolation guard yields ./_build/tracecore).
  • go test ./... green across the post-merge tree.
  • Hard pre-flight: zero external Go importers for any
    deletion target (clockreceiver, kernelevents, k8sevents,
    containerstdout, xidgen) — re-verified after the merge.
  • CI: chart-render job validates the surviving chart fixtures
    (all-receivers-off-values.yaml, one-receiver-on-values.yaml,
    pyspy-on-values.yaml); the deleted containerstdout-on-values.yaml
    drop-out should not regress conftest coverage of the
    containerstdout-allowlist / operational-invariant rules
    because the template guard still fires when
    containerstdout.enabled=true is set in any future values
    file. PR-K.3 will reassess once the chart-side keys go.
  • CI: install (kind) job continues to render the bench
    tracecore-values.yaml against the OCB binary
    (hostmetricsreceiver heartbeat surface).
  • CI: harness-determinism (amd64/arm64) job no longer runs
    the xid byte-determinism steps; pod-evict + golden-SHA loop
    survive. Expected: 2 fewer steps per matrix arm.

Gates that should fail this PR if I missed something

  • doc-check's "dead markdown link" gate would catch any
    surviving link into the deleted dirs (caught the
    docs/README.md regressions on the first run; fixed and
    re-verified).
  • go vet ./... would catch a stale import; ran clean.
  • golangci-lint run ./... would catch unused imports or dead
    code introduced by the sweep; 0 issues reported.
  • go mod tidy -diff would catch a missing dep prune; ran clean
    after the post-merge prune of go.uber.org/goleak.

Refs RFC-0013 §migration PR-K.2.

Tri Lam added 2 commits May 31, 2026 02:35
Delete `components/receivers/{clockreceiver,kernelevents,k8sevents,containerstdout}/`
in favor of the PR-J upstream-OTel recipes (hostmetricsreceiver,
journaldreceiver+filelogreceiver+OTTL Xid transform, k8sobjectsreceiver+OTTL
hint transform, filelogreceiver+container stanza+file_storage). Delete
`tools/failure-inject/xidgen/` (sole consumer was the kernelevents wire
shape per RFC-0013 §migration L253) and strip the `xid` subcommand +
golden SHAs + chaos-workflow determinism steps that exercised it.
Delete the `containerstdout-on-values.yaml` chart fixture (chart values
keys + DaemonSet template refs survive to PR-K.3's chart-cleanup cycle).
Delete `.github/ISSUE_TEMPLATE/component-bug-kernelevents.yml`.

Migrate 6 _test.go string-literal placeholder references (`clockreceiver/primary`
→ `hostmetrics/primary`) in `internal/pipeline/saferun_test.go`,
`internal/config/fuzz_test.go`, and `internal/pipelinebuilder/fuzz_test.go`.
Pure rename; no contract change. Update comment-only references in
`components/exporters/{otlphttp,stdoutexporter}/`,
`components/receivers/{nccl_fr,pyspy}/`,
`internal/synthesis/patterns/{doc,model,verdict}.go`, and
`internal/runtime/lifecycle/lifecycle.go`. Retire the kernelevents-only
Makefile targets (`test-extras-sustained`, `test-extras-fuzz-{kmsg,journald}`,
the kernelevents row in `test-extras-race`); empty the `bench-check` for-loop
(k8sevents was the only per-package baseline). go.mod sheds the k8s.io/*
+ sigs.k8s.io/* + gopkg.in/* dep cluster (28 lines off go.mod, 55 off go.sum).

Sweep CHANGELOG (new PR-K.2 entry under Unreleased + wave-4 paragraph
updated to reflect the three previously-open ports having landed),
MILESTONES (M1, M9, M10, M15 status lines flipped from "DELETED at
v0.2.0" → "DELETED in PR-K.2" with file pointers to docs/integrations/
recipes), docs/migration/v0.1-to-v0.2.md (PR-K.1 + PR-K.2 checkboxes
flipped, PR-F.2 gate language updated, PR-I.2 gate satisfied),
docs/README.md (per-component-docs table drops deleted-receiver rows),
AGENTS.md (queued-for-deletion paragraph updated to current reality),
bench/install/README.md (clockreceiver tick-alias note updated),
tools/failure-inject/README.md (xid section removed).

Hard pre-flight: zero external Go importers for any deletion target.
`make ci`, `make verify`, `make build` (OCB compile), and
`tracecore validate` against every `docs/integrations/examples/*.yaml`
all green.

Net: 196 files changed, 28,943 deletions, 131 insertions.

Refs RFC-0013 §migration PR-K.2.

Signed-off-by: Tri Lam <tri@maydow.com>
# Conflicts:
#	MILESTONES.md
#	components/receivers/kernelevents/README.md
#	docs/migration/v0.1-to-v0.2.md
#	internal/config/fuzz_test.go
#	internal/pipeline/saferun_test.go
#	internal/pipelinebuilder/fuzz_test.go
#	internal/runtime/lifecycle/lifecycle.go
@trilamsr trilamsr merged commit 7e5740e into main May 31, 2026
18 of 22 checks passed
@trilamsr trilamsr deleted the pr-k-2-delete-receivers branch May 31, 2026 13:50
trilamsr pushed a commit that referenced this pull request May 31, 2026
PR-K.2 (#217) and PR-F.2 (#215) deleted clockreceiver, containerstdout,
k8sevents, kernelevents from the binary; the namespace-alignment section
in v0.1-to-v0.2.md was authored before those merges landed and still
listed all eight. Trim the per-component substitution table to the four
surviving in-tree components (nccl_fr, pyspy, otlphttp, stdoutexporter)
and point operators at the existing "Orphan in-tree components" table
for the deleted-receiver migration path. Refresh the PromQL example
to use otlphttp (still present) instead of containerstdout (deleted).

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
…216)

## Summary

The four surviving in-tree components (`nccl_fr`, `pyspy`, `otlphttp`,
`stdoutexporter`) each emit self-telemetry through their own
per-component MeterProvider. This PR renames their instrument names from
the v0.1.x `tracecore.*` family to the upstream
`otelcol_<role>_<component>_<metric>` convention so the in-tree
namespace does not collide with the OCB pipeline-runtime's own
`otelcol_*` family.

Per [RFC-0013 §migration v0.1.0](docs/rfcs/0013-distro-first-pivot.md)
row 119: *"self-tel metric rename `tracecore.*` → `otelcol_*`"*.

**Scope reduced post-PR-K.2 / PR-F.2.** The four legacy in-tree
receivers originally listed for rename (`clockreceiver`,
`containerstdout`, `k8sevents`, `kernelevents`) plus the in-tree
boot-path internals were deleted from the binary in #217 and #215 while
this PR was in CI. The namespace-alignment edits to those files were
dropped during the post-#217 merge; the surviving net diff covers only
the four components still in the binary.

## Rename matrix

OTel-dot form (Prometheus scrape renders dots as underscores):

| v0.1.x instrument | post-rename instrument |
|---|---|
| `tracecore.receiver.errors_total` |
`otelcol.receiver.<name>.errors_total` |
| `tracecore.receiver.emissions_total` |
`otelcol.receiver.<name>.emissions_total` |
| `tracecore.receiver.collection_latency_seconds` |
`otelcol.receiver.<name>.collection_latency_seconds` |
| `tracecore.receiver.degraded_seconds_total` |
`otelcol.receiver.<name>.degraded_seconds_total` |
| `tracecore.receiver.last_activity_unix_seconds` |
`otelcol.receiver.<name>.last_activity_unix_seconds` |
| `tracecore.exporter.calls_total` |
`otelcol.exporter.<name>.calls_total` |
| `tracecore.selftelemetry.init_errors_total` |
`otelcol.selftelemetry.init_errors_total` |

Where `<name>` is the OCB component name without underscores.
Per-component substitutions:

| Component | `<name>` |
|---|---|
| `components/receivers/nccl_fr` | `ncclfr` (underscore stripped per
RFC) |
| `components/receivers/pyspy` | `pyspy` |
| `components/exporters/otlphttp` | `otlphttp` |
| `components/exporters/stdoutexporter` | `stdoutexporter` |

**Label shape is preserved.** `component_id` still partitions
per-instance; `kind` / `result` values are unchanged. Dashboards and
alerts that filtered on labels need only the metric-name rename, not a
label-selector rewrite.

## What changed

- **Four `selftel.go` files** (otlphttp, stdoutexporter, nccl_fr, pyspy)
— instrument literals updated; header comments rewritten to document the
new convention + label-preservation invariant.
- **Four `selftel_test.go` files** — `findInstrument(..., "...")` and
`scopeOf(..., "...")` assertions updated; `receiverInstrumentPrefix` /
`exporterInstrumentPrefix` constants (used by the `failingReceiverMP` /
`failingExporterMP` synthetic-failure seams) bumped to the new prefix so
a future drift back to `tracecore.*` would fail compile-time on the
prefix mismatch.
- **`docs/examples/prometheus-alerts.example.yaml`** — rewritten as
receiver-agnostic starter using regex matchers
(`{__name__=~"otelcol_receiver_.*_errors_total"}`) so a new in-tree
receiver inherits coverage on first scrape. Removed the
`tracecore_exporter_failure_rate` / `tracecore_build_info` rules — those
v0.1.x gauges no longer exist post-RFC-0013.
- **`docs/migration/v0.1-to-v0.2.md`** — new "In-tree receiver /
exporter namespace alignment (RFC-0013 v0.1.0)" section listing only the
four surviving in-tree components; points operators at the existing
"Orphan in-tree components" table for deleted-receiver migration.
- **`NOTE on ExporterCarrier removal`** blocks in otlphttp.go +
stdoutexporter.go collapsed — they referenced
`cmd/tracecore.collect.collectFailureRateReaders` (deleted in PR-A2) and
`internal/selftelemetry` (deleted in PR-F.1). New comments point at the
PromQL recipe
`rate(otelcol_exporter_<name>_calls_total{result="error"}[5m])`.
- **CHANGELOG** — `[Unreleased]` entry added documenting the alignment.

## Test plan

- [x] `go vet ./...` clean (root + module/ submodule)
- [x] `go test ./...` all green (root module; module/ submodule has no
test files)
- [x] `make check` (fmt + tidy-check + lint + vet + mod-verify) green
- [x] `make doc-check` green
- [x] `TestSelfTelemetry_*` assertions pass for every surviving package
- [x] `TestFactory_FallsBackToNoopWhenMeterFails`
synthetic-register-failure seam still trips on the renamed prefix
(`failingReceiverMP` / `failingExporterMP`)

## Merge resolution (post-#215, #217)

Conflicts surfaced as "deleted by them" across 20 files in the four
deleted receivers. Resolution: accept all upstream deletions; the
namespace-alignment edits to those files were moot because the files
themselves no longer exist on main. No content edits required to the
surviving four components from the merge — their namespace-alignment
commits remain intact.

```release-notes
**Breaking (pre-1.0)**: in-tree component self-telemetry metric names renamed from `tracecore.*` to `otelcol.<role>.<name>.*` per RFC-0013 §migration v0.1.0 namespace alignment. Affects the four surviving in-tree components (`nccl_fr`, `pyspy`, `otlphttp`, `stdoutexporter`); the four legacy in-tree receivers (`clockreceiver`, `containerstdout`, `k8sevents`, `kernelevents`) were already deleted from the binary in #215 / #217 and migrate via the upstream-receiver replacements documented in the "Orphan in-tree components" table in `docs/migration/v0.1-to-v0.2.md`. Label shape preserved (`component_id`, `kind`, `result` unchanged). Operators with dashboards / alerts on the v0.1.x names rename per the table in `docs/migration/v0.1-to-v0.2.md` under "In-tree receiver / exporter namespace alignment".
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request May 31, 2026
`scripts/doc-check.sh` `scan_paths` listed four docs, three of which
were deleted in #215 (PR-F.2, `internal/pipeline/README.md`) and #217
(PR-K.2, `components/receivers/dcgm/RUNBOOK.md` +
`components/receivers/kernelevents/RUNBOOK.md`). The `grep` swallowed
the "No such file" errors via `2>/dev/null`, so the gate appeared to
pass while only `docs/FAILURE-MODES.md` was actually scanned — silent
rot, not a hard CI fail.

Also drops `pkg/` from the `defined` Go-test-symbol grep roots (deleted
in PR-F.2) and adds `module/` (introduced in PR-I.1a / extended here in
PR-I.1b), so the gate keeps full coverage of the post-pivot test corpus.

Side fix: `go.work` header claimed the Go directive matched root +
submodule `go.mod` exactly. `module/go.mod` pins `go 1.22.0` to track
the collector v0.110.0 OCB-distribution baseline; root is `1.26.3`.
Workspace mode only requires `>= every member module's directive`, so
1.26.3 >= 1.22.0 is fine — but the comment claim was wrong. Updated to
acknowledge the delta + name the reason (OCB baseline floor).

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
#224)

## Summary

RFC-0013 §migration **PR-I.1b** — mechanical move of `nccl_fr` receiver
+ safe-pickle parser into the in-repo Go submodule scaffolded by #214.

- `git mv components/receivers/nccl_fr → module/receiver/ncclfrreceiver`
(Go package renamed `ncclfr → ncclfrreceiver` to match the OCB receiver
name; **OCB component type string `nccl_fr` unchanged** — operator
scrape names + dashboards do not regress, per PR-B1 metric-namespace
decision).
- `git mv pkg/nccl/fr_parser → module/pkg/nccl/fr_parser`.
- `builder-config.yaml` uncomments the PR-I.1a placeholder: adds `gomod:
github.com/tracecoreai/tracecore/module v0.1.0` + `import:
github.com/tracecoreai/tracecore/module/receiver/ncclfrreceiver` (split
because the single-`go.mod` submodule puts the receiver one path-segment
below the module root — a per-component `gomod:` would fail since
`module/receiver/ncclfrreceiver/` has no `go.mod` of its own).
- Root-level `replaces: github.com/tracecoreai/tracecore/module =>
../module` (`../module`, not `./module`, because OCB writes the replaces
verbatim into `./_build/go.mod`, one directory deeper than repo root).
- `module/go.mod` pins collector deps to **v0.110.0** + otel to
**v1.30.0** (the OCB-distribution baseline). MVS inside `_build/go.mod`
would otherwise pull forward to v1.59.0 — `scraperhelper` was split out
of `collector/receiver` between those two release lines, which would
break the `hostmetricsreceiver@v0.110.0` build added in PR-E.
- Root `go.mod` adds `replace github.com/tracecoreai/tracecore/module =>
./module` + matching `require` so `go mod tidy` (which ignores
`go.work`) resolves the submodule from the in-repo checkout rather than
the proxy. (Drops the first time a release builds against a published
`module/vX.Y.Z` tag.)
- Importers updated in root: `tools/genfixtures` (+ `-out` default),
`bench/overhead/nccl_fr_bench_test.go`, `scripts/nccl-fr-rce-gate.sh`
(runs `go list -deps` from inside `module/`), `Makefile`
(`generate-fixtures` / `test-extras-fuzz-nccl-fr` / `ci-fuzz-nccl-fr` /
`nccl-fr-rce-gate`), `.github/workflows/nccl-fr-fuzz-nightly.yml`, and
two doc-comment refs in
`components/exporters/stdoutexporter/selftel{,_test}.go`.
- OTel instrumentation scope on `selftel.go` moves to the new Go import
path
(`github.com/tracecoreai/tracecore/module/receiver/ncclfrreceiver`),
OTel convention: scope = Go import path.

**No operator-visible metric/log-data regression** — receiver type
string + metric instrument names unchanged across the OCB wire boundary.
The OTel instrumentation scope name does move (per OTel convention scope
= Go import path; matches bullet 7 + RFC-0013 §Migration + CHANGELOG),
but the scope is a logger/meter attribute, not part of metric instrument
names or the component-type-string operators configure against — so
dashboards and scrape jobs do not regress.

Post-merge: tag `module/v0.1.0 <merge-sha>` (the first version pinned in
`builder-config.yaml`).

## Test plan

Locally run + green on this branch (verified at commit `da44388`;
subsequent commits are an `origin/main` merge + a `doc-check.sh`
`scan_paths` fix-up that prunes three docs deleted by #215/#217 — no
source semantic change):

- [x] `make check` — fmt, golangci-lint, vet, mod-verify (0 issues).
- [x] `make verify` — license-check, generate-fixtures-check,
build-tags, **nccl-fr-rce-gate** (parser depends only on stdlib),
register-lint, actionlint, zizmor, doc-check, no-autoupdate-check.
- [x] `make build` — OCB v0.110.0 builds `./_build/tracecore` with the
submodule receiver wired in.
- [x] `./_build/tracecore components` — confirms `nccl_fr` receiver
registered (alongside hostmetrics, filelog, journald, k8sobjects, otlp,
prometheus).
- [x] `go test ./...` (root module).
- [x] `cd module && go test ./...` (submodule:
`module/pkg/nccl/fr_parser` + `module/receiver/ncclfrreceiver`).
- [x] `make ci-fuzz-nccl-fr` — 30s `FuzzParseFRPickle` on the moved
parser, no crashers.
- [x] `bash scripts/nccl-fr-rce-gate.sh` — parser deps clean (no
`os/exec`, `plugin`, `reflect.Call`, `reflect.MakeFunc`).

CI to confirm:

- [ ] All required checks green on this PR.
- [ ] `nccl-fr-fuzz-nightly` runs against the new path on next nightly
trigger.

Post-merge:

- [ ] Tag `module/v0.1.0 <merge-sha> && git push origin module/v0.1.0`.

## Why this is a root-cause fix, not a workaround

Two non-obvious decisions in this diff — both root-cause, not
workarounds:

1. **`replaces: ../module` (not `./module`) in builder-config.yaml.**
OCB writes the `replaces:` directive verbatim into `./_build/go.mod`.
Paths in `go.mod` resolve relative to that file's directory, which is
one level deeper than repo root. `./module` would fail to resolve;
`../module` resolves correctly to `<repo>/module/`.
2. **`module/go.mod` pins collector v0.110.0, not the root's v1.59.0.**
OCB's MVS reconciles `_build/go.mod`'s require graph against the
submodule's `go.mod`. If the submodule pinned the higher version, MVS
pulls forward inside `_build/go.mod` and `hostmetricsreceiver@v0.110.0`
(added in PR-E to replace `clockreceiver`) fails to build because
`scraperhelper` moved out of `go.opentelemetry.io/collector/receiver`
between v0.110.0 and v1.59.0. The root module independently uses v1.59.0
for its non-OCB code paths; MVS reconciles to the higher version where
root + submodule overlap, which is fine for root.

```release-notes
none
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
…234)

PR-K.2 (#217) deleted the in-tree `clockreceiver`, `dcgm`,
`kernelevents`, `containerstdout` receivers and `stdoutexporter` from
the binary; the chart still carried their values blocks, RBAC, volume
mounts, and policy exemptions. Operators turning any toggle on would get
a render that the OCB binary cannot load. This drops the dead surface —
pre-1.0, no operator-deprecation tax owed (per single-contributor
latitude).

Partial close of #220: chart bits only. The `k8sevents` values key
referenced in the issue is not touched here (the receiver still ships);
the `component-bug-kernelevents.yml` issue template referenced in #220
was already absent on `main`.

## What changed

- `values.yaml`: deletes the 5 retired toggle blocks (`clockreceiver`,
`dcgm`, `kernelevents`, `containerstdout`, `stdoutexporter`); keeps
`pyspy` (in-tree-only pending OTel Profiles GA, per scope).
- `templates/_helpers.tpl`: drops the `containerstdout` chart-only-keys
omit branch in `renderedConfig`; refreshes the retire-soon comment to
present-state language.
- `templates/daemonset.yaml`: drops the conditional
`automountServiceAccountToken`, conditional root `securityContext`,
conditional downward-API env block, and conditional
volumes/volumeMounts. The mainline `restricted`-PSS path is the only
path left.
- `templates/containerstdout-rbac.yaml`: deleted.
- `templates/NOTES.txt`: replaces the multi-toggle WARNING with a
pyspy-only note.
- `policies/conftest/tracecore.rego`: drops `containerstdout_enabled`
exemption from the `runAsNonRoot` / `runAsUser=0` / `runAsGroup=0`
rules, and removes the full M15 operational-invariants block (4 deny
rules + 2 helpers) that no longer have a receiver to gate.
- `README.md`: replaces the dcgm overlay worked example with the
otlphttp pattern; updates the defaults table + lead paragraph from
clockreceiver+stdoutexporter to hostmetrics+debug; rewrites the
kernelevents-OOMKilled troubleshooting note as a generic
high-volume-receiver bullet; trims deviations table.
- `ci/{all-receivers-off,one-receiver-on,pyspy-on}-values.yaml`: drops
`enabled: false` entries for the deleted toggles (no render impact, just
cleanup).

## Verification

Locally, against the worktree:

- `helm lint install/kubernetes/tracecore/` — 0 failures, 0 WARNINGs.
- `helm template` against the default values + all 3 ci fixtures —
renders clean; default `automountServiceAccountToken: false`; `pyspy-on`
still emits the pyspy receiver block; `all-receivers-off` correctly
produces no `receivers:` and no `pipelines:`.
- `conftest` against the rendered default DaemonSet — 39/39 pass. Each
`bad-*.yaml` fixture in `policies/conftest/testdata/` still denies
(12/13 pass with 1 expected deny each — `bad-runasuser-0` /
`bad-runasgroup-0` / `bad-runasroot` now deny without the exemption
escape hatch).
- `good-baseline.yaml` + `good-sys-ptrace.yaml` still pass — 26/26.
- `make check` + `make doc-check` — green.

CI (`chart` workflow): exercises every gate above against an actual
`helm` + `conftest` install, plus the kind-cluster e2e install, plus
`tracecore validate` on the `one-receiver-on` rendered config.

```release-notes
[CHANGE] helm chart: dropped values toggles for clockreceiver, dcgm, kernelevents, containerstdout, and stdoutexporter (retired from the binary in v0.1.0-m9-alpha per RFC-0013 PR-K.2). The chart now ships hostmetrics + debug as the hardware-free default. Operators on prior overlays that set `enabled: false` on these keys must remove those entries on upgrade.
```

Refs: #220

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
Closes #227.

## Summary

`make bench-check` has been a no-op since PR-K.2 (#217) deleted
`components/receivers/k8sevents/` — its `testdata/bench-baseline.txt`
was the sole row in the gate's `for pkg in …` loop, leaving the loop
body literally empty (`for pkg in ; do …`). The target stayed as a stub
so `make ci` automation kept a stable invocation, with a TODO to
re-register packages later. That re-registration is this PR.

## Root cause

`Makefile:71-101` (pre-diff) iterated an empty list. The wrapper around
`benchstat`+`scripts/bench-check.sh` was intact; only the *registry of
which packages to gate* had been emptied. So the fix is purely a
registration + baseline-capture exercise — no logic change to the gate's
comparison machinery, plus one scope-tightening change called out under
"Adversarial-review fix" below.

## What this PR does

1. **Registers two packages** with deterministic-allocation
micro-benchmarks for gating:
- `module/pkg/patterns` — `BenchmarkPodEvictedDetector_1kEventWindow`
(the M19 detector budget). (Path is post-PR-I.2a / #242; this branch was
rebased onto that merge during the PR.)
- `components/receivers/pyspy` — `BenchmarkParseDump` +
`BenchmarkStackID` (the M18 join-key hash + faulthandler parser).

2. **Captures committed baselines** at
`module/pkg/patterns/testdata/bench-baseline.txt` and
`components/receivers/pyspy/testdata/bench-baseline.txt`, both at
`count=10 -benchtime=500ms -benchmem`. `count=10` is the minimum that
survives benchstat's `≥6 samples for confidence interval at level 0.95`
warning with headroom for one outlier.

3. **Extracts the registry to `scripts/bench-registry.sh`** sourced by
both `scripts/bench-check-all.sh` (gate runner, wired to `make
bench-check`) and `scripts/bench-baseline.sh` (regenerator, wired to
`make bench-baseline`). Single source of truth — drift between "what we
gate" and "what we regenerate" can no longer happen.

Why scripts and not an inline Make loop: the bench regex
`^Benchmark(ParseDump|StackID)$` contains parentheses that the
Make-invoked `/bin/sh` tokenises as subshell metacharacters. Quoting
under `make` shellquote escaping is more fragile than just punting to
plain shell.

4. **Restricts the gate to `B/op` + `allocs/op` tables** in
`scripts/bench-check.sh`; `sec/op` is no longer gated. See
"Adversarial-review fix" below.

5. **Wires the gate into CI** as a new step in
`.github/workflows/ci.yml`'s `verify-static` job, after `make build`. A
pinned `go install golang.org/x/perf/cmd/benchstat@…` step installs the
tool on the runner.

## Adversarial-review fix (mid-PR)

The first commit gated sec/op too, with the design note claiming pure
wall-clock jitter would show as benchstat's `~` (non-significant) and
never reach the `+NN%` parser. **That's false.** Repeating `make
bench-check` immediately after a baseline capture, with identical code
on identical hardware, fired:

```
PodEvictedDetector_1kEventWindow-10  sec/op  +16.50% (p=0.001 n=10)
PodEvictedDetector_1kEventWindow-10  B/op    +0.01% (p=0.005 n=10)
PodEvictedDetector_1kEventWindow-10  allocs/op  ~ (p=1.000 n=10)
```

The 16.5% delta was background-load drift between two runs minutes
apart. benchstat marks it significant because the CV is small (~5-10%)
and the means are reliably different — both true symptoms of *real* but
*uninteresting* wall-clock variance. Gating on sec/op would produce
continuous false fires across the team without catching anything code
review can act on.

**Fix**: `scripts/bench-check.sh` now tracks the active benchstat table
by indented header line (`sec/op` / `B/op` / `allocs/op`) and only
counts `+NN%` deltas under `B/op` + `allocs/op`. Both stay pinned to 0%
CV across runs (deterministic Go allocator + identical bytes-allocated
per code path) and only move when the code does. `make bench-check` on
clean main now passes consistently across runs; the TDD planted
regression (heap-allocating sink in `stackID`) still trips the gate at
`+800% B/op` + `+133% allocs/op`.

## Design notes

- **Why these two benches and not
bench/overhead/nccl_fr_bench_test.go**: that benchmark replays 1 GiB of
fixtures per iter (~90s on M1 Max) and *already* self-asserts on
`HeapAlloc` delta against the NORTHSTARS O2 ceiling. Adding it to
`bench-check` would push CI past the budget and produce nothing the
in-bench assertion doesn't already gate. Stays advisory.
- **Hardware skew is irrelevant for B/op + allocs/op**: baselines were
captured on Apple M1 Max; CI runs on `ubuntu-latest` (x86_64). The Go
allocator is deterministic across GOARCH for a given code path —
`b.ReportAllocs()` returns identical numbers on both. The historical
k8sevents baseline lived under the same cross-arch arrangement.
- **`make ci` unchanged**: `bench-check` is NOT added to the `make ci`
recipe. That would force every contributor to install benchstat. CI
installs benchstat itself in the workflow; local dev runs `make
bench-check` only when intentionally vetting a perf change.
- **`module/` submodule resolution**: `module/pkg/patterns` lives in the
in-repo Go submodule (`module/go.mod`); local dev runs resolve through
`go.work` and OCB through `builder-config.yaml`'s `replaces: ./module`.
`go test ./module/pkg/patterns/…` works from the repo root because of
the workspace.

## TDD record

- **Define gate**: scripts/bench-check-all.sh + bench-registry.sh
register two packages; threshold 10%, env-overridable; only B/op +
allocs/op rows count.
- **Red (no plant)**: `make bench-check` on clean main passes — exit 0,
both packages report `PASS: no benchmarks regressed by more than 10% vs
baseline.` (Re-run multiple times across different machine load — stays
green.)
- **Red→Green (planted regression)**: edited `stackID` to allocate 64
B/iter via a package-level sink (escape-analysis-routed to the heap, so
allocs/op actually moved); `make bench-check` exit 2 (make wraps the
inner exit 1). Output flagged `StackID-10 +800.00% (p=0.000 n=10)` on
B/op and `+133.33% (p=0.000 n=10)` on allocs/op.
- **Green (revert)**: plant removed; `make bench-check` exit 0; `git
diff main -- components/receivers/pyspy/stackid.go` empty.

## Release notes

```release-notes
### Added
- `make bench-check` is now a real perf-regression gate again (was a no-op since PR-K.2). Registry-driven via `scripts/bench-registry.sh` (single source of truth, sourced by both `bench-check-all.sh` runner and `bench-baseline.sh` regenerator); the runner gates `B/op` + `allocs/op` deltas against a committed `testdata/bench-baseline.txt` via benchstat; any row regressing past `THRESHOLD%` (default 10, env-overridable) fails. `sec/op` is deliberately not gated (wall-clock CV on dev hardware and shared CI runners routinely crosses 10% from background load alone, even when benchstat marks the delta significant). Two packages registered: `module/pkg/patterns` (M19 PodEvictedDetector budget) and `components/receivers/pyspy` (M18 ParseDump + StackID hash). Regenerate baselines with `make bench-baseline` after a vetted, intentional perf change and commit the diff.
- CI `verify-static` job now runs `make bench-check` after installing a pinned benchstat. Local dev only needs benchstat when intentionally running the perf gate.
```

## Test plan

- [x] `make check` (fmt, tidy-check, lint, vet, mod-verify) — clean
(re-run post-adversarial-fix)
- [x] `make actionlint` — clean
- [x] `make zizmor` — clean (`No findings to report. Good job!`)
- [x] `make license-check` — clean
- [x] `make doc-check` — clean
- [x] `make bench-check` on clean main — exit 0, both packages PASS,
repeated runs stay green (false-positive bug fixed)
- [x] `make bench-check` with deliberate alloc-regression planted in
`stackID` — exit 2, gate flags `+800% B/op` + `+133% allocs/op`
- [x] `make bench-check` after revert — exit 0 again, stackid.go diff vs
main is empty
- [ ] CI `verify-static` runs `make bench-check` green on this PR
(awaiting CI)

## Self-grading

**A** — root cause named, fix scoped to missing registration without
expanding scope (no new benchmark framework, no rewrite of the
benchstat-comparison machinery beyond the sec/op-vs-alloc scoping fix).
TDD cycle complete with a deliberate regression that the gate catches.
Adversarial self-review found a real false-positive bug (sec/op gating)
and fixed it in-PR before merge. Registry extracted to a single source
of truth so the two consumers can't drift. CHANGELOG entry written.

Not A+ because: even with sec/op excluded, the gate still depends on
benchstat's text-output format remaining stable; a future major version
that changes column layout could silently disable the gate (it would
skip every row, exit 0, print PASS). A future iteration could parse
benchstat's `-format=csv` output instead, which is its versioned
machine-readable interface. Out of scope here; the current text format
hasn't changed for years and matches the existing
`scripts/bench-check.sh` shape that already shipped with the k8sevents
baseline.

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant