feat(pivot): PR-K.2 — delete 4 in-tree receivers + xidgen#217
Merged
Conversation
added 2 commits
May 31, 2026 02:35
Delete `components/receivers/{clockreceiver,kernelevents,k8sevents,containerstdout}/`
in favor of the PR-J upstream-OTel recipes (hostmetricsreceiver,
journaldreceiver+filelogreceiver+OTTL Xid transform, k8sobjectsreceiver+OTTL
hint transform, filelogreceiver+container stanza+file_storage). Delete
`tools/failure-inject/xidgen/` (sole consumer was the kernelevents wire
shape per RFC-0013 §migration L253) and strip the `xid` subcommand +
golden SHAs + chaos-workflow determinism steps that exercised it.
Delete the `containerstdout-on-values.yaml` chart fixture (chart values
keys + DaemonSet template refs survive to PR-K.3's chart-cleanup cycle).
Delete `.github/ISSUE_TEMPLATE/component-bug-kernelevents.yml`.
Migrate 6 _test.go string-literal placeholder references (`clockreceiver/primary`
→ `hostmetrics/primary`) in `internal/pipeline/saferun_test.go`,
`internal/config/fuzz_test.go`, and `internal/pipelinebuilder/fuzz_test.go`.
Pure rename; no contract change. Update comment-only references in
`components/exporters/{otlphttp,stdoutexporter}/`,
`components/receivers/{nccl_fr,pyspy}/`,
`internal/synthesis/patterns/{doc,model,verdict}.go`, and
`internal/runtime/lifecycle/lifecycle.go`. Retire the kernelevents-only
Makefile targets (`test-extras-sustained`, `test-extras-fuzz-{kmsg,journald}`,
the kernelevents row in `test-extras-race`); empty the `bench-check` for-loop
(k8sevents was the only per-package baseline). go.mod sheds the k8s.io/*
+ sigs.k8s.io/* + gopkg.in/* dep cluster (28 lines off go.mod, 55 off go.sum).
Sweep CHANGELOG (new PR-K.2 entry under Unreleased + wave-4 paragraph
updated to reflect the three previously-open ports having landed),
MILESTONES (M1, M9, M10, M15 status lines flipped from "DELETED at
v0.2.0" → "DELETED in PR-K.2" with file pointers to docs/integrations/
recipes), docs/migration/v0.1-to-v0.2.md (PR-K.1 + PR-K.2 checkboxes
flipped, PR-F.2 gate language updated, PR-I.2 gate satisfied),
docs/README.md (per-component-docs table drops deleted-receiver rows),
AGENTS.md (queued-for-deletion paragraph updated to current reality),
bench/install/README.md (clockreceiver tick-alias note updated),
tools/failure-inject/README.md (xid section removed).
Hard pre-flight: zero external Go importers for any deletion target.
`make ci`, `make verify`, `make build` (OCB compile), and
`tracecore validate` against every `docs/integrations/examples/*.yaml`
all green.
Net: 196 files changed, 28,943 deletions, 131 insertions.
Refs RFC-0013 §migration PR-K.2.
Signed-off-by: Tri Lam <tri@maydow.com>
# Conflicts: # MILESTONES.md # components/receivers/kernelevents/README.md # docs/migration/v0.1-to-v0.2.md # internal/config/fuzz_test.go # internal/pipeline/saferun_test.go # internal/pipelinebuilder/fuzz_test.go # internal/runtime/lifecycle/lifecycle.go
This was referenced May 31, 2026
trilamsr
pushed a commit
that referenced
this pull request
May 31, 2026
PR-K.2 (#217) and PR-F.2 (#215) deleted clockreceiver, containerstdout, k8sevents, kernelevents from the binary; the namespace-alignment section in v0.1-to-v0.2.md was authored before those merges landed and still listed all eight. Trim the per-component substitution table to the four surviving in-tree components (nccl_fr, pyspy, otlphttp, stdoutexporter) and point operators at the existing "Orphan in-tree components" table for the deleted-receiver migration path. Refresh the PromQL example to use otlphttp (still present) instead of containerstdout (deleted). Signed-off-by: Tri Lam <tri@maydow.com>
6 tasks
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
…216) ## Summary The four surviving in-tree components (`nccl_fr`, `pyspy`, `otlphttp`, `stdoutexporter`) each emit self-telemetry through their own per-component MeterProvider. This PR renames their instrument names from the v0.1.x `tracecore.*` family to the upstream `otelcol_<role>_<component>_<metric>` convention so the in-tree namespace does not collide with the OCB pipeline-runtime's own `otelcol_*` family. Per [RFC-0013 §migration v0.1.0](docs/rfcs/0013-distro-first-pivot.md) row 119: *"self-tel metric rename `tracecore.*` → `otelcol_*`"*. **Scope reduced post-PR-K.2 / PR-F.2.** The four legacy in-tree receivers originally listed for rename (`clockreceiver`, `containerstdout`, `k8sevents`, `kernelevents`) plus the in-tree boot-path internals were deleted from the binary in #217 and #215 while this PR was in CI. The namespace-alignment edits to those files were dropped during the post-#217 merge; the surviving net diff covers only the four components still in the binary. ## Rename matrix OTel-dot form (Prometheus scrape renders dots as underscores): | v0.1.x instrument | post-rename instrument | |---|---| | `tracecore.receiver.errors_total` | `otelcol.receiver.<name>.errors_total` | | `tracecore.receiver.emissions_total` | `otelcol.receiver.<name>.emissions_total` | | `tracecore.receiver.collection_latency_seconds` | `otelcol.receiver.<name>.collection_latency_seconds` | | `tracecore.receiver.degraded_seconds_total` | `otelcol.receiver.<name>.degraded_seconds_total` | | `tracecore.receiver.last_activity_unix_seconds` | `otelcol.receiver.<name>.last_activity_unix_seconds` | | `tracecore.exporter.calls_total` | `otelcol.exporter.<name>.calls_total` | | `tracecore.selftelemetry.init_errors_total` | `otelcol.selftelemetry.init_errors_total` | Where `<name>` is the OCB component name without underscores. Per-component substitutions: | Component | `<name>` | |---|---| | `components/receivers/nccl_fr` | `ncclfr` (underscore stripped per RFC) | | `components/receivers/pyspy` | `pyspy` | | `components/exporters/otlphttp` | `otlphttp` | | `components/exporters/stdoutexporter` | `stdoutexporter` | **Label shape is preserved.** `component_id` still partitions per-instance; `kind` / `result` values are unchanged. Dashboards and alerts that filtered on labels need only the metric-name rename, not a label-selector rewrite. ## What changed - **Four `selftel.go` files** (otlphttp, stdoutexporter, nccl_fr, pyspy) — instrument literals updated; header comments rewritten to document the new convention + label-preservation invariant. - **Four `selftel_test.go` files** — `findInstrument(..., "...")` and `scopeOf(..., "...")` assertions updated; `receiverInstrumentPrefix` / `exporterInstrumentPrefix` constants (used by the `failingReceiverMP` / `failingExporterMP` synthetic-failure seams) bumped to the new prefix so a future drift back to `tracecore.*` would fail compile-time on the prefix mismatch. - **`docs/examples/prometheus-alerts.example.yaml`** — rewritten as receiver-agnostic starter using regex matchers (`{__name__=~"otelcol_receiver_.*_errors_total"}`) so a new in-tree receiver inherits coverage on first scrape. Removed the `tracecore_exporter_failure_rate` / `tracecore_build_info` rules — those v0.1.x gauges no longer exist post-RFC-0013. - **`docs/migration/v0.1-to-v0.2.md`** — new "In-tree receiver / exporter namespace alignment (RFC-0013 v0.1.0)" section listing only the four surviving in-tree components; points operators at the existing "Orphan in-tree components" table for deleted-receiver migration. - **`NOTE on ExporterCarrier removal`** blocks in otlphttp.go + stdoutexporter.go collapsed — they referenced `cmd/tracecore.collect.collectFailureRateReaders` (deleted in PR-A2) and `internal/selftelemetry` (deleted in PR-F.1). New comments point at the PromQL recipe `rate(otelcol_exporter_<name>_calls_total{result="error"}[5m])`. - **CHANGELOG** — `[Unreleased]` entry added documenting the alignment. ## Test plan - [x] `go vet ./...` clean (root + module/ submodule) - [x] `go test ./...` all green (root module; module/ submodule has no test files) - [x] `make check` (fmt + tidy-check + lint + vet + mod-verify) green - [x] `make doc-check` green - [x] `TestSelfTelemetry_*` assertions pass for every surviving package - [x] `TestFactory_FallsBackToNoopWhenMeterFails` synthetic-register-failure seam still trips on the renamed prefix (`failingReceiverMP` / `failingExporterMP`) ## Merge resolution (post-#215, #217) Conflicts surfaced as "deleted by them" across 20 files in the four deleted receivers. Resolution: accept all upstream deletions; the namespace-alignment edits to those files were moot because the files themselves no longer exist on main. No content edits required to the surviving four components from the merge — their namespace-alignment commits remain intact. ```release-notes **Breaking (pre-1.0)**: in-tree component self-telemetry metric names renamed from `tracecore.*` to `otelcol.<role>.<name>.*` per RFC-0013 §migration v0.1.0 namespace alignment. Affects the four surviving in-tree components (`nccl_fr`, `pyspy`, `otlphttp`, `stdoutexporter`); the four legacy in-tree receivers (`clockreceiver`, `containerstdout`, `k8sevents`, `kernelevents`) were already deleted from the binary in #215 / #217 and migrate via the upstream-receiver replacements documented in the "Orphan in-tree components" table in `docs/migration/v0.1-to-v0.2.md`. Label shape preserved (`component_id`, `kind`, `result` unchanged). Operators with dashboards / alerts on the v0.1.x names rename per the table in `docs/migration/v0.1-to-v0.2.md` under "In-tree receiver / exporter namespace alignment". ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
11 tasks
trilamsr
pushed a commit
that referenced
this pull request
May 31, 2026
`scripts/doc-check.sh` `scan_paths` listed four docs, three of which were deleted in #215 (PR-F.2, `internal/pipeline/README.md`) and #217 (PR-K.2, `components/receivers/dcgm/RUNBOOK.md` + `components/receivers/kernelevents/RUNBOOK.md`). The `grep` swallowed the "No such file" errors via `2>/dev/null`, so the gate appeared to pass while only `docs/FAILURE-MODES.md` was actually scanned — silent rot, not a hard CI fail. Also drops `pkg/` from the `defined` Go-test-symbol grep roots (deleted in PR-F.2) and adds `module/` (introduced in PR-I.1a / extended here in PR-I.1b), so the gate keeps full coverage of the post-pivot test corpus. Side fix: `go.work` header claimed the Go directive matched root + submodule `go.mod` exactly. `module/go.mod` pins `go 1.22.0` to track the collector v0.110.0 OCB-distribution baseline; root is `1.26.3`. Workspace mode only requires `>= every member module's directive`, so 1.26.3 >= 1.22.0 is fine — but the comment claim was wrong. Updated to acknowledge the delta + name the reason (OCB baseline floor). Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
#224) ## Summary RFC-0013 §migration **PR-I.1b** — mechanical move of `nccl_fr` receiver + safe-pickle parser into the in-repo Go submodule scaffolded by #214. - `git mv components/receivers/nccl_fr → module/receiver/ncclfrreceiver` (Go package renamed `ncclfr → ncclfrreceiver` to match the OCB receiver name; **OCB component type string `nccl_fr` unchanged** — operator scrape names + dashboards do not regress, per PR-B1 metric-namespace decision). - `git mv pkg/nccl/fr_parser → module/pkg/nccl/fr_parser`. - `builder-config.yaml` uncomments the PR-I.1a placeholder: adds `gomod: github.com/tracecoreai/tracecore/module v0.1.0` + `import: github.com/tracecoreai/tracecore/module/receiver/ncclfrreceiver` (split because the single-`go.mod` submodule puts the receiver one path-segment below the module root — a per-component `gomod:` would fail since `module/receiver/ncclfrreceiver/` has no `go.mod` of its own). - Root-level `replaces: github.com/tracecoreai/tracecore/module => ../module` (`../module`, not `./module`, because OCB writes the replaces verbatim into `./_build/go.mod`, one directory deeper than repo root). - `module/go.mod` pins collector deps to **v0.110.0** + otel to **v1.30.0** (the OCB-distribution baseline). MVS inside `_build/go.mod` would otherwise pull forward to v1.59.0 — `scraperhelper` was split out of `collector/receiver` between those two release lines, which would break the `hostmetricsreceiver@v0.110.0` build added in PR-E. - Root `go.mod` adds `replace github.com/tracecoreai/tracecore/module => ./module` + matching `require` so `go mod tidy` (which ignores `go.work`) resolves the submodule from the in-repo checkout rather than the proxy. (Drops the first time a release builds against a published `module/vX.Y.Z` tag.) - Importers updated in root: `tools/genfixtures` (+ `-out` default), `bench/overhead/nccl_fr_bench_test.go`, `scripts/nccl-fr-rce-gate.sh` (runs `go list -deps` from inside `module/`), `Makefile` (`generate-fixtures` / `test-extras-fuzz-nccl-fr` / `ci-fuzz-nccl-fr` / `nccl-fr-rce-gate`), `.github/workflows/nccl-fr-fuzz-nightly.yml`, and two doc-comment refs in `components/exporters/stdoutexporter/selftel{,_test}.go`. - OTel instrumentation scope on `selftel.go` moves to the new Go import path (`github.com/tracecoreai/tracecore/module/receiver/ncclfrreceiver`), OTel convention: scope = Go import path. **No operator-visible metric/log-data regression** — receiver type string + metric instrument names unchanged across the OCB wire boundary. The OTel instrumentation scope name does move (per OTel convention scope = Go import path; matches bullet 7 + RFC-0013 §Migration + CHANGELOG), but the scope is a logger/meter attribute, not part of metric instrument names or the component-type-string operators configure against — so dashboards and scrape jobs do not regress. Post-merge: tag `module/v0.1.0 <merge-sha>` (the first version pinned in `builder-config.yaml`). ## Test plan Locally run + green on this branch (verified at commit `da44388`; subsequent commits are an `origin/main` merge + a `doc-check.sh` `scan_paths` fix-up that prunes three docs deleted by #215/#217 — no source semantic change): - [x] `make check` — fmt, golangci-lint, vet, mod-verify (0 issues). - [x] `make verify` — license-check, generate-fixtures-check, build-tags, **nccl-fr-rce-gate** (parser depends only on stdlib), register-lint, actionlint, zizmor, doc-check, no-autoupdate-check. - [x] `make build` — OCB v0.110.0 builds `./_build/tracecore` with the submodule receiver wired in. - [x] `./_build/tracecore components` — confirms `nccl_fr` receiver registered (alongside hostmetrics, filelog, journald, k8sobjects, otlp, prometheus). - [x] `go test ./...` (root module). - [x] `cd module && go test ./...` (submodule: `module/pkg/nccl/fr_parser` + `module/receiver/ncclfrreceiver`). - [x] `make ci-fuzz-nccl-fr` — 30s `FuzzParseFRPickle` on the moved parser, no crashers. - [x] `bash scripts/nccl-fr-rce-gate.sh` — parser deps clean (no `os/exec`, `plugin`, `reflect.Call`, `reflect.MakeFunc`). CI to confirm: - [ ] All required checks green on this PR. - [ ] `nccl-fr-fuzz-nightly` runs against the new path on next nightly trigger. Post-merge: - [ ] Tag `module/v0.1.0 <merge-sha> && git push origin module/v0.1.0`. ## Why this is a root-cause fix, not a workaround Two non-obvious decisions in this diff — both root-cause, not workarounds: 1. **`replaces: ../module` (not `./module`) in builder-config.yaml.** OCB writes the `replaces:` directive verbatim into `./_build/go.mod`. Paths in `go.mod` resolve relative to that file's directory, which is one level deeper than repo root. `./module` would fail to resolve; `../module` resolves correctly to `<repo>/module/`. 2. **`module/go.mod` pins collector v0.110.0, not the root's v1.59.0.** OCB's MVS reconciles `_build/go.mod`'s require graph against the submodule's `go.mod`. If the submodule pinned the higher version, MVS pulls forward inside `_build/go.mod` and `hostmetricsreceiver@v0.110.0` (added in PR-E to replace `clockreceiver`) fails to build because `scraperhelper` moved out of `go.opentelemetry.io/collector/receiver` between v0.110.0 and v1.59.0. The root module independently uses v1.59.0 for its non-OCB code paths; MVS reconciles to the higher version where root + submodule overlap, which is fine for root. ```release-notes none ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
…234) PR-K.2 (#217) deleted the in-tree `clockreceiver`, `dcgm`, `kernelevents`, `containerstdout` receivers and `stdoutexporter` from the binary; the chart still carried their values blocks, RBAC, volume mounts, and policy exemptions. Operators turning any toggle on would get a render that the OCB binary cannot load. This drops the dead surface — pre-1.0, no operator-deprecation tax owed (per single-contributor latitude). Partial close of #220: chart bits only. The `k8sevents` values key referenced in the issue is not touched here (the receiver still ships); the `component-bug-kernelevents.yml` issue template referenced in #220 was already absent on `main`. ## What changed - `values.yaml`: deletes the 5 retired toggle blocks (`clockreceiver`, `dcgm`, `kernelevents`, `containerstdout`, `stdoutexporter`); keeps `pyspy` (in-tree-only pending OTel Profiles GA, per scope). - `templates/_helpers.tpl`: drops the `containerstdout` chart-only-keys omit branch in `renderedConfig`; refreshes the retire-soon comment to present-state language. - `templates/daemonset.yaml`: drops the conditional `automountServiceAccountToken`, conditional root `securityContext`, conditional downward-API env block, and conditional volumes/volumeMounts. The mainline `restricted`-PSS path is the only path left. - `templates/containerstdout-rbac.yaml`: deleted. - `templates/NOTES.txt`: replaces the multi-toggle WARNING with a pyspy-only note. - `policies/conftest/tracecore.rego`: drops `containerstdout_enabled` exemption from the `runAsNonRoot` / `runAsUser=0` / `runAsGroup=0` rules, and removes the full M15 operational-invariants block (4 deny rules + 2 helpers) that no longer have a receiver to gate. - `README.md`: replaces the dcgm overlay worked example with the otlphttp pattern; updates the defaults table + lead paragraph from clockreceiver+stdoutexporter to hostmetrics+debug; rewrites the kernelevents-OOMKilled troubleshooting note as a generic high-volume-receiver bullet; trims deviations table. - `ci/{all-receivers-off,one-receiver-on,pyspy-on}-values.yaml`: drops `enabled: false` entries for the deleted toggles (no render impact, just cleanup). ## Verification Locally, against the worktree: - `helm lint install/kubernetes/tracecore/` — 0 failures, 0 WARNINGs. - `helm template` against the default values + all 3 ci fixtures — renders clean; default `automountServiceAccountToken: false`; `pyspy-on` still emits the pyspy receiver block; `all-receivers-off` correctly produces no `receivers:` and no `pipelines:`. - `conftest` against the rendered default DaemonSet — 39/39 pass. Each `bad-*.yaml` fixture in `policies/conftest/testdata/` still denies (12/13 pass with 1 expected deny each — `bad-runasuser-0` / `bad-runasgroup-0` / `bad-runasroot` now deny without the exemption escape hatch). - `good-baseline.yaml` + `good-sys-ptrace.yaml` still pass — 26/26. - `make check` + `make doc-check` — green. CI (`chart` workflow): exercises every gate above against an actual `helm` + `conftest` install, plus the kind-cluster e2e install, plus `tracecore validate` on the `one-receiver-on` rendered config. ```release-notes [CHANGE] helm chart: dropped values toggles for clockreceiver, dcgm, kernelevents, containerstdout, and stdoutexporter (retired from the binary in v0.1.0-m9-alpha per RFC-0013 PR-K.2). The chart now ships hostmetrics + debug as the hardware-free default. Operators on prior overlays that set `enabled: false` on these keys must remove those entries on upgrade. ``` Refs: #220 Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced May 31, 2026
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
Closes #227. ## Summary `make bench-check` has been a no-op since PR-K.2 (#217) deleted `components/receivers/k8sevents/` — its `testdata/bench-baseline.txt` was the sole row in the gate's `for pkg in …` loop, leaving the loop body literally empty (`for pkg in ; do …`). The target stayed as a stub so `make ci` automation kept a stable invocation, with a TODO to re-register packages later. That re-registration is this PR. ## Root cause `Makefile:71-101` (pre-diff) iterated an empty list. The wrapper around `benchstat`+`scripts/bench-check.sh` was intact; only the *registry of which packages to gate* had been emptied. So the fix is purely a registration + baseline-capture exercise — no logic change to the gate's comparison machinery, plus one scope-tightening change called out under "Adversarial-review fix" below. ## What this PR does 1. **Registers two packages** with deterministic-allocation micro-benchmarks for gating: - `module/pkg/patterns` — `BenchmarkPodEvictedDetector_1kEventWindow` (the M19 detector budget). (Path is post-PR-I.2a / #242; this branch was rebased onto that merge during the PR.) - `components/receivers/pyspy` — `BenchmarkParseDump` + `BenchmarkStackID` (the M18 join-key hash + faulthandler parser). 2. **Captures committed baselines** at `module/pkg/patterns/testdata/bench-baseline.txt` and `components/receivers/pyspy/testdata/bench-baseline.txt`, both at `count=10 -benchtime=500ms -benchmem`. `count=10` is the minimum that survives benchstat's `≥6 samples for confidence interval at level 0.95` warning with headroom for one outlier. 3. **Extracts the registry to `scripts/bench-registry.sh`** sourced by both `scripts/bench-check-all.sh` (gate runner, wired to `make bench-check`) and `scripts/bench-baseline.sh` (regenerator, wired to `make bench-baseline`). Single source of truth — drift between "what we gate" and "what we regenerate" can no longer happen. Why scripts and not an inline Make loop: the bench regex `^Benchmark(ParseDump|StackID)$` contains parentheses that the Make-invoked `/bin/sh` tokenises as subshell metacharacters. Quoting under `make` shellquote escaping is more fragile than just punting to plain shell. 4. **Restricts the gate to `B/op` + `allocs/op` tables** in `scripts/bench-check.sh`; `sec/op` is no longer gated. See "Adversarial-review fix" below. 5. **Wires the gate into CI** as a new step in `.github/workflows/ci.yml`'s `verify-static` job, after `make build`. A pinned `go install golang.org/x/perf/cmd/benchstat@…` step installs the tool on the runner. ## Adversarial-review fix (mid-PR) The first commit gated sec/op too, with the design note claiming pure wall-clock jitter would show as benchstat's `~` (non-significant) and never reach the `+NN%` parser. **That's false.** Repeating `make bench-check` immediately after a baseline capture, with identical code on identical hardware, fired: ``` PodEvictedDetector_1kEventWindow-10 sec/op +16.50% (p=0.001 n=10) PodEvictedDetector_1kEventWindow-10 B/op +0.01% (p=0.005 n=10) PodEvictedDetector_1kEventWindow-10 allocs/op ~ (p=1.000 n=10) ``` The 16.5% delta was background-load drift between two runs minutes apart. benchstat marks it significant because the CV is small (~5-10%) and the means are reliably different — both true symptoms of *real* but *uninteresting* wall-clock variance. Gating on sec/op would produce continuous false fires across the team without catching anything code review can act on. **Fix**: `scripts/bench-check.sh` now tracks the active benchstat table by indented header line (`sec/op` / `B/op` / `allocs/op`) and only counts `+NN%` deltas under `B/op` + `allocs/op`. Both stay pinned to 0% CV across runs (deterministic Go allocator + identical bytes-allocated per code path) and only move when the code does. `make bench-check` on clean main now passes consistently across runs; the TDD planted regression (heap-allocating sink in `stackID`) still trips the gate at `+800% B/op` + `+133% allocs/op`. ## Design notes - **Why these two benches and not bench/overhead/nccl_fr_bench_test.go**: that benchmark replays 1 GiB of fixtures per iter (~90s on M1 Max) and *already* self-asserts on `HeapAlloc` delta against the NORTHSTARS O2 ceiling. Adding it to `bench-check` would push CI past the budget and produce nothing the in-bench assertion doesn't already gate. Stays advisory. - **Hardware skew is irrelevant for B/op + allocs/op**: baselines were captured on Apple M1 Max; CI runs on `ubuntu-latest` (x86_64). The Go allocator is deterministic across GOARCH for a given code path — `b.ReportAllocs()` returns identical numbers on both. The historical k8sevents baseline lived under the same cross-arch arrangement. - **`make ci` unchanged**: `bench-check` is NOT added to the `make ci` recipe. That would force every contributor to install benchstat. CI installs benchstat itself in the workflow; local dev runs `make bench-check` only when intentionally vetting a perf change. - **`module/` submodule resolution**: `module/pkg/patterns` lives in the in-repo Go submodule (`module/go.mod`); local dev runs resolve through `go.work` and OCB through `builder-config.yaml`'s `replaces: ./module`. `go test ./module/pkg/patterns/…` works from the repo root because of the workspace. ## TDD record - **Define gate**: scripts/bench-check-all.sh + bench-registry.sh register two packages; threshold 10%, env-overridable; only B/op + allocs/op rows count. - **Red (no plant)**: `make bench-check` on clean main passes — exit 0, both packages report `PASS: no benchmarks regressed by more than 10% vs baseline.` (Re-run multiple times across different machine load — stays green.) - **Red→Green (planted regression)**: edited `stackID` to allocate 64 B/iter via a package-level sink (escape-analysis-routed to the heap, so allocs/op actually moved); `make bench-check` exit 2 (make wraps the inner exit 1). Output flagged `StackID-10 +800.00% (p=0.000 n=10)` on B/op and `+133.33% (p=0.000 n=10)` on allocs/op. - **Green (revert)**: plant removed; `make bench-check` exit 0; `git diff main -- components/receivers/pyspy/stackid.go` empty. ## Release notes ```release-notes ### Added - `make bench-check` is now a real perf-regression gate again (was a no-op since PR-K.2). Registry-driven via `scripts/bench-registry.sh` (single source of truth, sourced by both `bench-check-all.sh` runner and `bench-baseline.sh` regenerator); the runner gates `B/op` + `allocs/op` deltas against a committed `testdata/bench-baseline.txt` via benchstat; any row regressing past `THRESHOLD%` (default 10, env-overridable) fails. `sec/op` is deliberately not gated (wall-clock CV on dev hardware and shared CI runners routinely crosses 10% from background load alone, even when benchstat marks the delta significant). Two packages registered: `module/pkg/patterns` (M19 PodEvictedDetector budget) and `components/receivers/pyspy` (M18 ParseDump + StackID hash). Regenerate baselines with `make bench-baseline` after a vetted, intentional perf change and commit the diff. - CI `verify-static` job now runs `make bench-check` after installing a pinned benchstat. Local dev only needs benchstat when intentionally running the perf gate. ``` ## Test plan - [x] `make check` (fmt, tidy-check, lint, vet, mod-verify) — clean (re-run post-adversarial-fix) - [x] `make actionlint` — clean - [x] `make zizmor` — clean (`No findings to report. Good job!`) - [x] `make license-check` — clean - [x] `make doc-check` — clean - [x] `make bench-check` on clean main — exit 0, both packages PASS, repeated runs stay green (false-positive bug fixed) - [x] `make bench-check` with deliberate alloc-regression planted in `stackID` — exit 2, gate flags `+800% B/op` + `+133% allocs/op` - [x] `make bench-check` after revert — exit 0 again, stackid.go diff vs main is empty - [ ] CI `verify-static` runs `make bench-check` green on this PR (awaiting CI) ## Self-grading **A** — root cause named, fix scoped to missing registration without expanding scope (no new benchmark framework, no rewrite of the benchstat-comparison machinery beyond the sec/op-vs-alloc scoping fix). TDD cycle complete with a deliberate regression that the gate catches. Adversarial self-review found a real false-positive bug (sec/op gating) and fixed it in-PR before merge. Registry extracted to a single source of truth so the two consumers can't drift. CHANGELOG entry written. Not A+ because: even with sec/op excluded, the gate still depends on benchstat's text-output format remaining stable; a future major version that changes column layout could silently disable the gate (it would skip every row, exit 0, print PASS). A future iteration could parse benchstat's `-format=csv` output instead, which is its versioned machine-readable interface. Out of scope here; the current text format hasn't changed for years and matches the existing `scripts/bench-check.sh` shape that already shipped with the k8sevents baseline. --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RFC-0013 §migration PR-K.2: delete the four in-tree receivers
(
clockreceiver,kernelevents,k8sevents,containerstdout)plus the
xidgenfailure-injector (whose sole consumer waskernelevents's wire shape per RFC §migration L253). PR-K.1(#211) just severed
internal/synthesis/patterns/+replayfromk8sevents, unblocking the source-tree cut. PR-J (#195) shippedthe four upstream-OTel recipes that replace the in-tree receivers
in the bundled Helm-chart pipeline; this PR retires the
behind-the-curtain code.
Chart cleanup (values keys, DaemonSet template refs,
NOTES.txtdeprecation warnings) intentionally stays in PR-K.3 so operators
get one minor of deprecation telemetry before the values shape
breaks.
What lands
Deletions — receivers (4)
components/receivers/clockreceiver/hostmetricsreceiver(loadscraper @ 1s) — PR-E landed in #180. RFC-0013 §migration originally namedtelemetrygeneratorreceiverbut that receiver does not exist in opentelemetry-collector-contrib (contrib #41687 + #43657 both closednot_planned).components/receivers/kernelevents/journaldreceiver+filelogreceiver(kmsg) + OTTL Xid transform. Customer-stablekernelevents.xid+gpu.idattributes preserved via the OTTL transform per RFC-0013 §3.components/receivers/k8sevents/k8sobjectsreceiver(watch mode onevents) + OTTLk8s.event.hinttransform. 11-entry hint enum preserved via the OTTL transform per RFC-0013 §3. The typedinternal/synthesis/patterns/Record+NodeRecord(severed in PR-K.1) keeps M19's pod-evicted detector pinned.components/receivers/containerstdout/filelogreceiver+ container stanza +file_storageextension. Per-rank attribution + dataloader-timing extraction move to OTTL transforms in the bundled recipe per RFC-0013 §3.Deletions — supporting infra
tools/failure-inject/xidgen/(2 files, 281 LOC)/dev/kmsg(sudo tee) orsystemd-catagainst the journald recipe.install/kubernetes/tracecore/ci/containerstdout-on-values.yaml(41 LOC)all-receivers-off-values.yaml,one-receiver-on-values.yaml,pyspy-on-values.yaml) untouched; theirclockreceiver: enabled: false/kernelevents: enabled: falserows survive to PR-K.3's chart cleanup..github/ISSUE_TEMPLATE/component-bug-kernelevents.ymlfailure-inject CLI surface
failure-inject xid --code=N [--format=…] [--count=N]removed.failure-inject {pod-evict,nccl-hang,cpu-steal}unchanged.tools/failure-inject/testdata/golden.sha256drops the twoxidgolden rows.
.github/workflows/chaos.ymldrops the twotwo-run-determinism steps that exercised the xid byte-determinism
contract; pod-evict's determinism + the golden-SHA replay loop
survive.
Tooling shed
Makefile: retiretest-extras-sustainedbody (was kernelevents-only; now
@true— target retained so downstream automation hasa stable name; future sustained-load suites slot back in).
Retire
test-extras-fuzz-kmsg/test-extras-fuzz-journald;test-extras-fuzzloop drops to nccl-fr only. Drop thekernelevents row from
test-extras-race. Empty thebench-checkfor-loop (k8sevents was the only baseline; PR-F.2 already
rewrote the comment block to reflect this — the merge keeps
both edits aligned).
go.mod: shedsk8s.io/{api,apimachinery,client-go,klog/v2,kube-openapi,utils},sigs.k8s.io/{json,randfill,structured-merge-diff/v6,yaml},gopkg.in/{evanphx/json-patch.v4,inf.v0}— the dep clusterk8sevents dragged in.
go.uber.org/goleakalso dropped post-merge (was held only by PR-F.2's-deleted
internal/pipeline/chaos_test.go).Doc + comment sweep
Comment-only references to deleted receivers in
components/exporters/{otlphttp,stdoutexporter}/,components/receivers/{nccl_fr,pyspy}/,internal/synthesis/patterns/{doc,model,verdict}.gorewired tosurviving references (or to upstream recipe pointers).
docs/README.mdper-component-docs table drops five dead links(caught by
doc-check.sh's rotten-link gate — that is in facthow I caught the last drift).
bench/install/README.mdtick-alias note + schema-v2-rename note updated.
tools/failure-inject/README.mdxid section removed; statusbanner rewritten.
CHANGELOG (new PR-K.2 entry under Unreleased + wave-4 paragraph
re-balanced), MILESTONES (M1 + M9 + M10 + M15 status lines flipped
from "DELETED at v0.2.0" → "DELETED in PR-K.2" with file pointers
to the integration recipes),
docs/migration/v0.1-to-v0.2.md(PR-K.1 + PR-K.2 checkboxes flipped, PR-F.2 + PR-I.1a status
block updated post-merge), AGENTS.md (queued-for-deletion
paragraph updated to current reality) all swept.
What evaporated mid-flight
Originally this PR was also going to migrate a handful of
_test.gofiles that held placeholder-name string referencesto
clockreceiver(ininternal/pipeline/saferun_test.go,internal/config/fuzz_test.go,internal/pipelinebuilder/fuzz_test.go).PR-F.2 (#215) deleted those files outright while this branch
was being drafted, so the migration target disappeared. No
follow-up needed.
Net LOC delta
What is intentionally NOT in this PR
receivers.clockreceiver/receivers.kernelevents/
receivers.containerstdouttoggles + DaemonSet template refscontainerstdout-rbac.yamltemplate** — stays for PR-K.3 sooperators see
NOTES.txtdeprecation warnings before the valuesshape breaks. The toggles are already inert post-PR-A2 (enabling
any of them in
values.yamlcrashes the OCB binary at bootbecause the factories are not registered) — keeping them as
no-ops for one minor preserves operator UX.
internal/{componentstatus,pipeline,…}/deletion — alreadydone in PR-F.2 (feat(pivot): PR-F.2 — delete in-tree boot-path internals #215) before this branch landed; merged in.
values.yaml# clockreceiver — in-tree heartbeat retired by RFC-0013 PR-A2style retire-banners — stays forPR-K.3 alongside the actual values-keys cleanup.
tools/failure-inject/ncclhang/— KEPT. Used bypkg/nccl/fr_parser/synthesize_test.go+bench/overhead/nccl_fr_bench_test.go; this is the canonicalexample of a failure-injector that survives the v0.2.0 cut.
Root cause
The four receivers + xidgen survived only because PR-K.1's pattern-
library severance from
k8seventswas still in flight. With #211merged at the start of this session, the deletion is unblocked.
There is no workaround being applied here — this PR is the
root-cause deletion of the in-tree receivers themselves, which
RFC-0013 §migration set as the v0.2.0 deletion target.
Test plan
make cigreen post-merge (verify + license + nccl-fr-rce-gate +register-lint + actionlint + zizmor + coverage-check +
ci-fuzz-nccl-fr + govulncheck + doc-check + no-autoupdate-check +
build).
make verifygreen post-merge.make buildgreen post-merge (OCB compile againstbuilder-config.yamlwithGOWORK=offper the PR-I.1aisolation guard yields
./_build/tracecore).go test ./...green across the post-merge tree.deletion target (
clockreceiver,kernelevents,k8sevents,containerstdout,xidgen) — re-verified after the merge.chart-renderjob validates the surviving chart fixtures(
all-receivers-off-values.yaml,one-receiver-on-values.yaml,pyspy-on-values.yaml); the deletedcontainerstdout-on-values.yamldrop-out should not regress conftest coverage of the
containerstdout-allowlist / operational-invariant rules
because the template guard still fires when
containerstdout.enabled=trueis set in any future valuesfile. PR-K.3 will reassess once the chart-side keys go.
install (kind)job continues to render the benchtracecore-values.yaml against the OCB binary
(hostmetricsreceiver heartbeat surface).
harness-determinism (amd64/arm64)job no longer runsthe xid byte-determinism steps; pod-evict + golden-SHA loop
survive. Expected: 2 fewer steps per matrix arm.
Gates that should fail this PR if I missed something
doc-check's "dead markdown link" gate would catch anysurviving link into the deleted dirs (caught the
docs/README.mdregressions on the first run; fixed andre-verified).
go vet ./...would catch a stale import; ran clean.golangci-lint run ./...would catch unused imports or deadcode introduced by the sweep; 0 issues reported.
go mod tidy -diffwould catch a missing dep prune; ran cleanafter the post-merge prune of
go.uber.org/goleak.Refs RFC-0013 §migration PR-K.2.