feat(pivot): PR-B1 — port nccl_fr off internal selftel + lifecycle#184
Merged
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
5 tasks
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
## Summary - Port `components/receivers/clockreceiver` off `internal/selftelemetry` + `internal/runtime/lifecycle`. Helpers travel as siblings (`selftel.go` + `lifecycle.go`) co-located with the receiver, per RFC-0013 §migration PR-B (mechanical mirror of PR-B1 #184 for nccl_fr). - Receiver-scoped meter is acquired from `set.Telemetry.MeterProvider`. Instrumentation scope name = the receiver's Go import path (`github.com/tracecoreai/tracecore/components/receivers/clockreceiver`). - Metric names + label shape preserved (`tracecore.receiver.{errors_total,emissions_total,collection_latency_seconds,degraded_seconds_total,last_activity_unix_seconds}` with `component_id` + `kind`). The `cmd/tracecore` integration tests that assert these names by string still pass. - Lifecycle sibling is slimmer than the internal helper: drops `Add()` + the post-Shutdown silent-no-op path because clockreceiver is single-source (same shape as the nccl_fr sibling). `panicCallback(any)` signature preserved so the panic → `IncError(kindPanic)` + `SetDegraded(true)` wiring is untouched. - `kind` enum is trimmed to the two values clockreceiver actually emits (`downstream`, `panic`) — no speculative kinds. ## Why port-before-delete clockreceiver is scheduled for deletion in PR-K, so a fair question is "why port it at all?" Three reasons: 1. **Unblocks PR-F earlier.** The `internal/selftelemetry` + `internal/runtime/lifecycle` deletes in PR-F are blocked on every consumer porting off or being deleted. Porting clockreceiver here means PR-F can land before PR-K, instead of waiting on the kernelevents / stdoutexporter / dcgm porting wave plus the clockreceiver delete. 2. **Mechanical mirror of PR-B1.** This diff is a sibling-rename of PR #184 — the review surface is the trimmed `kind` enum + the package-name swap. No new design choices. 3. **Anchors the pattern.** clockreceiver is the "canonical example receiver" per its package doc; even briefly carrying the wrong shape would signal that the migration is optional. Partial credit toward the `internal/selftelemetry` + `internal/runtime/lifecycle` deletes scheduled for PR-F. ## Test plan - [x] `make check` (gofumpt, golangci-lint, go vet, mod verify) — 0 issues. - [x] `go test ./...` repo-wide — green for the clockreceiver and `cmd/tracecore` packages, which assert `tracecore_receiver_*` metric names. (kernelevents `TestReceiver_SLIBudget` shows a known p99-latency overshoot under full-suite load; passes in isolation with `-count=3`; unrelated to this PR.) - [x] New tests: `selftel_test.go` (6 tests — noop safety, nil-MP, errors_total + kind, emissions_total, scope-name pin, init_errors_total + factory-fallback seam), `lifecycle_test.go` (5 tests — happy path, idempotent Start/Shutdown, panic recovery, shutdown-deadline). - [x] 9-lens adversarial review (sibling-impl drift, lifecycle drift, test correctness, cardinality, doc-rot, ctx usage, race-safety, helper duplication, scope creep) — no findings. Bonus catch: `r.logger` now falls back to `slog.Default()` when `set.Telemetry.Logger` is nil; the original would have nil-dereferenced. - [x] `grep "internal/selftelemetry\|internal/runtime/lifecycle" components/receivers/clockreceiver/*.go` — only doc-comment mentions remain (no imports). ## Follow-ups - PR-F: delete `internal/{componentstatus,selftelemetry,runtime/lifecycle}` once every remaining consumer (stdoutexporter, dcgm, kernelevents, containerstdout) has ported or been deleted. - PR-K: delete clockreceiver itself (replaced by hostmetricsreceiver heartbeat). Binding doc: [docs/rfcs/0013-distro-first-pivot.md](docs/rfcs/0013-distro-first-pivot.md) §migration PR-B. ## Release notes ```release-notes [CHANGE] clockreceiver migrates its self-telemetry + lifecycle helpers to co-located sibling files; emitted metric names + labels are unchanged but the OTel instrumentation scope is now the receiver's Go import path. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced May 31, 2026
trilamsr
pushed a commit
that referenced
this pull request
May 31, 2026
Two cross-cut-reviewer findings on PR #188 before re-enabling auto-merge: 1. KindParse was speculative — the dcgm cgo client returns typed values, never raw bytes, so no call site ever emitted it. Removed from the canonical Kind* const block in selftel.go, the canonicalKindByName resolver map in docs_parity_test.go, and the noop-safety test's IncError sweep in selftel_test.go. The AST-walking TestCanonicalKindByName_CoversConstBlock guard stays green because both halves move together. 2. Mirror PR #184's nccl_fr register-failure pattern: added a failingDcgmMP test seam (wraps a real MeterProvider but fails every `tracecore.receiver.*` instrument registration) plus TestSelfTelemetry_NewReceiver_RegisterFailureReturnsErr that asserts newSelfTelemetry surfaces a wrapped errSyntheticDcgmFailure rather than returning a partially-wired impl. Pins the criterion-6 symmetry the cross-cut review flagged so a future refactor that reorders the constructor can't silently bypass the register- failure path on this receiver. TestRecordInitError_NilProviderIsSafe was already present from the initial port; no change needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
16 tasks
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
## Summary RFC-0013 PR-B2: ports `components/receivers/kernelevents` off `internal/selftelemetry` + `internal/runtime/lifecycle` to package-local sibling types, following PR-B1 (#184) for `nccl_fr`. One more receiver off the internal helpers; clears another blocker for PR-F (delete `internal/selftelemetry` + `internal/runtime/lifecycle`). ## Why `internal/selftelemetry` and `internal/runtime/lifecycle` are slated for deletion in RFC-0013 PR-F. Every in-tree user must migrate first. PR-B1 landed the pattern (`nccl_fr`); PR-B2 applies it to `kernelevents`. Receivers landing post-PR-F will own their selftel + lifecycle in their own submodule per OTel convention — this PR puts kernelevents on that footing. ## Key difference from PR-B1 PR-B1 slimmed nccl_fr's sibling lifecycle by dropping `Add()` (single source). kernelevents is multi-source (kmsg + journald) and registers one driver goroutine per source under the same WaitGroup via `Add()`, so this PR keeps `Add()` and ports it (with its post-Shutdown and pre-Start refusal guards) from `internal/runtime/lifecycle`. ## What changed - New `components/receivers/kernelevents/selftel.go` — local `selfTelemetry` interface, `kind` type, noop impl, real OTel-backed impl, `recordInitError`. Metric names + label shape preserved (`tracecore.receiver.errors_total{kind,component_id}` + siblings); scope name is the receiver's Go import path (OTel convention). Receiver-local kinds (`kmsg_oversized`, `kmsg_overflow`, `journalctl_crash`) sit alongside the canonical mirror set (`parse`, `downstream`, `panic`, `cardinality`); merges what used to live in `kinds.go` (deleted). - New `components/receivers/kernelevents/lifecycle.go` — local `lifecycle` type with `Start` / `Shutdown` / `Add`. TOCTOU-safe Add()/Shutdown via shared mutex; idempotent Shutdown that stashes the first error so deadline misses aren't swallowed; panic recovery routes through the same `onPanic` callback for both Start'd and Add'd goroutines. - New `selftel_test.go` + `lifecycle_test.go` — TDD-first, no `internal/telemetry` dep, OTel `sdkmetric.NewManualReader` for metric assertions. Covers noop safety, nil-MP error, errors_total + kind + component_id, scope-name pin, emissions monotonicity + negative-discard, init_errors_total, recordInitError nil-safety, Start/Shutdown happy path, Start-twice sentinel, idempotent Shutdown, panic callback fires, shutdown-deadline returns ctx err, Add() registers under same WaitGroup, Add'd panic fires callback, Add-before-Start no-op, Add-after-Shutdown no-op. - Rewires `kernelevents.go`, `source.go`, `kmsg.go`, `journald.go`, `factory.go`, `export_test.go`, `nullsource_test.go`, `source_template.go.example`, README — drop the `internal/*` imports, switch to the local `selfTelemetry` / `lifecycle` / `kind` symbols. `FakeTelemetry.ErrorKinds()` now returns `[]string` (was `[]selftelemetry.Kind`) — only consumer is in-package tests. - `runbook_kinds_test.go` AST gate rewritten to resolve local `kind` const decls (was: selector-walk for `selftelemetry.Kind...`); expected set unchanged so a future regression that drops a kind still fails the gate. - `internal/runtime/lifecycle/lifecycle.go` doc comment updated: kernelevents removed from the "Used by" list; dcgm is the last in-tree user, noted alongside the deletion plan. ## Root cause Not a bug fix — this is a structural migration to unblock PR-F. kernelevents-specific requirement: multi-source receiver needs `Add()` preserved across the migration (PR-B1's pattern dropped it for the single-source case). No workaround; full port. ## Test plan - [x] `make check` — 0 issues (golangci-lint + vet + mod verify) - [x] `go test ./...` — entire repo green - [x] `go test -race -run "^(TestSelfTelemetry|TestLifecycle|TestRecordInit|TestSourceInterface|TestKmsgSource)" ./components/receivers/kernelevents/` — race-clean - [x] `TestRUNBOOK_KindsMatchEmitted_PinsExpectedSet` still pins the same kind set after the AST walker rewrite - [x] Verified `kernelevents.FakeTelemetry` + helpers have no out-of-package consumers ### Known pre-existing flake (NOT caused by this PR) `TestReceiver_SLIBudget` (bench_test.go) flakes when p99 emit latency exceeds 5ms — reproduced on clean `main` at 12.9ms p99 without these changes. Tracking separately; not blocking. ## Release notes ```release-notes [CHANGE] kernelevents: receiver self-telemetry + lifecycle helpers moved from internal/selftelemetry + internal/runtime/lifecycle to package-local siblings (RFC-0013 PR-B2). Metric names, label shape, and observable behavior are unchanged. The OTel instrumentation scope name moves from `github.com/tracecoreai/tracecore/internal/selftelemetry` to `github.com/tracecoreai/tracecore/components/receivers/kernelevents`, matching OTel convention; dashboards that pinned the old scope name need updating. ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
## Summary Port `components/exporters/stdoutexporter` off `internal/selftelemetry`. This is the exporter-side half of RFC-0013 §migration PR-B — sibling to the receiver-side PR-B1 (#184) that ported `components/receivers/nccl_fr`. Same shape as PR-B1: receiver-scoped self-telemetry helpers travel as siblings (`selftel.go` + `selftel_test.go`) co-located with the component, scope-name = the component's Go import path, metric names + label shape preserved bit-for-bit so dashboards / alerts do not regress. ## Root cause Not a bug — a planned dependency severance. v0.1.x stdoutexporter imports `internal/selftelemetry` for the `Exporter` interface, the `NewNoopExporter` / `NewExporter` constructors, the `Kind` type, and `RecordInitError`. RFC-0013 PR-F deletes `internal/selftelemetry` entirely (the runtime that consumes it dies in the same wave). Every component carrying that import has to migrate to a sibling helper before PR-F can land. PR-B1 did the receiver; this PR does the exporter. ## Changes - `components/exporters/stdoutexporter/selftel.go` (new): sibling `selfExporter` interface + `selfExporterImpl` + `recordInitError`. Emits `tracecore.exporter.calls_total{result,kind,component_id}` on a meter acquired from `set.Telemetry.MeterProvider`. Scope = `github.com/tracecoreai/tracecore/components/exporters/stdoutexporter` (OTel convention). - `components/exporters/stdoutexporter/selftel_test.go` (new): 7 tests — noop safety, nil-MP error sentinel, calls_total emission + `{result,kind,component_id}` label shape, scope-name standard, `init_errors_total` tick + nil-MP safety, factory noop-fallback wiring with synthetic register failure, and a compile-time sibling-types guard. - `components/exporters/stdoutexporter/stdoutexporter.go` (rewired): use sibling types; drop `internal/selftelemetry` import; drop `(*stdoutExporter).SelfExporter()`. ## ExporterCarrier removal — rationale PR-B1 (nccl_fr receiver) didn't expose any selftelemetry types to the runtime — it was a clean port. stdoutexporter is different: v0.1.x exposed `SelfExporter() selftelemetry.Exporter` so `cmd/tracecore/collect.collectFailureRateReaders` could feed `tracecore.exporter.failure_rate`. This PR drops that contract: - The runtime's reader-collection path silently skips components that don't implement `ExporterCarrier` — documented "no per-exporter signal" degraded mode. - stdoutexporter is the canonical debug / example exporter (writes JSON lines to stdout). Operators don't alert on its failure_rate. Real backends in `components/exporters/otlphttp` carry that contract instead. - `tracecore_exporter_failure_rate` still surfaces in scrape via the SLO observable gauge (reports 0 with no readers registered) — the M2 acceptance check in `cmd/tracecore/integration_telemetry_test.go` still passes. - `tracecore_exporter_calls_total` continues to surface because the sibling impl emits it on `set.Telemetry.MeterProvider` directly, just under a new scope. - PR-F deletes the entire `internal/selftelemetry` package, so the `ExporterCarrier` contract evaporates regardless. Rationale documented inline in `stdoutexporter.go` and the `selfExporter` docstring so a six-months-cold reader knows why the carrier is missing. ## Test plan - [x] `make check` (gofumpt, golangci-lint, go vet, go mod verify) — green - [x] `go test ./components/exporters/stdoutexporter/... -count=1` — 13 tests pass (7 new selftel + 6 existing) - [x] `go test -race ./components/exporters/stdoutexporter/... -count=1` — green - [x] `go test ./cmd/tracecore/ -run TestIntegration_TelemetrySurface_EndToEnd -count=1` — passes (asserts both `tracecore_exporter_calls_total` and `tracecore_exporter_failure_rate` appear in scrape) - [x] `go test ./... -count=1 -short` — full repo green; no name regression in `cmd/tracecore`, `internal/telemetry`, or `internal/selftelemetry` (which still has its own tests until PR-F) - [x] Scope-name pin: assertion against `github.com/tracecoreai/tracecore/components/exporters/stdoutexporter` in selftel_test.go - [x] Sibling-types compile-time guard: `asSelfExporter` helper forces compile break if the type ever moves back to `internal/selftelemetry` ## Pattern continuity References `#184` (PR-B1, nccl_fr) for the receiver sibling pattern. Test helpers (`newTestMeterProvider`, `collectRM`, `scopeOf`, `kvMatch`, `dumpNames`, `failingExporterMP`) mirror the nccl_fr equivalents so an M8+ exporter author can read either pair and infer the convention. ## Release notes NONE — this is a pivot-internal port. Metric names + labels are unchanged on the scrape surface. The unexported `SelfExporter()` method removal is not observable to any v0.1.x operator (only `cmd/tracecore`'s reader-collection path read it, and that path's "skip on absent" branch is documented behavior). ```release-notes NONE ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
## Summary Mirrors PR-B1's nccl_fr sibling pattern (#184) for `components/receivers/dcgm`. Adds two package-local files — `selftel.go` and `lifecycle.go` — and rewires the receiver + every test off `internal/selftelemetry` and `internal/runtime/lifecycle`. Both internal packages are slated for deletion in RFC-0013 PR-F; this PR removes dcgm as a blocker. - `selftel.go` owns the `Kind` type, the canonical `Kind*` const block, the receiver-local `kindWatch` / `kindMIG`, the `selfTelemetry` interface + OTel-backed impl, and `recordInitError`. Scope name pinned to the receiver's Go import path per the PR-B1 standard. - `lifecycle.go` is a single-source slim variant of the internal helper (no `Add()` since dcgm doesn't run auxiliary watchers). Exports `ErrAlreadyStarted` so the double-Start contract test continues to compile via `errors.Is`. - The dcgm-specific bits PR-B1 didn't cover: receiver-local Kind constants layered on the canonical set (same const block, no separate import), explicit `SetDegraded(true|false)` transition path exercised by `degraded_seconds_total` accumulation across cycles, and the `WithSelfTelemetry` test seam that now accepts the package-local interface. ## Why port before the planned delete dcgm is scheduled for removal in PR-F. The port still earns its keep because: - It collapses the PR-F critical path from two PRs (dcgm-delete → internal-pkg-delete) to one (PR-F can land both atomically once dcgm is no longer a dependent). - The work is a mechanical mirror of PR-B1 — `selftel.go` + `lifecycle.go` are near-line-for-line copies with the dcgm-specific delta isolated to the const block and the SetDegraded contract. - Per `feedback_no_bloat.md`, the only justification for adding code that's about to be deleted is "it unblocks a delete." That's exactly this PR's posture. ## Root cause RFC-0013's PR-B-series unblocks PR-F by decoupling every `components/*` receiver from `internal/*` packages that should not have receiver-level dependents. Each receiver owns its own self-telemetry surface + lifecycle helper; `internal/selftelemetry` + `internal/runtime/lifecycle` become deletable in PR-F. Not a workaround — the port is the actual fix. ## TDD coverage added in `selftel_test.go` - Noop safety across every canonical Kind + the receiver-local `kindWatch` / `kindMIG`. - `errNilMeterProvider` sentinel returned (not noop-substituted) when MP is nil. - `tracecore.receiver.errors_total` partitioned by `kind` with `component_id` label, for both canonical kinds AND local kinds (separate test). - `tracecore.receiver.emissions_total` monotonic + drops negative values. - OTel scope name pinned to `github.com/tracecoreai/tracecore/components/receivers/dcgm`. - `degraded_seconds_total` accumulates across multiple degrade→recover cycles (the dcgm-specific contract). - `init_errors_total` ticks with `kind=receiver` / `reason=instrument_register`; nil MP is silently safe. - AST-walk parity test that asserts `docs_parity_test.go`'s hand-maintained `canonicalKindByName` matches the const block in `selftel.go` — so a future `KindFoo` addition without a paired map entry fails at the moment of drift. ## Test plan - [x] `make check` (gofmt + tidy-check + golangci-lint + go vet + mod-verify) → clean. - [x] `go test ./...` → clean. - [x] `go test -race ./components/receivers/dcgm/...` → clean. - [x] `go build ./components/receivers/dcgm/...` → clean. - [x] New parity test `TestCanonicalKindByName_CoversConstBlock` self-validates against the const block. - [x] `TestRUNBOOK_KindsMatchEmitted` continues to pass — the AST walker resolves both the local-kind idents (`kindWatch`, `kindMIG`) and the now-bare canonical `KindFoo` idents. - [x] `TestReceiver_UsesLifecycleHelper` updated to assert the new `*lifecycle` field-type shape (Ident inside StarExpr) — a revert to the deleted `internal/runtime/lifecycle.Lifecycle` would fail here. - [x] `TestReceiver_M2WiringFromMeterProvider` + `TestReceiver_RecordsInitErrorWhenConstructorFails` continue to exercise the end-to-end Prometheus scrape path against the real `internal/telemetry` MeterProvider — noop-fallback observability preserved. ## Release notes ```release-notes NONE: internal refactor only. Receiver behavior + metric names + label shape unchanged. Operators see no scrape-side diff. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
…189) ## Summary RFC-0013 PR-A2 (sequencing gate for PR-B2 / PR-F / PR-I): retire the hand-wired `./cmd/tracecore` entry point and adopt the OpenTelemetry Collector Builder (OCB) output at `./_build/tracecore` as the canonical binary. After this lands, all receivers register through OCB's generated `otelcol.Factories` instead of the bespoke `cmd/tracecore/components.go`. Big diff (-3,869 / +388 across 53 files) because PR-A2 is the load-bearing pivot point; PR-B2 / PR-F / PR-I that follow can land surgically. ```release-notes CLI surface change — `tracecore` now uses the upstream OCB-generated CLI: - `tracecore collect --config=…` → `tracecore --config=…` (collect was default; OCB main runs the collector by default) - `tracecore receivers list` → `tracecore components` (now shows receivers + processors + exporters + extensions + connectors) - `tracecore debug dump` → removed (no OCB equivalent; use `tracecore components` + the live config when filing issues) - `tracecore failure-inject {nccl-hang,pod-evict}` → use the standalone `tools/failure-inject` binary (already ships xid, nccl-hang, pod-evict, cpu-steal subcommands) - `--log.format=text` / `--shutdown.drain-budget=…` / `--version-short` → removed; OCB upstream uses `--feature-gates` + `--set` flags Chart-shape changes — operator-visible: - Default pipeline flips from clockreceiver→stdoutexporter (in-tree, not registered by OCB) to hostmetrics→debug (both upstream OCB-bundled). Fresh install on a no-GPU cluster boots and emits load-average metrics immediately, same as before. - `telemetry.listen` + `telemetry.paths.{metrics,healthz,readyz}` → `telemetry.metricsListen` + `telemetry.healthListen` + `telemetry.healthPath`. The legacy single-listener block is gone because upstream `service.telemetry` and `healthcheckextension` are two separate processes. Probes hit `:13133/` (healthcheckextension default) instead of `:8888/healthz`. Prometheus scrape port (8888) is unchanged. - Self-telemetry metric names rename `tracecore_*` → `otelcol_*` (upstream vocabulary). Dashboards on any `tracecore_receiver_*`, `tracecore_exporter_*`, `tracecore_queue_*`, `tracecore_component_*`, or `tracecore_build_info` must be rewritten — see `docs/migration/v0.1-to-v0.2.md` for the exact map. ``` ## Breaking changes — orphan components (PR-A2 → PR-J/K bridge) The OCB-assembled binary registers only the components in `builder-config.yaml`: 6 receivers, 4 exporters, 3 extensions, 4 processors. The chart's per-component toggles for the legacy in-tree set survive this PR so the values shape doesn't break for operators that pin them, but **enabling any of the following in chart values will cause the pod to fail at startup with an "unknown factory" error until PR-B2 / PR-J / PR-K rewire them**: | Component | Kind | Replacement (planned) | |---|---|---| | `clockreceiver` | receiver | `hostmetrics` (PR-E shipped; now default) | | `containerstdout` | receiver | `filelogreceiver` + container stanza + `file_storage` extension (PR-J) | | `dcgm` | receiver | `dcgm-exporter` DaemonSet + `prometheusreceiver` (PR-J) | | `k8sevents` | receiver | `k8sobjectsreceiver` + OTTL `k8s.event.hint` transform (PR-J) | | `kernelevents` | receiver | `journaldreceiver` + `filelogreceiver` (kmsg) + OTTL Xid transform (PR-J) | | `nccl_fr` | receiver | In-repo Go submodule via OCB `gomod:` (PR-B2 + PR-I) | | `pyspy` | receiver | Deferred until OTel Profiles GA | | `stdoutexporter` | exporter | `debug` (OCB-bundled; now default) | | `otlphttp` (in-tree clone) | exporter | `otlphttpexporter` (OCB-bundled; same `otlphttp` name in chart values, same field shape — `endpoint`, `compression`, `headers`, `tls.*`, `timeout`, `retry_on_failure`, `sending_queue`; pass-through render so any upstream field works) | To verify what's actually registered in the binary you're running: `./_build/tracecore components`. The chart's `NOTES.txt` surfaces a WARNING when an operator enables any of these, and the chart-render CI workflow now runs `tracecore validate` against the default + one-receiver-on fixtures so a chart edit that emits a non-OCB key trips CI before reaching `helm install`. ## What landed ### Deletions (3,032 LOC across 22 source + 7 test files) - `cmd/tracecore/` entire tree: `main.go`, `collect.go`, `validate.go`, `debug.go`, `receivers.go`, `signals.go`, `failure_inject.go`, `openflags_{linux,other}.go`, `receiver_variants{,_dcgm_cgo,_dcgm_stub}.go`, `components.go`, + every `_test.go` (`collect_test`, `debug_test`, `failure_inject{,_linux}_test`, `integration_test`, `integration_telemetry_test`, `main_test`, `receivers_test`) - `components.yaml` + `tools/components-gen/{main,main_test}.go` — superseded by `builder-config.yaml`; OCB owns codegen now - `components/receivers/kernelevents/runbook_test.go` — depended on the deleted `tracecore debug dump` subcommand; kernelevents itself is scheduled for deletion in PR-K ### Build path swap | File | Change | |---|---| | `Makefile` | `build` target now runs OCB (was: legacy `go build ./cmd/tracecore`); dropped `generate`, `generate-check`, `run`, legacy `-ldflags -X` version injection; `coverage` `-coverpkg` drops `./cmd/...` | | `install/kubernetes/tracecore/Dockerfile` | Builds via OCB (`make build` then copy `_build/tracecore`) | | `install/kubernetes/tracecore/templates/daemonset.yaml` | `args` drops the `collect` subcommand; probes hit the new `health` port (13133) at `healthPath` | | `install/kubernetes/tracecore/templates/_helpers.tpl` | `renderedConfig` emits upstream OTel shape — `service.telemetry.metrics.address` + `extensions.health_check` + `service.extensions: [health_check]` — instead of the legacy single-listener `telemetry:` top-level block | | `install/kubernetes/tracecore/values.yaml` | Default pipeline flipped to `hostmetrics → debug`; in-tree-only toggles kept with explicit doc-comments naming PR-J/K as their migration owner | | `.goreleaser.yaml` | Switched to `builder: prebuilt` against `./_build/{Os}-{Arch}/tracecore`; release.yml gains a per-platform pre-build step | | `.ko.yaml` | Builds from inside `./_build/` (the OCB submodule) via `KO_CONFIG_PATH=../.ko.yaml`; `main: .` | | `.github/workflows/ci.yml` | `package` job runs OCB; old `build-ocb` drift gate replaced by `smoke-test-binary` job consuming the package artefact | | `.github/workflows/release.yml` | New "Pre-build OCB binaries" step before goreleaser; ko-publish step runs from `cd ./_build/` so OCB submodule resolves | | `.github/workflows/chart.yml` | Restored the `tracecore validate` gate against the default + one-receiver-on chart renders; path triggers swap `cmd/tracecore/**` → `builder-config.yaml` | | `.github/workflows/install-bench.yml` | Path triggers swap `cmd/tracecore/**` → `builder-config.yaml` | | `builder-config.yaml` | `dist.version` bumped to `0.1.0-m9-alpha` to match `Chart.yaml` `appVersion`; `chart-appversion-check.sh` now reads it as source of truth | | `scripts/chart-appversion-check.sh` | Read `dist.version` from `builder-config.yaml` (was: `internal/version/version.go`) | | `scripts/smoke.sh` | Rewritten for OCB binary — hostmetrics → debug config, expects upstream lifecycle log lines | | `scripts/validator-recipe.sh` | `BIN` default now `./_build/tracecore` | | `scripts/{doc-check,no-autoupdate-check}.sh` | Drop `cmd` from scan paths | ### New integration seam - `internal/integration/ocb_scrape_test.go`: spawns `_build/tracecore` against a hostmetrics → debug config, polls the upstream `:NNNN/metrics` surface, asserts both `otelcol_process_uptime` and `otelcol_receiver_accepted_metric_points` are present, then SIGTERMs the subprocess and asserts clean exit. The test skips when `_build/tracecore` is absent so a fresh `git clone` + `go test ./...` stays green; `make build` is the prereq. This is the regression gate for the chart's operator-facing self-telemetry contract (RFC-0013 §3): if a future upstream OCB release renames either metric, the chart's `service.telemetry.metrics.address` advertisement breaks downstream dashboards silently — this test fires first. ### Integration recipe migration - `docs/integrations/examples/honeycomb.yaml` + `otel-backend.yaml`: `clockreceiver` → `hostmetrics` loadscraper (the OCB-supported equivalent per RFC-0013 PR-E). The other two recipes carry `pending-rfc-0013-pr-a` markers and are still skipped by `validator-recipe.sh`. ### Doc rot fixes - `docs/FAILURE-MODES.md`: rerouted 8 entries that referenced deleted in-tree tests to their upstream OCB owners. - `docs/FLAKY-TESTS.md`: moved the two in-tree integration flakes to Resolved. - `STYLE.md`: rewrote repo-layout + component-registration + CLI + build-release sections around OCB. - `PRINCIPLES.md`: dropped concrete-example reference to deleted file. - `install/kubernetes/tracecore/README.md`: ko local-build steps now run from inside `./_build/`. - `docs/rfcs/0013-distro-first-pivot.md`: PR-A2 entry rewritten as landed. - `docs/migration/v0.1-to-v0.2.md`: added rows for the self-telemetry metric-name rename (`tracecore_*` → `otelcol_*`) and the `telemetry.*` chart values key rename. ## Sequencing constraints honored - `components/receivers/{clockreceiver, containerstdout, dcgm, k8sevents, kernelevents, nccl_fr, pyspy}` survive as orphan-but-compiling code until PR-K deletes them along with the chart-fixture migration. - `internal/{pipeline, selftelemetry, telemetry, componentstatus, pipelinebuilder, consumer, fanout, runtime}` survive until PR-F (after PR-B1 lifted nccl_fr off `internal/selftelemetry` in #184). - `tracecore validate` gate in chart workflow is **restored** on the default + one-receiver-on fixtures (was temporarily disabled in the first push of this PR; the chart's renderedConfig template migration was completed in the same PR). ## Sequencing gate satisfied PR-B2 (`nccl_fr` import swap to upstream OCB types), PR-F (`internal/*` deletion), and PR-I (Go submodule extraction) are unblocked. The legacy boot path is gone; the OCB-driven boot path is live. ## Test plan - [x] `make build` → produces `./_build/tracecore` binary via OCB - [x] `./_build/tracecore --version` reports `0.1.0-m9-alpha` matching `Chart.yaml` - [x] `./_build/tracecore components` lists 6 receivers + 4 exporters + 3 extensions + 4 processors (the `builder-config.yaml` inventory) - [x] `./_build/tracecore validate --config=<rendered chart default>` exits 0 - [x] `./_build/tracecore validate --config=<rendered chart one-receiver-on fixture>` exits 0 - [x] `make smoke` passes (hostmetrics → debug, 1.5s window, clean shutdown) - [x] `make check` passes - [x] `go test ./internal/integration/...` passes (new OCB scrape test; ~1.2s) - [x] `helm lint install/kubernetes/tracecore` clean (1 chart, 0 failed; icon advisory only) - [x] `helm template demo install/kubernetes/tracecore --show-only templates/daemonset.yaml` renders `args: [--config=…]` + two ports (`telemetry`, `health`) + probes hitting `health` port at `/` - [x] `helm template demo install/kubernetes/tracecore --show-only templates/configmap.yaml | yq '.data["config.yaml"]'` renders upstream OTel shape: `service.telemetry.metrics.address`, `extensions.health_check`, `service.extensions: [health_check]` - [x] `conftest test --policy install/kubernetes/tracecore/policies/conftest/tracecore.rego /tmp/chart-render.yaml` — 51 tests pass - [ ] CI: `package` job builds amd64 + arm64 OCB binaries - [ ] CI: `smoke-test-binary` job runs `--version` + `components` on the package artefact - [ ] CI: `chart` workflow lints + templates + validate + yq + conftest pass - [ ] CI: `install-bench` workflow's kind-cluster install still rolls out (bench values updated to drop the obsolete `stdoutexporter` reference and align `otlphttp.endpoint` with the upstream `otlphttpexporter` schema — pass-through render so all upstream fields work without chart changes) --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 31, 2026
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
…block) (#196) ## Summary RFC-0013 PR-F unblock: ports `components/receivers/k8sevents` off `internal/selftelemetry` + `internal/runtime/lifecycle` to package-local sibling types, mirroring PR-B1 (#184, nccl_fr) and PR-B2 (#187, kernelevents). One more receiver off the internal helpers; clears another blocker for PR-F's deletion of those internal packages. ## Why `internal/selftelemetry` and `internal/runtime/lifecycle` are slated for deletion in RFC-0013 PR-F. Every in-tree user must migrate first. PR-B1 landed the pattern (`nccl_fr`); PR-B2 applied it to `kernelevents`; this PR applies it to `k8sevents`. Receivers landing post-PR-F will own their selftel + lifecycle in their own submodule per OTel convention — this PR puts k8sevents on that footing. ## Pattern choice (multi-source, keeps `Add()`) k8sevents is multi-source (Events informer + Node informer driver goroutines both join the receiver's lifecycle), so this PR follows the PR-B2 kernelevents sibling rather than the slimmer PR-B1 nccl_fr sibling — keeps `lifecycle.Add()` (with its post-Shutdown and pre-Start refusal guards, plus the mutex-guarded TOCTOU-safe `Add`/`Shutdown` interleave) so the SharedInformerFactory driver goroutine joins the same WaitGroup as the run-loop goroutine. ## What changed - **New** `components/receivers/k8sevents/selftel.go` — local `selfTelemetry` interface + noop + OTel-backed implementations + `recordInitError` fallback ticker. Instrumentation scope name pinned to k8sevents' Go import path per OTel convention (PR-I will move it to the submodule path when k8sevents goes external). Metric names (`tracecore.receiver.errors_total{kind,component_id}` and siblings) preserved so dashboards/alerts don't regress. - **New** `components/receivers/k8sevents/lifecycle.go` — local `lifecycle` helper (Start + Add + Shutdown + panic-recovery + TOCTOU-safe mutex-guarded `closed`/`Add` interleave). - **New** `selftel_test.go` + `lifecycle_test.go` — TDD-first, no `internal/telemetry` dep. Pin: noop safety across every kind, nil-MeterProvider error, errors_total kind+component_id labels, scope-name = receiver Go import path, init_errors_total tick, factory noop fallback, Start/Shutdown/Add contracts, Add refusal-modes, TOCTOU concurrent Add-during-Shutdown stress. - `factory.go`, `receiver.go` — drop internal imports, wire local `selfTelemetry` + `lifecycle`. `r.lc` retyped from `*lifecycle.Lifecycle` to `*lifecycle`; `r.telemetry` retyped from `selftelemetry.Receiver` to `selfTelemetry`. Exported `KindWatch` / `KindBackpressureDrop` / `KindNode*` constants retyped from `selftelemetry.Kind` to package-local `kind` alias (load-bearing for the `K8sEventsReceiverDegraded` alert label values, which stay byte-identical). Added re-exported `KindParse` / `KindDownstream` / `KindCardinality` / `KindPanic` so external `_test` packages can still partition assertions on canonical kinds without depending on the deleted internal package. - `export_test.go` — test-only `recordingTel` replaced by exported `CapturingTelemetry` (mirrors `selftelemetry.CapturingReceiver`, trimmed to the k8sevents surface). External `_test` packages get a stable shim that survives the internal/selftelemetry deletion. - `receiver_test.go`, `node_failure_modes_test.go` — drop `internal/selftelemetry` import; redirect through `k8sevents.CapturingTelemetry`. Per-test bodies otherwise unchanged (kept the `recordingTel` symbol as a thin wrapper). ## Verification - `make check` clean (golangci-lint 0 issues, vet clean, mod-verify) - `make verify` clean (incl. doc-check + alert-check + chart-check + no-autoupdate) - `go test -race -count=1 ./...` — entire repo green (incl. k8sevents + kernelevents + every internal package, fixture, tool) - `go test -race -count=1 ./components/receivers/k8sevents/...` — all green; lifecycle TOCTOU stress test passes deterministically under `-race` - Goroutine leaks: pre-existing `TestReceiver_GoleakNoLeakAfterShutdown` continues to pass after the lifecycle rewire ## Out of scope (deferred) - PR-F itself (deleting `internal/selftelemetry` + `internal/runtime/lifecycle`) — gated on the remaining 3 wave-1 consumers (pyspy, otlphttp, containerstdout) landing their ports. This PR is one of those four. ```release-notes NONE ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
…k) (#194) ## Summary Ports `components/receivers/pyspy` off `internal/selftelemetry` + `internal/runtime/lifecycle` using the sibling co-location pattern established by PR #184 (nccl_fr reference impl). `selftel.go` + `lifecycle.go` now live next to the receiver source. Unblocks **RFC-0013 PR-F** deletion of those internal moats. ## What changed - `selftel.go` (new) — receiver-scoped `selfTelemetry` interface + `selfTelemetryImpl` (OTel-backed) + `noopSelfTelemetry` + `recordInitError`. Metric names + label shape preserved 1:1 with `internal/selftelemetry`: - `tracecore.receiver.errors_total{kind,component_id}` - `tracecore.receiver.emissions_total{component_id}` - `tracecore.receiver.collection_latency_seconds{component_id}` (12-bucket boundaries, matches v0.1.x) - `tracecore.receiver.degraded_seconds_total{component_id}` (observable counter) - `tracecore.receiver.last_activity_unix_seconds{component_id}` (observable gauge) - `tracecore.selftelemetry.init_errors_total{kind,component_id,reason}` (factory-fallback signal) - `lifecycle.go` (new) — slim single-source variant (no `Add()`). pyspy fans its three goroutines (scan loop, trigger cadence driver, trigger run loop) via its own `sync.WaitGroup` inside `runAll`, so lifecycle only needs `Start` + `Shutdown`. - `kinds.go` — `kind` is now a receiver-local type. `kindPanic` is declared locally (replaces `selftelemetry.KindPanic`). - `factory.go` — uses local `newSelfTelemetry` / `recordInitError` / `reasonInstrumentRegister`. - `pyspy.go` — fields/types switched to local sibling. - `pyspy_test.go` — uses sibling-local `fakeTelemetry` captor (replaces `selftelemetry.CapturingReceiver`). - `selftel_test.go` (new, RED-first TDD) — 7 tests covering noop safety, nil-MP, errors_total kind+component_id, scope-name pin, init_errors_total, nil-MP safety on recordInitError, factory fallback path. - `fake_telemetry_test.go` (new) — test-only captor implementing the `selfTelemetry` interface; mirrors the kernelevents `export_test.go` pattern. ## Scope-name standard Instrumentation scope pinned to the receiver's Go import path: ``` github.com/tracecoreai/tracecore/components/receivers/pyspy ``` When the receiver moves to `module/receiver/pyspyreceiver/` in PR-I.1, the scope name moves with it (matches OTel convention). ## Lifecycle pattern decision pyspy uses **single-source** lifecycle (like nccl_fr), NOT multi-source (like kernelevents). The receiver's `runAll` callback owns a local `sync.WaitGroup` and fans into three goroutines from within the single `lifecycle.Start(ctx, r.runAll)` call. No `Add()` needed → dropped from the slim sibling. ## Tests ``` $ go test ./components/receivers/pyspy/ -count=1 ok github.com/tracecoreai/tracecore/components/receivers/pyspy 0.735s $ go test -race ./components/receivers/pyspy/ -count=1 ok github.com/tracecoreai/tracecore/components/receivers/pyspy 1.745s $ make check 0 issues. ``` 13 targeted tests pass (7 new selftel + 6 existing receiver lifecycle tests). ## Wave context pyspy is 1 of 4 remaining unported consumers of `internal/selftelemetry` + `internal/runtime/lifecycle`. Sibling agents (otlphttp / containerstdout / k8sevents) are landing in parallel; this PR's scope is the pyspy receiver only — zero file overlap with the parallel work. ## Test plan - [x] `go test ./components/receivers/pyspy/ -count=1` — green - [x] `go test -race ./components/receivers/pyspy/ -count=1` — green - [x] `make check` — 0 lint issues, vet clean, modules verified - [x] Scope name asserted at `github.com/tracecoreai/tracecore/components/receivers/pyspy` - [x] Factory fallback ticks `tracecore.selftelemetry.init_errors_total` on instrument-register failure - [x] Noop hot-path methods never panic; nil-MP rejected with sentinel error - [x] Existing receiver tests (Start_EmptyUDSDir, Start_NonExistent, Start_TopLevelDisabled, Shutdown_HonorsContextBudget, Shutdown_Idempotent) still pass against the sibling ```release-notes NONE ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
6 tasks
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
## Summary
- Port `components/receivers/nccl_fr` off the v0.1.x internal facades
(`internal/pipeline`, `internal/consumer`, `internal/runtime/lifecycle`)
onto upstream
`go.opentelemetry.io/collector/{component,receiver,consumer}`
v1.59.0 — the canonical types the OCB-generated `_build/main.go`
already consumes for all third-party receivers.
- Factory is now `receiver.NewFactory(componentType,
createDefaultConfig,
receiver.WithLogs(createLogs, component.StabilityLevelBeta))` instead
of a hand-rolled struct implementing
`internal/pipeline.ReceiverFactory`.
Stability level (`Beta`) preserved across the swap so OCB-surfaced
metadata doesn't regress.
- Receiver struct renamed to `ncclFRReceiver` (avoids collision with the
upstream `receiver` package name) and implements `receiver.Logs` via
a `Start(ctx, component.Host) error` / `Shutdown(ctx) error` pair —
the `pipeline.ComponentState` embed was dropped (upstream
`component.Component` carries no equivalent mixin; the lifecycle
bookkeeping the receiver actually needs lives in the sibling
`lifecycle.go` helper added in PR-B1 #184).
- Logger swapped from `*slog.Logger` → upstream's `*zap.Logger`
(the type carried in `component.TelemetrySettings.Logger`). All log
call sites converted to `zap.String/Int64/Duration/Error` fields;
log messages and fields are byte-for-byte preserved so operator
alerting on log content does not regress.
## Hard gate
PR-I.1 (submodule extraction to `module/receiver/ncclfrreceiver/`)
requires
`grep -r 'internal/(pipeline|consumer|runtime/lifecycle)'
components/receivers/nccl_fr/`
to return zero hits. This PR clears it:
```
$ grep -rn 'internal/pipeline\|internal/consumer\|internal/runtime/lifecycle' components/receivers/nccl_fr/*.go
(no matches)
```
(Two comment-only mentions remain in `lifecycle.go` and `factory.go` —
historical context for the v0.1.x → v0.2.0 migration, not imports.)
## Predecessor
PR-B1 #184 (merged 2026-05-30) ported the **self-telemetry + lifecycle**
helpers into the package as siblings. This PR (PR-B2) handles the
**pipeline + consumer + factory** layer — the last remaining
`internal/*` imports.
## Test plan
- [x] `go build ./...` — green
- [x] `go test ./components/receivers/nccl_fr/... -race` — 12 tests pass
- [x] `go test ./...` — green except pre-existing flake in
`components/receivers/kernelevents/TestReceiver_SLIBudget`
(verified flaky on stashed/PR-B2-not-applied tree)
- [x] `make check` — golangci-lint + go vet + go mod verify — green
- [x] Scope-name pin (`TestSelfTelemetry_ScopeNameIsReceiverImportPath`)
still asserts
`github.com/tracecoreai/tracecore/components/receivers/nccl_fr`
- [x] Factory fallback contract
(`TestFactory_FallsBackToNoopWhenMeterFails`)
still surfaces `tracecore.selftelemetry.init_errors_total` when
every `tracecore.receiver.*` instrument registration is synthetically
failed
## Compatibility note
`go.mod` now pins
`go.opentelemetry.io/collector/{component,receiver,consumer} v1.59.0`
— the v1.x stable line aligned with `pdata v1.59.0` the repo was already
on.
No transitive-dep churn beyond zap 1.24.0 → 1.28.0 (upstream component
v1.59.0 requires it).
```release-notes
NONE
```
## Type-swap reference
This PR is the first-of-kind upstream-API port; 7 more receiver/exporter
ports (PR-F.2 series — clockreceiver, kernelevents, stdoutexporter,
k8sevents, containerstdout, otlphttp, pyspy, dcgm) inherit this exact
mapping. Use this table as reference when porting those components off
`internal/pipeline` + `internal/consumer`.
| Internal | Upstream |
|---|---|
| `internal/pipeline.Type` | `component.Type` |
| `internal/pipeline.ReceiverFactory` | `receiver.Factory` |
| `internal/pipeline.CreateSettings` | `receiver.Settings` (via
`receivertest.NewNopSettings` in tests) |
| `internal/pipeline.Config` | `component.Config` |
| `internal/pipeline.Receiver` | `receiver.Logs` (`= interface{
component.Component }`) |
| `internal/consumer.Logs` | `consumer.Logs` |
| `*slog.Logger` | `*zap.Logger` |
| `internal/pipeline.MustNewType` | `component.MustNewType` |
| `internal/pipeline.MustNewID` | `component.NewIDWithName` |
## Deep-review cleanup (post-aec83be)
A follow-up commit applies five reference-pattern fixes so PR-F.2
inherits
the cleanest possible template:
1. Deleted dead `var Factory` + indirection wrapper — `NewFactory()` now
constructs the factory directly, mirroring upstream `otlpreceiver` /
`filelogreceiver`. The `tools/components-gen` driver that motivated the
package-var was deleted in PR-A2 #168.
2. Fixed misleading comment claiming `receiver.Logs` carries a
"LogsReceiver tag" — upstream defines `receiver.Logs` as
`interface { component.Component }`; the type identity is a
documentation marker only.
3. Switched tests to upstream
`receivertest.NewNopSettings(componentType())`
so test Settings auto-track upstream field additions (`BuildInfo`,
`TelemetrySettings`). A thin `testSettings()` wrapper pins the ID to
`nccl_fr/test` so selftel label assertions stay deterministic.
4. Renamed unexported `ncclFRReceiver` → `ncclfrReceiver` per Go acronym
convention (lowercased) and aligned with the planned PR-I.1b package
name `ncclfrreceiver`.
5. Appended this Type-swap reference section.
---------
Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
7 tasks
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
#206) ## Summary Deletes the three internal moats and the in-tree DCGM receiver that RFC-0013 §migration step 8 promised — the payoff for the wave-3 sibling-port PRs (#184/#185/#186/#187/#188/#193/#194/#196/#197). **Net: -12,482 LOC across 92 files (78 deletions, 14 modifications).** ### What deletes | Path | LOC | Why safe now | |---|---|---| | `components/receivers/dcgm/` | 7,604 | cgo stub never shipped real code; #188's PR-B2-shaped dcgm sweep already removed the live port surface. | | `pkg/dcgm/` | 922 | Only consumer was the deleted receiver. Bonus cleanup. | | `internal/selftelemetry/` | 1,946 | Every consumer (containerstdout, clockreceiver, kernelevents, k8sevents, nccl_fr, dcgm, pyspy, stdoutexporter, otlphttp) ported onto receiver/exporter-scoped sibling `selftel.go` files. | | `internal/telemetry/` | 1,991 | Probes flow through upstream `healthcheckextension`; MeterProvider via upstream `service.telemetry`. Only remaining consumers were `internal/selftelemetry/*_test.go` (deleted together) + one orphan clockreceiver test. | | `components/receivers/clockreceiver/errors_integration_test.go` | 100 | Orphan from #185's PR-B1 clockreceiver port — bootstrapped via the deleted `selftelemetry.Receiver` interface but never migrated to the receiver-scoped sibling `selftel.go`. Covered behaviour ("errors_total surfaces on downstream failure") is now exercised through clockreceiver's sibling tests. | ### Pre-flight grep evidence (post-merge of origin/main) ``` $ grep -rn "tracecoreai/tracecore/internal/selftelemetry" --include="*.go" . (zero matches) $ grep -rn "tracecoreai/tracecore/internal/telemetry" --include="*.go" . (zero matches) $ grep -rn "tracecoreai/tracecore/components/receivers/dcgm" --include="*.go" . $ grep -rn "tracecoreai/tracecore/pkg/dcgm" --include="*.go" . (zero matches) ``` ### Tooling - Retire the `dcgm` build tag — `make build-tags` no longer vets `-tags dcgm` (kept as a hook for future build-tag-gated paths). - `make bench-check` loop drops both deleted package rows (`internal/telemetry`, `components/receivers/dcgm`). - `scripts/register-lint.sh` allowlist emptied (the two `internal/telemetry/{build_info,slo}.go` entries are gone with the package; allowlist comment notes the post-PR-F.1 state). - `go.mod` direct deps shrink — `github.com/prometheus/client_golang` and `go.opentelemetry.io/otel/exporters/prometheus` drop to indirect (they were used by `internal/telemetry/server.go`). ### Chart toggles intentionally retained Chart `receivers.dcgm` toggle + `templates/NOTES.txt` warning + `templates/_helpers.tpl` doc-comment list keep the `dcgm` symbol for the migration window. The toggle has been inert since PR-A2 — operators enabling `receivers.dcgm.enabled=true` already crashed at boot because the OCB binary doesn't register the factory. PR-K removes the toggle entirely alongside the chart-default flip from `clockreceiver` → `hostmetrics` and the v0.2.0 recipe migration. ### Doc sweep - `internal/runtime/lifecycle/lifecycle.go` doc-comment: drop the dcgm pointer; flag containerstdout as the sole remaining in-tree consumer; reschedule the package itself for PR-F.2 deletion once containerstdout ports off the helper or PR-K.2 deletes the receiver. - `docs/FAILURE-MODES.md` self-tel-surface rows rewired from `internal/telemetry/server_test.go::*` (deleted) to upstream-delegated wording. - `docs/patterns/{README,pattern-{1,3,4,5}}.md` replay-test pointers updated — the in-tree `components/receivers/dcgm/pattern_replay_test.go` is gone; pattern replay now flows through `docs/integrations/prometheus-scrape.md` (PR-J's upstream `dcgm-exporter` recipe). - `docs/README.md` per-component table: drop the deleted `internal/telemetry/{README,SECURITY}.md` rows + the deleted `components/receivers/dcgm/{README,RUNBOOK}.md` rows. - `STYLE.md` vendor-SDK section: drop the `pkg/dcgm/` reference + the `//go:build dcgm` example; explicit cross-reference to PR-F.1 in the integration-test build-tag note. - `CHANGELOG.md`: PR-F.1 landed entry under Unreleased; "Remaining v0.1.0 work" line updated to point at PR-F.2. - `docs/rfcs/0013-distro-first-pivot.md` §migration step 8: PR-F entry replaced with the PR-F.1/PR-F.2 split + the explicit rationale (componentstatus travels with pipeline; pipeline is out of PR-F's scope per line 240's original framing). ### Out of scope (PR-F.2 follow-up) - `internal/componentstatus/` — 5-line `ReportStatus` free function. Travels with `internal/pipeline` (its only non-test consumers are `internal/pipeline/runtime_test.go` + `internal/pipeline/pipelinetest/fixture_test.go`). Deletion lands when pipeline migrates to upstream `go.opentelemetry.io/collector/component/componentstatus`. ### Rationale links - RFC-0013 §migration step 8 — the PR-F entry now codifies the F.1/F.2 split in this branch's RFC update. - PR-B2 scope-discovery (#188) — established the "rename + slim, don't reshape" pattern for the dcgm sweep that retired the cgo path. - Wave-3 PRs that unblocked selftelemetry deletion: #184 (nccl_fr), #185 (clockreceiver), #186 (kernelevents), #187 (stdoutexporter), #188 (dcgm), #193 (otlphttp), #194 (pyspy), #196 (k8sevents), #197 (containerstdout). ```release-notes [CHANGE] internal/{selftelemetry,telemetry} packages deleted; components/receivers/dcgm + pkg/dcgm deleted. Operators using the v0.1.x in-tree `tracecore.*` self-telemetry metric names migrate per docs/migration/v0.1-to-v0.2.md. Third-party importers of internal/* (unlikely pre-1.0) lose the `selftelemetry.{Receiver,Exporter}` interfaces and the `telemetry.MeterProvider` wrapper; receiver/exporter authors now wire a receiver-scoped sibling `selftel.go` per the PR-B1 pattern. ``` ## Test plan - [x] `make verify` (lint + vet + tidy-check + mod-verify + license-check + generate-fixtures-check + build-tags + nccl-fr-rce-gate + register-lint + actionlint + zizmor + doc-check + no-autoupdate-check) — exit 0. - [x] `go test ./...` — all green (29 packages). - [x] `make build` (OCB) — `./_build/tracecore` produced. - [x] `./_build/tracecore --version` — `tracecore version 0.1.0-m9-alpha`. - [x] Pre-flight greps for all four deleted paths — zero external importers. - [ ] CI green on PR (linux/race matrix, chart render, install-bench, zizmor, govulncheck). - [ ] Operator verification that the chart's `dcgm` toggle remains inert post-merge (no behaviour change from main — already inert since PR-A2). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
5 tasks
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
## Summary Deletes the seven `internal/*` packages that RFC-0013 §migration step 8 PR-F.2 promised once the upstream-port wave (#201/#202/#203/#204/#205/#207/#208/#209) cleared every external caller of the in-tree pipeline runtime. **Net: -6,888 LOC across 56 deleted files, +80 LOC across 14 modified files. 70 files total.** This is the final cut of RFC-0013 §migration step 8 PR-F. ## What deletes | Path | LOC | Replacement | |---|---|---| | `internal/pipeline/` | 4,134 | `go.opentelemetry.io/collector/service` (OCB-generated `_build/main.go` consumes `builder-config.yaml`). | | `internal/pipelinebuilder/` | 1,282 | Same — assembly is upstream `service`. | | `internal/config/` | 718 | Upstream `confmap` providers (`file`, `yaml`, `env`). | | `internal/consumer/` | 87 | Upstream `go.opentelemetry.io/collector/consumer`. | | `internal/fanout/` | 366 | Upstream `internal/fanoutconsumer` (collector module). | | `internal/componentstatus/` | 16 | Upstream `component/componentstatus.ReportStatus` (same free-function shape). | | `internal/runtime/lifecycle/` | 505 | Per-receiver package-local `lifecycle.go` siblings — already ported during the PR-B1 wave (#184/#185/#186/#187/#194/#196/#197); the in-tree helper had no remaining non-test consumer after PR-F.1 + the wave-2 upstream-port PRs. `kernelevents/lifecycle.go` was inherited from k8sevents (#208). | ## Pre-flight grep evidence ``` $ grep -rn 'tracecoreai/tracecore/internal/(pipeline|consumer|pipelinebuilder|config|fanout|componentstatus|runtime/lifecycle)' --include='*.go' . (zero matches) ``` ## Tooling - `.golangci.yml` `ignore-interface-regexps` repointed at upstream `consumer.{Metrics,Traces,Logs}` + `component.Component`. The in-tree-only same-package-error-wrap exemption stays — the STYLE rule applies regardless of which interface is forwarded. - `.github/workflows/chaos.yml` drops the `chaos-pipeline-test` job (the in-tree `internal/pipeline/chaos_test.go` is gone; upstream `service` provides the equivalent panic-recovery contract). `harness-determinism` (failure-inject golden-SHA), `cpu-steal-mpstat`, `pattern-pod-evicted` jobs preserved. - `.github/workflows/install-bench.yml` drops the `internal/{pipeline,runtime,selftelemetry}/**` path-filter rows. - `go.mod` / `go.sum` unchanged. ## Doc sweep - `CHANGELOG.md` Unreleased: PR-F.2 landed entry replacing the "PR-F.2 deferred" sentence; "Remaining v0.1.0 work" line updated; one dead `internal/pipeline/README.md` link in Foundation block rewritten as "deleted at v0.1.0". - `docs/rfcs/0013-distro-first-pivot.md` §7 deletion table: both pipeline-internals and runtime/lifecycle rows updated from "v0.1.0 (audit first…)" / "v0.2.0 (with last consumer)" to "v0.1.0 (landed PR-F.2)". §migration step 8 reframed. - `docs/FAILURE-MODES.md` Lifecycle / Data flow / Shutdown timing / Backend tables rewired from in-tree `internal/{config,pipeline,fanout}/*_test.go::TestName` pointers to upstream-delegated wording matching the pattern PR-A2 established. - `docs/STRATEGY.md` "Post-RFC-0013 status" intro updated; "Stable interfaces in `internal/pipeline/`" graduation row rewritten to point at the upstream surface. - `docs/migration/v0.1-to-v0.2.md` `internal/*` section status banner flipped from "deferred, still present in RC builds" to "landed, deleted in v0.2.0 builds". - `MILESTONES.md` v0.1.0 deletions row extended with boot-path internals; M1 + M4b + M19 rubric details annotated with the PR-F.2 retirement. - `README.md` Contributor row repointed at upstream `go.opentelemetry.io/collector` package docs. - `AGENTS.md` "Self-telemetry internals" bullet split into "Self-tel internals" + "Pipeline / boot-path internals" with explicit deletion status. - `docs/README.md` table row for `internal/pipeline/README.md` dropped. - `components/receivers/kernelevents/README.md` lifecycle-sibling rationale updated to past-tense. - `tools/failure-inject/README.md` "Testing locally" section drops the `-tags=chaos ./internal/pipeline/...` invocation. ## Sequencing This PR is hard-gated on every upstream-port PR landing first: - #201 nccl_fr (PR-B2) - #202 stdoutexporter - #203 pyspy - #204 k8sevents - #205 clockreceiver (PR-B3) - #207 otlphttp - #208 kernelevents - #209 containerstdout - #206 PR-F.1 (selftel / telemetry / dcgm) All nine merged before this PR opened; this is the moat-deletion payoff. Remaining v0.1.0 work is PR-K (chart-default flip + `clockreceiver` + `stdoutexporter` + remaining receiver source deletions, coupled with test-fixture migration and the `telemetry:` values-key deprecation cycle). ## Test plan - [x] `make check` — golangci-lint 0 issues, go vet clean, go mod verify ok. - [x] `go build ./...` — clean. - [x] `go test -count=1 ./...` — green (excluding the known `kernelevents/TestReceiver_SLIBudget` flake called out in #205's body, which only triggers under heavy parallel `go test ./...` load; passes standalone). - [x] `grep` confirms zero non-internal callers of the deleted packages. - [x] Doc-check pre-push hook passes after the CHANGELOG dead-link fix. ```release-notes [CHANGE] internal/{pipeline,pipelinebuilder,config,consumer,fanout,componentstatus,runtime/lifecycle} packages deleted. The OCB-generated boot path off builder-config.yaml replaces them. Third-party importers of internal/* (unlikely pre-1.0; the packages live under internal/ and the Go compiler rejects external imports) lose the pipeline-assembly + lifecycle + config-loader surfaces; receiver authors now wire against upstream go.opentelemetry.io/collector/{component,receiver,consumer,pipeline} directly. See docs/migration/v0.1-to-v0.2.md "internal/* package deletion". ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
components/receivers/nccl_froffinternal/selftelemetry+internal/runtime/lifecycle. Helpers travel as siblings (selftel.go+lifecycle.go) co-located with the receiver, per RFC-0013 §migration PR-B1.set.Telemetry.MeterProvider. Instrumentation scope name = the receiver's Go import path (github.com/tracecoreai/tracecore/components/receivers/nccl_fr); will follow the receiver tomodule/receiver/ncclfrreceiver/in PR-I.1.tracecore.receiver.{errors_total,emissions_total,collection_latency_seconds,degraded_seconds_total,last_activity_unix_seconds}withcomponent_id+kind), so dashboards / alerts do not regress. Thecmd/tracecoreintegration test that asserts these names by string still passes.Add()+ the post-Shutdown silent-no-op path because nccl_fr is single-source.PanicCallback(any)signature is preserved so the existing panic →IncError(kindPanic)+SetDegraded(true)wiring is untouched.This is partial credit toward the
internal/selftelemetry+internal/runtime/lifecycledeletes scheduled for PR-F. The package deletions wait on stdoutexporter, dcgm, kernelevents, and clockreceiver also porting off (or being deleted by) the wave.Why receiver-local scope (vs reusing the internal scope name)
The adversarial review flagged the scope-name choice as the key decision. Picking the receiver's import path means:
tracecore_receiver_*series will appear under twootel_scope_namelabels while old receivers still emit underinternal/selftelemetry. Bounded — all other in-tree receivers either die (PR-K) or move tomodule/receiver/*(PR-I.1) within v0.2.0, and no operator alerts exist at v0.1.x that strip the scope label.Test plan
make check(gofumpt, golangci-lint, go vet, mod verify) — 0 issues.go test ./...repo-wide — green, includingcmd/tracecoreintegration tests that asserttracecore_receiver_*metric names.selftel_test.go(6 tests covering noop, nil-MP, errors_total + kind label, emissions_total, scope-name pin, init_errors_total + the factory-fallback seam),lifecycle_test.go(5 tests covering happy path, idempotent Start/Shutdown, panic recovery, shutdown-deadline).grep "internal/selftelemetry\|internal/runtime/lifecycle" components/receivers/nccl_fr/*.go— only doc-comment mentions remain (no imports).Follow-ups
internal/pipeline+internal/consumer(blocked on PR-A2).internal/{componentstatus,selftelemetry,runtime/lifecycle}once every consumer has ported or been deleted.Binding doc: docs/rfcs/0013-distro-first-pivot.md §migration PR-B1.
Release notes