Skip to content

feat(pivot): PR-B1 — port nccl_fr off internal selftel + lifecycle#184

Merged
trilamsr merged 1 commit into
mainfrom
pr-b1-nccl-fr-port
May 31, 2026
Merged

feat(pivot): PR-B1 — port nccl_fr off internal selftel + lifecycle#184
trilamsr merged 1 commit into
mainfrom
pr-b1-nccl-fr-port

Conversation

@trilamsr

@trilamsr trilamsr commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Port components/receivers/nccl_fr off internal/selftelemetry + internal/runtime/lifecycle. Helpers travel as siblings (selftel.go + lifecycle.go) co-located with the receiver, per RFC-0013 §migration PR-B1.
  • Receiver-scoped meter is acquired from set.Telemetry.MeterProvider. Instrumentation scope name = the receiver's Go import path (github.com/tracecoreai/tracecore/components/receivers/nccl_fr); will follow the receiver to module/receiver/ncclfrreceiver/ in PR-I.1.
  • Metric names + label shape preserved (tracecore.receiver.{errors_total,emissions_total,collection_latency_seconds,degraded_seconds_total,last_activity_unix_seconds} with component_id + kind), so dashboards / alerts do not regress. The cmd/tracecore integration test that asserts these names by string still passes.
  • Lifecycle sibling is intentionally slimmer than the internal helper: drops Add() + the post-Shutdown silent-no-op path because nccl_fr is single-source. PanicCallback(any) signature is preserved so the existing panic → IncError(kindPanic) + SetDegraded(true) wiring is untouched.

This is partial credit toward the internal/selftelemetry + internal/runtime/lifecycle deletes scheduled for PR-F. The package deletions wait on stdoutexporter, dcgm, kernelevents, and clockreceiver also porting off (or being deleted by) the wave.

Why receiver-local scope (vs reusing the internal scope name)

The adversarial review flagged the scope-name choice as the key decision. Picking the receiver's import path means:

  • Transitional fragmentation: tracecore_receiver_* series will appear under two otel_scope_name labels while old receivers still emit under internal/selftelemetry. Bounded — all other in-tree receivers either die (PR-K) or move to module/receiver/* (PR-I.1) within v0.2.0, and no operator alerts exist at v0.1.x that strip the scope label.
  • Aligns with OTel convention (scope = Go import path), which makes the PR-I.1 move re-stamp the scope cleanly.

Test plan

  • make check (gofumpt, golangci-lint, go vet, mod verify) — 0 issues.
  • go test ./... repo-wide — green, including cmd/tracecore integration tests that assert tracecore_receiver_* metric names.
  • New tests: selftel_test.go (6 tests covering noop, nil-MP, errors_total + kind label, emissions_total, scope-name pin, init_errors_total + the factory-fallback seam), lifecycle_test.go (5 tests covering happy path, idempotent Start/Shutdown, panic recovery, shutdown-deadline).
  • Adversarial review across 9 lenses (sibling-impl drift, lifecycle drift, test correctness, cardinality, doc-rot, ctx usage, race-safety, helper duplication, scope creep) — no findings.
  • grep "internal/selftelemetry\|internal/runtime/lifecycle" components/receivers/nccl_fr/*.go — only doc-comment mentions remain (no imports).

Follow-ups

  • PR-B2: port nccl_fr off internal/pipeline + internal/consumer (blocked on PR-A2).
  • PR-F: delete internal/{componentstatus,selftelemetry,runtime/lifecycle} once every consumer has ported or been deleted.

Binding doc: docs/rfcs/0013-distro-first-pivot.md §migration PR-B1.

Release notes

[CHANGE] nccl_fr receiver migrates its self-telemetry + lifecycle helpers to co-located sibling files; emitted metric names + labels are unchanged but the OTel instrumentation scope is now the receiver's Go import path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) May 31, 2026 03:47
@trilamsr trilamsr merged commit b4b6e9b into main May 31, 2026
13 of 14 checks passed
@trilamsr trilamsr deleted the pr-b1-nccl-fr-port branch May 31, 2026 03:47
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

- Port `components/receivers/clockreceiver` off `internal/selftelemetry`
+ `internal/runtime/lifecycle`. Helpers travel as siblings (`selftel.go`
+ `lifecycle.go`) co-located with the receiver, per RFC-0013 §migration
PR-B (mechanical mirror of PR-B1 #184 for nccl_fr).
- Receiver-scoped meter is acquired from `set.Telemetry.MeterProvider`.
Instrumentation scope name = the receiver's Go import path
(`github.com/tracecoreai/tracecore/components/receivers/clockreceiver`).
- Metric names + label shape preserved
(`tracecore.receiver.{errors_total,emissions_total,collection_latency_seconds,degraded_seconds_total,last_activity_unix_seconds}`
with `component_id` + `kind`). The `cmd/tracecore` integration tests
that assert these names by string still pass.
- Lifecycle sibling is slimmer than the internal helper: drops `Add()` +
the post-Shutdown silent-no-op path because clockreceiver is
single-source (same shape as the nccl_fr sibling). `panicCallback(any)`
signature preserved so the panic → `IncError(kindPanic)` +
`SetDegraded(true)` wiring is untouched.
- `kind` enum is trimmed to the two values clockreceiver actually emits
(`downstream`, `panic`) — no speculative kinds.

## Why port-before-delete

clockreceiver is scheduled for deletion in PR-K, so a fair question is
"why port it at all?" Three reasons:

1. **Unblocks PR-F earlier.** The `internal/selftelemetry` +
`internal/runtime/lifecycle` deletes in PR-F are blocked on every
consumer porting off or being deleted. Porting clockreceiver here means
PR-F can land before PR-K, instead of waiting on the kernelevents /
stdoutexporter / dcgm porting wave plus the clockreceiver delete.
2. **Mechanical mirror of PR-B1.** This diff is a sibling-rename of PR
#184 — the review surface is the trimmed `kind` enum + the package-name
swap. No new design choices.
3. **Anchors the pattern.** clockreceiver is the "canonical example
receiver" per its package doc; even briefly carrying the wrong shape
would signal that the migration is optional.

Partial credit toward the `internal/selftelemetry` +
`internal/runtime/lifecycle` deletes scheduled for PR-F.

## Test plan

- [x] `make check` (gofumpt, golangci-lint, go vet, mod verify) — 0
issues.
- [x] `go test ./...` repo-wide — green for the clockreceiver and
`cmd/tracecore` packages, which assert `tracecore_receiver_*` metric
names. (kernelevents `TestReceiver_SLIBudget` shows a known p99-latency
overshoot under full-suite load; passes in isolation with `-count=3`;
unrelated to this PR.)
- [x] New tests: `selftel_test.go` (6 tests — noop safety, nil-MP,
errors_total + kind, emissions_total, scope-name pin, init_errors_total
+ factory-fallback seam), `lifecycle_test.go` (5 tests — happy path,
idempotent Start/Shutdown, panic recovery, shutdown-deadline).
- [x] 9-lens adversarial review (sibling-impl drift, lifecycle drift,
test correctness, cardinality, doc-rot, ctx usage, race-safety, helper
duplication, scope creep) — no findings. Bonus catch: `r.logger` now
falls back to `slog.Default()` when `set.Telemetry.Logger` is nil; the
original would have nil-dereferenced.
- [x] `grep "internal/selftelemetry\|internal/runtime/lifecycle"
components/receivers/clockreceiver/*.go` — only doc-comment mentions
remain (no imports).

## Follow-ups

- PR-F: delete
`internal/{componentstatus,selftelemetry,runtime/lifecycle}` once every
remaining consumer (stdoutexporter, dcgm, kernelevents, containerstdout)
has ported or been deleted.
- PR-K: delete clockreceiver itself (replaced by hostmetricsreceiver
heartbeat).

Binding doc:
[docs/rfcs/0013-distro-first-pivot.md](docs/rfcs/0013-distro-first-pivot.md)
§migration PR-B.

## Release notes

```release-notes
[CHANGE] clockreceiver migrates its self-telemetry + lifecycle helpers to co-located sibling files; emitted metric names + labels are unchanged but the OTel instrumentation scope is now the receiver's Go import path.
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request May 31, 2026
Two cross-cut-reviewer findings on PR #188 before re-enabling
auto-merge:

1. KindParse was speculative — the dcgm cgo client returns typed
   values, never raw bytes, so no call site ever emitted it. Removed
   from the canonical Kind* const block in selftel.go, the
   canonicalKindByName resolver map in docs_parity_test.go, and
   the noop-safety test's IncError sweep in selftel_test.go. The
   AST-walking TestCanonicalKindByName_CoversConstBlock guard
   stays green because both halves move together.

2. Mirror PR #184's nccl_fr register-failure pattern: added a
   failingDcgmMP test seam (wraps a real MeterProvider but fails
   every `tracecore.receiver.*` instrument registration) plus
   TestSelfTelemetry_NewReceiver_RegisterFailureReturnsErr that
   asserts newSelfTelemetry surfaces a wrapped errSyntheticDcgmFailure
   rather than returning a partially-wired impl. Pins the criterion-6
   symmetry the cross-cut review flagged so a future refactor that
   reorders the constructor can't silently bypass the register-
   failure path on this receiver.

TestRecordInitError_NilProviderIsSafe was already present from the
initial port; no change needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

RFC-0013 PR-B2: ports `components/receivers/kernelevents` off
`internal/selftelemetry` + `internal/runtime/lifecycle` to package-local
sibling types, following PR-B1 (#184) for `nccl_fr`. One more receiver
off the internal helpers; clears another blocker for PR-F (delete
`internal/selftelemetry` + `internal/runtime/lifecycle`).

## Why

`internal/selftelemetry` and `internal/runtime/lifecycle` are slated
for deletion in RFC-0013 PR-F. Every in-tree user must migrate first.
PR-B1 landed the pattern (`nccl_fr`); PR-B2 applies it to
`kernelevents`. Receivers landing post-PR-F will own their selftel +
lifecycle in their own submodule per OTel convention — this PR puts
kernelevents on that footing.

## Key difference from PR-B1

PR-B1 slimmed nccl_fr's sibling lifecycle by dropping `Add()` (single
source). kernelevents is multi-source (kmsg + journald) and registers
one driver goroutine per source under the same WaitGroup via
`Add()`, so this PR keeps `Add()` and ports it (with its post-Shutdown
and pre-Start refusal guards) from `internal/runtime/lifecycle`.

## What changed

- New `components/receivers/kernelevents/selftel.go` — local
  `selfTelemetry` interface, `kind` type, noop impl, real OTel-backed
  impl, `recordInitError`. Metric names + label shape preserved
  (`tracecore.receiver.errors_total{kind,component_id}` + siblings);
  scope name is the receiver's Go import path (OTel convention).
  Receiver-local kinds (`kmsg_oversized`, `kmsg_overflow`,
  `journalctl_crash`) sit alongside the canonical mirror set
  (`parse`, `downstream`, `panic`, `cardinality`); merges what used to
  live in `kinds.go` (deleted).
- New `components/receivers/kernelevents/lifecycle.go` — local
  `lifecycle` type with `Start` / `Shutdown` / `Add`. TOCTOU-safe
  Add()/Shutdown via shared mutex; idempotent Shutdown that stashes
  the first error so deadline misses aren't swallowed; panic recovery
  routes through the same `onPanic` callback for both Start'd and
  Add'd goroutines.
- New `selftel_test.go` + `lifecycle_test.go` — TDD-first, no
  `internal/telemetry` dep, OTel `sdkmetric.NewManualReader` for
  metric assertions. Covers noop safety, nil-MP error, errors_total +
  kind + component_id, scope-name pin, emissions monotonicity +
  negative-discard, init_errors_total, recordInitError nil-safety,
  Start/Shutdown happy path, Start-twice sentinel, idempotent
  Shutdown, panic callback fires, shutdown-deadline returns ctx err,
  Add() registers under same WaitGroup, Add'd panic fires callback,
  Add-before-Start no-op, Add-after-Shutdown no-op.
- Rewires `kernelevents.go`, `source.go`, `kmsg.go`, `journald.go`,
  `factory.go`, `export_test.go`, `nullsource_test.go`,
  `source_template.go.example`, README — drop the `internal/*`
  imports, switch to the local `selfTelemetry` / `lifecycle` /
  `kind` symbols. `FakeTelemetry.ErrorKinds()` now returns `[]string`
  (was `[]selftelemetry.Kind`) — only consumer is in-package tests.
- `runbook_kinds_test.go` AST gate rewritten to resolve local `kind`
  const decls (was: selector-walk for `selftelemetry.Kind...`);
  expected set unchanged so a future regression that drops a kind
  still fails the gate.
- `internal/runtime/lifecycle/lifecycle.go` doc comment updated:
  kernelevents removed from the "Used by" list; dcgm is the last
  in-tree user, noted alongside the deletion plan.

## Root cause

Not a bug fix — this is a structural migration to unblock PR-F.
kernelevents-specific requirement: multi-source receiver needs `Add()`
preserved across the migration (PR-B1's pattern dropped it for the
single-source case). No workaround; full port.

## Test plan

- [x] `make check` — 0 issues (golangci-lint + vet + mod verify)
- [x] `go test ./...` — entire repo green
- [x] `go test -race -run
"^(TestSelfTelemetry|TestLifecycle|TestRecordInit|TestSourceInterface|TestKmsgSource)"
./components/receivers/kernelevents/` — race-clean
- [x] `TestRUNBOOK_KindsMatchEmitted_PinsExpectedSet` still pins the
same kind set after the AST walker rewrite
- [x] Verified `kernelevents.FakeTelemetry` + helpers have no
out-of-package consumers

### Known pre-existing flake (NOT caused by this PR)

`TestReceiver_SLIBudget` (bench_test.go) flakes when p99 emit latency
exceeds 5ms — reproduced on clean `main` at 12.9ms p99 without these
changes. Tracking separately; not blocking.

## Release notes

```release-notes
[CHANGE] kernelevents: receiver self-telemetry + lifecycle helpers
moved from internal/selftelemetry + internal/runtime/lifecycle to
package-local siblings (RFC-0013 PR-B2). Metric names, label shape,
and observable behavior are unchanged. The OTel instrumentation
scope name moves from
`github.com/tracecoreai/tracecore/internal/selftelemetry` to
`github.com/tracecoreai/tracecore/components/receivers/kernelevents`,
matching OTel convention; dashboards that pinned the old scope name
need updating.
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

Port `components/exporters/stdoutexporter` off `internal/selftelemetry`.
This is the exporter-side half of RFC-0013 §migration PR-B — sibling to
the receiver-side PR-B1 (#184) that ported
`components/receivers/nccl_fr`.

Same shape as PR-B1: receiver-scoped self-telemetry helpers travel as
siblings (`selftel.go` + `selftel_test.go`) co-located with the
component, scope-name = the component's Go import path, metric names +
label shape preserved bit-for-bit so dashboards / alerts do not regress.

## Root cause

Not a bug — a planned dependency severance.

v0.1.x stdoutexporter imports `internal/selftelemetry` for the
`Exporter` interface, the `NewNoopExporter` / `NewExporter`
constructors, the `Kind` type, and `RecordInitError`. RFC-0013 PR-F
deletes `internal/selftelemetry` entirely (the runtime that consumes it
dies in the same wave). Every component carrying that import has to
migrate to a sibling helper before PR-F can land. PR-B1 did the
receiver; this PR does the exporter.

## Changes

- `components/exporters/stdoutexporter/selftel.go` (new): sibling
`selfExporter` interface + `selfExporterImpl` + `recordInitError`. Emits
`tracecore.exporter.calls_total{result,kind,component_id}` on a meter
acquired from `set.Telemetry.MeterProvider`. Scope =
`github.com/tracecoreai/tracecore/components/exporters/stdoutexporter`
(OTel convention).
- `components/exporters/stdoutexporter/selftel_test.go` (new): 7 tests —
noop safety, nil-MP error sentinel, calls_total emission +
`{result,kind,component_id}` label shape, scope-name standard,
`init_errors_total` tick + nil-MP safety, factory noop-fallback wiring
with synthetic register failure, and a compile-time sibling-types guard.
- `components/exporters/stdoutexporter/stdoutexporter.go` (rewired): use
sibling types; drop `internal/selftelemetry` import; drop
`(*stdoutExporter).SelfExporter()`.

## ExporterCarrier removal — rationale

PR-B1 (nccl_fr receiver) didn't expose any selftelemetry types to the
runtime — it was a clean port. stdoutexporter is different: v0.1.x
exposed `SelfExporter() selftelemetry.Exporter` so
`cmd/tracecore/collect.collectFailureRateReaders` could feed
`tracecore.exporter.failure_rate`. This PR drops that contract:

- The runtime's reader-collection path silently skips components that
don't implement `ExporterCarrier` — documented "no per-exporter signal"
degraded mode.
- stdoutexporter is the canonical debug / example exporter (writes JSON
lines to stdout). Operators don't alert on its failure_rate. Real
backends in `components/exporters/otlphttp` carry that contract instead.
- `tracecore_exporter_failure_rate` still surfaces in scrape via the SLO
observable gauge (reports 0 with no readers registered) — the M2
acceptance check in `cmd/tracecore/integration_telemetry_test.go` still
passes.
- `tracecore_exporter_calls_total` continues to surface because the
sibling impl emits it on `set.Telemetry.MeterProvider` directly, just
under a new scope.
- PR-F deletes the entire `internal/selftelemetry` package, so the
`ExporterCarrier` contract evaporates regardless.

Rationale documented inline in `stdoutexporter.go` and the
`selfExporter` docstring so a six-months-cold reader knows why the
carrier is missing.

## Test plan

- [x] `make check` (gofumpt, golangci-lint, go vet, go mod verify) —
green
- [x] `go test ./components/exporters/stdoutexporter/... -count=1` — 13
tests pass (7 new selftel + 6 existing)
- [x] `go test -race ./components/exporters/stdoutexporter/... -count=1`
— green
- [x] `go test ./cmd/tracecore/ -run
TestIntegration_TelemetrySurface_EndToEnd -count=1` — passes (asserts
both `tracecore_exporter_calls_total` and
`tracecore_exporter_failure_rate` appear in scrape)
- [x] `go test ./... -count=1 -short` — full repo green; no name
regression in `cmd/tracecore`, `internal/telemetry`, or
`internal/selftelemetry` (which still has its own tests until PR-F)
- [x] Scope-name pin: assertion against
`github.com/tracecoreai/tracecore/components/exporters/stdoutexporter`
in selftel_test.go
- [x] Sibling-types compile-time guard: `asSelfExporter` helper forces
compile break if the type ever moves back to `internal/selftelemetry`

## Pattern continuity

References `#184` (PR-B1, nccl_fr) for the receiver sibling pattern.
Test helpers (`newTestMeterProvider`, `collectRM`, `scopeOf`, `kvMatch`,
`dumpNames`, `failingExporterMP`) mirror the nccl_fr equivalents so an
M8+ exporter author can read either pair and infer the convention.

## Release notes

NONE — this is a pivot-internal port. Metric names + labels are
unchanged on the scrape surface. The unexported `SelfExporter()` method
removal is not observable to any v0.1.x operator (only `cmd/tracecore`'s
reader-collection path read it, and that path's "skip on absent" branch
is documented behavior).



```release-notes
NONE
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

Mirrors PR-B1's nccl_fr sibling pattern (#184) for
`components/receivers/dcgm`. Adds two package-local files — `selftel.go`
and `lifecycle.go` — and rewires the receiver + every test off
`internal/selftelemetry` and `internal/runtime/lifecycle`. Both internal
packages are slated for deletion in RFC-0013 PR-F; this PR removes dcgm
as a blocker.

- `selftel.go` owns the `Kind` type, the canonical `Kind*` const block,
the receiver-local `kindWatch` / `kindMIG`, the `selfTelemetry`
interface + OTel-backed impl, and `recordInitError`. Scope name pinned
to the receiver's Go import path per the PR-B1 standard.
- `lifecycle.go` is a single-source slim variant of the internal helper
(no `Add()` since dcgm doesn't run auxiliary watchers). Exports
`ErrAlreadyStarted` so the double-Start contract test continues to
compile via `errors.Is`.
- The dcgm-specific bits PR-B1 didn't cover: receiver-local Kind
constants layered on the canonical set (same const block, no separate
import), explicit `SetDegraded(true|false)` transition path exercised by
`degraded_seconds_total` accumulation across cycles, and the
`WithSelfTelemetry` test seam that now accepts the package-local
interface.

## Why port before the planned delete

dcgm is scheduled for removal in PR-F. The port still earns its keep
because:

- It collapses the PR-F critical path from two PRs (dcgm-delete →
internal-pkg-delete) to one (PR-F can land both atomically once dcgm is
no longer a dependent).
- The work is a mechanical mirror of PR-B1 — `selftel.go` +
`lifecycle.go` are near-line-for-line copies with the dcgm-specific
delta isolated to the const block and the SetDegraded contract.
- Per `feedback_no_bloat.md`, the only justification for adding code
that's about to be deleted is "it unblocks a delete." That's exactly
this PR's posture.

## Root cause

RFC-0013's PR-B-series unblocks PR-F by decoupling every `components/*`
receiver from `internal/*` packages that should not have receiver-level
dependents. Each receiver owns its own self-telemetry surface +
lifecycle helper; `internal/selftelemetry` +
`internal/runtime/lifecycle` become deletable in PR-F. Not a workaround
— the port is the actual fix.

## TDD coverage added in `selftel_test.go`

- Noop safety across every canonical Kind + the receiver-local
`kindWatch` / `kindMIG`.
- `errNilMeterProvider` sentinel returned (not noop-substituted) when MP
is nil.
- `tracecore.receiver.errors_total` partitioned by `kind` with
`component_id` label, for both canonical kinds AND local kinds (separate
test).
- `tracecore.receiver.emissions_total` monotonic + drops negative
values.
- OTel scope name pinned to
`github.com/tracecoreai/tracecore/components/receivers/dcgm`.
- `degraded_seconds_total` accumulates across multiple degrade→recover
cycles (the dcgm-specific contract).
- `init_errors_total` ticks with `kind=receiver` /
`reason=instrument_register`; nil MP is silently safe.
- AST-walk parity test that asserts `docs_parity_test.go`'s
hand-maintained `canonicalKindByName` matches the const block in
`selftel.go` — so a future `KindFoo` addition without a paired map entry
fails at the moment of drift.

## Test plan

- [x] `make check` (gofmt + tidy-check + golangci-lint + go vet +
mod-verify) → clean.
- [x] `go test ./...` → clean.
- [x] `go test -race ./components/receivers/dcgm/...` → clean.
- [x] `go build ./components/receivers/dcgm/...` → clean.
- [x] New parity test `TestCanonicalKindByName_CoversConstBlock`
self-validates against the const block.
- [x] `TestRUNBOOK_KindsMatchEmitted` continues to pass — the AST walker
resolves both the local-kind idents (`kindWatch`, `kindMIG`) and the
now-bare canonical `KindFoo` idents.
- [x] `TestReceiver_UsesLifecycleHelper` updated to assert the new
`*lifecycle` field-type shape (Ident inside StarExpr) — a revert to the
deleted `internal/runtime/lifecycle.Lifecycle` would fail here.
- [x] `TestReceiver_M2WiringFromMeterProvider` +
`TestReceiver_RecordsInitErrorWhenConstructorFails` continue to exercise
the end-to-end Prometheus scrape path against the real
`internal/telemetry` MeterProvider — noop-fallback observability
preserved.

## Release notes

```release-notes
NONE: internal refactor only. Receiver behavior + metric names + label
shape unchanged. Operators see no scrape-side diff.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
…189)

## Summary

RFC-0013 PR-A2 (sequencing gate for PR-B2 / PR-F / PR-I): retire the
hand-wired `./cmd/tracecore` entry point and adopt the OpenTelemetry
Collector Builder (OCB) output at `./_build/tracecore` as the canonical
binary. After this lands, all receivers register through OCB's generated
`otelcol.Factories` instead of the bespoke
`cmd/tracecore/components.go`.

Big diff (-3,869 / +388 across 53 files) because PR-A2 is the
load-bearing pivot point; PR-B2 / PR-F / PR-I that follow can land
surgically.

```release-notes
CLI surface change — `tracecore` now uses the upstream OCB-generated CLI:
- `tracecore collect --config=…` → `tracecore --config=…`
  (collect was default; OCB main runs the collector by default)
- `tracecore receivers list` → `tracecore components`
  (now shows receivers + processors + exporters + extensions + connectors)
- `tracecore debug dump` → removed (no OCB equivalent; use `tracecore
  components` + the live config when filing issues)
- `tracecore failure-inject {nccl-hang,pod-evict}` → use the standalone
  `tools/failure-inject` binary (already ships xid, nccl-hang,
  pod-evict, cpu-steal subcommands)
- `--log.format=text` / `--shutdown.drain-budget=…` / `--version-short`
  → removed; OCB upstream uses `--feature-gates` + `--set` flags

Chart-shape changes — operator-visible:
- Default pipeline flips from clockreceiver→stdoutexporter (in-tree,
  not registered by OCB) to hostmetrics→debug (both upstream
  OCB-bundled). Fresh install on a no-GPU cluster boots and emits
  load-average metrics immediately, same as before.
- `telemetry.listen` + `telemetry.paths.{metrics,healthz,readyz}` →
  `telemetry.metricsListen` + `telemetry.healthListen` +
  `telemetry.healthPath`. The legacy single-listener block is gone
  because upstream `service.telemetry` and `healthcheckextension` are
  two separate processes. Probes hit `:13133/` (healthcheckextension
  default) instead of `:8888/healthz`. Prometheus scrape port (8888)
  is unchanged.
- Self-telemetry metric names rename `tracecore_*` → `otelcol_*`
  (upstream vocabulary). Dashboards on any `tracecore_receiver_*`,
  `tracecore_exporter_*`, `tracecore_queue_*`, `tracecore_component_*`,
  or `tracecore_build_info` must be rewritten — see
  `docs/migration/v0.1-to-v0.2.md` for the exact map.
```

## Breaking changes — orphan components (PR-A2 → PR-J/K bridge)

The OCB-assembled binary registers only the components in
`builder-config.yaml`: 6 receivers, 4 exporters, 3 extensions,
4 processors. The chart's per-component toggles for the legacy in-tree
set survive this PR so the values shape doesn't break for operators
that pin them, but **enabling any of the following in chart values
will cause the pod to fail at startup with an "unknown factory" error
until PR-B2 / PR-J / PR-K rewire them**:

| Component | Kind | Replacement (planned) |
|---|---|---|
| `clockreceiver` | receiver | `hostmetrics` (PR-E shipped; now default)
|
| `containerstdout` | receiver | `filelogreceiver` + container stanza +
`file_storage` extension (PR-J) |
| `dcgm` | receiver | `dcgm-exporter` DaemonSet + `prometheusreceiver`
(PR-J) |
| `k8sevents` | receiver | `k8sobjectsreceiver` + OTTL `k8s.event.hint`
transform (PR-J) |
| `kernelevents` | receiver | `journaldreceiver` + `filelogreceiver`
(kmsg) + OTTL Xid transform (PR-J) |
| `nccl_fr` | receiver | In-repo Go submodule via OCB `gomod:` (PR-B2 +
PR-I) |
| `pyspy` | receiver | Deferred until OTel Profiles GA |
| `stdoutexporter` | exporter | `debug` (OCB-bundled; now default) |
| `otlphttp` (in-tree clone) | exporter | `otlphttpexporter`
(OCB-bundled; same `otlphttp` name in chart values, same field shape —
`endpoint`, `compression`, `headers`, `tls.*`, `timeout`,
`retry_on_failure`, `sending_queue`; pass-through render so any upstream
field works) |

To verify what's actually registered in the binary you're running:
`./_build/tracecore components`.

The chart's `NOTES.txt` surfaces a WARNING when an operator enables
any of these, and the chart-render CI workflow now runs `tracecore
validate` against the default + one-receiver-on fixtures so a chart
edit that emits a non-OCB key trips CI before reaching `helm install`.

## What landed

### Deletions (3,032 LOC across 22 source + 7 test files)

- `cmd/tracecore/` entire tree: `main.go`, `collect.go`, `validate.go`,
`debug.go`, `receivers.go`, `signals.go`, `failure_inject.go`,
`openflags_{linux,other}.go`,
`receiver_variants{,_dcgm_cgo,_dcgm_stub}.go`, `components.go`, + every
`_test.go` (`collect_test`, `debug_test`,
`failure_inject{,_linux}_test`, `integration_test`,
`integration_telemetry_test`, `main_test`, `receivers_test`)
- `components.yaml` + `tools/components-gen/{main,main_test}.go` —
superseded by `builder-config.yaml`; OCB owns codegen now
- `components/receivers/kernelevents/runbook_test.go` — depended on the
deleted `tracecore debug dump` subcommand; kernelevents itself is
scheduled for deletion in PR-K

### Build path swap

| File | Change |
|---|---|
| `Makefile` | `build` target now runs OCB (was: legacy `go build
./cmd/tracecore`); dropped `generate`, `generate-check`, `run`, legacy
`-ldflags -X` version injection; `coverage` `-coverpkg` drops
`./cmd/...` |
| `install/kubernetes/tracecore/Dockerfile` | Builds via OCB (`make
build` then copy `_build/tracecore`) |
| `install/kubernetes/tracecore/templates/daemonset.yaml` | `args` drops
the `collect` subcommand; probes hit the new `health` port (13133) at
`healthPath` |
| `install/kubernetes/tracecore/templates/_helpers.tpl` |
`renderedConfig` emits upstream OTel shape —
`service.telemetry.metrics.address` + `extensions.health_check` +
`service.extensions: [health_check]` — instead of the legacy
single-listener `telemetry:` top-level block |
| `install/kubernetes/tracecore/values.yaml` | Default pipeline flipped
to `hostmetrics → debug`; in-tree-only toggles kept with explicit
doc-comments naming PR-J/K as their migration owner |
| `.goreleaser.yaml` | Switched to `builder: prebuilt` against
`./_build/{Os}-{Arch}/tracecore`; release.yml gains a per-platform
pre-build step |
| `.ko.yaml` | Builds from inside `./_build/` (the OCB submodule) via
`KO_CONFIG_PATH=../.ko.yaml`; `main: .` |
| `.github/workflows/ci.yml` | `package` job runs OCB; old `build-ocb`
drift gate replaced by `smoke-test-binary` job consuming the package
artefact |
| `.github/workflows/release.yml` | New "Pre-build OCB binaries" step
before goreleaser; ko-publish step runs from `cd ./_build/` so OCB
submodule resolves |
| `.github/workflows/chart.yml` | Restored the `tracecore validate` gate
against the default + one-receiver-on chart renders; path triggers swap
`cmd/tracecore/**` → `builder-config.yaml` |
| `.github/workflows/install-bench.yml` | Path triggers swap
`cmd/tracecore/**` → `builder-config.yaml` |
| `builder-config.yaml` | `dist.version` bumped to `0.1.0-m9-alpha` to
match `Chart.yaml` `appVersion`; `chart-appversion-check.sh` now reads
it as source of truth |
| `scripts/chart-appversion-check.sh` | Read `dist.version` from
`builder-config.yaml` (was: `internal/version/version.go`) |
| `scripts/smoke.sh` | Rewritten for OCB binary — hostmetrics → debug
config, expects upstream lifecycle log lines |
| `scripts/validator-recipe.sh` | `BIN` default now `./_build/tracecore`
|
| `scripts/{doc-check,no-autoupdate-check}.sh` | Drop `cmd` from scan
paths |

### New integration seam

- `internal/integration/ocb_scrape_test.go`: spawns `_build/tracecore`
against a hostmetrics → debug config, polls the upstream `:NNNN/metrics`
surface, asserts both `otelcol_process_uptime` and
`otelcol_receiver_accepted_metric_points` are present, then SIGTERMs the
subprocess and asserts clean exit. The test skips when
`_build/tracecore` is absent so a fresh `git clone` + `go test ./...`
stays green; `make build` is the prereq. This is the regression gate for
the chart's operator-facing self-telemetry contract (RFC-0013 §3): if a
future upstream OCB release renames either metric, the chart's
`service.telemetry.metrics.address` advertisement breaks downstream
dashboards silently — this test fires first.

### Integration recipe migration

- `docs/integrations/examples/honeycomb.yaml` + `otel-backend.yaml`:
`clockreceiver` → `hostmetrics` loadscraper (the OCB-supported
equivalent per RFC-0013 PR-E). The other two recipes carry
`pending-rfc-0013-pr-a` markers and are still skipped by
`validator-recipe.sh`.

### Doc rot fixes

- `docs/FAILURE-MODES.md`: rerouted 8 entries that referenced deleted
in-tree tests to their upstream OCB owners.
- `docs/FLAKY-TESTS.md`: moved the two in-tree integration flakes to
Resolved.
- `STYLE.md`: rewrote repo-layout + component-registration + CLI +
build-release sections around OCB.
- `PRINCIPLES.md`: dropped concrete-example reference to deleted file.
- `install/kubernetes/tracecore/README.md`: ko local-build steps now run
from inside `./_build/`.
- `docs/rfcs/0013-distro-first-pivot.md`: PR-A2 entry rewritten as
landed.
- `docs/migration/v0.1-to-v0.2.md`: added rows for the self-telemetry
metric-name rename (`tracecore_*` → `otelcol_*`) and the `telemetry.*`
chart values key rename.

## Sequencing constraints honored

- `components/receivers/{clockreceiver, containerstdout, dcgm,
k8sevents, kernelevents, nccl_fr, pyspy}` survive as
orphan-but-compiling code until PR-K deletes them along with the
chart-fixture migration.
- `internal/{pipeline, selftelemetry, telemetry, componentstatus,
pipelinebuilder, consumer, fanout, runtime}` survive until PR-F (after
PR-B1 lifted nccl_fr off `internal/selftelemetry` in #184).
- `tracecore validate` gate in chart workflow is **restored** on the
default + one-receiver-on fixtures (was temporarily disabled in the
first push of this PR; the chart's renderedConfig template migration was
completed in the same PR).

## Sequencing gate satisfied

PR-B2 (`nccl_fr` import swap to upstream OCB types), PR-F (`internal/*`
deletion), and PR-I (Go submodule extraction) are unblocked. The legacy
boot path is gone; the OCB-driven boot path is live.

## Test plan

- [x] `make build` → produces `./_build/tracecore` binary via OCB
- [x] `./_build/tracecore --version` reports `0.1.0-m9-alpha` matching
`Chart.yaml`
- [x] `./_build/tracecore components` lists 6 receivers + 4 exporters +
3 extensions + 4 processors (the `builder-config.yaml` inventory)
- [x] `./_build/tracecore validate --config=<rendered chart default>`
exits 0
- [x] `./_build/tracecore validate --config=<rendered chart
one-receiver-on fixture>` exits 0
- [x] `make smoke` passes (hostmetrics → debug, 1.5s window, clean
shutdown)
- [x] `make check` passes
- [x] `go test ./internal/integration/...` passes (new OCB scrape test;
~1.2s)
- [x] `helm lint install/kubernetes/tracecore` clean (1 chart, 0 failed;
icon advisory only)
- [x] `helm template demo install/kubernetes/tracecore --show-only
templates/daemonset.yaml` renders `args: [--config=…]` + two ports
(`telemetry`, `health`) + probes hitting `health` port at `/`
- [x] `helm template demo install/kubernetes/tracecore --show-only
templates/configmap.yaml | yq '.data["config.yaml"]'` renders upstream
OTel shape: `service.telemetry.metrics.address`,
`extensions.health_check`, `service.extensions: [health_check]`
- [x] `conftest test --policy
install/kubernetes/tracecore/policies/conftest/tracecore.rego
/tmp/chart-render.yaml` — 51 tests pass
- [ ] CI: `package` job builds amd64 + arm64 OCB binaries
- [ ] CI: `smoke-test-binary` job runs `--version` + `components` on the
package artefact
- [ ] CI: `chart` workflow lints + templates + validate + yq + conftest
pass
- [ ] CI: `install-bench` workflow's kind-cluster install still rolls
out (bench values updated to drop the obsolete `stdoutexporter`
reference and align `otlphttp.endpoint` with the upstream
`otlphttpexporter` schema — pass-through render so all upstream fields
work without chart changes)

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 31, 2026
…block) (#196)

## Summary

RFC-0013 PR-F unblock: ports `components/receivers/k8sevents` off
`internal/selftelemetry` + `internal/runtime/lifecycle` to package-local
sibling types, mirroring PR-B1 (#184, nccl_fr) and PR-B2 (#187,
kernelevents). One more receiver off the internal helpers; clears
another blocker for PR-F's deletion of those internal packages.

## Why

`internal/selftelemetry` and `internal/runtime/lifecycle` are slated
for deletion in RFC-0013 PR-F. Every in-tree user must migrate first.
PR-B1 landed the pattern (`nccl_fr`); PR-B2 applied it to
`kernelevents`; this PR applies it to `k8sevents`. Receivers landing
post-PR-F will own their selftel + lifecycle in their own submodule per
OTel convention — this PR puts k8sevents on that footing.

## Pattern choice (multi-source, keeps `Add()`)

k8sevents is multi-source (Events informer + Node informer driver
goroutines both join the receiver's lifecycle), so this PR follows
the PR-B2 kernelevents sibling rather than the slimmer PR-B1 nccl_fr
sibling — keeps `lifecycle.Add()` (with its post-Shutdown and
pre-Start refusal guards, plus the mutex-guarded TOCTOU-safe
`Add`/`Shutdown` interleave) so the SharedInformerFactory driver
goroutine joins the same WaitGroup as the run-loop goroutine.

## What changed

- **New** `components/receivers/k8sevents/selftel.go` — local
  `selfTelemetry` interface + noop + OTel-backed implementations +
  `recordInitError` fallback ticker. Instrumentation scope name pinned
  to k8sevents' Go import path per OTel convention (PR-I will move it
  to the submodule path when k8sevents goes external). Metric names
  (`tracecore.receiver.errors_total{kind,component_id}` and siblings)
  preserved so dashboards/alerts don't regress.
- **New** `components/receivers/k8sevents/lifecycle.go` — local
  `lifecycle` helper (Start + Add + Shutdown + panic-recovery +
  TOCTOU-safe mutex-guarded `closed`/`Add` interleave).
- **New** `selftel_test.go` + `lifecycle_test.go` — TDD-first, no
  `internal/telemetry` dep. Pin: noop safety across every kind,
  nil-MeterProvider error, errors_total kind+component_id labels,
  scope-name = receiver Go import path, init_errors_total tick,
  factory noop fallback, Start/Shutdown/Add contracts, Add
  refusal-modes, TOCTOU concurrent Add-during-Shutdown stress.
- `factory.go`, `receiver.go` — drop internal imports, wire local
  `selfTelemetry` + `lifecycle`. `r.lc` retyped from
  `*lifecycle.Lifecycle` to `*lifecycle`; `r.telemetry` retyped from
  `selftelemetry.Receiver` to `selfTelemetry`. Exported `KindWatch`
  / `KindBackpressureDrop` / `KindNode*` constants retyped from
  `selftelemetry.Kind` to package-local `kind` alias (load-bearing
  for the `K8sEventsReceiverDegraded` alert label values, which stay
  byte-identical). Added re-exported `KindParse` / `KindDownstream`
  / `KindCardinality` / `KindPanic` so external `_test` packages can
  still partition assertions on canonical kinds without depending on
  the deleted internal package.
- `export_test.go` — test-only `recordingTel` replaced by exported
  `CapturingTelemetry` (mirrors `selftelemetry.CapturingReceiver`,
  trimmed to the k8sevents surface). External `_test` packages get
  a stable shim that survives the internal/selftelemetry deletion.
- `receiver_test.go`, `node_failure_modes_test.go` — drop
  `internal/selftelemetry` import; redirect through
  `k8sevents.CapturingTelemetry`. Per-test bodies otherwise
  unchanged (kept the `recordingTel` symbol as a thin wrapper).

## Verification

- `make check` clean (golangci-lint 0 issues, vet clean, mod-verify)
- `make verify` clean (incl. doc-check + alert-check + chart-check
  + no-autoupdate)
- `go test -race -count=1 ./...` — entire repo green (incl. k8sevents
  + kernelevents + every internal package, fixture, tool)
- `go test -race -count=1 ./components/receivers/k8sevents/...` —
  all green; lifecycle TOCTOU stress test passes deterministically
  under `-race`
- Goroutine leaks: pre-existing `TestReceiver_GoleakNoLeakAfterShutdown`
  continues to pass after the lifecycle rewire

## Out of scope (deferred)

- PR-F itself (deleting `internal/selftelemetry` +
  `internal/runtime/lifecycle`) — gated on the remaining 3 wave-1
  consumers (pyspy, otlphttp, containerstdout) landing their ports.
  This PR is one of those four.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
…k) (#194)

## Summary

Ports `components/receivers/pyspy` off `internal/selftelemetry` +
`internal/runtime/lifecycle` using the sibling co-location pattern
established by PR #184 (nccl_fr reference impl). `selftel.go` +
`lifecycle.go` now live next to the receiver source.

Unblocks **RFC-0013 PR-F** deletion of those internal moats.

## What changed

- `selftel.go` (new) — receiver-scoped `selfTelemetry` interface +
`selfTelemetryImpl` (OTel-backed) + `noopSelfTelemetry` +
`recordInitError`. Metric names + label shape preserved 1:1 with
`internal/selftelemetry`:
  - `tracecore.receiver.errors_total{kind,component_id}`
  - `tracecore.receiver.emissions_total{component_id}`
- `tracecore.receiver.collection_latency_seconds{component_id}`
(12-bucket boundaries, matches v0.1.x)
- `tracecore.receiver.degraded_seconds_total{component_id}` (observable
counter)
- `tracecore.receiver.last_activity_unix_seconds{component_id}`
(observable gauge)
- `tracecore.selftelemetry.init_errors_total{kind,component_id,reason}`
(factory-fallback signal)
- `lifecycle.go` (new) — slim single-source variant (no `Add()`). pyspy
fans its three goroutines (scan loop, trigger cadence driver, trigger
run loop) via its own `sync.WaitGroup` inside `runAll`, so lifecycle
only needs `Start` + `Shutdown`.
- `kinds.go` — `kind` is now a receiver-local type. `kindPanic` is
declared locally (replaces `selftelemetry.KindPanic`).
- `factory.go` — uses local `newSelfTelemetry` / `recordInitError` /
`reasonInstrumentRegister`.
- `pyspy.go` — fields/types switched to local sibling.
- `pyspy_test.go` — uses sibling-local `fakeTelemetry` captor (replaces
`selftelemetry.CapturingReceiver`).
- `selftel_test.go` (new, RED-first TDD) — 7 tests covering noop safety,
nil-MP, errors_total kind+component_id, scope-name pin,
init_errors_total, nil-MP safety on recordInitError, factory fallback
path.
- `fake_telemetry_test.go` (new) — test-only captor implementing the
`selfTelemetry` interface; mirrors the kernelevents `export_test.go`
pattern.

## Scope-name standard

Instrumentation scope pinned to the receiver's Go import path:

```
github.com/tracecoreai/tracecore/components/receivers/pyspy
```

When the receiver moves to `module/receiver/pyspyreceiver/` in PR-I.1,
the scope name moves with it (matches OTel convention).

## Lifecycle pattern decision

pyspy uses **single-source** lifecycle (like nccl_fr), NOT multi-source
(like kernelevents). The receiver's `runAll` callback owns a local
`sync.WaitGroup` and fans into three goroutines from within the single
`lifecycle.Start(ctx, r.runAll)` call. No `Add()` needed → dropped from
the slim sibling.

## Tests

```
$ go test ./components/receivers/pyspy/ -count=1
ok  	github.com/tracecoreai/tracecore/components/receivers/pyspy	0.735s

$ go test -race ./components/receivers/pyspy/ -count=1
ok  	github.com/tracecoreai/tracecore/components/receivers/pyspy	1.745s

$ make check
0 issues.
```

13 targeted tests pass (7 new selftel + 6 existing receiver lifecycle
tests).

## Wave context

pyspy is 1 of 4 remaining unported consumers of `internal/selftelemetry`
+ `internal/runtime/lifecycle`. Sibling agents (otlphttp /
containerstdout / k8sevents) are landing in parallel; this PR's scope is
the pyspy receiver only — zero file overlap with the parallel work.

## Test plan

- [x] `go test ./components/receivers/pyspy/ -count=1` — green
- [x] `go test -race ./components/receivers/pyspy/ -count=1` — green
- [x] `make check` — 0 lint issues, vet clean, modules verified
- [x] Scope name asserted at
`github.com/tracecoreai/tracecore/components/receivers/pyspy`
- [x] Factory fallback ticks `tracecore.selftelemetry.init_errors_total`
on instrument-register failure
- [x] Noop hot-path methods never panic; nil-MP rejected with sentinel
error
- [x] Existing receiver tests (Start_EmptyUDSDir, Start_NonExistent,
Start_TopLevelDisabled, Shutdown_HonorsContextBudget,
Shutdown_Idempotent) still pass against the sibling

```release-notes
NONE
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

- Port `components/receivers/nccl_fr` off the v0.1.x internal facades
(`internal/pipeline`, `internal/consumer`, `internal/runtime/lifecycle`)
onto upstream
`go.opentelemetry.io/collector/{component,receiver,consumer}`
  v1.59.0 — the canonical types the OCB-generated `_build/main.go`
  already consumes for all third-party receivers.
- Factory is now `receiver.NewFactory(componentType,
createDefaultConfig,
  receiver.WithLogs(createLogs, component.StabilityLevelBeta))` instead
of a hand-rolled struct implementing
`internal/pipeline.ReceiverFactory`.
  Stability level (`Beta`) preserved across the swap so OCB-surfaced
  metadata doesn't regress.
- Receiver struct renamed to `ncclFRReceiver` (avoids collision with the
  upstream `receiver` package name) and implements `receiver.Logs` via
  a `Start(ctx, component.Host) error` / `Shutdown(ctx) error` pair —
  the `pipeline.ComponentState` embed was dropped (upstream
  `component.Component` carries no equivalent mixin; the lifecycle
  bookkeeping the receiver actually needs lives in the sibling
  `lifecycle.go` helper added in PR-B1 #184).
- Logger swapped from `*slog.Logger` → upstream's `*zap.Logger`
  (the type carried in `component.TelemetrySettings.Logger`). All log
  call sites converted to `zap.String/Int64/Duration/Error` fields;
  log messages and fields are byte-for-byte preserved so operator
  alerting on log content does not regress.

## Hard gate

PR-I.1 (submodule extraction to `module/receiver/ncclfrreceiver/`)
requires
`grep -r 'internal/(pipeline|consumer|runtime/lifecycle)'
components/receivers/nccl_fr/`
to return zero hits. This PR clears it:

```
$ grep -rn 'internal/pipeline\|internal/consumer\|internal/runtime/lifecycle' components/receivers/nccl_fr/*.go
(no matches)
```

(Two comment-only mentions remain in `lifecycle.go` and `factory.go` —
historical context for the v0.1.x → v0.2.0 migration, not imports.)

## Predecessor

PR-B1 #184 (merged 2026-05-30) ported the **self-telemetry + lifecycle**
helpers into the package as siblings. This PR (PR-B2) handles the
**pipeline + consumer + factory** layer — the last remaining
`internal/*` imports.

## Test plan

- [x] `go build ./...` — green
- [x] `go test ./components/receivers/nccl_fr/... -race` — 12 tests pass
- [x] `go test ./...` — green except pre-existing flake in
      `components/receivers/kernelevents/TestReceiver_SLIBudget`
      (verified flaky on stashed/PR-B2-not-applied tree)
- [x] `make check` — golangci-lint + go vet + go mod verify — green
- [x] Scope-name pin (`TestSelfTelemetry_ScopeNameIsReceiverImportPath`)
still asserts
`github.com/tracecoreai/tracecore/components/receivers/nccl_fr`
- [x] Factory fallback contract
(`TestFactory_FallsBackToNoopWhenMeterFails`)
      still surfaces `tracecore.selftelemetry.init_errors_total` when
every `tracecore.receiver.*` instrument registration is synthetically
failed

## Compatibility note

`go.mod` now pins
`go.opentelemetry.io/collector/{component,receiver,consumer} v1.59.0`
— the v1.x stable line aligned with `pdata v1.59.0` the repo was already
on.
No transitive-dep churn beyond zap 1.24.0 → 1.28.0 (upstream component
v1.59.0 requires it).

```release-notes
NONE
```


## Type-swap reference

This PR is the first-of-kind upstream-API port; 7 more receiver/exporter
ports (PR-F.2 series — clockreceiver, kernelevents, stdoutexporter,
k8sevents, containerstdout, otlphttp, pyspy, dcgm) inherit this exact
mapping. Use this table as reference when porting those components off
`internal/pipeline` + `internal/consumer`.

| Internal | Upstream |
|---|---|
| `internal/pipeline.Type` | `component.Type` |
| `internal/pipeline.ReceiverFactory` | `receiver.Factory` |
| `internal/pipeline.CreateSettings` | `receiver.Settings` (via
`receivertest.NewNopSettings` in tests) |
| `internal/pipeline.Config` | `component.Config` |
| `internal/pipeline.Receiver` | `receiver.Logs` (`= interface{
component.Component }`) |
| `internal/consumer.Logs` | `consumer.Logs` |
| `*slog.Logger` | `*zap.Logger` |
| `internal/pipeline.MustNewType` | `component.MustNewType` |
| `internal/pipeline.MustNewID` | `component.NewIDWithName` |

## Deep-review cleanup (post-aec83be)

A follow-up commit applies five reference-pattern fixes so PR-F.2
inherits
the cleanest possible template:

1. Deleted dead `var Factory` + indirection wrapper — `NewFactory()` now
   constructs the factory directly, mirroring upstream `otlpreceiver` /
`filelogreceiver`. The `tools/components-gen` driver that motivated the
   package-var was deleted in PR-A2 #168.
2. Fixed misleading comment claiming `receiver.Logs` carries a
   "LogsReceiver tag" — upstream defines `receiver.Logs` as
   `interface { component.Component }`; the type identity is a
   documentation marker only.
3. Switched tests to upstream
`receivertest.NewNopSettings(componentType())`
   so test Settings auto-track upstream field additions (`BuildInfo`,
   `TelemetrySettings`). A thin `testSettings()` wrapper pins the ID to
   `nccl_fr/test` so selftel label assertions stay deterministic.
4. Renamed unexported `ncclFRReceiver` → `ncclfrReceiver` per Go acronym
   convention (lowercased) and aligned with the planned PR-I.1b package
   name `ncclfrreceiver`.
5. Appended this Type-swap reference section.

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
#206)

## Summary

Deletes the three internal moats and the in-tree DCGM receiver that
RFC-0013 §migration step 8 promised — the payoff for the wave-3
sibling-port PRs (#184/#185/#186/#187/#188/#193/#194/#196/#197).

**Net: -12,482 LOC across 92 files (78 deletions, 14 modifications).**

### What deletes

| Path | LOC | Why safe now |
|---|---|---|
| `components/receivers/dcgm/` | 7,604 | cgo stub never shipped real
code; #188's PR-B2-shaped dcgm sweep already removed the live port
surface. |
| `pkg/dcgm/` | 922 | Only consumer was the deleted receiver. Bonus
cleanup. |
| `internal/selftelemetry/` | 1,946 | Every consumer (containerstdout,
clockreceiver, kernelevents, k8sevents, nccl_fr, dcgm, pyspy,
stdoutexporter, otlphttp) ported onto receiver/exporter-scoped sibling
`selftel.go` files. |
| `internal/telemetry/` | 1,991 | Probes flow through upstream
`healthcheckextension`; MeterProvider via upstream `service.telemetry`.
Only remaining consumers were `internal/selftelemetry/*_test.go`
(deleted together) + one orphan clockreceiver test. |
| `components/receivers/clockreceiver/errors_integration_test.go` | 100
| Orphan from #185's PR-B1 clockreceiver port — bootstrapped via the
deleted `selftelemetry.Receiver` interface but never migrated to the
receiver-scoped sibling `selftel.go`. Covered behaviour ("errors_total
surfaces on downstream failure") is now exercised through
clockreceiver's sibling tests. |

### Pre-flight grep evidence (post-merge of origin/main)

```
$ grep -rn "tracecoreai/tracecore/internal/selftelemetry" --include="*.go" .
(zero matches)

$ grep -rn "tracecoreai/tracecore/internal/telemetry" --include="*.go" .
(zero matches)

$ grep -rn "tracecoreai/tracecore/components/receivers/dcgm" --include="*.go" .
$ grep -rn "tracecoreai/tracecore/pkg/dcgm" --include="*.go" .
(zero matches)
```

### Tooling

- Retire the `dcgm` build tag — `make build-tags` no longer vets `-tags
dcgm` (kept as a hook for future build-tag-gated paths).
- `make bench-check` loop drops both deleted package rows
(`internal/telemetry`, `components/receivers/dcgm`).
- `scripts/register-lint.sh` allowlist emptied (the two
`internal/telemetry/{build_info,slo}.go` entries are gone with the
package; allowlist comment notes the post-PR-F.1 state).
- `go.mod` direct deps shrink — `github.com/prometheus/client_golang`
and `go.opentelemetry.io/otel/exporters/prometheus` drop to indirect
(they were used by `internal/telemetry/server.go`).

### Chart toggles intentionally retained

Chart `receivers.dcgm` toggle + `templates/NOTES.txt` warning +
`templates/_helpers.tpl` doc-comment list keep the `dcgm` symbol for the
migration window. The toggle has been inert since PR-A2 — operators
enabling `receivers.dcgm.enabled=true` already crashed at boot because
the OCB binary doesn't register the factory. PR-K removes the toggle
entirely alongside the chart-default flip from `clockreceiver` →
`hostmetrics` and the v0.2.0 recipe migration.

### Doc sweep

- `internal/runtime/lifecycle/lifecycle.go` doc-comment: drop the dcgm
pointer; flag containerstdout as the sole remaining in-tree consumer;
reschedule the package itself for PR-F.2 deletion once containerstdout
ports off the helper or PR-K.2 deletes the receiver.
- `docs/FAILURE-MODES.md` self-tel-surface rows rewired from
`internal/telemetry/server_test.go::*` (deleted) to upstream-delegated
wording.
- `docs/patterns/{README,pattern-{1,3,4,5}}.md` replay-test pointers
updated — the in-tree `components/receivers/dcgm/pattern_replay_test.go`
is gone; pattern replay now flows through
`docs/integrations/prometheus-scrape.md` (PR-J's upstream
`dcgm-exporter` recipe).
- `docs/README.md` per-component table: drop the deleted
`internal/telemetry/{README,SECURITY}.md` rows + the deleted
`components/receivers/dcgm/{README,RUNBOOK}.md` rows.
- `STYLE.md` vendor-SDK section: drop the `pkg/dcgm/` reference + the
`//go:build dcgm` example; explicit cross-reference to PR-F.1 in the
integration-test build-tag note.
- `CHANGELOG.md`: PR-F.1 landed entry under Unreleased; "Remaining
v0.1.0 work" line updated to point at PR-F.2.
- `docs/rfcs/0013-distro-first-pivot.md` §migration step 8: PR-F entry
replaced with the PR-F.1/PR-F.2 split + the explicit rationale
(componentstatus travels with pipeline; pipeline is out of PR-F's scope
per line 240's original framing).

### Out of scope (PR-F.2 follow-up)

- `internal/componentstatus/` — 5-line `ReportStatus` free function.
Travels with `internal/pipeline` (its only non-test consumers are
`internal/pipeline/runtime_test.go` +
`internal/pipeline/pipelinetest/fixture_test.go`). Deletion lands when
pipeline migrates to upstream
`go.opentelemetry.io/collector/component/componentstatus`.

### Rationale links

- RFC-0013 §migration step 8 — the PR-F entry now codifies the F.1/F.2
split in this branch's RFC update.
- PR-B2 scope-discovery (#188) — established the "rename + slim, don't
reshape" pattern for the dcgm sweep that retired the cgo path.
- Wave-3 PRs that unblocked selftelemetry deletion: #184 (nccl_fr), #185
(clockreceiver), #186 (kernelevents), #187 (stdoutexporter), #188
(dcgm), #193 (otlphttp), #194 (pyspy), #196 (k8sevents), #197
(containerstdout).

```release-notes
[CHANGE] internal/{selftelemetry,telemetry} packages deleted; components/receivers/dcgm + pkg/dcgm deleted. Operators using the v0.1.x in-tree `tracecore.*` self-telemetry metric names migrate per docs/migration/v0.1-to-v0.2.md. Third-party importers of internal/* (unlikely pre-1.0) lose the `selftelemetry.{Receiver,Exporter}` interfaces and the `telemetry.MeterProvider` wrapper; receiver/exporter authors now wire a receiver-scoped sibling `selftel.go` per the PR-B1 pattern.
```

## Test plan

- [x] `make verify` (lint + vet + tidy-check + mod-verify +
license-check + generate-fixtures-check + build-tags + nccl-fr-rce-gate
+ register-lint + actionlint + zizmor + doc-check + no-autoupdate-check)
— exit 0.
- [x] `go test ./...` — all green (29 packages).
- [x] `make build` (OCB) — `./_build/tracecore` produced.
- [x] `./_build/tracecore --version` — `tracecore version
0.1.0-m9-alpha`.
- [x] Pre-flight greps for all four deleted paths — zero external
importers.
- [ ] CI green on PR (linux/race matrix, chart render, install-bench,
zizmor, govulncheck).
- [ ] Operator verification that the chart's `dcgm` toggle remains inert
post-merge (no behaviour change from main — already inert since PR-A2).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

Deletes the seven `internal/*` packages that RFC-0013 §migration step 8
PR-F.2 promised once the upstream-port wave
(#201/#202/#203/#204/#205/#207/#208/#209) cleared every external caller
of the in-tree pipeline runtime.

**Net: -6,888 LOC across 56 deleted files, +80 LOC across 14 modified
files. 70 files total.** This is the final cut of RFC-0013 §migration
step 8 PR-F.

## What deletes

| Path | LOC | Replacement |
|---|---|---|
| `internal/pipeline/` | 4,134 | `go.opentelemetry.io/collector/service`
(OCB-generated `_build/main.go` consumes `builder-config.yaml`). |
| `internal/pipelinebuilder/` | 1,282 | Same — assembly is upstream
`service`. |
| `internal/config/` | 718 | Upstream `confmap` providers (`file`,
`yaml`, `env`). |
| `internal/consumer/` | 87 | Upstream
`go.opentelemetry.io/collector/consumer`. |
| `internal/fanout/` | 366 | Upstream `internal/fanoutconsumer`
(collector module). |
| `internal/componentstatus/` | 16 | Upstream
`component/componentstatus.ReportStatus` (same free-function shape). |
| `internal/runtime/lifecycle/` | 505 | Per-receiver package-local
`lifecycle.go` siblings — already ported during the PR-B1 wave
(#184/#185/#186/#187/#194/#196/#197); the in-tree helper had no
remaining non-test consumer after PR-F.1 + the wave-2 upstream-port PRs.
`kernelevents/lifecycle.go` was inherited from k8sevents (#208). |

## Pre-flight grep evidence

```
$ grep -rn 'tracecoreai/tracecore/internal/(pipeline|consumer|pipelinebuilder|config|fanout|componentstatus|runtime/lifecycle)' --include='*.go' .
(zero matches)
```

## Tooling

- `.golangci.yml` `ignore-interface-regexps` repointed at upstream
`consumer.{Metrics,Traces,Logs}` + `component.Component`. The
in-tree-only same-package-error-wrap exemption stays — the STYLE rule
applies regardless of which interface is forwarded.
- `.github/workflows/chaos.yml` drops the `chaos-pipeline-test` job (the
in-tree `internal/pipeline/chaos_test.go` is gone; upstream `service`
provides the equivalent panic-recovery contract). `harness-determinism`
(failure-inject golden-SHA), `cpu-steal-mpstat`, `pattern-pod-evicted`
jobs preserved.
- `.github/workflows/install-bench.yml` drops the
`internal/{pipeline,runtime,selftelemetry}/**` path-filter rows.
- `go.mod` / `go.sum` unchanged.

## Doc sweep

- `CHANGELOG.md` Unreleased: PR-F.2 landed entry replacing the "PR-F.2
deferred" sentence; "Remaining v0.1.0 work" line updated; one dead
`internal/pipeline/README.md` link in Foundation block rewritten as
"deleted at v0.1.0".
- `docs/rfcs/0013-distro-first-pivot.md` §7 deletion table: both
pipeline-internals and runtime/lifecycle rows updated from "v0.1.0
(audit first…)" / "v0.2.0 (with last consumer)" to "v0.1.0 (landed
PR-F.2)". §migration step 8 reframed.
- `docs/FAILURE-MODES.md` Lifecycle / Data flow / Shutdown timing /
Backend tables rewired from in-tree
`internal/{config,pipeline,fanout}/*_test.go::TestName` pointers to
upstream-delegated wording matching the pattern PR-A2 established.
- `docs/STRATEGY.md` "Post-RFC-0013 status" intro updated; "Stable
interfaces in `internal/pipeline/`" graduation row rewritten to point at
the upstream surface.
- `docs/migration/v0.1-to-v0.2.md` `internal/*` section status banner
flipped from "deferred, still present in RC builds" to "landed, deleted
in v0.2.0 builds".
- `MILESTONES.md` v0.1.0 deletions row extended with boot-path
internals; M1 + M4b + M19 rubric details annotated with the PR-F.2
retirement.
- `README.md` Contributor row repointed at upstream
`go.opentelemetry.io/collector` package docs.
- `AGENTS.md` "Self-telemetry internals" bullet split into "Self-tel
internals" + "Pipeline / boot-path internals" with explicit deletion
status.
- `docs/README.md` table row for `internal/pipeline/README.md` dropped.
- `components/receivers/kernelevents/README.md` lifecycle-sibling
rationale updated to past-tense.
- `tools/failure-inject/README.md` "Testing locally" section drops the
`-tags=chaos ./internal/pipeline/...` invocation.

## Sequencing

This PR is hard-gated on every upstream-port PR landing first:

- #201 nccl_fr (PR-B2)
- #202 stdoutexporter
- #203 pyspy
- #204 k8sevents
- #205 clockreceiver (PR-B3)
- #207 otlphttp
- #208 kernelevents
- #209 containerstdout
- #206 PR-F.1 (selftel / telemetry / dcgm)

All nine merged before this PR opened; this is the moat-deletion payoff.
Remaining v0.1.0 work is PR-K (chart-default flip + `clockreceiver` +
`stdoutexporter` + remaining receiver source deletions, coupled with
test-fixture migration and the `telemetry:` values-key deprecation
cycle).

## Test plan

- [x] `make check` — golangci-lint 0 issues, go vet clean, go mod verify
ok.
- [x] `go build ./...` — clean.
- [x] `go test -count=1 ./...` — green (excluding the known
`kernelevents/TestReceiver_SLIBudget` flake called out in #205's body,
which only triggers under heavy parallel `go test ./...` load; passes
standalone).
- [x] `grep` confirms zero non-internal callers of the deleted packages.
- [x] Doc-check pre-push hook passes after the CHANGELOG dead-link fix.

```release-notes
[CHANGE] internal/{pipeline,pipelinebuilder,config,consumer,fanout,componentstatus,runtime/lifecycle} packages deleted. The OCB-generated boot path off builder-config.yaml replaces them. Third-party importers of internal/* (unlikely pre-1.0; the packages live under internal/ and the Go compiler rejects external imports) lose the pipeline-assembly + lifecycle + config-loader surfaces; receiver authors now wire against upstream go.opentelemetry.io/collector/{component,receiver,consumer,pipeline} directly. See docs/migration/v0.1-to-v0.2.md "internal/* package deletion".
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant