Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,19 @@ User-visible changes are documented here. Format: [Keep a Changelog](https://kee

Pre-alpha. **Distribution-first pivot adopted ([RFC-0013](docs/rfcs/0013-distro-first-pivot.md))** - binary now assembled via the OpenTelemetry Collector Builder (OCB) from upstream + contrib components plus a thin in-repo Go submodule at `module/` (path `github.com/tracecoreai/tracecore/module`) containing only the moat (NCCL FlightRecorder receiver, OTTL processors with windowed semantics, pattern detectors). The M1 in-tree pipeline runtime + factory-based assembly is queued for deletion at v0.1.0 in favor of the OCB-generated boot path; the canonical `clockreceiver` + `stdoutexporter` examples ship for one PR cycle and then exit. Targeting v0.1.0 / v0.2.0 / v0.3.0 release boundaries per RFC-0013 §4.

Pivot landed across three waves of PRs:
Pivot landed across four waves of PRs:
- Wave 1 (#166 RFC doc accepted, #168 delete kueue + kineto receivers, #169 pre-PR-A drift sweep + Helm security tighten, #170 containerstdout deletion explicit in §7, #171 PR-A OCB skeleton + `builder-config.yaml` + `make build-ocb`, #172 dedup gate execution, #173 rename check tiers + add PR body-artifact guard, #174 PR-C release pipeline → goreleaser stack + RFC supersession + top-level doc alignment, #175 wave-1 self-review fixes + delete archive folder).
- Wave 2 (#176 PR-D image build → ko + `_build/` walker fix + PR-B reframe as side-effect of binary swap, #177 build-ocb CI gate, #178 post-wave-2 drift sweep, #179 v0.1→v0.2 migration guide skeleton).
- Wave 3 (PR-E: bench heartbeat swap `clockreceiver` → `hostmetricsreceiver`; PR-J: ship the four receiver-side recipes that replace the deleted in-tree receivers — filelog+container, journald+filelog+OTTL, k8sobjects+transform, prometheusreceiver).
- Wave 3 — PR-B1-shape sibling ports (each receiver/exporter gets its own `selftel.go` + `lifecycle.go` siblings off `internal/selftelemetry` + `internal/runtime/lifecycle`) plus support infra: #180 PR-E bench heartbeat swap `clockreceiver` → `hostmetricsreceiver`; #181 RFC-0013 §migration rescope (PR-A2/B1/B2/I.1/I.2 sub-sequencing + in-repo submodule); #182 PR-G RFC-0004 archive + stale-path sweep; #183 PR-H PRINCIPLES + CONTRIBUTING pivot alignment; #184 nccl_fr (PR-B1 canonical); #185 clockreceiver; #186 stdoutexporter; #187 kernelevents; #188 dcgm; #189 PR-A2 OCB-generated `main` swap + delete `cmd/tracecore/` tree; #190 install-bench → OCB binary; #191 PR-L migration-guide body; #192 doc-rot sweep post-A2; #193 otlphttp; #194 pyspy; #196 k8sevents.
- Wave 4 — PR-B2-shape sibling ports (mechanical import swap to upstream `go.opentelemetry.io/collector/{component,receiver,consumer,pipeline}`; lands together with #195 PR-J recipes, #199 RFC §migration amendment for PR-I/K sub-slicing, #200 PR-N security-posture migration, #198 lint concurrency fix, #210 TOCTOU race-window test hardening): #195 PR-J four upstream-receiver recipes; #197 PR-F precursor containerstdout off selftel+lc; #198 golangci-lint stale-PID fix; #199 RFC §migration PR-I/K amendment + PR-B2 gate; #200 PR-N pyspy capability-surface guide; #201 PR-B2 nccl_fr upstream (canonical); #202 stdoutexporter upstream; #203 pyspy upstream; #206 PR-F.1 (delete `internal/{selftelemetry,telemetry}` + `components/receivers/dcgm/` + `pkg/dcgm/`); #208 kernelevents upstream; #209 containerstdout upstream; #210 lifecycle TOCTOU concurrent-Add race hardening (kernelevents + k8sevents). Remaining open: #204 (k8sevents upstream), #205 (PR-B3 clockreceiver upstream), #207 (otlphttp upstream) — gate for PR-F.2.

**PR-F.1 landed: `components/receivers/dcgm/` + `pkg/dcgm/` + `internal/selftelemetry/` + `internal/telemetry/` deleted; one orphan clockreceiver integration test deleted.** Net deletion across the four moats RFC-0013 §migration step 8 promised. Deletes:
- `components/receivers/dcgm/` + `pkg/dcgm/` — cgo stub never shipped real code; live ports removed in #188's PR-B2-shaped dcgm sweep; kueue + kineto already deleted in #168.
- `internal/selftelemetry/` — every consumer (containerstdout, clockreceiver, kernelevents, k8sevents, nccl_fr, dcgm, pyspy, stdoutexporter, otlphttp) ported onto receiver/exporter-scoped sibling `selftel.go` files in wave-3 of the pivot (#184/#185/#186/#187/#188/#193/#194/#196/#197). The 5-method `selftelemetry.Receiver` and 1-method `selftelemetry.Exporter` interfaces (and the `Kind` canonical-set enum) leave the tree.
- `internal/telemetry/` — was the in-tree `MeterProvider` + probe-server (`/metrics`, `/healthz`, `/readyz`) wrapper. Probes now flow through the upstream `healthcheckextension`; meter-provider is upstream `service.telemetry`. Only remaining consumers were `internal/selftelemetry/*_test.go` (deleted together with selftelemetry) and one orphan clockreceiver integration test.
- `components/receivers/clockreceiver/errors_integration_test.go` — orphan integration test from #185's PR-B1 clockreceiver port; bootstrapped via the now-deleted `selftelemetry.Receiver` interface but never migrated to the receiver-scoped sibling `selftel.go`. The covered behaviour ("errors_total surfaces on downstream failure") is now exercised through clockreceiver's sibling tests.

PR-F.2 (deferred): `internal/componentstatus/` (5-line `ReportStatus` free function) travels with `internal/pipeline` — its only non-test consumers are `internal/pipeline/runtime_test.go` + `internal/pipeline/pipelinetest/fixture_test.go`. Deletion lands when pipeline migrates to upstream `go.opentelemetry.io/collector/component/componentstatus`.
PR-F.2 (deferred — pending three open ports): Delete `internal/{componentstatus,pipeline,pipelinebuilder,consumer,fanout,runtime/lifecycle}`. Gated on the last three pipeline+consumer-importing receivers landing — k8sevents (#204), clockreceiver (#205), otlphttp (#207) — all three open as of this entry, all three following the PR-B2 (#201) shape. Once they merge, the entire `internal/*` runtime bundle has zero non-test consumers and drops in a single cut. The `clockreceiver` source deletion stays in PR-K (chart + values-keys deprecation cycle) — PR-F.2 only deletes `internal/*` packages, not the canonical-example receivers themselves.

Build-tag `dcgm` retired (`make build-tags` no longer vets `-tags dcgm`). `make bench-check` loop drops both deleted package rows (dcgm + internal/telemetry). `scripts/register-lint.sh` allowlist emptied (the two `internal/telemetry/{build_info,slo}.go` entries are gone with the package). Chart `receivers.dcgm` toggle + `_helpers.tpl` doc-list + `NOTES.txt` warning retained until PR-K removes them outright (toggle is already inert — operators enabling `receivers.dcgm.enabled=true` have crashed at boot since PR-A2). `internal/runtime/lifecycle/` doc-comment updated. `docs/FAILURE-MODES.md` self-tel-surface rows rewired to upstream-delegated wording. `docs/patterns/{README,pattern-{1,3,4,5}}.md` replay-test pointers updated.

Expand Down
6 changes: 3 additions & 3 deletions MILESTONES.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ Every milestone, in every lane, satisfies all seven principles below. Depth live
### M1. Pipeline runtime & component contract

- **Status:** ☑ delivered (PRs #12 + #13)
- **Status (RFC-0013):** DELETED at v0.1.0 (pipeline boot path) - replaced by OCB-generated `main.go` from `builder-config.yaml`. `internal/pipeline/`, `internal/pipelinebuilder/`, `internal/config/`, `internal/consumer/`, `internal/fanout/` audited and folded into the upstream lifecycle per RFC-0013 §7 (kept only if a custom receiver/processor depends on a non-replaceable abstraction). `internal/runtime/lifecycle/` deletes at v0.2.0 with its last consumer. The bundled `components/receivers/clockreceiver/` and `components/exporters/stdoutexporter/` canonical examples are queued for deletion at v0.1.0; `clockreceiver` replaced by `telemetrygeneratorreceiver`.
- **Status (RFC-0013):** DELETED at v0.1.0 (pipeline boot path) - replaced by OCB-generated `main.go` from `builder-config.yaml`. **PR-A2 landed (#189)**: `cmd/tracecore/` deleted (3,032 LOC across 14 source + 7 test files); the OCB binary at `./_build/tracecore` is the canonical entry point. **PR-F.1 landed (#206)**: `internal/selftelemetry/` + `internal/telemetry/` deleted; every receiver/exporter now travels its own `selftel.go` + `lifecycle.go` siblings (PR-B1-shape sibling ports: #184/#185/#186/#187/#188/#193/#194/#196/#197). **PR-F.2 deferred**: `internal/{componentstatus,pipeline,pipelinebuilder,consumer,fanout,runtime/lifecycle}` drop together once the last three pipeline+consumer-importing receivers land (#204 k8sevents, #205 clockreceiver, #207 otlphttp — all PR-B2-shape ports off canonical #201). `internal/config/` retained (still load-bearing for `tracecore validate`). The bundled `components/receivers/clockreceiver/` and `components/exporters/stdoutexporter/` canonical examples are queued for deletion at v0.2.0 (PR-K.2); `clockreceiver` replaced by `hostmetricsreceiver` (loadscraper @ 1s) per PR-E unblocking (#180) — the originally-planned `telemetrygeneratorreceiver` does not exist in opentelemetry-collector-contrib at any tag.
- **Depends on:** none (foundational)
- **Reference:** [RFC-0003](docs/rfcs/0003-pipeline-runtime-and-component-contract.md). Contract documented in [`internal/pipeline/README.md`](internal/pipeline/README.md).

Expand All @@ -126,7 +126,7 @@ Every milestone, in every lane, satisfies all seven principles below. Depth live
### M2. Self-telemetry surface

- **Status:** ☑ delivered (PR #17)
- **Status (RFC-0013):** DELETED at v0.1.0 - replaced by upstream `go.opentelemetry.io/collector/component/componentstatus` + `service/telemetry` + standard `otelcol_*` metrics. `internal/componentstatus`, `internal/selftelemetry`, `internal/telemetry` removed per RFC-0013 §7. M2 carry-forward divergence list closes via adoption, not via further in-tree work.
- **Status (RFC-0013):** DELETED at v0.1.0 - replaced by upstream `go.opentelemetry.io/collector/component/componentstatus` + `service/telemetry` + standard `otelcol_*` metrics. **PR-F.1 landed (#206)**: `internal/selftelemetry/` + `internal/telemetry/` deleted; probes flow through upstream `healthcheckextension`, meter-provider via upstream `service.telemetry`. **PR-F.2 deferred**: `internal/componentstatus` deletes alongside `internal/pipeline` (its only non-test consumers) once the last three PR-B2-shape ports land (#204 / #205 / #207). M2 carry-forward divergence list closes via adoption, not via further in-tree work.
- **Depends on:** M1
- **Reference:** [RFC-0006](docs/rfcs/0006-self-telemetry-surface.md).
- **Carry-forward:** see [`docs/followups/M2.md`](docs/followups/M2.md) (pprof endpoint, queue impl, restart mechanism, OTLP push reader, MetricsLevel knob, histogram tuning, per-role CreateSettings split, TracerProvider field).
Expand Down Expand Up @@ -587,7 +587,7 @@ Lane 6 covers NVIDIA-side device telemetry (DCGM), NCCL collective diagnostics (
### M8. DCGM receiver - cgo client + hardware integration (carry-forward)

- **Status:** ⧗ (alpha scaffold shipped in PR #18; cgo client + hardware integration carry-forward pending)
- **Status (RFC-0013):** DELETED at v0.1.0 - cgo client path never shipped, and the stub adds no value once `dcgm-exporter` + `prometheusreceiver` is the supported path. NVIDIA's 1st-party `dcgm-exporter` covers every metric in the RFC-0005 set; cross-vendor `gpu.vendor` resource attribute lands via OTTL transform over Prometheus output (RFC-0013 §3, upstream-contribution target to OTel `hw.*` semconv per §5). Replacement applies to AMD (`ROCm/device-metrics-exporter`), Intel (`intel/xpumanager`), and Habana (Habana Prometheus Metric Exporter) on the same recipe shape.
- **Status (RFC-0013):** DELETED at v0.1.0 — **landed in PR-F.1 (#206)**: `components/receivers/dcgm/` + `pkg/dcgm/` removed (cgo client path never shipped real code; live ports removed in #188's PR-B2-shaped dcgm sweep). Replaced by `dcgm-exporter` + `prometheusreceiver` per `docs/integrations/prometheus-scrape.md` (PR-J, #195). NVIDIA's 1st-party `dcgm-exporter` covers every metric in the RFC-0005 set; cross-vendor `gpu.vendor` resource attribute lands via OTTL transform over Prometheus output (RFC-0013 §3, upstream-contribution target to OTel `hw.*` semconv per §5). Replacement applies to AMD (`ROCm/device-metrics-exporter`), Intel (`intel/xpumanager`), and Habana (Habana Prometheus Metric Exporter) on the same recipe shape. Chart `receivers.dcgm` toggle + `_helpers.tpl` doc-list + `NOTES.txt` warning retained until PR-K.3 (toggle is inert post-PR-A2: enabling it crashes the OCB binary at boot with "unknown factory").
- **Depends on:** M1
- **Reference:** [RFC-0005](docs/rfcs/0005-dcgm-receiver-scope.md)
- **Hardware:** Linux + NVIDIA GPU host with `nv-hostengine` reachable; driver R580 LTSB + DCGM 4.4.x reference (per [endoflife.date/nvidia](https://endoflife.date/nvidia) - R580 active support ends 2026-08-04, refresh LTSB pin within Q3 2026; DCGM 4.4.2 is current core release per [NVIDIA/DCGM tags](https://github.com/NVIDIA/DCGM/tags))
Expand Down
20 changes: 14 additions & 6 deletions docs/migration/v0.1-to-v0.2.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ CI workflows changed path triggers from `cmd/tracecore/**` to `builder-config.ya

## `internal/*` package deletion (PR-F)

> **Status:** PR-F not yet open at the time of this guide's first publish. The packages listed below are still present in v0.2.0 RC builds and will be deleted in PR-F before v0.2.0 GA.
> **Status:** PR-F.1 landed (#206) — `internal/selftelemetry/` and `internal/telemetry/` are already gone in current main. PR-F.2 (deletes `internal/{componentstatus,pipeline,pipelinebuilder,consumer,fanout,runtime/lifecycle}`) is gated on three open ports: #204 (k8sevents), #205 (clockreceiver), #207 (otlphttp). Once those land, the remaining `internal/*` runtime packages drop in a single cut before v0.2.0 GA.

Several internal Go packages were load-bearing only for the deleted `cmd/tracecore` boot path and the in-tree receivers/exporters. Third-party Go importers (unlikely in OSS pre-1.0; the packages live under `internal/` and the Go compiler rejects external imports) lose:

Expand Down Expand Up @@ -199,8 +199,16 @@ Charts are pinned by `--version` (the chart-package version from `Chart.yaml`),

## Open items (fill in as PRs land)

- [ ] PR-I (in-repo Go submodule extraction at `module/`) — link
- [x] PR-J (ship recipes for filelog + journald + k8sobjects + prometheus) — [`docs/integrations/{filelog-container,journald-kernel,k8sobjects-events,prometheus-scrape}.md`](../integrations/)
- [ ] PR-K (delete in-tree receivers) — link
- [ ] PR-L (this guide, full body) — link
- [ ] PR-E unblocking decision (heartbeat replacement) — link
- [x] PR-A2 (switch entrypoint to OCB-generated main + delete `cmd/tracecore/`) — [#189](https://github.com/TraceCoreAI/tracecore/pull/189)
- [x] PR-E (heartbeat replacement decision) — `hostmetricsreceiver` (loadscraper @ 1s); [#180](https://github.com/TraceCoreAI/tracecore/pull/180). The originally-planned `telemetrygeneratorreceiver` does not exist in opentelemetry-collector-contrib (contrib issues #41687 + #43657 both closed `not_planned`).
- [x] PR-F.1 (delete `internal/{selftelemetry,telemetry}` + `components/receivers/dcgm/` + `pkg/dcgm/`) — [#206](https://github.com/TraceCoreAI/tracecore/pull/206)
- [ ] PR-F.2 (delete `internal/{componentstatus,pipeline,pipelinebuilder,consumer,fanout,runtime/lifecycle}`) — gated on #204 / #205 / #207
- [ ] PR-I.1a (scaffold `module/` Go submodule + `go.work` + `replaces:`) — in flight
- [ ] PR-I.1b (`git mv` nccl_fr → `module/receiver/ncclfrreceiver`) — gate satisfied by [#201](https://github.com/TraceCoreAI/tracecore/pull/201); waits on PR-I.1a
- [ ] PR-I.2 (`rankjoinprocessor` + `patterndetectorprocessor`) — gated on PR-K.1
- [x] PR-J (ship recipes for filelog + journald + k8sobjects + prometheus) — [`docs/integrations/{filelog-container,journald-kernel,k8sobjects-events,prometheus-scrape}.md`](../integrations/); [#195](https://github.com/TraceCoreAI/tracecore/pull/195)
- [ ] PR-K.1 (sever patterns-lib k8sevents dep) — in flight
- [ ] PR-K.2 (delete in-tree receivers + migrate ~86 test fixtures + delete `tools/failure-inject/xidgen/`) — link
- [ ] PR-K.3 (chart cleanup + values-keys `NOTES.txt` deprecation + values delete after one-minor window) — link
- [x] PR-L (this guide, skeleton + body) — skeleton [#179](https://github.com/TraceCoreAI/tracecore/pull/179), body [#191](https://github.com/TraceCoreAI/tracecore/pull/191); living document
- [x] PR-N (pyspy capability-surface security note, ahead of v0.3.0) — [`docs/migration/v0.2-to-v0.3.md`](v0.2-to-v0.3.md); [#200](https://github.com/TraceCoreAI/tracecore/pull/200)
Loading
Loading