Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,14 @@ User-visible changes are documented here. Format: [Keep a Changelog](https://kee

Pre-alpha. **Distribution-first pivot adopted ([RFC-0013](docs/rfcs/0013-distro-first-pivot.md))** - binary now assembled via the OpenTelemetry Collector Builder (OCB) from upstream + contrib components plus a thin `tracecoreai/tracecore-components` module containing only the moat (NCCL FlightRecorder receiver, OTTL processors with windowed semantics, pattern detectors). The M1 in-tree pipeline runtime + factory-based assembly is queued for deletion at v0.1.0 in favor of the OCB-generated boot path; the canonical `clockreceiver` + `stdoutexporter` examples ship for one PR cycle and then exit. Targeting v0.1.0 / v0.2.0 / v0.3.0 release boundaries per RFC-0013 §4.

Pivot landed across two waves of PRs:
Pivot landed across three waves of PRs:
- Wave 1 (#166 RFC doc accepted, #168 delete kueue + kineto receivers, #169 pre-PR-A drift sweep + Helm security tighten, #170 containerstdout deletion explicit in §7, #171 PR-A OCB skeleton + `builder-config.yaml` + `make build-ocb`, #172 dedup gate execution, #173 rename check tiers + add PR body-artifact guard, #174 PR-C release pipeline → goreleaser stack + RFC supersession + top-level doc alignment, #175 wave-1 self-review fixes + delete archive folder).
- Wave 2 (#176 PR-D image build → ko + `_build/` walker fix + PR-B reframe as side-effect of binary swap).
- Wave 2 (#176 PR-D image build → ko + `_build/` walker fix + PR-B reframe as side-effect of binary swap, #177 build-ocb CI gate, #178 post-wave-2 drift sweep, #179 v0.1→v0.2 migration guide skeleton).
- Wave 3 (PR-E: bench heartbeat swap `clockreceiver` → `hostmetricsreceiver`).

Remaining v0.1.0 work: PR-E (clockreceiver → telemetrygenerator) currently BLOCKED — upstream `telemetrygeneratorreceiver` doesn't exist at any version. PR-F (delete `internal/{componentstatus,selftelemetry,telemetry}` + `components/receivers/{clockreceiver,dcgm,kueue}`) deferred — chart default pipeline hardwires the to-be-deleted receivers, so deletion happens together with the v0.2.0 recipe migration (PR-K) to avoid an interim chart break.
**PR-E unblocked.** Original RFC-0013 §migration plan named `telemetrygeneratorreceiver` as the upstream replacement for `clockreceiver`. Verified 2026-05-30: the receiver does not exist in `opentelemetry-collector-contrib` at any tag from v0.95.0 through v0.130.0; two community proposals (contrib issues #41687 and #43657) were closed `not_planned`. Replacement landed on `hostmetricsreceiver` (loadscraper @ 1s) — an upstream OCB-bundled receiver that emits 3 low-cardinality series (`system.cpu.load_average.{1m,5m,15m}`) at the cadence the bench's pass condition needs (first parseable JSON line at the sink — see `bench/install/run.sh`). This PR adds `hostmetricsreceiver` to `builder-config.yaml`, adds a `receivers.hostmetrics` opt-in block to the chart values (default disabled — chart default stays `clockreceiver` this release), and flips `bench/install/tracecore-values.yaml` to enable hostmetrics + disable clockreceiver. RFC-0013 §migration PR-E + §4 + §7 deletion table updated. Chart-default flip from `clockreceiver` to `hostmetrics` + source-deletion of `components/receivers/clockreceiver/` are deferred to PR-K (in-tree-receiver deletion wave) so the values-keys migration ships together with `NOTES.txt` deprecation warnings and the coordinated migration of ~92 in-tree test-fixture references in one cut rather than two operator-visible changes.

Remaining v0.1.0 work: PR-F (delete `internal/{componentstatus,selftelemetry,telemetry}` + `components/receivers/{dcgm,kueue}`) deferred — chart default pipeline hardwires the to-be-deleted receivers, so deletion happens together with the v0.2.0 recipe migration (PR-K) to avoid an interim chart break. `clockreceiver` source deletion also part of PR-K per PR-E rationale.

**PR-B reframed: self-tel metric rename (`tracecore.*` → `otelcol_*`) is a side-effect of the binary swap, not a caller rewrite.** Investigation found that `service/telemetry` + `componentstatus` upstream APIs are not drop-in replacements for the `IncError`/`IncEmissions`/`ObserveLatency`/`SetDegraded`/`MarkActivity` surface that `internal/selftelemetry/` provides today — the standard `otelcol_*` metrics RFC-0013 §2 promises are emitted by upstream `receiver/scraperhelper`, `exporter/exporterhelper`, and the OCB-generated pipeline runtime, NOT by `componentstatus` (which is a status-event surface). The rename therefore arrives automatically once PR-A's OCB binary boots with upstream receivers and PR-F deletes the in-tree receivers; no caller rewrite is needed in between. RFC-0013 §migration PR-B is collapsed into PR-F; the standalone PR-B step is documentation-only and lives in this CHANGELOG entry.

Expand All @@ -21,7 +24,7 @@ Remaining v0.1.0 work: PR-E (clockreceiver → telemetrygenerator) currently BLO
- **Adopt > build posture replaces in-tree receivers for GPU telemetry, container stdout, kernel events, K8s events, Kueue, Python profiling, heartbeat, self-telemetry, release pipeline, and image publish.** Adoption matrix lives in [RFC-0013 §2](docs/rfcs/0013-distro-first-pivot.md#2-adoption-matrix). Vendors: NVIDIA (`dcgm-exporter`), AMD (`ROCm/device-metrics-exporter`), Intel (`intel/xpumanager`), Habana (Habana Prometheus Metric Exporter) - all scraped via upstream `prometheusreceiver`. CNCF: `filelogreceiver` + container stanza + `file_storage`; `journaldreceiver`; `k8sobjectsreceiver`; `telemetrygeneratorreceiver`. CNCF Profiles: `parca-agent` via OTLP profiles sink. Self-telemetry: upstream `componentstatus` + `service/telemetry` + standard `otelcol_*` metrics.
- **Customer-stable telemetry contracts preserved across the pivot** via the OTTL `transform` processor in the bundled Helm-chart recipe ([RFC-0013 §3](docs/rfcs/0013-distro-first-pivot.md#3-customer-stable-telemetry-contracts)). Stable surfaces: `k8s.event.hint` 11-entry enum (pod_evicted, mount_failure, backoff, oom_killed, node_unhealthy, schedule_failure, create_failure, volume_attach_failure, container_status_unknown, node_pressure, image_pull_failure); `kernelevents.xid` (NVRM Xid code); `gpu.id` (PCI BDF); `gpu.vendor` (nvidia | amd | intel | habana - upstream-contribution target to OTel `hw.*` semconv); `gen_ai.training.rank` and `gen_ai.training.job_id` (cross-receiver join keys); NCCL FlightRecorder span schema; pattern detector outputs (M17/M18/M19). Operator alerts written against these survive the receiver swap.
- **Deletions scheduled** (RFC-0013 §7):
- **v0.1.0:** `components/receivers/clockreceiver/` (→ `telemetrygeneratorreceiver`), `components/receivers/dcgm/` (cgo stub never shipped real path; → `dcgm-exporter` + `prometheusreceiver` recipe), `components/receivers/kueue/` (never shipped; → `prometheusreceiver` recipe), `internal/componentstatus/`, `internal/selftelemetry/`, `internal/telemetry/`. Hand-rolled `.github/workflows/release.yml` rewritten onto the goreleaser stack (prior workflow preserved in git history). Operator-visible breaks: self-tel metric rename `tracecore.*` → `otelcol_*`; release-artifact provenance shape change (documented once).
- **v0.1.0:** bench heartbeat swap `clockreceiver` → `hostmetricsreceiver` (PR-E; source survives until PR-K per coupled test-fixture migration), `components/receivers/dcgm/` (cgo stub never shipped real path; → `dcgm-exporter` + `prometheusreceiver` recipe), `components/receivers/kueue/` (never shipped; → `prometheusreceiver` recipe), `internal/componentstatus/`, `internal/selftelemetry/`, `internal/telemetry/`. Hand-rolled `.github/workflows/release.yml` rewritten onto the goreleaser stack (prior workflow preserved in git history). Operator-visible breaks: self-tel metric rename `tracecore.*` → `otelcol_*`; release-artifact provenance shape change (documented once).
- **v0.2.0:** `components/receivers/kernelevents/` (→ `journaldreceiver` + `filelogreceiver` + OTTL Xid transform), `components/receivers/k8sevents/` (→ `k8sobjectsreceiver` + OTTL `k8s.event.hint` transform), `components/receivers/kineto/` (deferred; re-eval at OTel Profiles GA), plus `.github/workflows/kernelevents-integration.yml`. Operator-visible breaks: ALL recipe-side receiver swaps, batched into one migration guide; Helm values keys map old→new for one minor release with `NOTES.txt` deprecation warning.
- **v0.3.0:** `components/receivers/pyspy/` (→ `parca-agent` via separate chart), `python/tracecore_pyspy/`, `tools/pyspy-lint/`, `.github/workflows/{pyspy-integration,python-publish}.yml`. Operator-visible breaks: PyPI helper deleted; security posture changes (CAP_SYS_PTRACE → CAP_SYS_ADMIN/BPF - operator review window).
- **Upstream contributions become first-class policy.** Tracecore patches upstream first; forks only when upstream rejects ([RFC-0013 §5](docs/rfcs/0013-distro-first-pivot.md#5-upstream-contribution-policy)). When a contribution is in-flight, tracecore ships against a `replace` directive in `go.mod` pointing at the contribution branch; the replace is removed when the upstream tag lands. Likely contribution slots opened by the pivot: `k8sobjectsreceiver` (`k8s.event.hint` derived attribute), `filelogreceiver` / container stanza (PyTorch rank + dataloader-timing presets), `journaldreceiver` (`_TRACE_ID`/`_SPAN_ID` propagation), cross-vendor `gpu.vendor` semconv extension, OTel Profiles Kineto adapter, OCB reproducibility flags, `telemetrygeneratorreceiver` rate-limit knobs.
Expand Down
15 changes: 11 additions & 4 deletions bench/install/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,13 @@ for the JSON Schema. Each row carries:
- `install_seconds`; `helm install` return time
- `first_data_seconds`; first OTLP byte at the sink (tick-aliased,
see `clockreceiver_interval_seconds`)
- `clockreceiver_interval_seconds`; receiver tick period, for
tick-alias correction across runs with different intervals
- `clockreceiver_interval_seconds`; heartbeat-receiver emit period,
for tick-alias correction across runs with different intervals.
**Note (RFC-0013 PR-E, 2026-05-30):** the bench heartbeat source
is now `hostmetricsreceiver` (loadscraper @ 1s); the field name
is preserved for schema-v1 stability. Schema v2 will rename to
`heartbeat_interval_seconds` alongside the PR-K chart-default
flip.
- `poll_interval_ms`; sink-side polling cadence (noise floor for
`first_data_seconds`)
- envelope fields per the shared schema
Expand All @@ -67,8 +72,10 @@ for the JSON Schema. Each row carries:
`components/exporters/otlphttp/otlphttp_test.go`), but install-bench
validates the metrics wire path only. Adding traces+logs to the
bench is tracked in `docs/FOLLOWUPS.md`.
- **First-data is tick-aliased.** The clockreceiver fires on a 1 s
interval by default; `first_data_seconds` includes up to one full
- **First-data is tick-aliased.** The bench heartbeat source emits
on a 1 s interval (hostmetricsreceiver loadscraper as of PR-E;
was clockreceiver pre-PR-E — chart default remains clockreceiver
this release); `first_data_seconds` includes up to one full
tick of wait. Subtract `clockreceiver_interval_seconds` for the
pipeline-startup latency.
- **No multi-arch yet.** ubuntu-latest only. arm64 GHA runners were
Expand Down
50 changes: 33 additions & 17 deletions bench/install/tracecore-values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,43 @@
# alongside this bench) so the rendered pipeline wires otlphttp
# automatically. No free-form `config:` override needed.
#
# RFC-0013 note: the `clockreceiver` and `stdoutexporter` values
# keys below refer to the v0.0.x in-tree components. At v0.1.0 they
# map to recipe equivalents per RFC-0013 §7 (Deletion list):
# clockreceiver -> telemetrygeneratorreceiver (OCB-bundled) [BLOCKED]
# stdoutexporter -> debugexporter (OCB-bundled)
# PR-E status (2026-05-30): the clockreceiver -> telemetrygeneratorreceiver
# swap is DEFERRED. The receiver does not exist in
# opentelemetry-collector-contrib at v0.110.0 or main (verified against
# the GH tree API; receiver/telemetrygeneratorreceiver path 404s on every
# tag from v0.95.0 through v0.130.0). The bench keeps using the in-tree
# clockreceiver until PR-K (v0.2.0 recipe migration) deletes it — at which point
# the bench will switch to a non-clock load source (likely
# hostmetricsreceiver on a 1s scrape, or an OCB-bundled equivalent if one
# lands upstream). See builder-config.yaml TODO(RFC-0013 PR-E) block.
# The chart's compat map will keep these values keys working for one
# minor with a `NOTES.txt` deprecation warning per RFC-0013 §8.
# RFC-0013 PR-E (2026-05-30): bench load source is `hostmetrics`
# (loadscraper @ 1s) — an upstream OTel-contrib receiver bundled by
# OCB. Replaces the legacy in-tree `clockreceiver` here because the
# distro-first pivot's intent is "no custom receiver where upstream
# satisfies." hostmetrics' loadscraper emits 3 low-cardinality
# series (system.cpu.load_average.{1m,5m,15m}) at the cadence the
# bench's pass condition needs (first parseable JSON line at the
# sink — see bench/install/run.sh).
#
# The originally-planned `telemetrygeneratorreceiver` does NOT
# exist in opentelemetry-collector-contrib at any tag (verified
# 2026-05-30; contrib issues #41687 and #43657 both closed
# `not_planned`). Re-evaluation trigger: a new generator-shaped
# receiver landing in contrib.
#
# Scope deferral: chart default stays `clockreceiver` this release;
# default-flip + values-keys migration ship together in PR-K
# (in-tree-receiver deletion wave) with NOTES.txt deprecation
# warnings — one coordinated cut rather than two operator-visible
# changes. `components/receivers/clockreceiver/` source also
# survives until PR-K because it doubles as the canonical example
# receiver across cmd/tracecore/*_test.go + internal/pipeline +
# internal/selftelemetry fixtures (~92 references audited).
image:
repository: tracecore
tag: bench
pullPolicy: Never

receivers:
clockreceiver:
enabled: false
hostmetrics:
enabled: true
collection_interval: 1s
scrapers:
load: {}

exporters:
stdoutexporter:
enabled: false
Expand All @@ -36,5 +52,5 @@ exporters:

pipelines:
metrics:
receivers: [clockreceiver]
receivers: [hostmetrics]
exporters: [otlphttp]
19 changes: 6 additions & 13 deletions builder-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,12 @@ receivers:
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/journaldreceiver v0.110.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/k8sobjectsreceiver v0.110.0
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/prometheusreceiver v0.110.0
# TODO(RFC-0013 PR-E): swap clockreceiver -> telemetrygeneratorreceiver.
# BLOCKER (verified 2026-05-30 against opentelemetry-collector-contrib
# v0.110.0 + main): receiver does NOT exist in the OTel contrib repo at
# any path (receiver/telemetrygeneratorreceiver, loadgenreceiver,
# mockreceiver, dummyreceiver all 404). RFC-0013 §1's example shape
# referenced it speculatively — the receiver was never landed upstream.
# Decision: keep clockreceiver alive in cmd/tracecore/components.go
# legacy boot path until either (a) an upstream replacement lands and we
# bump OCB to that release, or (b) PR-F's deletion ships and the bench
# rewires to a different load source (e.g. hostmetricsreceiver on a
# short scrape interval, or otlpreceiver fed by a sibling loader pod).
# Re-evaluate on every OCB version bump. Tracked in RFC-0013 OQ
# follow-up (cannot edit RFC inline per PR-E constraint).
# RFC-0013 PR-E (2026-05-30): hostmetricsreceiver replaces the
# planned-but-nonexistent telemetrygeneratorreceiver as the bench
# heartbeat source. Two upstream proposals (contrib #41687, #43657)
# closed `not_planned`; re-evaluation trigger is a generator-shaped
# receiver landing in contrib at any future tag.
- gomod: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver v0.110.0

processors:
- gomod: go.opentelemetry.io/collector/processor/batchprocessor v0.110.0
Expand Down
2 changes: 1 addition & 1 deletion docs/migration/v0.1-to-v0.2.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ v0.2.0 completes the RFC-0013 receiver swap. The in-tree custom receivers for ke
| GPU telemetry (NVIDIA) | `dcgm` (in-tree, cgo stub) | `dcgm-exporter` DaemonSet + `prometheusreceiver` | Deploy `dcgm-exporter` via its own chart; chart adds a `gpu.nvidia.recipe: prometheus` toggle to wire the scrape. |
| GPU telemetry (AMD / Intel / Habana) | Not shipped | `ROCm/device-metrics-exporter` / `intel/xpumanager` / Habana Prometheus Metric Exporter, all scraped via `prometheusreceiver` | New capability; opt-in via `gpu.<vendor>.recipe: prometheus`. |
| Kueue scheduler metrics | `kueue` (in-tree, never shipped) | `prometheusreceiver` recipe with bearer-token + TLS | Opt-in via `kueue.recipe: prometheus`. |
| Heartbeat / install-bench primitive | `clockreceiver` (in-tree) | `telemetrygeneratorreceiver` — **BLOCKED**: receiver does not exist upstream at any version | TBD. Current plan: `hostmetricsreceiver` (1s scrape) as the load source. Track [GitHub issue / RFC OQ followup]. |
| Heartbeat / install-bench primitive | `clockreceiver` (in-tree, chart default) | `hostmetricsreceiver` (loadscraper @ 1s, upstream OCB-bundled) | v0.1.x bench already swapped (PR-E). v0.2.0 flips the chart default — set `receivers.hostmetrics.enabled: true` + `receivers.clockreceiver.enabled: false` if you want to track the new default before the chart-default flip; otherwise no action until v0.2.0. `NOTES.txt` will surface a deprecation warning for one minor after the flip. |
| Kineto profiler | `kineto` (in-tree, deferred) | Deferred until OTel Profiles GA | No action; re-evaluation when contrib ships `pprofreceiver`. |
| `tracecoreai/tracecore-components` module | Lives inside this repo | Separate Go module pulled via OCB `gomod:` | No operator action. Module split is internal. |
| Helm values keys | Per-receiver `<name>.*` | Per-receiver `<name>.recipe: <upstream|legacy>` + per-recipe stanzas | One-minor compat. Migrate by setting `.recipe: upstream` per receiver. |
Expand Down
Loading
Loading