feat(ocb): PR-A2 — switch tracecore entrypoint to OCB-generated main#189
Conversation
RFC-0013 PR-A2 (sequencing gate for PR-B2 / PR-F / PR-I): retire the
hand-wired ./cmd/tracecore entry point and adopt the OpenTelemetry
Collector Builder (OCB) output at ./_build/tracecore as the canonical
binary. After this lands, all receivers register through OCB's
generated otelcol.Factories instead of the bespoke
cmd/tracecore/components.go.
Deletions (3,032 LOC across 22 source + 7 test files):
- cmd/tracecore/ entire tree (main, collect, validate, debug,
receivers, signals, failure_inject, openflags, receiver_variants)
- components.yaml + tools/components-gen/ (superseded by
builder-config.yaml; OCB owns codegen)
- components/receivers/kernelevents/runbook_test.go (depended on the
deleted `tracecore debug dump` subcommand; kernelevents itself is
scheduled for deletion in PR-K)
Build path swap:
- Makefile: `build` target now runs OCB (was: legacy `go build
./cmd/tracecore`); dropped `generate`, `generate-check`, `run`,
legacy -ldflags -X version injection; coverage -coverpkg drops
./cmd/...
- install/kubernetes/tracecore/Dockerfile: builds via OCB (`make
build` then copy `_build/tracecore`)
- install/kubernetes/tracecore/templates/daemonset.yaml: args drop the
`collect` subcommand (OCB main runs the collector as default)
- .goreleaser.yaml: switched to `builder: prebuilt` against
./_build/{Os}-{Arch}/tracecore; release.yml gains a per-platform
pre-build step before invoking goreleaser
- .ko.yaml: builds from inside ./_build/ (the OCB submodule) via
KO_CONFIG_PATH=../.ko.yaml; main: .
- .github/workflows/ci.yml: package job runs OCB; build-ocb drift gate
replaced by smoke-test-binary job consuming the package artefact
- builder-config.yaml: dist.version bumped to 0.1.0-m9-alpha to match
Chart.yaml appVersion (chart-appversion-check.sh now reads it as the
source of truth, replacing internal/version/version.go)
Sequencing constraints honored:
- components/receivers/{clockreceiver, containerstdout, dcgm,
k8sevents, kernelevents, nccl_fr, pyspy} survive as
orphan-but-compiling code until PR-K deletes them along with the
chart-fixture migration
- internal/{pipeline, selftelemetry, telemetry, componentstatus,
pipelinebuilder, consumer, fanout, runtime} survive until PR-F
- chart workflow's `tracecore validate` step temporarily disabled —
the chart's renderedConfig still emits the legacy `telemetry:`
top-level key and references clockreceiver/pyspy/containerstdout;
PR-K reinstates the gate once the chart shape migrates
Operator surface changes (release-notes block):
- `tracecore collect --config=…` → `tracecore --config=…` (collect was
the default subcommand; OCB main runs the collector by default)
- `tracecore --log.format=text` / `--shutdown.drain-budget=…` /
`--version-short` → removed; OCB uses upstream `--feature-gates` +
`--set` flag surface, version reads from binary metadata
- `tracecore receivers list` → `tracecore components` (shows
receivers/processors/exporters/extensions/connectors)
- `tracecore debug dump` → removed (OCB has no equivalent; operators
filing issues use `tracecore components` + the live config)
- `tracecore failure-inject {nccl-hang,pod-evict}` → replaced by the
standalone `tools/failure-inject` binary (already shipped with
xid, nccl-hang, pod-evict, cpu-steal subcommands)
Doc-rot fixes:
- docs/FAILURE-MODES.md: rerouted 8 entries that referenced deleted
cmd/tracecore tests to upstream OCB owners; legacy contracts
(signal handling, multi-instance components, empty-config WARN)
now flow through service.New / otelcol.Collector
- docs/FLAKY-TESTS.md: moved the two cmd/tracecore.TestIntegration_*
flakes to Resolved (deleted with the legacy entry point)
- docs/integrations/examples/{honeycomb,otel-backend}.yaml: clockreceiver
→ hostmetrics loadscraper (OCB-supported); pending-rfc-0013-pr-a
recipes unchanged
- STYLE.md: repo layout + component-registration + CLI + build-release
sections rewritten around OCB
- PRINCIPLES.md: dropped concrete-example reference to deleted file
- scripts/doc-check.sh, scripts/no-autoupdate-check.sh: drop `cmd`
from scan paths
- scripts/chart-appversion-check.sh: read dist.version from
builder-config.yaml instead of internal/version/version.go
- scripts/smoke.sh: rewritten for OCB binary (hostmetrics → debug
config; expects upstream lifecycle log lines)
- scripts/validator-recipe.sh: BIN default now ./_build/tracecore
Verification:
- `make build` (OCB) → produces ./_build/tracecore binary
- `./_build/tracecore --version` reports 0.1.0-m9-alpha matching Chart.yaml
- `./_build/tracecore components` lists 6 receivers, 4 exporters,
3 extensions, 4 processors (the builder-config.yaml inventory)
- `make smoke` passes (hostmetrics → debug, 1.5s window, clean shutdown)
- `make check`, `make verify` pass (license, fmt, lint, vet, tidy,
build-tags, register-lint, actionlint, zizmor, doc-check,
chart-appversion-check, no-autoupdate-check, nccl-fr-rce-gate)
- `go test -race ./...` passes (skipping TestReceiver_SLIBudget +
TestSLIBudget_WarmupDiscardIsLoadBearing — pre-existing macOS p99
flakes already excluded from `make test-extras-race`)
- `helm lint install/kubernetes/tracecore` clean
- `helm template` renders daemonset with `args: [--config=…]`
(no longer `[collect, --config=…]`)
Sequencing gate: PR-B2 / PR-F / PR-I are unblocked.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
|
Adversarial cross-cut review found 4 blockers — auto-merge disabled until fixes land. Fixer agent in flight. Summary: 🔴 F1 chart defaults break helm install — values.yaml still defaults clockreceiver (not in OCB binary). validate-CI gate deleted in same PR so this regression has no seam. 🔴 F2 9 in-tree components silently orphaned — release-notes block doesn't disclose breadth. Bench otlphttp config needs schema verification against upstream otlphttpexporter. 🔴 F3 metric-name contract deleted, no replacement test — 🔴 F4 chart probes break — Fixes incoming. |
## Root cause PR #180 (`chore(pivot): PR-E unblock — bench heartbeat to hostmetricsreceiver`) enabled the `hostmetrics` receiver in `bench/install/tracecore-values.yaml` and added it to `builder-config.yaml`, but the install-bench `Dockerfile` was still building from `./cmd/tracecore`. The generated `cmd/tracecore/components.go` only registers in-tree receivers — `hostmetricsreceiver` is upstream OTel-contrib and is only bundled by the OCB-assembled binary at `_build/tracecore`. The daemonset pod failed config load with `unknown component type hostmetrics`, `kubectl rollout status` timed out at 5 m, and `bash -e` aborted `run.sh` before any diagnostics fired — so CI showed a bare red with no actionable log. `install-bench` has been red on `main` since 2026-05-31T02:28:40Z and on every PR opened after #180. **Affected PRs (open as of this writing)**: #186, #187, #188, #189. ## Fix Switch `install/kubernetes/tracecore/Dockerfile` to build via OCB: ``` make build-ocb # generates ./_build/{main.go,go.mod,...} + compiles cd _build && go build . # re-link with CGO_ENABLED=0 -trimpath -ldflags "-s -w" ``` The re-link with our flags guarantees the static binary the distroless base can exec; OCB's intermediate compile uses its own defaults. The final image still uses `gcr.io/distroless/static-debian12:nonroot` at the same pinned digest. This is a **tactical bridge**: PR-A2 (#189) makes `_build/tracecore` the canonical binary for all builds. Once that lands, the in-tree `cmd/tracecore` path retires entirely (RFC-0013 PR-F) and this Dockerfile change becomes the new normal across every image, not just install-bench. ### Why not the alternatives? - **Revert hostmetrics in bench values** → walks back PR #180's pivot intent ("no custom receiver where upstream satisfies"); the legacy `clockreceiver` is on its way out in PR-K. - **Add hostmetricsreceiver to `cmd/tracecore/components.go`** → diverges the in-tree component list from the OCB-managed one; the whole point of PR-A2 is to delete that divergence. ## Bonus: surface root cause on rollout-status failure `bench/install/run.sh` had post-deadline diagnostics for the first-data path, but the rollout-status path (the actual failure mode of this regression) just exited via `set -e`. Added `dump_failure_diagnostics()` (pod state, `kubectl describe`, current + previous container logs, rendered config) wired to both failure paths; refactor eliminates the duplicated tracecore-pod spelunking that lived inline. Future regressions surface root cause in the CI log without re-running. ## Verification ``` $ make check … 0 issues, all modules verified $ docker build -f install/kubernetes/tracecore/Dockerfile -t tracecore:bench-test . … exporting to image done $ docker run --rm tracecore:bench-test components | grep -E "hostmetrics|otlp" - name: hostmetrics module: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver v0.110.0 - name: otlp module: go.opentelemetry.io/collector/receiver/otlpreceiver v0.110.0 - name: otlphttp module: go.opentelemetry.io/collector/exporter/otlphttpexporter v0.110.0 ``` End-to-end install-bench (kind cluster + helm install) runs on this PR via the workflow itself. ## Cost Docker build stage adds ~100 s (OCB compile inside Alpine). Bench Docker rebuild only fires on chart/bench/builder-config changes — acceptable. ```release-notes [CI] install-bench Dockerfile now builds via OpenTelemetry Collector Builder so the bench daemonset can load hostmetricsreceiver; also dumps pod state, logs, and rendered config on rollout-status failure. Unblocks every PR opened after #180. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
Adversarial review of PR #189 flagged 4 blockers that the initial swap deferred to PR-K. Each one had the chart's renderedConfig template emitting a key the OCB binary does not recognise — meaning `helm install` would crash-loop on a fresh cluster. Root cause: the legacy single-listener `telemetry:` block + `clockreceiver` + `stdoutexporter` references survived the cmd/tracecore deletion but the OCB binary registers only what `builder-config.yaml` lists. Blocker fixes: 1. Chart defaults flipped to OCB-supported shape: - clockreceiver→false / hostmetrics→true (was the other way) - stdoutexporter→false / debug→true (was: stdoutexporter) - Default pipeline: hostmetrics → debug - Legacy `telemetry:` top-level block replaced by upstream `service.telemetry.metrics.address` + `healthcheckextension` on a separate port. The chart's `telemetry.enabled` knob still drives both surfaces. - In-tree-only toggles (clockreceiver/dcgm/kernelevents/pyspy/ containerstdout/stdoutexporter) kept with explicit doc-comments naming PR-J/K as their migration owner; NOTES.txt WARNs if any is enabled. 2. PR body Breaking-changes section now enumerates every orphan component and its planned replacement. Bench values verified against upstream otlphttpexporter schema (already compatible — the chart renders the `otlphttp.*` block pass-through). 3. New regression seam: `internal/integration/ocb_scrape_test.go` spawns _build/tracecore against hostmetrics→debug, polls :NNNN/metrics, asserts both `otelcol_process_uptime` and `otelcol_receiver_accepted_metric_points` are present, then SIGTERMs and asserts clean exit. Catches an upstream metric-name rename before it ships and silently breaks dashboards. Migration doc gains the tracecore_* → otelcol_* mapping row. 4. DaemonSet probes now hit the healthcheckextension on a dedicated `health` port (default :13133) at `healthPath` (default /); the OCB binary doesn't serve /healthz /readyz so the previous probes would 404 forever. Two listener ports (`telemetry`, `health`) replace one because upstream `service.telemetry` and `healthcheckextension` are two separate processes. Chart-render CI workflow gets back the `tracecore validate` gate on the default + one-receiver-on fixtures; a chart edit that re-introduces an unknown key now fails at PR-time, not at `helm install` time. Verified locally: - helm lint clean (1 chart, 0 failed) - helm template default → tracecore validate exits 0 - helm template one-receiver-on fixture → tracecore validate exits 0 - conftest: 51 tests pass against rendered chart - make check passes (fmt, tidy, lint, vet, mod-verify) - go test ./... passes including new integration test (race+verbose) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Make license-check is a pre-push hook gate; the previous commit landed the new test file without the SPDX-License-Identifier line the repo standard requires. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Two-file conflict from main landings since last sync: - install/kubernetes/tracecore/Dockerfile: both sides build the OCB-generated _build/tracecore binary. Took PR-A2's canonical `make build` invocation (the OCB target) and kept #190's defense-in-depth re-link inside ./_build/ with distroless flags (CGO_ENABLED=0, -trimpath, -s -w) so the resulting binary is guaranteed-static for distroless/static-debian12. - docs/migration/v0.1-to-v0.2.md: PR-A2 added two self-telemetry rows (otelcol_* rename + telemetry.listen split), #186 added a stdoutexporter row. All three rows belong; kept all three. values.yaml + .github/workflows/chart.yml had no actual conflicts after fetch — must have auto-merged in the prior sync. Gates: make check / go test ./... / helm lint / helm template / make build / ./_build/tracecore validate against rendered chart — all green. Signed-off-by: Tri Lam <tri@maydow.com>
CI Build linux/arm64 failed with "exec format error" because `go run go.opentelemetry.io/collector/cmd/builder` built the builder tool itself under the surrounding GOOS=linux GOARCH=arm64, then tried to exec the arm64 binary on the amd64 runner. Split the Makefile build target into (i) `go install` with GOOS=/GOARCH= (host arch) into _build/.tools, then (ii) run the builder with the target GOOS/GOARCH preserved -- the builder's inner `go build` inherits those and produces a cross-arch binary. Verified locally on darwin/arm64 host: `GOOS=linux GOARCH=arm64 make build` now emits ELF aarch64. verify-static / generate-check failed because PR-A2 deleted tools/components-gen/ and the cmd/tracecore/components.go generation target; the workflow step still referenced `make generate-check` which no longer exists. OCB owns component registration via builder-config.yaml now, so the gate is dead. Removed it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
## Summary PR-A2 (#189) deleted `cmd/tracecore/` (3,032 LOC across 14 source + 7 test files), `components.yaml`, `tools/components-gen/`, and the `Makefile` `generate` + `generate-check` + `run` targets. Adversarial review of #189 surfaced 8 docs still pointing at those deleted seams as if they were live. This PR sweeps the live-instruction hits and adds dated post-pivot footnotes to historical-narrative hits. Root cause: PR-A2 prioritized atomic deletion of the build surface + sequencing-gate satisfaction; documentation cleanup was deferred to a follow-up sweep (this PR). Not a workaround — the deleted surfaces are not coming back; the docs now reflect the new OCB-generated state. ## Files touched (8, all docs / CI fixture) | File | Change | Reasoning | |---|---|---| | `AGENTS.md` | Drop `generate-check` from the `make ci` gate roster (one of seven gates listed). | Live instruction — gate no longer exists. | | `docs/notes/stacked-pr-workflow.md` | Rewrite "Codegen-aware conflict resolution" lesson against the surviving codegen seams (`_build/tracecore/main.go` via `make build`, parser goldens via `make generate-fixtures`); soften the `make ci`-discipline lesson's `generate-check` reference to past-tense + dated. | Live instructions for stacked-PR contributors. | | `docs/notes/2026-05-19-m14-autonomous-run.md` | Add a one-line post-A2 footnote to the codegen-aware-rebase lesson's Anchor line. Body preserved. | Historical retro — top of file already carried RFC-0013 status note; just stamping the specific anchor. | | `docs/notes/autonomous-feature-flow.md` | Drop `generate-check` from two inline gate lists; generalize "conflicts in generated files" bullet to post-A2 seams. | Live workflow template. | | `docs/research/baselines.md` | Drop `generate-check` from the `make ci` wall-clock fold list with a dated parenthetical. | Live measurement methodology context. | | `docs/research/m15-container-stdout.md` | Prepend a post-pivot blockquote footnote to the "components.go is generated" subsection. Body preserved. | Research narrative — top of file lacked an explicit pointer to the deleted seam. | | `docs/FAILURE-MODES.md` | Rewrite the two self-telemetry rows that named `cmd/tracecore/runCollect` + `telemetry.listen` to reference upstream `service.telemetry.metrics.address` bind path and `healthcheckextension` probe surface respectively. | Live operator-facing failure-mode inventory. | | `install/kubernetes/tracecore/ci/all-receivers-off-values.yaml` | Fix fixture header to state `chart.yml` skips `tracecore validate` on this fixture (RFC-0013 PR-A2 invariant) instead of claiming it exits 0. | Live CI fixture comment. | ## Intentionally NOT touched - **`MILESTONES.md`** (11 refs to `cmd/tracecore/components.go` + `components.yaml` across §Lane structure + per-milestone narrative). The top of MILESTONES.md carries explicit `**Status (RFC-0013):** DELETED at v0.X.0 - replaced by <upstream>` banners per block. A cold reader hits the supersession banner before the milestone-narrative ref. Per `feedback_no_bloat` x historical-record lens: leave. - **`docs/research/m16-kueue.md`** (1 ref) and **`docs/research/m5-m6-research.md`** (2 refs). Both already carry top-of-doc `**Status (2026-05-22):** … superseded by RFC-0013 …` banners. Cold reader is served. - **`tracecore_*` self-telemetry metric vocabulary** (~150 refs across `components/`, `internal/selftelemetry`, `internal/telemetry`, runbooks, alerts, RFCs). These are NOT stale: the in-tree `internal/selftelemetry` + `internal/telemetry` packages still emit those names (orphan-but-compiling per RFC-0013 §7; deletion ships under PR-F/PR-K). The migration to upstream `otelcol_*` is documented in `docs/migration/v0.1-to-v0.2.md`. Customer-facing rename is a single coordinated cut under PR-K, not piecewise. ## Test plan - [x] `make doc-check` — 507 markdown links resolve, 45 test references verified, banned-phrase + comment-noise gates clean. - [x] `make check` — `fmt` + `tidy-check` + `lint` + `vet` + `mod-verify` all green. - [x] `git grep -rn` for residual `cmd/tracecore` / `generate-check` refs confirms remaining hits are all historical-narrative (already covered by per-file supersession banners) or live in-tree code that ships under future PR-F/PR-K. ```release-notes NONE ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
## Summary RFC-0013 PR-L — expand `docs/migration/v0.1-to-v0.2.md` from the PR-179 skeleton (plus the metric-name + chart-values rows landed inline in PR-A2 / #189) into a comprehensive v0.1.x → v0.2.0 cutover guide. Every post-wave-2 landing now has a corresponding operator-facing migration row. ```release-notes NONE ``` ## What landed | Section added / expanded | Source of truth | |---|---| | **CLI surface** — table covering every removed subcommand (`collect`, `receivers list`, `debug dump`, `failure-inject`) and removed flag (`--log.format=text`, `--shutdown.drain-budget`, `--version-short`) + their upstream replacements | PR #189 release-notes block + the deleted `cmd/tracecore/` tree | | **Helm chart values** — `telemetry.listen` + `telemetry.paths.*` → `telemetry.metricsListen` + `telemetry.healthListen` + `telemetry.healthPath` with default-port values | `install/kubernetes/tracecore/values.yaml` HEAD | | **Probes** — `/healthz` + `/readyz` on `:8888` → `healthcheckextension` at `:13133/` | `install/kubernetes/tracecore/templates/daemonset.yaml` HEAD | | **Default pipeline** — `clockreceiver → stdoutexporter` → `hostmetrics → debug` snippet | `values.yaml` `pipelines:` block HEAD | | **Orphan components table** — all 9 (clockreceiver, containerstdout, dcgm, k8sevents, kernelevents, nccl_fr, pyspy, otlphttp, stdoutexporter) mapped to upstream replacement + PR-J recipe | `components/receivers/`, `components/exporters/` directory inventory + RFC-0013 §2 adoption matrix | | **Self-telemetry metric vocabulary** — `tracecore_*` → `otelcol_*` for receiver / exporter / queue / component-status / build-info families with per-signal split | upstream OCB instrumentation conventions + `internal/integration/ocb_scrape_test.go` contract metrics | | **`stdoutexporter` failure-rate gap** — debugexporter pins `otelcol_exporter_send_failed_*` at zero; debug-only pipelines lose the signal | upstream `debugexporter` semantics | | **Build / CI changes** — Makefile, output path (`./_build/tracecore`), source tree, smoke, image build, release pipeline, version source | Makefile + `.goreleaser.yaml` + `.ko.yaml` + workflows HEAD | | **`internal/*` package deletion** (PR-F, in flight) — per-package public-surface migration map for `selftelemetry`, `runtime/lifecycle`, `componentstatus`, `telemetry`, `pipeline`, `pipelinebuilder`, `consumer`, `fanout` | RFC-0013 PR-F scope + `internal/` tree inventory | | **Reproducibility note** — `0.1.0-m9-alpha` hardcoded in `builder-config.yaml dist.version`; cross-ref to `docs/reproducibility.md` workaround | `builder-config.yaml` HEAD | | **Verification** — adds probe smoke test + `tracecore components` parity check against the rendered config | new section | | **Rollback** — recipe-toggle path is not available for the deleted set; pin chart + image at v0.1.x | corrects the v0.1.x-era rollback prose | Closes RFC-0013 PR-L. Open follow-ups (PR-I in-repo submodule, PR-J upstream recipes, PR-K in-tree-receiver delete, PR-F internal/* delete) are referenced inline in the guide so the next agent picking up any of them lands the corresponding doc update in the same PR. ## Adversarial pre-review notes - Verified component counts (`builder-config.yaml`: 6 receivers, 4 exporters, 3 extensions, 4 processors) against `awk` count of `gomod:` lines. - Verified `hostmetrics` is the default in `values.yaml` (enabled: true, loadscraper, 1s). - Verified `cmd/tracecore`, `tools/components-gen`, `components.yaml` all deleted from HEAD (`git ls-files` returns empty). - Verified `internal/integration/ocb_scrape_test.go` is present and asserts the two contract metrics named in the guide. - Verified the daemonset.yaml probes wire `port: health` (not `port: telemetry`) at `healthPath`. - Verified no broken markdown links — all 5 outbound links resolve (`builder-config.yaml`, `ocb_scrape_test.go`, RFC-0013 §3, RFC-0013 §migration, in-doc anchor). - One non-blocking observation: `install/kubernetes/tracecore/README.md` still references `/healthz` + `/readyz` on three lines (chart-doc rot from PR-A2 that didn't sweep the README). Out of scope for PR-L; flagging for the next chart-doc sweep. ## Test plan - [x] `make doc-check` passes (banned-phrase lint, link resolution, test-name parity, all 15 sub-checks green) - [x] Pre-commit hook (golangci-lint, go vet, go mod verify, DCO + AI trailer) passes - [x] Pre-push hook (no-autoupdate-check) passes - [ ] CI: `chart`, `ci`, `install-bench` workflows do not gate on this file; only `doc-check` matters for a docs-only PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
## Summary Reconcile the four pivot-tracking docs (`docs/rfcs/0013-distro-first-pivot.md`, `CHANGELOG.md`, `MILESTONES.md`, `docs/migration/v0.1-to-v0.2.md`) with the wave-3 (PR-B1-shape sibling ports) and wave-4 (PR-B2-shape upstream-only ports + PR-F.1 + PR-J + PR-L + PR-N) landings. Pure doc sweep — no code or config touched. ## What changed ### `docs/rfcs/0013-distro-first-pivot.md` §migration PR sequence rows updated with PR-number citations and landed markers: - **PR-A2** (landed, #189, 2026-05-30) - **PR-B2** (landed, #201) — also enumerates sibling-receiver follow-ups under PR-B2 to dispel the slug collision with #188's PR-B2-labelled dcgm port: stdoutexporter (#202), pyspy (#203), kernelevents (#208), containerstdout (#209) - **PR-F.1** (landed) — fleshed-out delete list (`internal/{selftelemetry,telemetry}` + `components/receivers/dcgm/` + `pkg/dcgm/` + one orphan clockreceiver integration test) - **PR-F.2** re-scoped — now deletes the whole `internal/{componentstatus,pipeline,pipelinebuilder,consumer,fanout,runtime/lifecycle}` bundle in one cut once the last three pipeline+consumer-importing receivers land (#204 k8sevents, #205 clockreceiver, #207 otlphttp). Per the import-graph state — `internal/componentstatus`'s only non-test consumer is `internal/pipeline`, so they delete together - **PR-G** (landed, #182), **PR-H** (landed, #183) - **PR-I.1a** (in flight — scaffold agent), **PR-I.1b** (pre-staged; gate satisfied by #201) - **PR-J** (landed, #195) — kept existing marker - **PR-K.1** (in flight — separate agent landing) - **PR-L** (landed, skeleton #179 + body #191) — flagged as living document - **PR-N** (landed, #200) — shipped at v0.1.0 ahead of v0.3.0 as a doc-only update at `docs/migration/v0.2-to-v0.3.md` ### `CHANGELOG.md` [Unreleased] - Restructured the pivot wave list as **four waves** (was three). Wave 3 enumerates PR-B1-shape sibling ports + support infra (#180-#194/#196). Wave 4 enumerates PR-B2-shape upstream-only ports + PR-J (#195) + PR-F.1 (#206) + PR-N (#200) + lint/TOCTOU hardening (#198/#210). - Tightened the PR-F.2 deferred note to point at the three open ports (#204/#205/#207) as the gate. ### `MILESTONES.md` - **M1** (pipeline runtime) — status row now cites PR-A2 (#189), PR-F.1 (#206), PR-F.2 gate (#204/#205/#207), PR-E (#180), retains `internal/config/` (still load-bearing for `tracecore validate`). - **M2** (self-telemetry) — status row now cites PR-F.1 (#206); flags `internal/componentstatus` as travelling with `internal/pipeline` in PR-F.2. - **M8** (DCGM receiver) — status flipped to *landed-and-replaced*: cites PR-F.1 (#206) deletion + PR-J (#195) `docs/integrations/prometheus-scrape.md` recipe. Notes the inert chart toggle retention until PR-K.3. ### `docs/migration/v0.1-to-v0.2.md` - §`internal/*` package deletion (PR-F) status flips from "not yet open" to "PR-F.1 landed (#206), PR-F.2 gated on three open ports". - Open-items checklist expanded from 5 to 13 entries — tracks every PR letter the migration guide cares about (A2 / E / F.1 / F.2 / I.1a-c / J / K.1-3 / L / N) with PR numbers and links. ## Why now Tracking docs accumulated drift across wave-3 + wave-4 because every sibling-port PR (and the support-infra PRs around them) updated the bottom of `CHANGELOG.md` but did not always touch the upstream sequencing section in RFC-0013. Per memory rule `[Keeping this document current]`: status drift is a review blocker. This PR is the consolidated catch-up; future port PRs include their RFC-row flip in-PR. ## What this PR does NOT change - No code, no config, no YAML, no chart — only the four tracking docs. - No new doc gates added; existing gates pass. - No PRs other than the four named docs are modified. ## Test plan - [x] `bash scripts/doc-check.sh` clean (33 test refs, 528 links resolve, comment-noise diff gate clean vs `origin/main`, all 13 gates green). - [x] Pre-commit hook (`commitlint` 72-char subject limit + DCO + AI-trailer gates) passed. - [x] Pre-push hook (`make ci-fast` equivalent: `golangci-lint`, `go vet`, `go mod verify`, `no-autoupdate-check`, `doc-check.sh`) passed on second attempt after `git fetch origin main` populated the worktree's `origin/main` ref — first push failed because the worktree previously tracked the (gone) `pr-a2-ocb-main-swap` branch, so `doc-check.sh`'s comment-noise diff-scope gate exited 128 on the missing ref. Root cause fixed by the fetch; not a workaround. - [ ] CI green on this branch. ```release-notes NONE ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
Summary
RFC-0013 PR-A2 (sequencing gate for PR-B2 / PR-F / PR-I): retire the hand-wired
./cmd/tracecoreentry point and adopt the OpenTelemetry Collector Builder (OCB) output at./_build/tracecoreas the canonical binary. After this lands, all receivers register through OCB's generatedotelcol.Factoriesinstead of the bespokecmd/tracecore/components.go.Big diff (-3,869 / +388 across 53 files) because PR-A2 is the load-bearing pivot point; PR-B2 / PR-F / PR-I that follow can land surgically.
Breaking changes — orphan components (PR-A2 → PR-J/K bridge)
The OCB-assembled binary registers only the components in
builder-config.yaml: 6 receivers, 4 exporters, 3 extensions,4 processors. The chart's per-component toggles for the legacy in-tree
set survive this PR so the values shape doesn't break for operators
that pin them, but enabling any of the following in chart values
will cause the pod to fail at startup with an "unknown factory" error
until PR-B2 / PR-J / PR-K rewire them:
clockreceiverhostmetrics(PR-E shipped; now default)containerstdoutfilelogreceiver+ container stanza +file_storageextension (PR-J)dcgmdcgm-exporterDaemonSet +prometheusreceiver(PR-J)k8seventsk8sobjectsreceiver+ OTTLk8s.event.hinttransform (PR-J)kerneleventsjournaldreceiver+filelogreceiver(kmsg) + OTTL Xid transform (PR-J)nccl_frgomod:(PR-B2 + PR-I)pyspystdoutexporterdebug(OCB-bundled; now default)otlphttp(in-tree clone)otlphttpexporter(OCB-bundled; sameotlphttpname in chart values, same field shape —endpoint,compression,headers,tls.*,timeout,retry_on_failure,sending_queue; pass-through render so any upstream field works)To verify what's actually registered in the binary you're running:
./_build/tracecore components.The chart's
NOTES.txtsurfaces a WARNING when an operator enablesany of these, and the chart-render CI workflow now runs
tracecore validateagainst the default + one-receiver-on fixtures so a chartedit that emits a non-OCB key trips CI before reaching
helm install.What landed
Deletions (3,032 LOC across 22 source + 7 test files)
cmd/tracecore/entire tree:main.go,collect.go,validate.go,debug.go,receivers.go,signals.go,failure_inject.go,openflags_{linux,other}.go,receiver_variants{,_dcgm_cgo,_dcgm_stub}.go,components.go, + every_test.go(collect_test,debug_test,failure_inject{,_linux}_test,integration_test,integration_telemetry_test,main_test,receivers_test)components.yaml+tools/components-gen/{main,main_test}.go— superseded bybuilder-config.yaml; OCB owns codegen nowcomponents/receivers/kernelevents/runbook_test.go— depended on the deletedtracecore debug dumpsubcommand; kernelevents itself is scheduled for deletion in PR-KBuild path swap
Makefilebuildtarget now runs OCB (was: legacygo build ./cmd/tracecore); droppedgenerate,generate-check,run, legacy-ldflags -Xversion injection;coverage-coverpkgdrops./cmd/...install/kubernetes/tracecore/Dockerfilemake buildthen copy_build/tracecore)install/kubernetes/tracecore/templates/daemonset.yamlargsdrops thecollectsubcommand; probes hit the newhealthport (13133) athealthPathinstall/kubernetes/tracecore/templates/_helpers.tplrenderedConfigemits upstream OTel shape —service.telemetry.metrics.address+extensions.health_check+service.extensions: [health_check]— instead of the legacy single-listenertelemetry:top-level blockinstall/kubernetes/tracecore/values.yamlhostmetrics → debug; in-tree-only toggles kept with explicit doc-comments naming PR-J/K as their migration owner.goreleaser.yamlbuilder: prebuiltagainst./_build/{Os}-{Arch}/tracecore; release.yml gains a per-platform pre-build step.ko.yaml./_build/(the OCB submodule) viaKO_CONFIG_PATH=../.ko.yaml;main: ..github/workflows/ci.ymlpackagejob runs OCB; oldbuild-ocbdrift gate replaced bysmoke-test-binaryjob consuming the package artefact.github/workflows/release.ymlcd ./_build/so OCB submodule resolves.github/workflows/chart.ymltracecore validategate against the default + one-receiver-on chart renders; path triggers swapcmd/tracecore/**→builder-config.yaml.github/workflows/install-bench.ymlcmd/tracecore/**→builder-config.yamlbuilder-config.yamldist.versionbumped to0.1.0-m9-alphato matchChart.yamlappVersion;chart-appversion-check.shnow reads it as source of truthscripts/chart-appversion-check.shdist.versionfrombuilder-config.yaml(was:internal/version/version.go)scripts/smoke.shscripts/validator-recipe.shBINdefault now./_build/tracecorescripts/{doc-check,no-autoupdate-check}.shcmdfrom scan pathsNew integration seam
internal/integration/ocb_scrape_test.go: spawns_build/tracecoreagainst a hostmetrics → debug config, polls the upstream:NNNN/metricssurface, asserts bothotelcol_process_uptimeandotelcol_receiver_accepted_metric_pointsare present, then SIGTERMs the subprocess and asserts clean exit. The test skips when_build/tracecoreis absent so a freshgit clone+go test ./...stays green;make buildis the prereq. This is the regression gate for the chart's operator-facing self-telemetry contract (RFC-0013 §3): if a future upstream OCB release renames either metric, the chart'sservice.telemetry.metrics.addressadvertisement breaks downstream dashboards silently — this test fires first.Integration recipe migration
docs/integrations/examples/honeycomb.yaml+otel-backend.yaml:clockreceiver→hostmetricsloadscraper (the OCB-supported equivalent per RFC-0013 PR-E). The other two recipes carrypending-rfc-0013-pr-amarkers and are still skipped byvalidator-recipe.sh.Doc rot fixes
docs/FAILURE-MODES.md: rerouted 8 entries that referenced deleted in-tree tests to their upstream OCB owners.docs/FLAKY-TESTS.md: moved the two in-tree integration flakes to Resolved.STYLE.md: rewrote repo-layout + component-registration + CLI + build-release sections around OCB.PRINCIPLES.md: dropped concrete-example reference to deleted file.install/kubernetes/tracecore/README.md: ko local-build steps now run from inside./_build/.docs/rfcs/0013-distro-first-pivot.md: PR-A2 entry rewritten as landed.docs/migration/v0.1-to-v0.2.md: added rows for the self-telemetry metric-name rename (tracecore_*→otelcol_*) and thetelemetry.*chart values key rename.Sequencing constraints honored
components/receivers/{clockreceiver, containerstdout, dcgm, k8sevents, kernelevents, nccl_fr, pyspy}survive as orphan-but-compiling code until PR-K deletes them along with the chart-fixture migration.internal/{pipeline, selftelemetry, telemetry, componentstatus, pipelinebuilder, consumer, fanout, runtime}survive until PR-F (after PR-B1 lifted nccl_fr offinternal/selftelemetryin feat(pivot): PR-B1 — port nccl_fr off internal selftel + lifecycle #184).tracecore validategate in chart workflow is restored on the default + one-receiver-on fixtures (was temporarily disabled in the first push of this PR; the chart's renderedConfig template migration was completed in the same PR).Sequencing gate satisfied
PR-B2 (
nccl_frimport swap to upstream OCB types), PR-F (internal/*deletion), and PR-I (Go submodule extraction) are unblocked. The legacy boot path is gone; the OCB-driven boot path is live.Test plan
make build→ produces./_build/tracecorebinary via OCB./_build/tracecore --versionreports0.1.0-m9-alphamatchingChart.yaml./_build/tracecore componentslists 6 receivers + 4 exporters + 3 extensions + 4 processors (thebuilder-config.yamlinventory)./_build/tracecore validate --config=<rendered chart default>exits 0./_build/tracecore validate --config=<rendered chart one-receiver-on fixture>exits 0make smokepasses (hostmetrics → debug, 1.5s window, clean shutdown)make checkpassesgo test ./internal/integration/...passes (new OCB scrape test; ~1.2s)helm lint install/kubernetes/tracecoreclean (1 chart, 0 failed; icon advisory only)helm template demo install/kubernetes/tracecore --show-only templates/daemonset.yamlrendersargs: [--config=…]+ two ports (telemetry,health) + probes hittinghealthport at/helm template demo install/kubernetes/tracecore --show-only templates/configmap.yaml | yq '.data["config.yaml"]'renders upstream OTel shape:service.telemetry.metrics.address,extensions.health_check,service.extensions: [health_check]conftest test --policy install/kubernetes/tracecore/policies/conftest/tracecore.rego /tmp/chart-render.yaml— 51 tests passpackagejob builds amd64 + arm64 OCB binariessmoke-test-binaryjob runs--version+componentson the package artefactchartworkflow lints + templates + validate + yq + conftest passinstall-benchworkflow's kind-cluster install still rolls out (bench values updated to drop the obsoletestdoutexporterreference and alignotlphttp.endpointwith the upstreamotlphttpexporterschema — pass-through render so all upstream fields work without chart changes)