Skip to content

feat(ocb): PR-A2 — switch tracecore entrypoint to OCB-generated main#189

Merged
trilamsr merged 8 commits into
mainfrom
pr-a2-ocb-main-swap
May 31, 2026
Merged

feat(ocb): PR-A2 — switch tracecore entrypoint to OCB-generated main#189
trilamsr merged 8 commits into
mainfrom
pr-a2-ocb-main-swap

Conversation

@trilamsr

@trilamsr trilamsr commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

RFC-0013 PR-A2 (sequencing gate for PR-B2 / PR-F / PR-I): retire the hand-wired ./cmd/tracecore entry point and adopt the OpenTelemetry Collector Builder (OCB) output at ./_build/tracecore as the canonical binary. After this lands, all receivers register through OCB's generated otelcol.Factories instead of the bespoke cmd/tracecore/components.go.

Big diff (-3,869 / +388 across 53 files) because PR-A2 is the load-bearing pivot point; PR-B2 / PR-F / PR-I that follow can land surgically.

CLI surface change — `tracecore` now uses the upstream OCB-generated CLI:
- `tracecore collect --config=…` → `tracecore --config=…`
  (collect was default; OCB main runs the collector by default)
- `tracecore receivers list` → `tracecore components`
  (now shows receivers + processors + exporters + extensions + connectors)
- `tracecore debug dump` → removed (no OCB equivalent; use `tracecore
  components` + the live config when filing issues)
- `tracecore failure-inject {nccl-hang,pod-evict}` → use the standalone
  `tools/failure-inject` binary (already ships xid, nccl-hang,
  pod-evict, cpu-steal subcommands)
- `--log.format=text` / `--shutdown.drain-budget=…` / `--version-short`
  → removed; OCB upstream uses `--feature-gates` + `--set` flags

Chart-shape changes — operator-visible:
- Default pipeline flips from clockreceiver→stdoutexporter (in-tree,
  not registered by OCB) to hostmetrics→debug (both upstream
  OCB-bundled). Fresh install on a no-GPU cluster boots and emits
  load-average metrics immediately, same as before.
- `telemetry.listen` + `telemetry.paths.{metrics,healthz,readyz}` →
  `telemetry.metricsListen` + `telemetry.healthListen` +
  `telemetry.healthPath`. The legacy single-listener block is gone
  because upstream `service.telemetry` and `healthcheckextension` are
  two separate processes. Probes hit `:13133/` (healthcheckextension
  default) instead of `:8888/healthz`. Prometheus scrape port (8888)
  is unchanged.
- Self-telemetry metric names rename `tracecore_*` → `otelcol_*`
  (upstream vocabulary). Dashboards on any `tracecore_receiver_*`,
  `tracecore_exporter_*`, `tracecore_queue_*`, `tracecore_component_*`,
  or `tracecore_build_info` must be rewritten — see
  `docs/migration/v0.1-to-v0.2.md` for the exact map.

Breaking changes — orphan components (PR-A2 → PR-J/K bridge)

The OCB-assembled binary registers only the components in
builder-config.yaml: 6 receivers, 4 exporters, 3 extensions,
4 processors. The chart's per-component toggles for the legacy in-tree
set survive this PR so the values shape doesn't break for operators
that pin them, but enabling any of the following in chart values
will cause the pod to fail at startup with an "unknown factory" error
until PR-B2 / PR-J / PR-K rewire them
:

Component Kind Replacement (planned)
clockreceiver receiver hostmetrics (PR-E shipped; now default)
containerstdout receiver filelogreceiver + container stanza + file_storage extension (PR-J)
dcgm receiver dcgm-exporter DaemonSet + prometheusreceiver (PR-J)
k8sevents receiver k8sobjectsreceiver + OTTL k8s.event.hint transform (PR-J)
kernelevents receiver journaldreceiver + filelogreceiver (kmsg) + OTTL Xid transform (PR-J)
nccl_fr receiver In-repo Go submodule via OCB gomod: (PR-B2 + PR-I)
pyspy receiver Deferred until OTel Profiles GA
stdoutexporter exporter debug (OCB-bundled; now default)
otlphttp (in-tree clone) exporter otlphttpexporter (OCB-bundled; same otlphttp name in chart values, same field shape — endpoint, compression, headers, tls.*, timeout, retry_on_failure, sending_queue; pass-through render so any upstream field works)

To verify what's actually registered in the binary you're running:
./_build/tracecore components.

The chart's NOTES.txt surfaces a WARNING when an operator enables
any of these, and the chart-render CI workflow now runs tracecore validate against the default + one-receiver-on fixtures so a chart
edit that emits a non-OCB key trips CI before reaching helm install.

What landed

Deletions (3,032 LOC across 22 source + 7 test files)

  • cmd/tracecore/ entire tree: main.go, collect.go, validate.go, debug.go, receivers.go, signals.go, failure_inject.go, openflags_{linux,other}.go, receiver_variants{,_dcgm_cgo,_dcgm_stub}.go, components.go, + every _test.go (collect_test, debug_test, failure_inject{,_linux}_test, integration_test, integration_telemetry_test, main_test, receivers_test)
  • components.yaml + tools/components-gen/{main,main_test}.go — superseded by builder-config.yaml; OCB owns codegen now
  • components/receivers/kernelevents/runbook_test.go — depended on the deleted tracecore debug dump subcommand; kernelevents itself is scheduled for deletion in PR-K

Build path swap

File Change
Makefile build target now runs OCB (was: legacy go build ./cmd/tracecore); dropped generate, generate-check, run, legacy -ldflags -X version injection; coverage -coverpkg drops ./cmd/...
install/kubernetes/tracecore/Dockerfile Builds via OCB (make build then copy _build/tracecore)
install/kubernetes/tracecore/templates/daemonset.yaml args drops the collect subcommand; probes hit the new health port (13133) at healthPath
install/kubernetes/tracecore/templates/_helpers.tpl renderedConfig emits upstream OTel shape — service.telemetry.metrics.address + extensions.health_check + service.extensions: [health_check] — instead of the legacy single-listener telemetry: top-level block
install/kubernetes/tracecore/values.yaml Default pipeline flipped to hostmetrics → debug; in-tree-only toggles kept with explicit doc-comments naming PR-J/K as their migration owner
.goreleaser.yaml Switched to builder: prebuilt against ./_build/{Os}-{Arch}/tracecore; release.yml gains a per-platform pre-build step
.ko.yaml Builds from inside ./_build/ (the OCB submodule) via KO_CONFIG_PATH=../.ko.yaml; main: .
.github/workflows/ci.yml package job runs OCB; old build-ocb drift gate replaced by smoke-test-binary job consuming the package artefact
.github/workflows/release.yml New "Pre-build OCB binaries" step before goreleaser; ko-publish step runs from cd ./_build/ so OCB submodule resolves
.github/workflows/chart.yml Restored the tracecore validate gate against the default + one-receiver-on chart renders; path triggers swap cmd/tracecore/**builder-config.yaml
.github/workflows/install-bench.yml Path triggers swap cmd/tracecore/**builder-config.yaml
builder-config.yaml dist.version bumped to 0.1.0-m9-alpha to match Chart.yaml appVersion; chart-appversion-check.sh now reads it as source of truth
scripts/chart-appversion-check.sh Read dist.version from builder-config.yaml (was: internal/version/version.go)
scripts/smoke.sh Rewritten for OCB binary — hostmetrics → debug config, expects upstream lifecycle log lines
scripts/validator-recipe.sh BIN default now ./_build/tracecore
scripts/{doc-check,no-autoupdate-check}.sh Drop cmd from scan paths

New integration seam

  • internal/integration/ocb_scrape_test.go: spawns _build/tracecore against a hostmetrics → debug config, polls the upstream :NNNN/metrics surface, asserts both otelcol_process_uptime and otelcol_receiver_accepted_metric_points are present, then SIGTERMs the subprocess and asserts clean exit. The test skips when _build/tracecore is absent so a fresh git clone + go test ./... stays green; make build is the prereq. This is the regression gate for the chart's operator-facing self-telemetry contract (RFC-0013 §3): if a future upstream OCB release renames either metric, the chart's service.telemetry.metrics.address advertisement breaks downstream dashboards silently — this test fires first.

Integration recipe migration

  • docs/integrations/examples/honeycomb.yaml + otel-backend.yaml: clockreceiverhostmetrics loadscraper (the OCB-supported equivalent per RFC-0013 PR-E). The other two recipes carry pending-rfc-0013-pr-a markers and are still skipped by validator-recipe.sh.

Doc rot fixes

  • docs/FAILURE-MODES.md: rerouted 8 entries that referenced deleted in-tree tests to their upstream OCB owners.
  • docs/FLAKY-TESTS.md: moved the two in-tree integration flakes to Resolved.
  • STYLE.md: rewrote repo-layout + component-registration + CLI + build-release sections around OCB.
  • PRINCIPLES.md: dropped concrete-example reference to deleted file.
  • install/kubernetes/tracecore/README.md: ko local-build steps now run from inside ./_build/.
  • docs/rfcs/0013-distro-first-pivot.md: PR-A2 entry rewritten as landed.
  • docs/migration/v0.1-to-v0.2.md: added rows for the self-telemetry metric-name rename (tracecore_*otelcol_*) and the telemetry.* chart values key rename.

Sequencing constraints honored

  • components/receivers/{clockreceiver, containerstdout, dcgm, k8sevents, kernelevents, nccl_fr, pyspy} survive as orphan-but-compiling code until PR-K deletes them along with the chart-fixture migration.
  • internal/{pipeline, selftelemetry, telemetry, componentstatus, pipelinebuilder, consumer, fanout, runtime} survive until PR-F (after PR-B1 lifted nccl_fr off internal/selftelemetry in feat(pivot): PR-B1 — port nccl_fr off internal selftel + lifecycle #184).
  • tracecore validate gate in chart workflow is restored on the default + one-receiver-on fixtures (was temporarily disabled in the first push of this PR; the chart's renderedConfig template migration was completed in the same PR).

Sequencing gate satisfied

PR-B2 (nccl_fr import swap to upstream OCB types), PR-F (internal/* deletion), and PR-I (Go submodule extraction) are unblocked. The legacy boot path is gone; the OCB-driven boot path is live.

Test plan

  • make build → produces ./_build/tracecore binary via OCB
  • ./_build/tracecore --version reports 0.1.0-m9-alpha matching Chart.yaml
  • ./_build/tracecore components lists 6 receivers + 4 exporters + 3 extensions + 4 processors (the builder-config.yaml inventory)
  • ./_build/tracecore validate --config=<rendered chart default> exits 0
  • ./_build/tracecore validate --config=<rendered chart one-receiver-on fixture> exits 0
  • make smoke passes (hostmetrics → debug, 1.5s window, clean shutdown)
  • make check passes
  • go test ./internal/integration/... passes (new OCB scrape test; ~1.2s)
  • helm lint install/kubernetes/tracecore clean (1 chart, 0 failed; icon advisory only)
  • helm template demo install/kubernetes/tracecore --show-only templates/daemonset.yaml renders args: [--config=…] + two ports (telemetry, health) + probes hitting health port at /
  • helm template demo install/kubernetes/tracecore --show-only templates/configmap.yaml | yq '.data["config.yaml"]' renders upstream OTel shape: service.telemetry.metrics.address, extensions.health_check, service.extensions: [health_check]
  • conftest test --policy install/kubernetes/tracecore/policies/conftest/tracecore.rego /tmp/chart-render.yaml — 51 tests pass
  • CI: package job builds amd64 + arm64 OCB binaries
  • CI: smoke-test-binary job runs --version + components on the package artefact
  • CI: chart workflow lints + templates + validate + yq + conftest pass
  • CI: install-bench workflow's kind-cluster install still rolls out (bench values updated to drop the obsolete stdoutexporter reference and align otlphttp.endpoint with the upstream otlphttpexporter schema — pass-through render so all upstream fields work without chart changes)

RFC-0013 PR-A2 (sequencing gate for PR-B2 / PR-F / PR-I): retire the
hand-wired ./cmd/tracecore entry point and adopt the OpenTelemetry
Collector Builder (OCB) output at ./_build/tracecore as the canonical
binary. After this lands, all receivers register through OCB's
generated otelcol.Factories instead of the bespoke
cmd/tracecore/components.go.

Deletions (3,032 LOC across 22 source + 7 test files):
- cmd/tracecore/ entire tree (main, collect, validate, debug,
  receivers, signals, failure_inject, openflags, receiver_variants)
- components.yaml + tools/components-gen/ (superseded by
  builder-config.yaml; OCB owns codegen)
- components/receivers/kernelevents/runbook_test.go (depended on the
  deleted `tracecore debug dump` subcommand; kernelevents itself is
  scheduled for deletion in PR-K)

Build path swap:
- Makefile: `build` target now runs OCB (was: legacy `go build
  ./cmd/tracecore`); dropped `generate`, `generate-check`, `run`,
  legacy -ldflags -X version injection; coverage -coverpkg drops
  ./cmd/...
- install/kubernetes/tracecore/Dockerfile: builds via OCB (`make
  build` then copy `_build/tracecore`)
- install/kubernetes/tracecore/templates/daemonset.yaml: args drop the
  `collect` subcommand (OCB main runs the collector as default)
- .goreleaser.yaml: switched to `builder: prebuilt` against
  ./_build/{Os}-{Arch}/tracecore; release.yml gains a per-platform
  pre-build step before invoking goreleaser
- .ko.yaml: builds from inside ./_build/ (the OCB submodule) via
  KO_CONFIG_PATH=../.ko.yaml; main: .
- .github/workflows/ci.yml: package job runs OCB; build-ocb drift gate
  replaced by smoke-test-binary job consuming the package artefact
- builder-config.yaml: dist.version bumped to 0.1.0-m9-alpha to match
  Chart.yaml appVersion (chart-appversion-check.sh now reads it as the
  source of truth, replacing internal/version/version.go)

Sequencing constraints honored:
- components/receivers/{clockreceiver, containerstdout, dcgm,
  k8sevents, kernelevents, nccl_fr, pyspy} survive as
  orphan-but-compiling code until PR-K deletes them along with the
  chart-fixture migration
- internal/{pipeline, selftelemetry, telemetry, componentstatus,
  pipelinebuilder, consumer, fanout, runtime} survive until PR-F
- chart workflow's `tracecore validate` step temporarily disabled —
  the chart's renderedConfig still emits the legacy `telemetry:`
  top-level key and references clockreceiver/pyspy/containerstdout;
  PR-K reinstates the gate once the chart shape migrates

Operator surface changes (release-notes block):
- `tracecore collect --config=…` → `tracecore --config=…` (collect was
  the default subcommand; OCB main runs the collector by default)
- `tracecore --log.format=text` / `--shutdown.drain-budget=…` /
  `--version-short` → removed; OCB uses upstream `--feature-gates` +
  `--set` flag surface, version reads from binary metadata
- `tracecore receivers list` → `tracecore components` (shows
  receivers/processors/exporters/extensions/connectors)
- `tracecore debug dump` → removed (OCB has no equivalent; operators
  filing issues use `tracecore components` + the live config)
- `tracecore failure-inject {nccl-hang,pod-evict}` → replaced by the
  standalone `tools/failure-inject` binary (already shipped with
  xid, nccl-hang, pod-evict, cpu-steal subcommands)

Doc-rot fixes:
- docs/FAILURE-MODES.md: rerouted 8 entries that referenced deleted
  cmd/tracecore tests to upstream OCB owners; legacy contracts
  (signal handling, multi-instance components, empty-config WARN)
  now flow through service.New / otelcol.Collector
- docs/FLAKY-TESTS.md: moved the two cmd/tracecore.TestIntegration_*
  flakes to Resolved (deleted with the legacy entry point)
- docs/integrations/examples/{honeycomb,otel-backend}.yaml: clockreceiver
  → hostmetrics loadscraper (OCB-supported); pending-rfc-0013-pr-a
  recipes unchanged
- STYLE.md: repo layout + component-registration + CLI + build-release
  sections rewritten around OCB
- PRINCIPLES.md: dropped concrete-example reference to deleted file
- scripts/doc-check.sh, scripts/no-autoupdate-check.sh: drop `cmd`
  from scan paths
- scripts/chart-appversion-check.sh: read dist.version from
  builder-config.yaml instead of internal/version/version.go
- scripts/smoke.sh: rewritten for OCB binary (hostmetrics → debug
  config; expects upstream lifecycle log lines)
- scripts/validator-recipe.sh: BIN default now ./_build/tracecore

Verification:
- `make build` (OCB) → produces ./_build/tracecore binary
- `./_build/tracecore --version` reports 0.1.0-m9-alpha matching Chart.yaml
- `./_build/tracecore components` lists 6 receivers, 4 exporters,
  3 extensions, 4 processors (the builder-config.yaml inventory)
- `make smoke` passes (hostmetrics → debug, 1.5s window, clean shutdown)
- `make check`, `make verify` pass (license, fmt, lint, vet, tidy,
  build-tags, register-lint, actionlint, zizmor, doc-check,
  chart-appversion-check, no-autoupdate-check, nccl-fr-rce-gate)
- `go test -race ./...` passes (skipping TestReceiver_SLIBudget +
  TestSLIBudget_WarmupDiscardIsLoadBearing — pre-existing macOS p99
  flakes already excluded from `make test-extras-race`)
- `helm lint install/kubernetes/tracecore` clean
- `helm template` renders daemonset with `args: [--config=…]`
  (no longer `[collect, --config=…]`)

Sequencing gate: PR-B2 / PR-F / PR-I are unblocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) May 31, 2026 04:30
@trilamsr trilamsr disabled auto-merge May 31, 2026 04:30
@trilamsr trilamsr enabled auto-merge (squash) May 31, 2026 04:31
@trilamsr trilamsr disabled auto-merge May 31, 2026 04:36
@trilamsr

Copy link
Copy Markdown
Contributor Author

Adversarial cross-cut review found 4 blockers — auto-merge disabled until fixes land. Fixer agent in flight. Summary:

🔴 F1 chart defaults break helm install — values.yaml still defaults clockreceiver (not in OCB binary). validate-CI gate deleted in same PR so this regression has no seam.

🔴 F2 9 in-tree components silently orphaned — release-notes block doesn't disclose breadth. Bench otlphttp config needs schema verification against upstream otlphttpexporter.

🔴 F3 metric-name contract deleted, no replacement testtracecore_* integration test removed. New OCB binary emits otelcol_* vocabulary. Operators upgrading will see blank dashboards with no migration note.

🔴 F4 chart probes break/healthz + /readyz not served by OCB binary; healthcheckextension is bundled but not wired in configmap template.

Fixes incoming.

trilamsr added a commit that referenced this pull request May 31, 2026
## Root cause

PR #180 (`chore(pivot): PR-E unblock — bench heartbeat to
hostmetricsreceiver`) enabled the `hostmetrics` receiver in
`bench/install/tracecore-values.yaml` and added it to
`builder-config.yaml`, but the install-bench `Dockerfile` was still
building from `./cmd/tracecore`. The generated
`cmd/tracecore/components.go` only registers in-tree receivers —
`hostmetricsreceiver` is upstream OTel-contrib and is only bundled by
the OCB-assembled binary at `_build/tracecore`. The daemonset pod failed
config load with `unknown component type hostmetrics`, `kubectl rollout
status` timed out at 5 m, and `bash -e` aborted `run.sh` before any
diagnostics fired — so CI showed a bare red with no actionable log.

`install-bench` has been red on `main` since 2026-05-31T02:28:40Z and on
every PR opened after #180.

**Affected PRs (open as of this writing)**: #186, #187, #188, #189.

## Fix

Switch `install/kubernetes/tracecore/Dockerfile` to build via OCB:

```
make build-ocb           # generates ./_build/{main.go,go.mod,...} + compiles
cd _build && go build .  # re-link with CGO_ENABLED=0 -trimpath -ldflags "-s -w"
```

The re-link with our flags guarantees the static binary the distroless
base can exec; OCB's intermediate compile uses its own defaults. The
final image still uses `gcr.io/distroless/static-debian12:nonroot` at
the same pinned digest.

This is a **tactical bridge**: PR-A2 (#189) makes `_build/tracecore` the
canonical binary for all builds. Once that lands, the in-tree
`cmd/tracecore` path retires entirely (RFC-0013 PR-F) and this
Dockerfile change becomes the new normal across every image, not just
install-bench.

### Why not the alternatives?

- **Revert hostmetrics in bench values** → walks back PR #180's pivot
intent ("no custom receiver where upstream satisfies"); the legacy
`clockreceiver` is on its way out in PR-K.
- **Add hostmetricsreceiver to `cmd/tracecore/components.go`** →
diverges the in-tree component list from the OCB-managed one; the whole
point of PR-A2 is to delete that divergence.

## Bonus: surface root cause on rollout-status failure

`bench/install/run.sh` had post-deadline diagnostics for the first-data
path, but the rollout-status path (the actual failure mode of this
regression) just exited via `set -e`. Added `dump_failure_diagnostics()`
(pod state, `kubectl describe`, current + previous container logs,
rendered config) wired to both failure paths; refactor eliminates the
duplicated tracecore-pod spelunking that lived inline. Future
regressions surface root cause in the CI log without re-running.

## Verification

```
$ make check
… 0 issues, all modules verified

$ docker build -f install/kubernetes/tracecore/Dockerfile -t tracecore:bench-test .
… exporting to image done

$ docker run --rm tracecore:bench-test components | grep -E "hostmetrics|otlp"
    - name: hostmetrics
      module: github.com/open-telemetry/opentelemetry-collector-contrib/receiver/hostmetricsreceiver v0.110.0
    - name: otlp
      module: go.opentelemetry.io/collector/receiver/otlpreceiver v0.110.0
    - name: otlphttp
      module: go.opentelemetry.io/collector/exporter/otlphttpexporter v0.110.0
```

End-to-end install-bench (kind cluster + helm install) runs on this PR
via the workflow itself.

## Cost

Docker build stage adds ~100 s (OCB compile inside Alpine). Bench Docker
rebuild only fires on chart/bench/builder-config changes — acceptable.

```release-notes
[CI] install-bench Dockerfile now builds via OpenTelemetry Collector Builder so the bench daemonset can load hostmetricsreceiver; also dumps pod state, logs, and rendered config on rollout-status failure. Unblocks every PR opened after #180.
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Tri Lam and others added 5 commits May 30, 2026 21:56
Adversarial review of PR #189 flagged 4 blockers that the initial swap
deferred to PR-K. Each one had the chart's renderedConfig template
emitting a key the OCB binary does not recognise — meaning `helm
install` would crash-loop on a fresh cluster. Root cause: the legacy
single-listener `telemetry:` block + `clockreceiver` + `stdoutexporter`
references survived the cmd/tracecore deletion but the OCB binary
registers only what `builder-config.yaml` lists.

Blocker fixes:

1. Chart defaults flipped to OCB-supported shape:
   - clockreceiver→false / hostmetrics→true (was the other way)
   - stdoutexporter→false / debug→true (was: stdoutexporter)
   - Default pipeline: hostmetrics → debug
   - Legacy `telemetry:` top-level block replaced by upstream
     `service.telemetry.metrics.address` + `healthcheckextension` on
     a separate port. The chart's `telemetry.enabled` knob still
     drives both surfaces.
   - In-tree-only toggles (clockreceiver/dcgm/kernelevents/pyspy/
     containerstdout/stdoutexporter) kept with explicit doc-comments
     naming PR-J/K as their migration owner; NOTES.txt WARNs if any
     is enabled.

2. PR body Breaking-changes section now enumerates every orphan
   component and its planned replacement. Bench values verified
   against upstream otlphttpexporter schema (already compatible — the
   chart renders the `otlphttp.*` block pass-through).

3. New regression seam: `internal/integration/ocb_scrape_test.go`
   spawns _build/tracecore against hostmetrics→debug, polls
   :NNNN/metrics, asserts both `otelcol_process_uptime` and
   `otelcol_receiver_accepted_metric_points` are present, then
   SIGTERMs and asserts clean exit. Catches an upstream metric-name
   rename before it ships and silently breaks dashboards.
   Migration doc gains the tracecore_* → otelcol_* mapping row.

4. DaemonSet probes now hit the healthcheckextension on a dedicated
   `health` port (default :13133) at `healthPath` (default /); the
   OCB binary doesn't serve /healthz /readyz so the previous probes
   would 404 forever. Two listener ports (`telemetry`, `health`)
   replace one because upstream `service.telemetry` and
   `healthcheckextension` are two separate processes.

Chart-render CI workflow gets back the `tracecore validate` gate on
the default + one-receiver-on fixtures; a chart edit that re-introduces
an unknown key now fails at PR-time, not at `helm install` time.

Verified locally:
- helm lint clean (1 chart, 0 failed)
- helm template default → tracecore validate exits 0
- helm template one-receiver-on fixture → tracecore validate exits 0
- conftest: 51 tests pass against rendered chart
- make check passes (fmt, tidy, lint, vet, mod-verify)
- go test ./... passes including new integration test (race+verbose)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
Make license-check is a pre-push hook gate; the previous commit
landed the new test file without the SPDX-License-Identifier line
the repo standard requires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
Two-file conflict from main landings since last sync:

- install/kubernetes/tracecore/Dockerfile: both sides build the
  OCB-generated _build/tracecore binary. Took PR-A2's canonical
  `make build` invocation (the OCB target) and kept #190's
  defense-in-depth re-link inside ./_build/ with distroless flags
  (CGO_ENABLED=0, -trimpath, -s -w) so the resulting binary is
  guaranteed-static for distroless/static-debian12.
- docs/migration/v0.1-to-v0.2.md: PR-A2 added two self-telemetry
  rows (otelcol_* rename + telemetry.listen split), #186 added a
  stdoutexporter row. All three rows belong; kept all three.

values.yaml + .github/workflows/chart.yml had no actual conflicts
after fetch — must have auto-merged in the prior sync.

Gates: make check / go test ./... / helm lint / helm template /
make build / ./_build/tracecore validate against rendered chart —
all green.

Signed-off-by: Tri Lam <tri@maydow.com>
CI Build linux/arm64 failed with "exec format error" because `go run
go.opentelemetry.io/collector/cmd/builder` built the builder tool itself
under the surrounding GOOS=linux GOARCH=arm64, then tried to exec the
arm64 binary on the amd64 runner. Split the Makefile build target into
(i) `go install` with GOOS=/GOARCH= (host arch) into _build/.tools, then
(ii) run the builder with the target GOOS/GOARCH preserved -- the
builder's inner `go build` inherits those and produces a cross-arch
binary. Verified locally on darwin/arm64 host: `GOOS=linux GOARCH=arm64
make build` now emits ELF aarch64.

verify-static / generate-check failed because PR-A2 deleted
tools/components-gen/ and the cmd/tracecore/components.go generation
target; the workflow step still referenced `make generate-check` which
no longer exists. OCB owns component registration via builder-config.yaml
now, so the gate is dead. Removed it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) May 31, 2026 05:37
@trilamsr trilamsr merged commit 8473a1a into main May 31, 2026
14 checks passed
@trilamsr trilamsr deleted the pr-a2-ocb-main-swap branch May 31, 2026 05:48
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

PR-A2 (#189) deleted `cmd/tracecore/` (3,032 LOC across 14 source + 7
test files), `components.yaml`, `tools/components-gen/`, and the
`Makefile` `generate` + `generate-check` + `run` targets. Adversarial
review of #189 surfaced 8 docs still pointing at those deleted seams as
if they were live. This PR sweeps the live-instruction hits and adds
dated post-pivot footnotes to historical-narrative hits.

Root cause: PR-A2 prioritized atomic deletion of the build surface +
sequencing-gate satisfaction; documentation cleanup was deferred to a
follow-up sweep (this PR). Not a workaround — the deleted surfaces are
not coming back; the docs now reflect the new OCB-generated state.

## Files touched (8, all docs / CI fixture)

| File | Change | Reasoning |
|---|---|---|
| `AGENTS.md` | Drop `generate-check` from the `make ci` gate roster
(one of seven gates listed). | Live instruction — gate no longer exists.
|
| `docs/notes/stacked-pr-workflow.md` | Rewrite "Codegen-aware conflict
resolution" lesson against the surviving codegen seams
(`_build/tracecore/main.go` via `make build`, parser goldens via `make
generate-fixtures`); soften the `make ci`-discipline lesson's
`generate-check` reference to past-tense + dated. | Live instructions
for stacked-PR contributors. |
| `docs/notes/2026-05-19-m14-autonomous-run.md` | Add a one-line post-A2
footnote to the codegen-aware-rebase lesson's Anchor line. Body
preserved. | Historical retro — top of file already carried RFC-0013
status note; just stamping the specific anchor. |
| `docs/notes/autonomous-feature-flow.md` | Drop `generate-check` from
two inline gate lists; generalize "conflicts in generated files" bullet
to post-A2 seams. | Live workflow template. |
| `docs/research/baselines.md` | Drop `generate-check` from the `make
ci` wall-clock fold list with a dated parenthetical. | Live measurement
methodology context. |
| `docs/research/m15-container-stdout.md` | Prepend a post-pivot
blockquote footnote to the "components.go is generated" subsection. Body
preserved. | Research narrative — top of file lacked an explicit pointer
to the deleted seam. |
| `docs/FAILURE-MODES.md` | Rewrite the two self-telemetry rows that
named `cmd/tracecore/runCollect` + `telemetry.listen` to reference
upstream `service.telemetry.metrics.address` bind path and
`healthcheckextension` probe surface respectively. | Live
operator-facing failure-mode inventory. |
| `install/kubernetes/tracecore/ci/all-receivers-off-values.yaml` | Fix
fixture header to state `chart.yml` skips `tracecore validate` on this
fixture (RFC-0013 PR-A2 invariant) instead of claiming it exits 0. |
Live CI fixture comment. |

## Intentionally NOT touched

- **`MILESTONES.md`** (11 refs to `cmd/tracecore/components.go` +
`components.yaml` across §Lane structure + per-milestone narrative). The
top of MILESTONES.md carries explicit `**Status (RFC-0013):** DELETED at
v0.X.0 - replaced by <upstream>` banners per block. A cold reader hits
the supersession banner before the milestone-narrative ref. Per
`feedback_no_bloat` x historical-record lens: leave.
- **`docs/research/m16-kueue.md`** (1 ref) and
**`docs/research/m5-m6-research.md`** (2 refs). Both already carry
top-of-doc `**Status (2026-05-22):** … superseded by RFC-0013 …`
banners. Cold reader is served.
- **`tracecore_*` self-telemetry metric vocabulary** (~150 refs across
`components/`, `internal/selftelemetry`, `internal/telemetry`, runbooks,
alerts, RFCs). These are NOT stale: the in-tree `internal/selftelemetry`
+ `internal/telemetry` packages still emit those names
(orphan-but-compiling per RFC-0013 §7; deletion ships under PR-F/PR-K).
The migration to upstream `otelcol_*` is documented in
`docs/migration/v0.1-to-v0.2.md`. Customer-facing rename is a single
coordinated cut under PR-K, not piecewise.

## Test plan

- [x] `make doc-check` — 507 markdown links resolve, 45 test references
verified, banned-phrase + comment-noise gates clean.
- [x] `make check` — `fmt` + `tidy-check` + `lint` + `vet` +
`mod-verify` all green.
- [x] `git grep -rn` for residual `cmd/tracecore` / `generate-check`
refs confirms remaining hits are all historical-narrative (already
covered by per-file supersession banners) or live in-tree code that
ships under future PR-F/PR-K.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

RFC-0013 PR-L — expand `docs/migration/v0.1-to-v0.2.md` from the PR-179
skeleton (plus the metric-name + chart-values rows landed inline in
PR-A2 / #189) into a comprehensive v0.1.x → v0.2.0 cutover guide. Every
post-wave-2 landing now has a corresponding operator-facing migration
row.

```release-notes
NONE
```

## What landed

| Section added / expanded | Source of truth |
|---|---|
| **CLI surface** — table covering every removed subcommand (`collect`,
`receivers list`, `debug dump`, `failure-inject`) and removed flag
(`--log.format=text`, `--shutdown.drain-budget`, `--version-short`) +
their upstream replacements | PR #189 release-notes block + the deleted
`cmd/tracecore/` tree |
| **Helm chart values** — `telemetry.listen` + `telemetry.paths.*` →
`telemetry.metricsListen` + `telemetry.healthListen` +
`telemetry.healthPath` with default-port values |
`install/kubernetes/tracecore/values.yaml` HEAD |
| **Probes** — `/healthz` + `/readyz` on `:8888` →
`healthcheckextension` at `:13133/` |
`install/kubernetes/tracecore/templates/daemonset.yaml` HEAD |
| **Default pipeline** — `clockreceiver → stdoutexporter` → `hostmetrics
→ debug` snippet | `values.yaml` `pipelines:` block HEAD |
| **Orphan components table** — all 9 (clockreceiver, containerstdout,
dcgm, k8sevents, kernelevents, nccl_fr, pyspy, otlphttp, stdoutexporter)
mapped to upstream replacement + PR-J recipe | `components/receivers/`,
`components/exporters/` directory inventory + RFC-0013 §2 adoption
matrix |
| **Self-telemetry metric vocabulary** — `tracecore_*` → `otelcol_*` for
receiver / exporter / queue / component-status / build-info families
with per-signal split | upstream OCB instrumentation conventions +
`internal/integration/ocb_scrape_test.go` contract metrics |
| **`stdoutexporter` failure-rate gap** — debugexporter pins
`otelcol_exporter_send_failed_*` at zero; debug-only pipelines lose the
signal | upstream `debugexporter` semantics |
| **Build / CI changes** — Makefile, output path (`./_build/tracecore`),
source tree, smoke, image build, release pipeline, version source |
Makefile + `.goreleaser.yaml` + `.ko.yaml` + workflows HEAD |
| **`internal/*` package deletion** (PR-F, in flight) — per-package
public-surface migration map for `selftelemetry`, `runtime/lifecycle`,
`componentstatus`, `telemetry`, `pipeline`, `pipelinebuilder`,
`consumer`, `fanout` | RFC-0013 PR-F scope + `internal/` tree inventory
|
| **Reproducibility note** — `0.1.0-m9-alpha` hardcoded in
`builder-config.yaml dist.version`; cross-ref to
`docs/reproducibility.md` workaround | `builder-config.yaml` HEAD |
| **Verification** — adds probe smoke test + `tracecore components`
parity check against the rendered config | new section |
| **Rollback** — recipe-toggle path is not available for the deleted
set; pin chart + image at v0.1.x | corrects the v0.1.x-era rollback
prose |

Closes RFC-0013 PR-L. Open follow-ups (PR-I in-repo submodule, PR-J
upstream recipes, PR-K in-tree-receiver delete, PR-F internal/* delete)
are referenced inline in the guide so the next agent picking up any of
them lands the corresponding doc update in the same PR.

## Adversarial pre-review notes

- Verified component counts (`builder-config.yaml`: 6 receivers, 4
exporters, 3 extensions, 4 processors) against `awk` count of `gomod:`
lines.
- Verified `hostmetrics` is the default in `values.yaml` (enabled: true,
loadscraper, 1s).
- Verified `cmd/tracecore`, `tools/components-gen`, `components.yaml`
all deleted from HEAD (`git ls-files` returns empty).
- Verified `internal/integration/ocb_scrape_test.go` is present and
asserts the two contract metrics named in the guide.
- Verified the daemonset.yaml probes wire `port: health` (not `port:
telemetry`) at `healthPath`.
- Verified no broken markdown links — all 5 outbound links resolve
(`builder-config.yaml`, `ocb_scrape_test.go`, RFC-0013 §3, RFC-0013
§migration, in-doc anchor).
- One non-blocking observation: `install/kubernetes/tracecore/README.md`
still references `/healthz` + `/readyz` on three lines (chart-doc rot
from PR-A2 that didn't sweep the README). Out of scope for PR-L;
flagging for the next chart-doc sweep.

## Test plan

- [x] `make doc-check` passes (banned-phrase lint, link resolution,
test-name parity, all 15 sub-checks green)
- [x] Pre-commit hook (golangci-lint, go vet, go mod verify, DCO + AI
trailer) passes
- [x] Pre-push hook (no-autoupdate-check) passes
- [ ] CI: `chart`, `ci`, `install-bench` workflows do not gate on this
file; only `doc-check` matters for a docs-only PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request May 31, 2026
Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

Reconcile the four pivot-tracking docs
(`docs/rfcs/0013-distro-first-pivot.md`, `CHANGELOG.md`,
`MILESTONES.md`, `docs/migration/v0.1-to-v0.2.md`) with the wave-3
(PR-B1-shape sibling ports) and wave-4 (PR-B2-shape upstream-only ports
+ PR-F.1 + PR-J + PR-L + PR-N) landings. Pure doc sweep — no code or
config touched.

## What changed

### `docs/rfcs/0013-distro-first-pivot.md` §migration

PR sequence rows updated with PR-number citations and landed markers:

- **PR-A2** (landed, #189, 2026-05-30)
- **PR-B2** (landed, #201) — also enumerates sibling-receiver follow-ups
under PR-B2 to dispel the slug collision with #188's PR-B2-labelled dcgm
port: stdoutexporter (#202), pyspy (#203), kernelevents (#208),
containerstdout (#209)
- **PR-F.1** (landed) — fleshed-out delete list
(`internal/{selftelemetry,telemetry}` + `components/receivers/dcgm/` +
`pkg/dcgm/` + one orphan clockreceiver integration test)
- **PR-F.2** re-scoped — now deletes the whole
`internal/{componentstatus,pipeline,pipelinebuilder,consumer,fanout,runtime/lifecycle}`
bundle in one cut once the last three pipeline+consumer-importing
receivers land (#204 k8sevents, #205 clockreceiver, #207 otlphttp). Per
the import-graph state — `internal/componentstatus`'s only non-test
consumer is `internal/pipeline`, so they delete together
- **PR-G** (landed, #182), **PR-H** (landed, #183)
- **PR-I.1a** (in flight — scaffold agent), **PR-I.1b** (pre-staged;
gate satisfied by #201)
- **PR-J** (landed, #195) — kept existing marker
- **PR-K.1** (in flight — separate agent landing)
- **PR-L** (landed, skeleton #179 + body #191) — flagged as living
document
- **PR-N** (landed, #200) — shipped at v0.1.0 ahead of v0.3.0 as a
doc-only update at `docs/migration/v0.2-to-v0.3.md`

### `CHANGELOG.md` [Unreleased]

- Restructured the pivot wave list as **four waves** (was three). Wave 3
enumerates PR-B1-shape sibling ports + support infra (#180-#194/#196).
Wave 4 enumerates PR-B2-shape upstream-only ports + PR-J (#195) + PR-F.1
(#206) + PR-N (#200) + lint/TOCTOU hardening (#198/#210).
- Tightened the PR-F.2 deferred note to point at the three open ports
(#204/#205/#207) as the gate.

### `MILESTONES.md`

- **M1** (pipeline runtime) — status row now cites PR-A2 (#189), PR-F.1
(#206), PR-F.2 gate (#204/#205/#207), PR-E (#180), retains
`internal/config/` (still load-bearing for `tracecore validate`).
- **M2** (self-telemetry) — status row now cites PR-F.1 (#206); flags
`internal/componentstatus` as travelling with `internal/pipeline` in
PR-F.2.
- **M8** (DCGM receiver) — status flipped to *landed-and-replaced*:
cites PR-F.1 (#206) deletion + PR-J (#195)
`docs/integrations/prometheus-scrape.md` recipe. Notes the inert chart
toggle retention until PR-K.3.

### `docs/migration/v0.1-to-v0.2.md`

- §`internal/*` package deletion (PR-F) status flips from "not yet open"
to "PR-F.1 landed (#206), PR-F.2 gated on three open ports".
- Open-items checklist expanded from 5 to 13 entries — tracks every PR
letter the migration guide cares about (A2 / E / F.1 / F.2 / I.1a-c / J
/ K.1-3 / L / N) with PR numbers and links.

## Why now

Tracking docs accumulated drift across wave-3 + wave-4 because every
sibling-port PR (and the support-infra PRs around them) updated the
bottom of `CHANGELOG.md` but did not always touch the upstream
sequencing section in RFC-0013. Per memory rule `[Keeping this document
current]`: status drift is a review blocker. This PR is the consolidated
catch-up; future port PRs include their RFC-row flip in-PR.

## What this PR does NOT change

- No code, no config, no YAML, no chart — only the four tracking docs.
- No new doc gates added; existing gates pass.
- No PRs other than the four named docs are modified.

## Test plan

- [x] `bash scripts/doc-check.sh` clean (33 test refs, 528 links
resolve, comment-noise diff gate clean vs `origin/main`, all 13 gates
green).
- [x] Pre-commit hook (`commitlint` 72-char subject limit + DCO +
AI-trailer gates) passed.
- [x] Pre-push hook (`make ci-fast` equivalent: `golangci-lint`, `go
vet`, `go mod verify`, `no-autoupdate-check`, `doc-check.sh`) passed on
second attempt after `git fetch origin main` populated the worktree's
`origin/main` ref — first push failed because the worktree previously
tracked the (gone) `pr-a2-ocb-main-swap` branch, so `doc-check.sh`'s
comment-noise diff-scope gate exited 128 on the missing ref. Root cause
fixed by the fetch; not a workaround.
- [ ] CI green on this branch.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant