diff --git a/docs/migration/v0.1-to-v0.2.md b/docs/migration/v0.1-to-v0.2.md index d619410d..7ba0e171 100644 --- a/docs/migration/v0.1-to-v0.2.md +++ b/docs/migration/v0.1-to-v0.2.md @@ -1,61 +1,198 @@ # Migration: v0.1.x → v0.2.0 -> **Status:** skeleton. Concrete migration steps fill in as v0.2.0 PRs land per [RFC-0013 §migration](../rfcs/0013-distro-first-pivot.md#migration--rollout) (PR-I through PR-L). - -This guide tells operators how to move from a `v0.1.x` deployment to `v0.2.0`. Every operator-visible break gets a row; everything not listed below is unchanged. +This guide tells operators how to move from a `v0.1.x` deployment to `v0.2.0`. Every operator-visible break gets a row; everything not listed below is unchanged. Sections below mirror [RFC-0013 §migration](../rfcs/0013-distro-first-pivot.md#migration--rollout) (PR-A through PR-L). ## TL;DR -v0.2.0 completes the RFC-0013 receiver swap. The in-tree custom receivers for kernel events, kubernetes events, kueue scheduler metrics, GPU telemetry (NVIDIA/AMD/Intel/Habana), and the heartbeat clock are deleted from the tracecore binary and replaced by upstream OpenTelemetry receivers wired through the bundled Helm-chart recipe. Your alerts on the [customer-stable telemetry contracts](../rfcs/0013-distro-first-pivot.md#3-customer-stable-telemetry-contracts) survive the swap; your `values.yaml` keys map old→new for one minor with a `NOTES.txt` deprecation warning per RFC-0013 §8. +v0.2.0 retires the hand-rolled `cmd/tracecore` entry point and ships the OpenTelemetry Collector Builder (OCB) binary at `./_build/tracecore`. The custom receivers for kernel events, kubernetes events, kueue scheduler metrics, GPU telemetry (NVIDIA/AMD/Intel/Habana), container stdout, and the heartbeat clock are deleted from the binary and replaced by upstream OpenTelemetry receivers wired through the bundled Helm-chart recipe. Self-telemetry metric names rename `tracecore_*` → `otelcol_*`; kubelet probes move from `:8888/healthz` to `:13133/`. Your alerts on the [customer-stable telemetry contracts](../rfcs/0013-distro-first-pivot.md#3-customer-stable-telemetry-contracts) (`k8s.event.hint`, `kernelevents.xid`, `gpu.id/vendor`, `gen_ai.training.*`) survive without edit. + +## CLI surface + +The `tracecore` binary is now upstream OCB main. Flag and subcommand shape changed: + +| v0.1.x | v0.2.0 | Notes | +|---|---|---| +| `tracecore collect --config=…` | `tracecore --config=…` | OCB main runs the collector by default; no subcommand. | +| `tracecore validate --config=…` | `tracecore validate --config=…` | Unchanged. Used by `helm template … \| tracecore validate` in CI. | +| `tracecore receivers list` | `tracecore components` | Now enumerates receivers + processors + exporters + extensions + connectors. | +| `tracecore debug dump` | *(removed)* | No OCB equivalent. Capture `tracecore components` + the live config when filing issues. | +| `tracecore failure-inject xid` | `failure-inject xid` | Moved to the standalone `tools/failure-inject` binary (already shipped at v0.1.x; the `tracecore failure-inject` shim is gone). | +| `tracecore failure-inject {nccl-hang, pod-evict, cpu-steal}` | `failure-inject {nccl-hang, pod-evict, cpu-steal}` | Same — standalone binary. | +| `--log.format=text` | *(removed)* | OCB upstream emits structured logs; configure via `service.telemetry.logs.encoding` in the config file. | +| `--shutdown.drain-budget=…` | *(removed)* | OCB upstream shutdown is driven by `signal.NotifyContext`; tune via container `terminationGracePeriodSeconds`. | +| `--version-short` | `--version` | Upstream prints `tracecore version `; the short/long distinction is gone. | +| *(no upstream equivalent)* | `--feature-gates=…` / `--set=…` | New OCB-native flags for component-level toggles and config overlays. | + +## Helm chart values + +### Self-telemetry block + +The legacy single-listener `telemetry:` block (one port serving `/metrics` + `/healthz` + `/readyz`) is gone; OCB doesn't recognise it. Upstream `service.telemetry.metrics` and the `healthcheckextension` are two separate processes, so the chart exposes two listener ports. + +| v0.1.x | v0.2.0 default | Maps to | +|---|---|---| +| `telemetry.listen: "0.0.0.0:8888"` | `telemetry.metricsListen: "0.0.0.0:8888"` | `service.telemetry.metrics.address` | +| `telemetry.paths.metrics: /metrics` | *(implicit, always `/metrics`)* | upstream `service.telemetry` server only serves `/metrics` | +| `telemetry.paths.healthz: /healthz` | `telemetry.healthPath: /` | `extensions.health_check.path` (default `/`) | +| `telemetry.paths.readyz: /readyz` | *(folded into `healthPath`)* | `healthcheckextension` has no separate readiness route; the single `/` endpoint covers both probe types | +| *(implicit, same port as metrics)* | `telemetry.healthListen: "0.0.0.0:13133"` | `extensions.health_check.endpoint`; `:13133` is the upstream extension default | + +The chart `telemetry.enabled: true` (default) wires both surfaces. `telemetry.enabled: false` drops both ports and omits kubelet probes. + +### Probes + +The DaemonSet's `livenessProbe` and `readinessProbe` now hit the `health` named port (`:13133`) at `telemetry.healthPath` (default `/`) — not `:8888/healthz` and `:8888/readyz`. Probe timing (`initialDelaySeconds`, `periodSeconds`, `failureThreshold`) values under `probes.{liveness,readiness}` are unchanged. -## Breaking changes +If you maintain a custom DaemonSet patch that hard-codes probe URLs, update both probes to point at `port: health` and the `/` path. The Prometheus scrape port (`:8888`) is unchanged. -| What | v0.1.x | v0.2.0 | Action | +### Default pipeline + +The chart default pipeline flips from `clockreceiver → stdoutexporter` (in-tree, not registered by OCB) to `hostmetrics → debug` (both upstream OCB-bundled). Fresh install on a no-GPU cluster boots and emits load-average metrics immediately — same as before, different metric names. + +```yaml +# v0.2.0 default +pipelines: + metrics: + receivers: [hostmetrics] + exporters: [debug] +``` + +### Orphan in-tree components + +The OCB-assembled binary registers only the components listed in [`builder-config.yaml`](../../builder-config.yaml): 6 receivers, 4 exporters, 3 extensions, 4 processors. The chart's per-component toggles for the legacy in-tree set survive this release so the values shape doesn't break for operators that pin them, but **enabling any of the following in chart values fails the chart-render CI gate (`tracecore validate`) and would fail the pod at startup with an "unknown factory" error**: + +| Component (chart values key) | Kind | Upstream replacement | Operator action in v0.2.0 | |---|---|---|---| -| Kernel events receiver | `kernelevents` (in-tree) | `journaldreceiver` + `filelogreceiver` (kmsg) + OTTL Xid transform | Chart compat map keeps the `kernelevents.*` values keys for one minor with a deprecation warning. To opt into the new path immediately, set `kernelevents.recipe: upstream`. | -| K8s events receiver | `k8sevents` (in-tree) | `k8sobjectsreceiver` + OTTL `k8s.event.hint` transform | Chart compat map keeps `k8sevents.*` values keys; opt-in via `k8sevents.recipe: upstream`. | -| Container stdout receiver | `containerstdout` (in-tree) | `filelogreceiver` + container stanza + `file_storage` extension | Chart compat map keeps `containerstdout.*` keys; opt-in via `containerstdout.recipe: upstream`. | -| GPU telemetry (NVIDIA) | `dcgm` (in-tree, cgo stub) | `dcgm-exporter` DaemonSet + `prometheusreceiver` | Deploy `dcgm-exporter` via its own chart; chart adds a `gpu.nvidia.recipe: prometheus` toggle to wire the scrape. | -| GPU telemetry (AMD / Intel / Habana) | Not shipped | `ROCm/device-metrics-exporter` / `intel/xpumanager` / Habana Prometheus Metric Exporter, all scraped via `prometheusreceiver` | New capability; opt-in via `gpu..recipe: prometheus`. | -| Kueue scheduler metrics | `kueue` (in-tree, never shipped) | `prometheusreceiver` recipe with bearer-token + TLS | Opt-in via `kueue.recipe: prometheus`. | -| Heartbeat / install-bench primitive | `clockreceiver` (in-tree, chart default) | `hostmetricsreceiver` (loadscraper @ 1s, upstream OCB-bundled) | v0.1.x bench already swapped (PR-E). v0.2.0 flips the chart default — set `receivers.hostmetrics.enabled: true` + `receivers.clockreceiver.enabled: false` if you want to track the new default before the chart-default flip; otherwise no action until v0.2.0. `NOTES.txt` will surface a deprecation warning for one minor after the flip. | -| Kineto profiler | `kineto` (in-tree, deferred) | Deferred until OTel Profiles GA | No action; re-evaluation when contrib ships `pprofreceiver`. | -| Moat components (nccl_fr + pattern engine) | Currently in `components/receivers/nccl_fr/` + `internal/synthesis/patterns/` under the single repo-root `go.mod` | Will live in an in-repo Go submodule at `module/` (path `github.com/tracecoreai/tracecore/module`) pulled via OCB `gomod:` + dev-loop `replaces: ./module` | No operator action. Submodule split is internal — same repo, same fork, same CI; OCB builds via `gomod:` like any other upstream module. | -| Helm values keys | Per-receiver `.*` | Per-receiver `.recipe: ` + per-recipe stanzas | One-minor compat. Migrate by setting `.recipe: upstream` per receiver. | -| Self-telemetry metric names | `tracecore_receiver_*`, `tracecore_exporter_*`, `tracecore_queue_*`, `tracecore_component_*`, `tracecore_build_info` (in-tree selftel surface, exposed at `:8888/metrics`) | Upstream `otelcol_*` vocabulary (`otelcol_receiver_accepted_metric_points`, `otelcol_exporter_sent_metric_points`, `otelcol_process_uptime`, `otelcol_scraper_scraped_metric_points`, etc.) exposed at the same `:8888/metrics` endpoint via `service.telemetry.metrics.address` | Dashboards + alerts referencing any `tracecore_*` self-telemetry metric must be rewritten to the `otelcol_*` equivalent. The upstream names are stable across collector releases (RFC-0013 §3 customer-stable contract); see `internal/integration/ocb_scrape_test.go` for the exact set the chart commits to keep available. | -| Self-telemetry chart values keys | `telemetry.listen` (single port serving /metrics + /healthz + /readyz) + `telemetry.paths.{metrics,healthz,readyz}` | `telemetry.metricsListen` (Prometheus metrics; upstream `service.telemetry.metrics.address`) + `telemetry.healthListen` (kubelet probes; upstream `healthcheckextension` endpoint) + `telemetry.healthPath` (extension path, default `/`) | Two listener ports replace one because the upstream `service.telemetry` server does not serve health routes and `healthcheckextension` does not serve `/metrics`. Probe paths in the DaemonSet template now hit the `health` port (default `:13133`) at `healthPath`. Operators using the chart's structured `telemetry.*` keys must rename `telemetry.listen` → `telemetry.metricsListen` and drop `telemetry.paths.*`. The Prometheus scrape port (default 8888) is unchanged. | -| `stdoutexporter` self-telemetry | Contributed to `tracecore_exporter_failure_rate` via `selftelemetry.ExporterCarrier` | Emits `tracecore.exporter.calls_total` only; no `failure_rate` contribution | stdoutexporter no longer contributes to **tracecore_exporter_failure_rate**. Switch to a real backend exporter (otlphttp, etc.) if per-exporter failure-rate alerts matter. | +| `receivers.clockreceiver` | receiver | `hostmetricsreceiver` (loadscraper) | Already the default. Remove the `clockreceiver` block; the chart no-ops if you leave `clockreceiver.enabled: false`. | +| `receivers.containerstdout` | receiver | `filelogreceiver` + container stanza + `file_storage` extension | Leave `containerstdout.enabled: false` (default). The upstream recipe lands in PR-J; until then, pin v0.1.x if you need pod-log collection. | +| `receivers.dcgm` | receiver | `dcgm-exporter` DaemonSet + `prometheusreceiver` | Leave `dcgm.enabled: false` (default). The PR-J recipe wires `prometheusreceiver` against an out-of-band `dcgm-exporter` DaemonSet; until then, deploy `dcgm-exporter` manually and scrape it from your own Prometheus. | +| `receivers.k8sevents` | receiver | `k8sobjectsreceiver` + OTTL `k8s.event.hint` transform | Leave `k8sevents.enabled: false` (default). The PR-J recipe ships the OTTL transform that preserves the 11-entry `k8s.event.hint` enum (RFC-0013 §3 contract); until then, pin v0.1.x if you alert on `k8s.event.hint`. | +| `receivers.kernelevents` | receiver | `journaldreceiver` + `filelogreceiver` (kmsg) + OTTL Xid transform | Leave `kernelevents.enabled: false` (default). The PR-J recipe ships the OTTL transform that keeps `kernelevents.xid` populated; until then, pin v0.1.x if you alert on Xid codes. | +| `receivers.nccl_fr` | receiver | In-repo Go submodule via OCB `gomod:` (PR-I) + `replaces: ./module` | No operator action; the receiver ships in `module/receiver/ncclfrreceiver` and OCB pulls it like any upstream module. | +| `receivers.pyspy` | receiver | Deferred until OTel Profiles GA | Leave `pyspy.enabled: false` (default). No upstream replacement exists today; the toggle survives until contrib ships `pprofreceiver`. | +| `exporters.stdoutexporter` | exporter | `debugexporter` (OCB-bundled, chart default) | Replace `exporters.stdoutexporter` with `exporters.debug` in pipelines. The debug exporter writes to pod stdout, same observation channel. | +| `exporters.otlphttp` (in-tree clone) | exporter | `otlphttpexporter` (OCB-bundled) | Same chart key (`otlphttp`), same field shape — `endpoint`, `compression`, `headers`, `tls.*`, `timeout`, `retry_on_failure`, `sending_queue` pass through to the upstream exporter without translation. | + +To verify what's actually registered in the binary you're running: + +```bash +./_build/tracecore components +``` + +The chart does not yet ship a per-receiver `recipe:` switch — that mechanism arrives in [RFC-0013 PR-J](../rfcs/0013-distro-first-pivot.md#migration--rollout) along with the upstream-recipe templates. Until PR-J lands, the migration path for the in-tree receivers other than `clockreceiver`/`stdoutexporter` is: pin v0.1.x → wait for PR-J → cut over to the upstream-recipe values shape in one minor. + +## Self-telemetry metric vocabulary + +The OCB binary's self-telemetry surface uses upstream `otelcol_*` vocabulary instead of the v0.1.x `tracecore_*` family. The Prometheus scrape endpoint is unchanged (`http://:8888/metrics`); the metric NAMES on that endpoint changed. + +| v0.1.x metric | v0.2.0 metric | Notes | +|---|---|---| +| `tracecore_receiver_emissions_total{receiver}` | `otelcol_receiver_accepted_metric_points{receiver}` / `_log_records` / `_spans` | One name per signal type. Sum across the three for a per-receiver total. | +| `tracecore_receiver_emission_failures_total{receiver}` | `otelcol_receiver_refused_metric_points{receiver}` / `_log_records` / `_spans` | Same per-signal split. | +| `tracecore_exporter_emissions_total{exporter}` | `otelcol_exporter_sent_metric_points{exporter}` / `_log_records` / `_spans` | | +| `tracecore_exporter_failure_rate{exporter}` | `otelcol_exporter_send_failed_metric_points{exporter}` / `_log_records` / `_spans` | Rate is now a counter, not a gauge — wrap in `rate()` in PromQL. | +| `tracecore_queue_depth{exporter}` | `otelcol_exporter_queue_size{exporter}` | | +| `tracecore_queue_capacity{exporter}` | `otelcol_exporter_queue_capacity{exporter}` | | +| `tracecore_component_status{component, status}` | `otelcol_process_uptime` + per-component log events on the `otelcol.component.status` event channel | OTel collector exposes component status via the `componentstatus` event stream, not a dedicated metric; PromQL alerts on `tracecore_component_status` should switch to alerting on the absence of `otelcol_process_uptime` increase or on log-based events. | +| `tracecore_build_info{version, commit}` | `otelcol_process_uptime` (presence) + `--version` CLI | Build-info gauge is gone. Use a Kubernetes-side `up{}` join against the image tag for the same dashboard signal. | +| *(none)* | `otelcol_process_runtime_total_alloc_bytes`, `otelcol_process_runtime_heap_alloc_bytes`, `otelcol_process_memory_rss`, `otelcol_process_cpu_seconds` | New process-level signals from upstream `service.telemetry`. | + +The two metrics the chart commits to keeping available across OCB version bumps are: + +- `otelcol_process_uptime` — emitted by `service.telemetry` at startup; presence proves the binary is OCB-assembled and the self-tel server is wired through `service.telemetry.metrics.address`. +- `otelcol_receiver_accepted_metric_points` — emitted by the receiver helper once the first scrape lands; presence proves end-to-end pipeline liveness. + +[`internal/integration/ocb_scrape_test.go`](../../internal/integration/ocb_scrape_test.go) (`TestOCBScrape_UpstreamMetricVocabulary`) is the regression gate: an upstream rename of either metric fails this test before it can ship. + +### `stdoutexporter` failure-rate gap + +In v0.1.x, the in-tree `stdoutexporter` contributed to the aggregate `tracecore_exporter_failure_rate` gauge via `selftelemetry.ExporterCarrier`. In v0.2.0, the upstream `debugexporter` (which replaces `stdoutexporter` as the chart default) writes to stdout and effectively never fails — the `otelcol_exporter_send_failed_*` counter stays pinned at zero. + +Operators running a pipeline that fans out to ONLY the debug/stdout exporter have no meaningful per-exporter failure-rate signal. Add a real backend exporter (`otlphttp`, `datadog`, `clickhouse`) to recover the signal; the debug exporter is intended for development and `kubectl logs` inspection, not for steady-state production observability. + +## Build / CI changes + +The build path changed. If you pull from source: + +| Concern | v0.1.x | v0.2.0 | +|---|---|---| +| Build entry | `go build ./cmd/tracecore` | `make build` (runs `go run go.opentelemetry.io/collector/cmd/builder@v0.110.0 --config=builder-config.yaml`) | +| Output path | `./tracecore` (CWD) | `./_build/tracecore` | +| Source tree | `cmd/tracecore/` (~20 .go files) | Gone. The binary `main.go` is OCB-generated under `_build/`. | +| Component registration | `cmd/tracecore/components.go` + `components.yaml` + `tools/components-gen/` | `builder-config.yaml` (OCB inventory). `tools/components-gen` and `components.yaml` deleted. | +| Make targets removed | — | `generate`, `generate-check`, `run` | +| Smoke test | `scripts/smoke.sh` against `./tracecore collect` | `scripts/smoke.sh` against `./_build/tracecore --config=` | +| Container image | `cmd/tracecore` → `ko` | `_build/tracecore` → `ko` (build runs from inside `./_build/`) | +| Release | `goreleaser` `builder: go` | `goreleaser` `builder: prebuilt` against per-platform OCB output | +| CI version source | `internal/version/version.go` | `builder-config.yaml` `dist.version` (read by `scripts/chart-appversion-check.sh`) | + +CI workflows changed path triggers from `cmd/tracecore/**` to `builder-config.yaml`. The `build-ocb` drift gate (introduced in v0.1.x as a tripwire) is now redundant and was replaced by a `smoke-test-binary` job that consumes the CI `package` job's artefact. + +## `internal/*` package deletion (PR-F) + +> **Status:** PR-F not yet open at the time of this guide's first publish. The packages listed below are still present in v0.2.0 RC builds and will be deleted in PR-F before v0.2.0 GA. + +Several internal Go packages were load-bearing only for the deleted `cmd/tracecore` boot path and the in-tree receivers/exporters. Third-party Go importers (unlikely in OSS pre-1.0; the packages live under `internal/` and the Go compiler rejects external imports) lose: + +| Package | Public surface | Migration | +|---|---|---| +| `internal/selftelemetry` | `Kind` (canonical failure-reason enum), `CanonicalKinds()`, `CanonicalKindsByName()`, `IsCanonicalKind()`, `Exporter`, `FailureRateReader`, `ExporterCarrier`, `Receiver`, `NewExporter`, `NewReceiver` | Switch to upstream `go.opentelemetry.io/collector/component/componentstatus` for status reporting and to the upstream receiver/exporter helper packages for counter emission. The `Kind` enum has no upstream equivalent — pivot status events use OTel's `componentstatus.Event` with severity-style fields instead. | +| `internal/runtime/lifecycle` | `Lifecycle`, `New`, `Start`, `Shutdown`, `Add`, `PanicCallback` | Switch to upstream `go.opentelemetry.io/collector/component.Host` lifecycle, driven by OCB's generated `service.New(...)`. The bespoke `Lifecycle.Add` worker registration pattern has no direct upstream analogue; restructure auxiliary goroutines as OTel extensions. | +| `internal/componentstatus` | `*` (publish/subscribe helpers around the legacy `cmd/tracecore` event bus) | Switch to upstream `go.opentelemetry.io/collector/component/componentstatus` directly. | +| `internal/telemetry` | `ServerConfig`, `Paths`, `Server`, `NewServer`, `MeterProvider`, `NewMeterProvider`, `WindowedRate`, `AggregateSLOSource`, `ExporterRegistry`, `SLOSource` | Self-telemetry HTTP server is replaced by `service.telemetry.metrics.address`; `MeterProvider` is replaced by the upstream collector's internal meter provider. The probe-server `Server` (paths `/healthz` / `/readyz`) is replaced by `healthcheckextension`. | +| `internal/pipeline`, `internal/pipelinebuilder`, `internal/consumer`, `internal/fanout` | Pipeline assembly helpers | Replaced wholesale by upstream `go.opentelemetry.io/collector/service.New(...)` driven by `builder-config.yaml`. | + +PR-F will collapse the `internal/` tree to: `safe`, `config`, `synthesis`, `version`, `sli`, `integration` (the OCB scrape test). Everything else listed above goes away. + +## Reproducibility note + +The OCB-generated binary reads `0.1.0-m9-alpha` from `builder-config.yaml` `dist.version`. Until [RFC-0013 PR-D follow-up](../rfcs/0013-distro-first-pivot.md#migration--rollout) lands a ldflags injection step, `tracecore --version` prints the same string regardless of git SHA or release tag. This is a known temporary limitation; `docs/reproducibility.md` documents the workaround (cross-reference image digest against `Chart.yaml` `appVersion` for now). ## What's NOT changing These are stable across the cut per [RFC-0013 §3](../rfcs/0013-distro-first-pivot.md#3-customer-stable-telemetry-contracts). If your alerts/dashboards consume these, they survive without edit: -- `k8s.event.hint` 11-entry enum (pod_evicted, mount_failure, backoff, oom_killed, node_unhealthy, schedule_failure, create_failure, volume_attach_failure, container_status_unknown, node_pressure, image_pull_failure) +- `k8s.event.hint` 11-entry enum (`pod_evicted`, `mount_failure`, `backoff`, `oom_killed`, `node_unhealthy`, `schedule_failure`, `create_failure`, `volume_attach_failure`, `container_status_unknown`, `node_pressure`, `image_pull_failure`) - `kernelevents.xid` (NVRM Xid code) - `gpu.id` (PCI BDF) - `gpu.vendor` (`nvidia` / `amd` / `intel` / `habana`) - `gen_ai.training.rank`, `gen_ai.training.job_id` (cross-receiver join keys) - NCCL FlightRecorder span schema - Pattern detector outputs (M17 / M18 / M19) -- Chart `image.repository`, container ports, RBAC shape -- Self-telemetry metric names (`otelcol_*`; PR-D wave-1 + PR-A binary swap delivered the rename automatically) +- Chart `image.repository`, container ports, RBAC shape (unchanged for the default pipeline; `containerstdout` opt-in still requires the ClusterRole + root pod context) +- Prometheus self-tel scrape port (`:8888`) ## Verification -1. **Before upgrading**, snapshot existing alert queries and confirm they reference attributes from the "What's NOT changing" list. If any alert reads a `tracecore.*` metric name or a `kernelevents.` that doesn't appear above, raise it before cutover — the recipe transform may not cover it. -2. **After upgrading**, run `kubectl rollout status ds/tracecore` and inspect `_build/tracecore components` against the chart-rendered config: every receiver referenced in the rendered config must enumerate in the binary. -3. **Pilot for one day** with traffic mirror to both old and new pipelines before retiring the v0.1.x deploy. +1. **Before upgrading**, snapshot existing alert queries and confirm they reference attributes from the "What's NOT changing" list. Any alert reading a `tracecore_*` self-tel metric name needs rewriting per the [vocabulary table](#self-telemetry-metric-vocabulary) above; any alert reading `kernelevents.` not in the contract list should be raised with maintainers before cutover. +2. **After upgrading**, compare the chart-rendered config against the binary's registered factories: -## Rollback + ```bash + helm template release install/kubernetes/tracecore \ + --show-only templates/configmap.yaml | yq '.data["config.yaml"]' > /tmp/rendered.yaml + ./_build/tracecore validate --config=/tmp/rendered.yaml + ./_build/tracecore components + ``` + Every receiver/exporter/extension referenced in the rendered config must enumerate in `tracecore components`. + +3. **Probe smoke test** — once the pod is `Running`: + + ```bash + kubectl -n tracecore-system port-forward ds/tracecore 13133:13133 8888:8888 + curl -s http://127.0.0.1:13133/ # health: 200 OK + curl -s http://127.0.0.1:8888/metrics | grep otelcol_process_uptime # presence proves OCB self-tel wired + ``` -The chart's `recipe` values default to `legacy` for one minor release. Setting all `.recipe: legacy` rolls back to the v0.1.x in-tree receivers — but only while the v0.2.x binary still bundles them via `replace` directive. By v0.3.0 the legacy receivers are gone for good. +4. **Pilot for one day** with traffic mirror to both old and new pipelines before retiring the v0.1.x deploy. + +## Rollback -If recipe-toggle rollback doesn't help, pin the chart and image at the prior v0.1.x version. No data is mutated on upgrade. +The OCB binary does not bundle the in-tree receivers anymore (`builder-config.yaml` does not list them; OCB regeneration would be required to add them back). Recipe-toggle rollback is not available for the deleted set. If the v0.2.0 deploy fails health checks, pin the chart and image at the last v0.1.x tag (`v0.1.0-m1` at time of writing; substitute the latest `v0.1.x` tag from `git tag -l 'v0.1.*'` if a newer one has shipped): -## Open items (fill in as PRs land) +```bash +helm upgrade tracecore install/kubernetes/tracecore \ + --version \ + --set image.tag=0.1.0-m1 +``` -- [ ] PR-I (in-repo Go submodule extraction at `module/`) — link -- [ ] PR-J (ship recipes for filelog + journald + k8sobjects + prometheus) — link -- [ ] PR-K (delete in-tree receivers) — link -- [ ] PR-L (this guide, full body) — link -- [ ] PR-E unblocking decision (heartbeat replacement) — link +Charts are pinned by `--version` (the chart-package version from `Chart.yaml`), not by `appVersion`; the image tag pins the binary independently. No data is mutated on upgrade.