Skip to content

docs(migration): PR-L — flesh v0.1.x→v0.2.0 guide body#191

Merged
trilamsr merged 2 commits into
mainfrom
pr-l-migration-guide-body
May 31, 2026
Merged

docs(migration): PR-L — flesh v0.1.x→v0.2.0 guide body#191
trilamsr merged 2 commits into
mainfrom
pr-l-migration-guide-body

Conversation

@trilamsr

@trilamsr trilamsr commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

RFC-0013 PR-L — expand docs/migration/v0.1-to-v0.2.md from the PR-179 skeleton (plus the metric-name + chart-values rows landed inline in PR-A2 / #189) into a comprehensive v0.1.x → v0.2.0 cutover guide. Every post-wave-2 landing now has a corresponding operator-facing migration row.

NONE

What landed

Section added / expanded Source of truth
CLI surface — table covering every removed subcommand (collect, receivers list, debug dump, failure-inject) and removed flag (--log.format=text, --shutdown.drain-budget, --version-short) + their upstream replacements PR #189 release-notes block + the deleted cmd/tracecore/ tree
Helm chart valuestelemetry.listen + telemetry.paths.*telemetry.metricsListen + telemetry.healthListen + telemetry.healthPath with default-port values install/kubernetes/tracecore/values.yaml HEAD
Probes/healthz + /readyz on :8888healthcheckextension at :13133/ install/kubernetes/tracecore/templates/daemonset.yaml HEAD
Default pipelineclockreceiver → stdoutexporterhostmetrics → debug snippet values.yaml pipelines: block HEAD
Orphan components table — all 9 (clockreceiver, containerstdout, dcgm, k8sevents, kernelevents, nccl_fr, pyspy, otlphttp, stdoutexporter) mapped to upstream replacement + PR-J recipe components/receivers/, components/exporters/ directory inventory + RFC-0013 §2 adoption matrix
Self-telemetry metric vocabularytracecore_*otelcol_* for receiver / exporter / queue / component-status / build-info families with per-signal split upstream OCB instrumentation conventions + internal/integration/ocb_scrape_test.go contract metrics
stdoutexporter failure-rate gap — debugexporter pins otelcol_exporter_send_failed_* at zero; debug-only pipelines lose the signal upstream debugexporter semantics
Build / CI changes — Makefile, output path (./_build/tracecore), source tree, smoke, image build, release pipeline, version source Makefile + .goreleaser.yaml + .ko.yaml + workflows HEAD
internal/* package deletion (PR-F, in flight) — per-package public-surface migration map for selftelemetry, runtime/lifecycle, componentstatus, telemetry, pipeline, pipelinebuilder, consumer, fanout RFC-0013 PR-F scope + internal/ tree inventory
Reproducibility note0.1.0-m9-alpha hardcoded in builder-config.yaml dist.version; cross-ref to docs/reproducibility.md workaround builder-config.yaml HEAD
Verification — adds probe smoke test + tracecore components parity check against the rendered config new section
Rollback — recipe-toggle path is not available for the deleted set; pin chart + image at v0.1.x corrects the v0.1.x-era rollback prose

Closes RFC-0013 PR-L. Open follow-ups (PR-I in-repo submodule, PR-J upstream recipes, PR-K in-tree-receiver delete, PR-F internal/* delete) are referenced inline in the guide so the next agent picking up any of them lands the corresponding doc update in the same PR.

Adversarial pre-review notes

  • Verified component counts (builder-config.yaml: 6 receivers, 4 exporters, 3 extensions, 4 processors) against awk count of gomod: lines.
  • Verified hostmetrics is the default in values.yaml (enabled: true, loadscraper, 1s).
  • Verified cmd/tracecore, tools/components-gen, components.yaml all deleted from HEAD (git ls-files returns empty).
  • Verified internal/integration/ocb_scrape_test.go is present and asserts the two contract metrics named in the guide.
  • Verified the daemonset.yaml probes wire port: health (not port: telemetry) at healthPath.
  • Verified no broken markdown links — all 5 outbound links resolve (builder-config.yaml, ocb_scrape_test.go, RFC-0013 §3, RFC-0013 §migration, in-doc anchor).
  • One non-blocking observation: install/kubernetes/tracecore/README.md still references /healthz + /readyz on three lines (chart-doc rot from PR-A2 that didn't sweep the README). Out of scope for PR-L; flagging for the next chart-doc sweep.

Test plan

  • make doc-check passes (banned-phrase lint, link resolution, test-name parity, all 15 sub-checks green)
  • Pre-commit hook (golangci-lint, go vet, go mod verify, DCO + AI trailer) passes
  • Pre-push hook (no-autoupdate-check) passes
  • CI: chart, ci, install-bench workflows do not gate on this file; only doc-check matters for a docs-only PR

Expand the migration guide from the PR-179 skeleton (and the PR-A2
metric-name / chart-values rows landed inline) into a comprehensive
v0.1.x → v0.2.0 cutover guide. Every post-wave-2 landing now has an
operator-facing migration row:

- CLI surface table — every removed subcommand and flag from PR-A2,
  with the upstream replacement.
- Self-telemetry metric vocabulary table — `tracecore_*` → `otelcol_*`
  for receiver / exporter / queue / component-status / build-info
  families. Names the two contract metrics `ocb_scrape_test` pins.
- Helm chart values rename — `telemetry.listen` + `telemetry.paths.*`
  → `telemetry.metricsListen` + `telemetry.healthListen` +
  `telemetry.healthPath`. Default-port values inline.
- Probes — `:8888/healthz` + `:8888/readyz` → `:13133/`.
- Default pipeline — `clockreceiver→stdoutexporter` →
  `hostmetrics→debug` snippet.
- Orphan components — all 9 (7 receivers + 2 exporters) mapped to
  their upstream replacement + PR-J recipe.
- `stdoutexporter` failure-rate gap — debugexporter pins
  send_failed_* at zero, so debug-only pipelines lose the signal.
- Build / CI changes — Makefile, output path, source tree, smoke,
  image build, release pipeline, version source.
- `internal/*` package deletion (PR-F, in flight) — selftelemetry,
  lifecycle, componentstatus, telemetry public-surface migration map.
- Reproducibility note — `0.1.0-m9-alpha` hardcoded; cross-ref to
  `docs/reproducibility.md` workaround.
- Verification — adds probe smoke test + `tracecore components` parity
  check against the rendered config.
- Rollback — recipe-toggle path doesn't exist for the deleted set;
  pin chart + image at v0.1.x.

`make doc-check` clean (banned-phrase, link resolution, test-name
parity, etc.).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr

Copy link
Copy Markdown
Contributor Author

Code review

Found 2 issues:

  1. The orphan-components table and the closing paragraph point operators at <receiver>.recipe: upstream (lines 67, 69, 70, 82), but no recipe key exists in the chart today. install/kubernetes/tracecore/values.schema.json is additionalProperties: false at the chart root and on every receiver block, so an operator who runs --set receivers.containerstdout.recipe=upstream gets schema-rejected by helm install --dry-run before anything else fires. This is a forward-looking design from RFC-0013 that PR-J will land — the guide should either gate the advice on PR-J landing or reword to describe what operators do in v0.2.0 today (pin chart and image at v0.1.x, wait for PR-J).

|---|---|---|---|
| `receivers.clockreceiver` | receiver | `hostmetricsreceiver` (loadscraper) | Already the default. Remove the `clockreceiver` block; the chart no-ops if you leave `clockreceiver.enabled: false`. |
| `receivers.containerstdout` | receiver | `filelogreceiver` + container stanza + `file_storage` extension | Set `containerstdout.recipe: upstream` once PR-J ships; the chart still flips pod `runAsUser=0` and creates the RBAC ClusterRole for `/var/log/pods` access. |
| `receivers.dcgm` | receiver | `dcgm-exporter` DaemonSet + `prometheusreceiver` | Deploy `dcgm-exporter` via NVIDIA's chart; set `gpu.nvidia.recipe: prometheus` to wire the scrape. |
| `receivers.k8sevents` | receiver | `k8sobjectsreceiver` + OTTL `k8s.event.hint` transform | Set `k8sevents.recipe: upstream`; the OTTL transform preserves the 11-entry `k8s.event.hint` enum (RFC-0013 §3 contract). |
| `receivers.kernelevents` | receiver | `journaldreceiver` + `filelogreceiver` (kmsg) + OTTL Xid transform | Set `kernelevents.recipe: upstream`; OTTL transform keeps `kernelevents.xid` attribute populated. |
| `receivers.nccl_fr` | receiver | In-repo Go submodule via OCB `gomod:` (PR-I) + `replaces: ./module` | No operator action; the receiver ships in `module/receiver/ncclfrreceiver` and OCB pulls it like any upstream module. |
| `receivers.pyspy` | receiver | Deferred until OTel Profiles GA | Receiver toggle survives until contrib ships `pprofreceiver`. No replacement to wire today. |
| `exporters.stdoutexporter` | exporter | `debugexporter` (OCB-bundled, chart default) | Replace `exporters.stdoutexporter` with `exporters.debug` in pipelines. The debug exporter writes to pod stdout, same observation channel. |
| `exporters.otlphttp` (in-tree clone) | exporter | `otlphttpexporter` (OCB-bundled) | Same chart key (`otlphttp`), same field shape — `endpoint`, `compression`, `headers`, `tls.*`, `timeout`, `retry_on_failure`, `sending_queue` pass through to the upstream exporter without translation. |
To verify what's actually registered in the binary you're running:
```bash
./_build/tracecore components
```
Until PR-J ships the upstream recipes, the migration path for the in-tree receivers other than `clockreceiver`/`stdoutexporter` is: pin v0.1.x → wait for PR-J → cut over with `<receiver>.recipe: upstream` in one minor.

  1. Rollback section recommends pinning image.tag: 0.1.0-m8-alpha and chart appVersion: 0.1.0-m8-alpha, but no v0.1.0-m8-alpha release exists. gh release list shows the only v0.1.x tag is v0.1.0-m1; the chart's current appVersion is 0.1.0-m9-alpha (unreleased, lives only on main). Operators following the guide would helm install --set image.tag=0.1.0-m8-alpha and get ImagePullBackOff. Suggest pointing at v0.1.0-m1 or the actual last published release tag, and either dropping the chart appVersion line (charts are pinned by --version, not appVersion) or naming the matching chart-package version.

## Rollback
The OCB binary does not bundle the in-tree receivers anymore (`builder-config.yaml` does not list them; OCB regeneration would be required to add them back). Recipe-toggle rollback is not available for the deleted set. If the v0.2.0 deploy fails health checks, pin the chart and image at the prior v0.1.x version (`image.tag: 0.1.0-m8-alpha`, chart `appVersion: 0.1.0-m8-alpha`). No data is mutated on upgrade.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Drop the fictional `<receiver>.recipe: upstream` advice from the
orphan-components table and the closing paragraph — that switch lands
in PR-J, not v0.2.0, so today the only operator action for those
receivers is to leave them disabled and pin v0.1.x if they need the
signal. Rewrite the rollback section to point at the actual `v0.1.0-m1`
tag instead of the non-existent `0.1.0-m8-alpha`, and drop the chart
`appVersion` pin — charts are pinned by `--version` (chart-package
version), not by `appVersion`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) May 31, 2026 06:10
@trilamsr trilamsr merged commit 1dc839d into main May 31, 2026
11 checks passed
@trilamsr trilamsr deleted the pr-l-migration-guide-body branch May 31, 2026 06:17
trilamsr added a commit that referenced this pull request May 31, 2026
…de (#200)

## Summary

Adds `docs/migration/v0.2-to-v0.3.md` covering the v0.3.0
security-posture migration per [RFC-0013
§migration](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0013-distro-first-pivot.md#migration--rollout).
The single operator-visible break at v0.3.0 is the Python-profiling
story: the cooperative `pyspy` receiver (zero capabilities added;
in-process `faulthandler.dump_traceback` over UDS per
[RFC-0009](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0009-pyspy-receiver-scope.md))
is deleted in PR-M, and the replacement `parca-agent` (eBPF) requires
`CAP_SYS_ADMIN` (or root) + `hostPID: true` + a BTF-enabled kernel ≥5.3.

The guide names:

- The exact upstream capability requirement (`root` or `CAP_SYS_ADMIN`
per
[parca-dev/parca-agent](https://www.parca.dev/docs/parca-agent-security))
and why `CAP_BPF`/`CAP_PERFMON` is **not** yet a documented narrower
alternative (conservative grant remains `CAP_SYS_ADMIN`).
- Failure shapes by **syscall + errno** (`bpf(BPF_PROG_LOAD,…)` →
`EPERM`, `perf_event_open(…)` → `EACCES`,
`open("/sys/kernel/btf/vmlinux",…)` → `ENOENT`, etc.) — the stable
surface across parca-agent versions, not paraphrased agent log strings.
- A minimum-grant container `SecurityContext` snippet (DaemonSet-shaped,
`add: [SYS_ADMIN]` not `privileged: true`) with explicit disclaimers
about `readOnlyRootFilesystem` (deferred to upstream manifest
verification) and PSS interactions.
- A clean rollback path (pin v0.2.x chart + image; `tracecore-pyspy`
PyPI helper remains installable one minor past v0.3.0).

### Why this PR exists despite cooperative pyspy needing zero
capabilities

tracecore's `pyspy` receiver does **not** use `CAP_SYS_PTRACE`. Per
[RFC-0009 §Safety
properties](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0009-pyspy-receiver-scope.md#proposal)
and the chart's `conftest` policy (`add: []` asserted in `chart.yml`),
the cooperative design walks Python frames in-process via
`faulthandler.dump_traceback` and ships them over UDS — no `ptrace`, no
`process_vm_readv`. The security-posture change at v0.3.0 is the
**delta** from the cooperative path (zero capabilities, tracecore pod)
to the eBPF path (`CAP_SYS_ADMIN`, separate `parca-agent` pod). PR-N
documents that delta.

### Why now (not deferred with PR-M)

parca-agent research confirms the OTel Profiles signal is still
**Alpha** (Mar 2026 release) and parca-agent has no OTLP profiles
exporter yet. That means operators upgrading to v0.3.0 will run
parca-agent **alongside** tracecore for at least one more minor — they
need the security-posture delta documented before PR-M cuts the
receiver. PR-N landing ahead of PR-M gives operators an evaluation
window.

### Drift fix

`docs/migration/v0.1-to-v0.2.md`'s `pyspy` row claimed "Deferred until
OTel Profiles GA. No upstream replacement exists today; the toggle
survives until contrib ships `pprofreceiver`." This contradicted
RFC-0013, which has named `parca-agent` (separate DaemonSet) as the
replacement since the pivot landed. Updated the row to forward-reference
the new guide and removed the stale "no upstream replacement" claim.
Root cause: the v0.1-to-v0.2.md skeleton (PR #179, then fleshed in PR
#191) predated the RFC-0013 §2 adoption-matrix line that explicitly maps
`components/receivers/pyspy/` → `parca-agent` at v0.3.0. Fixed at the
row, not paved over.

## Test plan

- [x] `make doc-check` — 510 markdown links resolve to on-disk files;
banned-phrase lint clean across 109 markdown files; new file's RFC +
pyspy README/RUNBOOK cross-links verified.
- [x] `make check` — `golangci-lint run ./...` 0 issues; `go vet ./...`
clean; `go mod verify` clean.
- [x] Pre-commit hooks (no-autoupdate-check, license-check) green at
push time.
- [ ] Reviewer sanity-check the SecurityContext YAML snippet renders as
valid Kubernetes `apps/v1.DaemonSet` (apiVersion + spec.template
structure).
- [ ] Reviewer sanity-check the failure-mode table syscall + errno
columns match Linux man-page conventions (`bpf(2)`,
`perf_event_open(2)`).

```release-notes
NONE
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

Reconcile the four pivot-tracking docs
(`docs/rfcs/0013-distro-first-pivot.md`, `CHANGELOG.md`,
`MILESTONES.md`, `docs/migration/v0.1-to-v0.2.md`) with the wave-3
(PR-B1-shape sibling ports) and wave-4 (PR-B2-shape upstream-only ports
+ PR-F.1 + PR-J + PR-L + PR-N) landings. Pure doc sweep — no code or
config touched.

## What changed

### `docs/rfcs/0013-distro-first-pivot.md` §migration

PR sequence rows updated with PR-number citations and landed markers:

- **PR-A2** (landed, #189, 2026-05-30)
- **PR-B2** (landed, #201) — also enumerates sibling-receiver follow-ups
under PR-B2 to dispel the slug collision with #188's PR-B2-labelled dcgm
port: stdoutexporter (#202), pyspy (#203), kernelevents (#208),
containerstdout (#209)
- **PR-F.1** (landed) — fleshed-out delete list
(`internal/{selftelemetry,telemetry}` + `components/receivers/dcgm/` +
`pkg/dcgm/` + one orphan clockreceiver integration test)
- **PR-F.2** re-scoped — now deletes the whole
`internal/{componentstatus,pipeline,pipelinebuilder,consumer,fanout,runtime/lifecycle}`
bundle in one cut once the last three pipeline+consumer-importing
receivers land (#204 k8sevents, #205 clockreceiver, #207 otlphttp). Per
the import-graph state — `internal/componentstatus`'s only non-test
consumer is `internal/pipeline`, so they delete together
- **PR-G** (landed, #182), **PR-H** (landed, #183)
- **PR-I.1a** (in flight — scaffold agent), **PR-I.1b** (pre-staged;
gate satisfied by #201)
- **PR-J** (landed, #195) — kept existing marker
- **PR-K.1** (in flight — separate agent landing)
- **PR-L** (landed, skeleton #179 + body #191) — flagged as living
document
- **PR-N** (landed, #200) — shipped at v0.1.0 ahead of v0.3.0 as a
doc-only update at `docs/migration/v0.2-to-v0.3.md`

### `CHANGELOG.md` [Unreleased]

- Restructured the pivot wave list as **four waves** (was three). Wave 3
enumerates PR-B1-shape sibling ports + support infra (#180-#194/#196).
Wave 4 enumerates PR-B2-shape upstream-only ports + PR-J (#195) + PR-F.1
(#206) + PR-N (#200) + lint/TOCTOU hardening (#198/#210).
- Tightened the PR-F.2 deferred note to point at the three open ports
(#204/#205/#207) as the gate.

### `MILESTONES.md`

- **M1** (pipeline runtime) — status row now cites PR-A2 (#189), PR-F.1
(#206), PR-F.2 gate (#204/#205/#207), PR-E (#180), retains
`internal/config/` (still load-bearing for `tracecore validate`).
- **M2** (self-telemetry) — status row now cites PR-F.1 (#206); flags
`internal/componentstatus` as travelling with `internal/pipeline` in
PR-F.2.
- **M8** (DCGM receiver) — status flipped to *landed-and-replaced*:
cites PR-F.1 (#206) deletion + PR-J (#195)
`docs/integrations/prometheus-scrape.md` recipe. Notes the inert chart
toggle retention until PR-K.3.

### `docs/migration/v0.1-to-v0.2.md`

- §`internal/*` package deletion (PR-F) status flips from "not yet open"
to "PR-F.1 landed (#206), PR-F.2 gated on three open ports".
- Open-items checklist expanded from 5 to 13 entries — tracks every PR
letter the migration guide cares about (A2 / E / F.1 / F.2 / I.1a-c / J
/ K.1-3 / L / N) with PR numbers and links.

## Why now

Tracking docs accumulated drift across wave-3 + wave-4 because every
sibling-port PR (and the support-infra PRs around them) updated the
bottom of `CHANGELOG.md` but did not always touch the upstream
sequencing section in RFC-0013. Per memory rule `[Keeping this document
current]`: status drift is a review blocker. This PR is the consolidated
catch-up; future port PRs include their RFC-row flip in-PR.

## What this PR does NOT change

- No code, no config, no YAML, no chart — only the four tracking docs.
- No new doc gates added; existing gates pass.
- No PRs other than the four named docs are modified.

## Test plan

- [x] `bash scripts/doc-check.sh` clean (33 test refs, 528 links
resolve, comment-noise diff gate clean vs `origin/main`, all 13 gates
green).
- [x] Pre-commit hook (`commitlint` 72-char subject limit + DCO +
AI-trailer gates) passed.
- [x] Pre-push hook (`make ci-fast` equivalent: `golangci-lint`, `go
vet`, `go mod verify`, `no-autoupdate-check`, `doc-check.sh`) passed on
second attempt after `git fetch origin main` populated the worktree's
`origin/main` ref — first push failed because the worktree previously
tracked the (gone) `pr-a2-ocb-main-swap` branch, so `doc-check.sh`'s
comment-noise diff-scope gate exited 128 on the missing ref. Root cause
fixed by the fetch; not a workaround.
- [ ] CI green on this branch.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant