diff --git a/docs/migration/v0.2-to-v0.3.md b/docs/migration/v0.2-to-v0.3.md index 93cba6cb..764962df 100644 --- a/docs/migration/v0.2-to-v0.3.md +++ b/docs/migration/v0.2-to-v0.3.md @@ -1,22 +1,28 @@ # Migration: v0.2.x → v0.3.0 -This guide tells operators how to move from a `v0.2.x` deployment to `v0.3.0`. The single operator-visible break this release is the Python-profiling story: tracecore's in-tree `pyspy` receiver and its `tracecore-pyspy` PyPI helper are deleted, and the upstream-recipe replacement (`parca-agent`) changes the SecurityContext budget. Everything else is unchanged from `v0.2.x` — see [`v0.1-to-v0.2.md`](v0.1-to-v0.2.md) for the prior cut's surface. +This guide tells operators how to move from a `v0.2.x` deployment to `v0.3.0`. **The Python-profiling story is unchanged in v0.3.0**: the cooperative `pyspy` receiver and its `tracecore-pyspy` PyPI helper ship as in `v0.2.x`, with the same zero-capability posture. PR-M (delete pyspy + ship `parca-agent` recipe) has been **deferred to v0.4.0+** per [#222](https://github.com/TraceCoreAI/tracecore/issues/222). Everything else is unchanged from `v0.2.x` — see [`v0.1-to-v0.2.md`](v0.1-to-v0.2.md) for the prior cut's surface. + +This guide remains in the v0.2→v0.3 lane because the security-posture work (PR-N) landed at v0.3.0 as **operator preparation material** for the eventual v0.4.0+ cutover. The CAP_SYS_PTRACE → CAP_SYS_ADMIN/CAP_BPF migration is a forward-looking reference: it tells operators what the eBPF-profiler future will require so they can budget cluster policy, kernel versions, and PSS exceptions ahead of time. ## TL;DR -The cooperative `pyspy` receiver (Phase 2 design: in-process Python helper + UDS + `faulthandler.dump_traceback`, zero capabilities added per [RFC-0009 §Safety properties](../rfcs/0009-pyspy-receiver-scope.md)) is deleted in PR-M. Operators who want Python stack sampling deploy `parca-agent` as a separate DaemonSet via the upstream chart. `parca-agent` is an eBPF profiler and **requires `CAP_SYS_ADMIN` (or root)** on its pod — a strictly larger capability budget than the zero-capability cooperative pyspy needed. The tracecore DaemonSet's capability set is unchanged (still `drop: [ALL]`, `add: []`); the new capability lives on the `parca-agent` pod, deployed and governed separately by the operator. +**v0.3.0 actual behaviour.** The cooperative `pyspy` receiver (Phase 2 design: in-process Python helper + UDS + `faulthandler.dump_traceback`, zero capabilities added per [RFC-0009 §Safety properties](../rfcs/0009-pyspy-receiver-scope.md)) **ships as-is in v0.3.0**. No deletion, no PyPI yank, no chart-values key removal, no SecurityContext change on the tracecore DaemonSet — operator action required to upgrade from v0.2.x to v0.3.0 is *zero* on the profiling surface. + +**v0.4.0+ planned behaviour (PR-M, deferred).** When PR-M lands, the cooperative pyspy receiver and `tracecore-pyspy` PyPI helper are deleted, and operators who want Python stack sampling deploy `parca-agent` as a separate DaemonSet via the upstream chart. `parca-agent` is an eBPF profiler and **requires `CAP_SYS_ADMIN` (or root)** on its pod — a strictly larger capability budget than the zero-capability cooperative pyspy needs. The tracecore DaemonSet's capability set will remain unchanged (`drop: [ALL]`, `add: []`); the new capability lives on the `parca-agent` pod, deployed and governed separately by the operator. + +**Re-evaluation triggers** (per [#222](https://github.com/TraceCoreAI/tracecore/issues/222)): PR-M unblocks when (1) OTel Profiles reaches Beta and the `service.profilesSupport` feature-gate is removed, **and** (2) parca-agent gains OTLP export (or PR-M is re-scoped to an "otelcol-ebpf-profiler sibling distro" pattern). Neither condition is met at v0.3.0 cut. -This guide names the new capability surface, the kernel requirement, the failure modes operators will see if the SecurityContext is too restrictive, and a minimum-grant SecurityContext snippet. The cooperative-pyspy design rationale (why tracecore avoided `CAP_SYS_PTRACE` for two minor releases) is in [RFC-0009 §Alternatives](../rfcs/0009-pyspy-receiver-scope.md#alternatives-considered); the deletion rationale is in [RFC-0013 §Adoption matrix](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix). +The remainder of this guide names the eventual capability surface, the kernel requirement, the failure modes operators will see if the SecurityContext is too restrictive, and a minimum-grant SecurityContext snippet — all material that operators planning v0.4.0+ upgrades should pre-evaluate today. The cooperative-pyspy design rationale (why tracecore avoided `CAP_SYS_PTRACE` for two minor releases) is in [RFC-0009 §Alternatives](../rfcs/0009-pyspy-receiver-scope.md#alternatives-considered); the deferral rationale is in [#222](https://github.com/TraceCoreAI/tracecore/issues/222). -## Why the security posture changes +## Why the security posture will change (at v0.4.0+) -Cooperative pyspy (v0.1.x, v0.2.x) walked Python frames **inside** the workload process via `faulthandler.dump_traceback`, then shipped the rendered output over a per-process Unix domain socket. No memory of another process was read; no `ptrace`, no `process_vm_readv`, no signal. The tradeoff was operator-side: every workload had to `pip install tracecore-pyspy` and call `attach()` once at startup, and the helper only worked against cooperating CPython interpreters. +Cooperative pyspy (v0.1.x, v0.2.x, **and v0.3.0**) walks Python frames **inside** the workload process via `faulthandler.dump_traceback`, then ships the rendered output over a per-process Unix domain socket. No memory of another process is read; no `ptrace`, no `process_vm_readv`, no signal. The tradeoff is operator-side: every workload has to `pip install tracecore-pyspy` and call `attach()` once at startup, and the helper only works against cooperating CPython interpreters. -`parca-agent` (v0.3.0+) walks frames **out-of-process** via eBPF programs attached to the kernel's perf-events subsystem. The eBPF approach removes the workload-side cooperation requirement (any binary the kernel can sample is in scope, including non-Python runtimes), and removes the per-language helper-distribution problem. The cost is privilege: loading eBPF programs requires `CAP_SYS_ADMIN` (or root), and reading symbolized stacks from kernel + user space requires that the agent see the global PID namespace (`hostPID: true`) and the on-disk binaries of every workload it samples. +`parca-agent` (v0.4.0+ when PR-M lands) walks frames **out-of-process** via eBPF programs attached to the kernel's perf-events subsystem. The eBPF approach removes the workload-side cooperation requirement (any binary the kernel can sample is in scope, including non-Python runtimes), and removes the per-language helper-distribution problem. The cost is privilege: loading eBPF programs requires `CAP_SYS_ADMIN` (or root), and reading symbolized stacks from kernel + user space requires that the agent see the global PID namespace (`hostPID: true`) and the on-disk binaries of every workload it samples. -The change is a tradeoff, not a regression: tracecore preserves the cooperative path through end-of-life at v0.3.0 specifically so operators with restricted-tier Pod Security Standards have one release to evaluate whether the eBPF capability cost is acceptable for their cluster. +The change is a tradeoff, not a regression: tracecore preserves the cooperative path past v0.3.0 to give operators with restricted-tier Pod Security Standards multiple releases to evaluate whether the eBPF capability cost is acceptable for their cluster. -## What `parca-agent` requires +## What `parca-agent` will require (forward-looking, v0.4.0+) | Requirement | Value | Source | |---|---|---| @@ -28,7 +34,7 @@ The change is a tradeoff, not a regression: tracecore preserves the cooperative **On `CAP_BPF` / `CAP_PERFMON`.** Linux kernel 5.8 split `CAP_SYS_ADMIN`'s BPF surface into the narrower `CAP_BPF` (load BPF programs and maps) + `CAP_PERFMON` (open perf events) capabilities. In principle a profiler that uses only BPF + perf-events can run with `add: [BPF, PERFMON]` instead of `add: [SYS_ADMIN]`. **Upstream `parca-agent` does not document support for this narrower set today** (per its security docs, the requirement is `root` or `CAP_SYS_ADMIN`); operators interested in the narrower split should track [parca-dev/parca-agent#3115](https://github.com/parca-dev/parca-agent/issues) (CAP_BPF/CAP_PERFMON tracking) and validate against their kernel before relying on it. The conservative grant remains `CAP_SYS_ADMIN`. -## What tracecore's pod still requires +## What tracecore's pod still requires (v0.3.0) **Unchanged from v0.2.x.** The tracecore DaemonSet's container SecurityContext is still: @@ -41,11 +47,11 @@ containerSecurityContext: add: [] ``` -The chart's conftest policy (`install/kubernetes/tracecore/policy/`) still rejects any capability addition — there is no v0.3.0 operator path that puts `CAP_SYS_ADMIN` on the tracecore pod itself. All new capability surface lives on the **separate** `parca-agent` DaemonSet. +The chart's conftest policy (`install/kubernetes/tracecore/policy/`) still rejects any capability addition — there is no operator path (v0.3.0 *or* v0.4.0+) that puts `CAP_SYS_ADMIN` on the tracecore pod itself. When PR-M lands, all new capability surface will live on the **separate** `parca-agent` DaemonSet. -## Minimum-grant `parca-agent` SecurityContext +## Minimum-grant `parca-agent` SecurityContext (forward-looking) -A starting point for operators who want to deploy `parca-agent` alongside tracecore. Place the agent in its own namespace; do not co-locate it in the tracecore pod. +A starting point for operators who want to plan ahead for the v0.4.0+ `parca-agent` deployment alongside tracecore. Place the agent in its own namespace; do not co-locate it in the tracecore pod. **Do not deploy this in v0.3.0** — `parca-agent` is not part of v0.3.0's recipe set; the cooperative pyspy receiver is still the supported path. ```yaml apiVersion: apps/v1 @@ -98,9 +104,9 @@ This is a **starting point**, not the upstream-recommended manifest. Pull the ca 1. **Pod Security Standards.** `hostPID: true` and `add: [SYS_ADMIN]` both violate **baseline** PSS (and therefore restricted). Clusters with namespace labels `pod-security.kubernetes.io/enforce: baseline` (or restricted) must place `parca-agent` in an exempted namespace. 2. **OPA / Kyverno cluster policies.** Custom admission policies that ban capability additions, `hostPID`, or host-path mounts must add a `parca-agent`-namespace exception. -## Failure modes when capabilities are missing +## Failure modes when capabilities are missing (forward-looking, v0.4.0+) -These are the kernel-level failure shapes operators will see in `kubectl logs ds/parca-agent` when the SecurityContext is too restrictive. The agent's exact log strings vary by parca-agent version; the **errno / syscall** column is the stable surface — grep for the syscall name + errno code rather than the prose string. +These are the kernel-level failure shapes operators will see in `kubectl logs ds/parca-agent` when PR-M has landed and the SecurityContext is too restrictive. The agent's exact log strings vary by parca-agent version; the **errno / syscall** column is the stable surface — grep for the syscall name + errno code rather than the prose string. | Failure shape | Underlying syscall + errno | Root cause | Remediation | |---|---|---|---| @@ -112,30 +118,22 @@ These are the kernel-level failure shapes operators will see in `kubectl logs ds When triaging a real failure, capture the agent's full log (`kubectl logs --previous` for crash loops) and check it against [parca-dev/parca-agent/issues](https://github.com/parca-dev/parca-agent/issues) — operator failures outside the patterns above are upstream concerns, not tracecore concerns. -## Helper / receiver removal checklist +## Helper / receiver removal checklist (forward-looking, v0.4.0+) -The following artefacts are gone at v0.3.0. Any operator config or CI workflow that references them fails fast (chart-render rejects unknown receiver keys; `pip install` fails on the deleted PyPI package). +**Nothing to remove in v0.3.0.** This checklist is the *eventual* artefact removal once PR-M lands at v0.4.0+. Operators should not act on it at the v0.2.x → v0.3.0 upgrade; it is here so config / CI / alerting owners can stage the eventual cleanup ahead of time. -| Artefact | Action required | +| Artefact | Action required (at v0.4.0+, not v0.3.0) | |---|---| -| Chart values key `receivers.pyspy.*` | Remove the block. Chart-render in v0.3.0 emits a `NOTES.txt` deprecation warning for one minor; v0.4.0 removes the key entirely. | -| `pip install tracecore-pyspy` in workload images | Remove from `Dockerfile` / `requirements.txt`. The PyPI package is yanked at v0.3.0; rebuilds will fail with `No matching distribution`. | +| Chart values key `receivers.pyspy.*` | Remove the block. The chart will emit a `NOTES.txt` deprecation warning for one minor before the values key is removed. | +| `pip install tracecore-pyspy` in workload images | Remove from `Dockerfile` / `requirements.txt`. The PyPI package will be yanked when PR-M lands; rebuilds will then fail with `No matching distribution`. | | Workload-side `from tracecore_pyspy import attach; attach()` calls | Delete the import and call. No-op replacement — `parca-agent` requires zero workload code changes. | | Per-pod `/var/run/tracecore/pyspy/` `emptyDir` volume | Remove from your Pod spec. Was only needed for the UDS rendezvous. | | Alerts on `tracecore_receiver_errors_total{component="pyspy",kind=…}` | Delete. No corresponding metric in `parca-agent`; pivot to `parca_agent_*` self-metrics if you alert on profiler health. | -| Pre-merge CI hooks for `tools/pyspy-lint` | Delete. The symbol-table lint guarded the cooperative receiver's "no out-of-process memory reads" property; it has no purpose once the receiver is gone. | +| Pre-merge CI hooks for `tools/pyspy-lint` | Delete. The symbol-table lint guards the cooperative receiver's "no out-of-process memory reads" property; it has no purpose once the receiver is gone. | ## Verification -1. **Before upgrading**, confirm parca-agent is deployable on at least one canary node: - - ```bash - # Verify kernel BTF on a canary node - kubectl debug node/ -it --image=busybox -- ls -la /host/sys/kernel/btf/vmlinux - # Expect: file exists. If missing, kernel upgrade required before v0.3.0 cutover. - ``` - -2. **After upgrading**, verify the tracecore pod's SecurityContext is unchanged: +1. **After upgrading to v0.3.0**, verify the tracecore pod's SecurityContext is unchanged: ```bash kubectl -n tracecore-system get ds tracecore -o yaml \ @@ -143,17 +141,19 @@ The following artefacts are gone at v0.3.0. Any operator config or CI workflow t # Expect: capabilities.drop == [ALL], capabilities.add == [] or null. ``` -3. **Verify parca-agent boot** (in its own namespace): +2. **Cooperative pyspy still works in v0.3.0.** No re-deploy of the helper is required; existing `tracecore-pyspy` `attach()` calls and `receivers.pyspy.*` chart-values keys continue to function. The receiver remains registered in v0.3.0's OCB binary. + +3. **Forward-looking: BTF check for the eventual parca-agent migration.** Operators planning the v0.4.0+ upgrade can confirm kernel BTF availability on canary nodes ahead of time: ```bash - kubectl -n parca logs ds/parca-agent --tail=50 \ - | grep -E 'started|listening|attached' - # Expect: "started" line. EPERM / BTF errors per the table above indicate misconfiguration. + # Verify kernel BTF on a canary node (v0.4.0+ prerequisite) + kubectl debug node/ -it --image=busybox -- ls -la /host/sys/kernel/btf/vmlinux + # Expect: file exists. If missing, kernel upgrade required before the v0.4.0+ cutover. ``` ## Rollback -The cooperative pyspy receiver is **not** registered in v0.3.0's OCB binary (per `builder-config.yaml`). Recipe-toggle rollback is not available. If parca-agent doesn't meet your security or compatibility budget, pin the chart and image at the last v0.2.x tag (`v0.2.0-…`; substitute the latest `v0.2.x` tag from `git tag -l 'v0.2.*'`) and keep running the cooperative receiver: +There is no profiling-surface rollback required for the v0.2.x → v0.3.0 upgrade — the cooperative pyspy receiver is still registered in v0.3.0's OCB binary, the `tracecore-pyspy` PyPI package is still installable, and the chart's `receivers.pyspy.*` values key is still honoured. If a v0.3.0 upgrade introduces an unrelated regression, the standard chart-pin rollback applies: ```bash helm upgrade tracecore install/kubernetes/tracecore \ @@ -161,15 +161,16 @@ helm upgrade tracecore install/kubernetes/tracecore \ --set image.tag= ``` -The cooperative receiver's PyPI helper (`tracecore-pyspy`) remains installable from PyPI's archive for one minor release after v0.3.0 cuts; pin `tracecore-pyspy==0.1.0` in your workload `requirements.txt`. PyPI yank happens at v0.4.0. +(Substitute the latest `v0.2.x` tag from `git tag -l 'v0.2.*'`.) ## References -- [RFC-0013 §Adoption matrix](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix) — why pyspy is deleted in favour of parca-agent -- [RFC-0013 §Migration / rollout](../rfcs/0013-distro-first-pivot.md#migration--rollout) — PR-M and PR-N sequencing -- [RFC-0009 §Safety properties](../rfcs/0009-pyspy-receiver-scope.md#proposal) — historical record of the cooperative receiver's zero-capability design -- [`components/receivers/pyspy/README.md`](../../components/receivers/pyspy/README.md) — cooperative receiver's user-facing docs (carries the v0.3.0 deletion banner) -- [`components/receivers/pyspy/RUNBOOK.md`](../../components/receivers/pyspy/RUNBOOK.md) — per-kind operator triage for the cooperative receiver (preserved for operators still on v0.2.x) +- [#222: PR-M deferral memo](https://github.com/TraceCoreAI/tracecore/issues/222) — current PR-M status + re-evaluation triggers (OTel Profiles → Beta, parca-agent OTLP export) +- [RFC-0013 §Adoption matrix](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix) — why pyspy is on the eventual deletion path in favour of parca-agent (note: timing in the RFC predates the #222 deferral) +- [RFC-0013 §Migration / rollout](../rfcs/0013-distro-first-pivot.md#migration--rollout) — original PR-M and PR-N sequencing (supersede with #222 for current timeline) +- [RFC-0009 §Safety properties](../rfcs/0009-pyspy-receiver-scope.md#proposal) — design record of the cooperative receiver's zero-capability posture (still in force at v0.3.0) +- [`components/receivers/pyspy/README.md`](../../components/receivers/pyspy/README.md) — cooperative receiver's user-facing docs (the receiver ships in v0.3.0) +- [`components/receivers/pyspy/RUNBOOK.md`](../../components/receivers/pyspy/RUNBOOK.md) — per-kind operator triage for the cooperative receiver - [Parca Agent / Requirements](https://github.com/parca-dev/parca-agent#requirements) - [Parca Agent / Security](https://www.parca.dev/docs/parca-agent-security) - [Linux Yama LSM (`ptrace_scope`)](https://docs.kernel.org/admin-guide/LSM/Yama.html) — relevant for operators evaluating in-cluster debugging policy alongside eBPF profiling