Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 41 additions & 40 deletions docs/migration/v0.2-to-v0.3.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,28 @@
# Migration: v0.2.x → v0.3.0

This guide tells operators how to move from a `v0.2.x` deployment to `v0.3.0`. The single operator-visible break this release is the Python-profiling story: tracecore's in-tree `pyspy` receiver and its `tracecore-pyspy` PyPI helper are deleted, and the upstream-recipe replacement (`parca-agent`) changes the SecurityContext budget. Everything else is unchanged from `v0.2.x` — see [`v0.1-to-v0.2.md`](v0.1-to-v0.2.md) for the prior cut's surface.
This guide tells operators how to move from a `v0.2.x` deployment to `v0.3.0`. **The Python-profiling story is unchanged in v0.3.0**: the cooperative `pyspy` receiver and its `tracecore-pyspy` PyPI helper ship as in `v0.2.x`, with the same zero-capability posture. PR-M (delete pyspy + ship `parca-agent` recipe) has been **deferred to v0.4.0+** per [#222](https://github.com/TraceCoreAI/tracecore/issues/222). Everything else is unchanged from `v0.2.x` — see [`v0.1-to-v0.2.md`](v0.1-to-v0.2.md) for the prior cut's surface.

This guide remains in the v0.2→v0.3 lane because the security-posture work (PR-N) landed at v0.3.0 as **operator preparation material** for the eventual v0.4.0+ cutover. The CAP_SYS_PTRACE → CAP_SYS_ADMIN/CAP_BPF migration is a forward-looking reference: it tells operators what the eBPF-profiler future will require so they can budget cluster policy, kernel versions, and PSS exceptions ahead of time.

## TL;DR

The cooperative `pyspy` receiver (Phase 2 design: in-process Python helper + UDS + `faulthandler.dump_traceback`, zero capabilities added per [RFC-0009 §Safety properties](../rfcs/0009-pyspy-receiver-scope.md)) is deleted in PR-M. Operators who want Python stack sampling deploy `parca-agent` as a separate DaemonSet via the upstream chart. `parca-agent` is an eBPF profiler and **requires `CAP_SYS_ADMIN` (or root)** on its pod — a strictly larger capability budget than the zero-capability cooperative pyspy needed. The tracecore DaemonSet's capability set is unchanged (still `drop: [ALL]`, `add: []`); the new capability lives on the `parca-agent` pod, deployed and governed separately by the operator.
**v0.3.0 actual behaviour.** The cooperative `pyspy` receiver (Phase 2 design: in-process Python helper + UDS + `faulthandler.dump_traceback`, zero capabilities added per [RFC-0009 §Safety properties](../rfcs/0009-pyspy-receiver-scope.md)) **ships as-is in v0.3.0**. No deletion, no PyPI yank, no chart-values key removal, no SecurityContext change on the tracecore DaemonSet — operator action required to upgrade from v0.2.x to v0.3.0 is *zero* on the profiling surface.

**v0.4.0+ planned behaviour (PR-M, deferred).** When PR-M lands, the cooperative pyspy receiver and `tracecore-pyspy` PyPI helper are deleted, and operators who want Python stack sampling deploy `parca-agent` as a separate DaemonSet via the upstream chart. `parca-agent` is an eBPF profiler and **requires `CAP_SYS_ADMIN` (or root)** on its pod — a strictly larger capability budget than the zero-capability cooperative pyspy needs. The tracecore DaemonSet's capability set will remain unchanged (`drop: [ALL]`, `add: []`); the new capability lives on the `parca-agent` pod, deployed and governed separately by the operator.

**Re-evaluation triggers** (per [#222](https://github.com/TraceCoreAI/tracecore/issues/222)): PR-M unblocks when (1) OTel Profiles reaches Beta and the `service.profilesSupport` feature-gate is removed, **and** (2) parca-agent gains OTLP export (or PR-M is re-scoped to an "otelcol-ebpf-profiler sibling distro" pattern). Neither condition is met at v0.3.0 cut.

This guide names the new capability surface, the kernel requirement, the failure modes operators will see if the SecurityContext is too restrictive, and a minimum-grant SecurityContext snippet. The cooperative-pyspy design rationale (why tracecore avoided `CAP_SYS_PTRACE` for two minor releases) is in [RFC-0009 §Alternatives](../rfcs/0009-pyspy-receiver-scope.md#alternatives-considered); the deletion rationale is in [RFC-0013 §Adoption matrix](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix).
The remainder of this guide names the eventual capability surface, the kernel requirement, the failure modes operators will see if the SecurityContext is too restrictive, and a minimum-grant SecurityContext snippet — all material that operators planning v0.4.0+ upgrades should pre-evaluate today. The cooperative-pyspy design rationale (why tracecore avoided `CAP_SYS_PTRACE` for two minor releases) is in [RFC-0009 §Alternatives](../rfcs/0009-pyspy-receiver-scope.md#alternatives-considered); the deferral rationale is in [#222](https://github.com/TraceCoreAI/tracecore/issues/222).

## Why the security posture changes
## Why the security posture will change (at v0.4.0+)

Cooperative pyspy (v0.1.x, v0.2.x) walked Python frames **inside** the workload process via `faulthandler.dump_traceback`, then shipped the rendered output over a per-process Unix domain socket. No memory of another process was read; no `ptrace`, no `process_vm_readv`, no signal. The tradeoff was operator-side: every workload had to `pip install tracecore-pyspy` and call `attach()` once at startup, and the helper only worked against cooperating CPython interpreters.
Cooperative pyspy (v0.1.x, v0.2.x, **and v0.3.0**) walks Python frames **inside** the workload process via `faulthandler.dump_traceback`, then ships the rendered output over a per-process Unix domain socket. No memory of another process is read; no `ptrace`, no `process_vm_readv`, no signal. The tradeoff is operator-side: every workload has to `pip install tracecore-pyspy` and call `attach()` once at startup, and the helper only works against cooperating CPython interpreters.

`parca-agent` (v0.3.0+) walks frames **out-of-process** via eBPF programs attached to the kernel's perf-events subsystem. The eBPF approach removes the workload-side cooperation requirement (any binary the kernel can sample is in scope, including non-Python runtimes), and removes the per-language helper-distribution problem. The cost is privilege: loading eBPF programs requires `CAP_SYS_ADMIN` (or root), and reading symbolized stacks from kernel + user space requires that the agent see the global PID namespace (`hostPID: true`) and the on-disk binaries of every workload it samples.
`parca-agent` (v0.4.0+ when PR-M lands) walks frames **out-of-process** via eBPF programs attached to the kernel's perf-events subsystem. The eBPF approach removes the workload-side cooperation requirement (any binary the kernel can sample is in scope, including non-Python runtimes), and removes the per-language helper-distribution problem. The cost is privilege: loading eBPF programs requires `CAP_SYS_ADMIN` (or root), and reading symbolized stacks from kernel + user space requires that the agent see the global PID namespace (`hostPID: true`) and the on-disk binaries of every workload it samples.

The change is a tradeoff, not a regression: tracecore preserves the cooperative path through end-of-life at v0.3.0 specifically so operators with restricted-tier Pod Security Standards have one release to evaluate whether the eBPF capability cost is acceptable for their cluster.
The change is a tradeoff, not a regression: tracecore preserves the cooperative path past v0.3.0 to give operators with restricted-tier Pod Security Standards multiple releases to evaluate whether the eBPF capability cost is acceptable for their cluster.

## What `parca-agent` requires
## What `parca-agent` will require (forward-looking, v0.4.0+)

| Requirement | Value | Source |
|---|---|---|
Expand All @@ -28,7 +34,7 @@ The change is a tradeoff, not a regression: tracecore preserves the cooperative

**On `CAP_BPF` / `CAP_PERFMON`.** Linux kernel 5.8 split `CAP_SYS_ADMIN`'s BPF surface into the narrower `CAP_BPF` (load BPF programs and maps) + `CAP_PERFMON` (open perf events) capabilities. In principle a profiler that uses only BPF + perf-events can run with `add: [BPF, PERFMON]` instead of `add: [SYS_ADMIN]`. **Upstream `parca-agent` does not document support for this narrower set today** (per its security docs, the requirement is `root` or `CAP_SYS_ADMIN`); operators interested in the narrower split should track [parca-dev/parca-agent#3115](https://github.com/parca-dev/parca-agent/issues) (CAP_BPF/CAP_PERFMON tracking) and validate against their kernel before relying on it. The conservative grant remains `CAP_SYS_ADMIN`.

## What tracecore's pod still requires
## What tracecore's pod still requires (v0.3.0)

**Unchanged from v0.2.x.** The tracecore DaemonSet's container SecurityContext is still:

Expand All @@ -41,11 +47,11 @@ containerSecurityContext:
add: []
```

The chart's conftest policy (`install/kubernetes/tracecore/policy/`) still rejects any capability addition — there is no v0.3.0 operator path that puts `CAP_SYS_ADMIN` on the tracecore pod itself. All new capability surface lives on the **separate** `parca-agent` DaemonSet.
The chart's conftest policy (`install/kubernetes/tracecore/policy/`) still rejects any capability addition — there is no operator path (v0.3.0 *or* v0.4.0+) that puts `CAP_SYS_ADMIN` on the tracecore pod itself. When PR-M lands, all new capability surface will live on the **separate** `parca-agent` DaemonSet.

## Minimum-grant `parca-agent` SecurityContext
## Minimum-grant `parca-agent` SecurityContext (forward-looking)

A starting point for operators who want to deploy `parca-agent` alongside tracecore. Place the agent in its own namespace; do not co-locate it in the tracecore pod.
A starting point for operators who want to plan ahead for the v0.4.0+ `parca-agent` deployment alongside tracecore. Place the agent in its own namespace; do not co-locate it in the tracecore pod. **Do not deploy this in v0.3.0** — `parca-agent` is not part of v0.3.0's recipe set; the cooperative pyspy receiver is still the supported path.

```yaml
apiVersion: apps/v1
Expand Down Expand Up @@ -98,9 +104,9 @@ This is a **starting point**, not the upstream-recommended manifest. Pull the ca
1. **Pod Security Standards.** `hostPID: true` and `add: [SYS_ADMIN]` both violate **baseline** PSS (and therefore restricted). Clusters with namespace labels `pod-security.kubernetes.io/enforce: baseline` (or restricted) must place `parca-agent` in an exempted namespace.
2. **OPA / Kyverno cluster policies.** Custom admission policies that ban capability additions, `hostPID`, or host-path mounts must add a `parca-agent`-namespace exception.

## Failure modes when capabilities are missing
## Failure modes when capabilities are missing (forward-looking, v0.4.0+)

These are the kernel-level failure shapes operators will see in `kubectl logs ds/parca-agent` when the SecurityContext is too restrictive. The agent's exact log strings vary by parca-agent version; the **errno / syscall** column is the stable surface — grep for the syscall name + errno code rather than the prose string.
These are the kernel-level failure shapes operators will see in `kubectl logs ds/parca-agent` when PR-M has landed and the SecurityContext is too restrictive. The agent's exact log strings vary by parca-agent version; the **errno / syscall** column is the stable surface — grep for the syscall name + errno code rather than the prose string.

| Failure shape | Underlying syscall + errno | Root cause | Remediation |
|---|---|---|---|
Expand All @@ -112,64 +118,59 @@ These are the kernel-level failure shapes operators will see in `kubectl logs ds

When triaging a real failure, capture the agent's full log (`kubectl logs --previous` for crash loops) and check it against [parca-dev/parca-agent/issues](https://github.com/parca-dev/parca-agent/issues) — operator failures outside the patterns above are upstream concerns, not tracecore concerns.

## Helper / receiver removal checklist
## Helper / receiver removal checklist (forward-looking, v0.4.0+)

The following artefacts are gone at v0.3.0. Any operator config or CI workflow that references them fails fast (chart-render rejects unknown receiver keys; `pip install` fails on the deleted PyPI package).
**Nothing to remove in v0.3.0.** This checklist is the *eventual* artefact removal once PR-M lands at v0.4.0+. Operators should not act on it at the v0.2.x → v0.3.0 upgrade; it is here so config / CI / alerting owners can stage the eventual cleanup ahead of time.

| Artefact | Action required |
| Artefact | Action required (at v0.4.0+, not v0.3.0) |
|---|---|
| Chart values key `receivers.pyspy.*` | Remove the block. Chart-render in v0.3.0 emits a `NOTES.txt` deprecation warning for one minor; v0.4.0 removes the key entirely. |
| `pip install tracecore-pyspy` in workload images | Remove from `Dockerfile` / `requirements.txt`. The PyPI package is yanked at v0.3.0; rebuilds will fail with `No matching distribution`. |
| Chart values key `receivers.pyspy.*` | Remove the block. The chart will emit a `NOTES.txt` deprecation warning for one minor before the values key is removed. |
| `pip install tracecore-pyspy` in workload images | Remove from `Dockerfile` / `requirements.txt`. The PyPI package will be yanked when PR-M lands; rebuilds will then fail with `No matching distribution`. |
| Workload-side `from tracecore_pyspy import attach; attach()` calls | Delete the import and call. No-op replacement — `parca-agent` requires zero workload code changes. |
| Per-pod `/var/run/tracecore/pyspy/` `emptyDir` volume | Remove from your Pod spec. Was only needed for the UDS rendezvous. |
| Alerts on `tracecore_receiver_errors_total{component="pyspy",kind=…}` | Delete. No corresponding metric in `parca-agent`; pivot to `parca_agent_*` self-metrics if you alert on profiler health. |
| Pre-merge CI hooks for `tools/pyspy-lint` | Delete. The symbol-table lint guarded the cooperative receiver's "no out-of-process memory reads" property; it has no purpose once the receiver is gone. |
| Pre-merge CI hooks for `tools/pyspy-lint` | Delete. The symbol-table lint guards the cooperative receiver's "no out-of-process memory reads" property; it has no purpose once the receiver is gone. |

## Verification

1. **Before upgrading**, confirm parca-agent is deployable on at least one canary node:

```bash
# Verify kernel BTF on a canary node
kubectl debug node/<canary-node> -it --image=busybox -- ls -la /host/sys/kernel/btf/vmlinux
# Expect: file exists. If missing, kernel upgrade required before v0.3.0 cutover.
```

2. **After upgrading**, verify the tracecore pod's SecurityContext is unchanged:
1. **After upgrading to v0.3.0**, verify the tracecore pod's SecurityContext is unchanged:

```bash
kubectl -n tracecore-system get ds tracecore -o yaml \
| yq '.spec.template.spec.containers[0].securityContext'
# Expect: capabilities.drop == [ALL], capabilities.add == [] or null.
```

3. **Verify parca-agent boot** (in its own namespace):
2. **Cooperative pyspy still works in v0.3.0.** No re-deploy of the helper is required; existing `tracecore-pyspy` `attach()` calls and `receivers.pyspy.*` chart-values keys continue to function. The receiver remains registered in v0.3.0's OCB binary.

3. **Forward-looking: BTF check for the eventual parca-agent migration.** Operators planning the v0.4.0+ upgrade can confirm kernel BTF availability on canary nodes ahead of time:

```bash
kubectl -n parca logs ds/parca-agent --tail=50 \
| grep -E 'started|listening|attached'
# Expect: "started" line. EPERM / BTF errors per the table above indicate misconfiguration.
# Verify kernel BTF on a canary node (v0.4.0+ prerequisite)
kubectl debug node/<canary-node> -it --image=busybox -- ls -la /host/sys/kernel/btf/vmlinux
# Expect: file exists. If missing, kernel upgrade required before the v0.4.0+ cutover.
```

## Rollback

The cooperative pyspy receiver is **not** registered in v0.3.0's OCB binary (per `builder-config.yaml`). Recipe-toggle rollback is not available. If parca-agent doesn't meet your security or compatibility budget, pin the chart and image at the last v0.2.x tag (`v0.2.0-…`; substitute the latest `v0.2.x` tag from `git tag -l 'v0.2.*'`) and keep running the cooperative receiver:
There is no profiling-surface rollback required for the v0.2.x → v0.3.0 upgrade — the cooperative pyspy receiver is still registered in v0.3.0's OCB binary, the `tracecore-pyspy` PyPI package is still installable, and the chart's `receivers.pyspy.*` values key is still honoured. If a v0.3.0 upgrade introduces an unrelated regression, the standard chart-pin rollback applies:

```bash
helm upgrade tracecore install/kubernetes/tracecore \
--version <chart-package version matching v0.2.x> \
--set image.tag=<v0.2.x binary tag>
```

The cooperative receiver's PyPI helper (`tracecore-pyspy`) remains installable from PyPI's archive for one minor release after v0.3.0 cuts; pin `tracecore-pyspy==0.1.0` in your workload `requirements.txt`. PyPI yank happens at v0.4.0.
(Substitute the latest `v0.2.x` tag from `git tag -l 'v0.2.*'`.)

## References

- [RFC-0013 §Adoption matrix](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix) — why pyspy is deleted in favour of parca-agent
- [RFC-0013 §Migration / rollout](../rfcs/0013-distro-first-pivot.md#migration--rollout) — PR-M and PR-N sequencing
- [RFC-0009 §Safety properties](../rfcs/0009-pyspy-receiver-scope.md#proposal) — historical record of the cooperative receiver's zero-capability design
- [`components/receivers/pyspy/README.md`](../../components/receivers/pyspy/README.md) — cooperative receiver's user-facing docs (carries the v0.3.0 deletion banner)
- [`components/receivers/pyspy/RUNBOOK.md`](../../components/receivers/pyspy/RUNBOOK.md) — per-kind operator triage for the cooperative receiver (preserved for operators still on v0.2.x)
- [#222: PR-M deferral memo](https://github.com/TraceCoreAI/tracecore/issues/222) — current PR-M status + re-evaluation triggers (OTel Profiles → Beta, parca-agent OTLP export)
- [RFC-0013 §Adoption matrix](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix) — why pyspy is on the eventual deletion path in favour of parca-agent (note: timing in the RFC predates the #222 deferral)
- [RFC-0013 §Migration / rollout](../rfcs/0013-distro-first-pivot.md#migration--rollout) — original PR-M and PR-N sequencing (supersede with #222 for current timeline)
- [RFC-0009 §Safety properties](../rfcs/0009-pyspy-receiver-scope.md#proposal) — design record of the cooperative receiver's zero-capability posture (still in force at v0.3.0)
- [`components/receivers/pyspy/README.md`](../../components/receivers/pyspy/README.md) — cooperative receiver's user-facing docs (the receiver ships in v0.3.0)
- [`components/receivers/pyspy/RUNBOOK.md`](../../components/receivers/pyspy/RUNBOOK.md) — per-kind operator triage for the cooperative receiver
- [Parca Agent / Requirements](https://github.com/parca-dev/parca-agent#requirements)
- [Parca Agent / Security](https://www.parca.dev/docs/parca-agent-security)
- [Linux Yama LSM (`ptrace_scope`)](https://docs.kernel.org/admin-guide/LSM/Yama.html) — relevant for operators evaluating in-cluster debugging policy alongside eBPF profiling
Expand Down