From 83adce2298d7610d4a0d0dc9617516105c5a9d8d Mon Sep 17 00:00:00 2001
From: Tri Lam <tri@maydow.com>
Date: Sat, 30 May 2026 23:46:30 -0700
Subject: [PATCH] =?UTF-8?q?docs(security):=20PR-N=20=E2=80=94=20pyspy=20ca?=
 =?UTF-8?q?pability=20surface=20+=20SecurityContext=20guide?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Adds docs/migration/v0.2-to-v0.3.md covering the v0.3.0 security-posture
migration per RFC-0013 §migration. Cooperative pyspy (zero capabilities,
in-process faulthandler) is deleted at v0.3.0; operators who want Python
profiling deploy parca-agent, which requires CAP_SYS_ADMIN (or root) +
hostPID + BTF-enabled kernel.

The guide names the exact capability surface, kernel requirement, kernel
syscall + errno failure shapes (not paraphrased agent log strings),
minimum-grant SecurityContext snippet, and rollback path. Conservative
on CAP_BPF/CAP_PERFMON — upstream parca-agent does not document the
narrower split today.

Updates docs/migration/v0.1-to-v0.2.md pyspy row to forward-reference
the new guide (was claiming "no upstream replacement exists today" —
RFC-0013 names parca-agent at v0.3.0).

Updates docs/README.md to index the migration/ subdirectory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
---
 docs/README.md                 |   1 +
 docs/migration/v0.1-to-v0.2.md |   2 +-
 docs/migration/v0.2-to-v0.3.md | 176 +++++++++++++++++++++++++++++++++
 3 files changed, 178 insertions(+), 1 deletion(-)
 create mode 100644 docs/migration/v0.2-to-v0.3.md

diff --git a/docs/README.md b/docs/README.md
index 78b8b4b5..7d8f5330 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -36,6 +36,7 @@ Legend: 👤 operator · 🛠️ contributor · 🏛️ maintainer · 🌐 exter
 | [examples/](examples/) | 👤 | Reference operator artifacts (Prometheus alerts, Grafana dashboard, with-telemetry config). |
 | [followups/](followups/) | 🏛️ | Per-milestone follow-up shards + cross-cutting `_needs-prod-data` / `_needs-gpu` buckets. See [followups/README.md](followups/README.md) for filing convention. |
 | [integrations/](integrations/) | 👤 | Validated recipes for shipping tracecore output to specific backends. See per-recipe rows below. |
+| [migration/](migration/) | 👤 | Per-minor-release upgrade guides covering every operator-visible break. One file per release boundary. |
 | [notes/](notes/) | 🛠️ 🏛️ | Working notes on process, CI, PR workflow, reviews, conftest, autonomous-run logs. See [notes/README.md](notes/README.md). |
 
 ## Integrations
diff --git a/docs/migration/v0.1-to-v0.2.md b/docs/migration/v0.1-to-v0.2.md
index 7ba0e171..d7e6496c 100644
--- a/docs/migration/v0.1-to-v0.2.md
+++ b/docs/migration/v0.1-to-v0.2.md
@@ -69,7 +69,7 @@ The OCB-assembled binary registers only the components listed in [`builder-confi
 | `receivers.k8sevents` | receiver | `k8sobjectsreceiver` + OTTL `k8s.event.hint` transform | Leave `k8sevents.enabled: false` (default). The PR-J recipe ships the OTTL transform that preserves the 11-entry `k8s.event.hint` enum (RFC-0013 §3 contract); until then, pin v0.1.x if you alert on `k8s.event.hint`. |
 | `receivers.kernelevents` | receiver | `journaldreceiver` + `filelogreceiver` (kmsg) + OTTL Xid transform | Leave `kernelevents.enabled: false` (default). The PR-J recipe ships the OTTL transform that keeps `kernelevents.xid` populated; until then, pin v0.1.x if you alert on Xid codes. |
 | `receivers.nccl_fr` | receiver | In-repo Go submodule via OCB `gomod:` (PR-I) + `replaces: ./module` | No operator action; the receiver ships in `module/receiver/ncclfrreceiver` and OCB pulls it like any upstream module. |
-| `receivers.pyspy` | receiver | Deferred until OTel Profiles GA | Leave `pyspy.enabled: false` (default). No upstream replacement exists today; the toggle survives until contrib ships `pprofreceiver`. |
+| `receivers.pyspy` | receiver | `parca-agent` (separate DaemonSet, eBPF) at v0.3.0 | Leave `pyspy.enabled: false` (default) through v0.2.x. At v0.3.0 the receiver + PyPI helper are deleted; the security-posture migration (zero-capability cooperative receiver → `parca-agent`'s `CAP_SYS_ADMIN` on a separate pod) is documented in [`v0.2-to-v0.3.md`](v0.2-to-v0.3.md). |
 | `exporters.stdoutexporter` | exporter | `debugexporter` (OCB-bundled, chart default) | Replace `exporters.stdoutexporter` with `exporters.debug` in pipelines. The debug exporter writes to pod stdout, same observation channel. |
 | `exporters.otlphttp` (in-tree clone) | exporter | `otlphttpexporter` (OCB-bundled) | Same chart key (`otlphttp`), same field shape — `endpoint`, `compression`, `headers`, `tls.*`, `timeout`, `retry_on_failure`, `sending_queue` pass through to the upstream exporter without translation. |
 
diff --git a/docs/migration/v0.2-to-v0.3.md b/docs/migration/v0.2-to-v0.3.md
new file mode 100644
index 00000000..93cba6cb
--- /dev/null
+++ b/docs/migration/v0.2-to-v0.3.md
@@ -0,0 +1,176 @@
+# Migration: v0.2.x → v0.3.0
+
+This guide tells operators how to move from a `v0.2.x` deployment to `v0.3.0`. The single operator-visible break this release is the Python-profiling story: tracecore's in-tree `pyspy` receiver and its `tracecore-pyspy` PyPI helper are deleted, and the upstream-recipe replacement (`parca-agent`) changes the SecurityContext budget. Everything else is unchanged from `v0.2.x` — see [`v0.1-to-v0.2.md`](v0.1-to-v0.2.md) for the prior cut's surface.
+
+## TL;DR
+
+The cooperative `pyspy` receiver (Phase 2 design: in-process Python helper + UDS + `faulthandler.dump_traceback`, zero capabilities added per [RFC-0009 §Safety properties](../rfcs/0009-pyspy-receiver-scope.md)) is deleted in PR-M. Operators who want Python stack sampling deploy `parca-agent` as a separate DaemonSet via the upstream chart. `parca-agent` is an eBPF profiler and **requires `CAP_SYS_ADMIN` (or root)** on its pod — a strictly larger capability budget than the zero-capability cooperative pyspy needed. The tracecore DaemonSet's capability set is unchanged (still `drop: [ALL]`, `add: []`); the new capability lives on the `parca-agent` pod, deployed and governed separately by the operator.
+
+This guide names the new capability surface, the kernel requirement, the failure modes operators will see if the SecurityContext is too restrictive, and a minimum-grant SecurityContext snippet. The cooperative-pyspy design rationale (why tracecore avoided `CAP_SYS_PTRACE` for two minor releases) is in [RFC-0009 §Alternatives](../rfcs/0009-pyspy-receiver-scope.md#alternatives-considered); the deletion rationale is in [RFC-0013 §Adoption matrix](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix).
+
+## Why the security posture changes
+
+Cooperative pyspy (v0.1.x, v0.2.x) walked Python frames **inside** the workload process via `faulthandler.dump_traceback`, then shipped the rendered output over a per-process Unix domain socket. No memory of another process was read; no `ptrace`, no `process_vm_readv`, no signal. The tradeoff was operator-side: every workload had to `pip install tracecore-pyspy` and call `attach()` once at startup, and the helper only worked against cooperating CPython interpreters.
+
+`parca-agent` (v0.3.0+) walks frames **out-of-process** via eBPF programs attached to the kernel's perf-events subsystem. The eBPF approach removes the workload-side cooperation requirement (any binary the kernel can sample is in scope, including non-Python runtimes), and removes the per-language helper-distribution problem. The cost is privilege: loading eBPF programs requires `CAP_SYS_ADMIN` (or root), and reading symbolized stacks from kernel + user space requires that the agent see the global PID namespace (`hostPID: true`) and the on-disk binaries of every workload it samples.
+
+The change is a tradeoff, not a regression: tracecore preserves the cooperative path through end-of-life at v0.3.0 specifically so operators with restricted-tier Pod Security Standards have one release to evaluate whether the eBPF capability cost is acceptable for their cluster.
+
+## What `parca-agent` requires
+
+| Requirement | Value | Source |
+|---|---|---|
+| Linux kernel | ≥ 5.3 with BTF (`CONFIG_DEBUG_INFO_BTF=y`) | [Parca Agent docs / Requirements](https://github.com/parca-dev/parca-agent#requirements) |
+| User | `root` **OR** `CAP_SYS_ADMIN` (no narrower split documented upstream) | [Parca Agent docs / Security](https://www.parca.dev/docs/parca-agent-security) |
+| Pod-level | `hostPID: true` (cross-namespace process visibility for symbolization) | upstream DaemonSet manifest |
+| Volumes | `/sys` (BPF FS, perf-events), `/proc` (process discovery), `/run` (BPF map persistence), host filesystem for symbol resolution | upstream DaemonSet manifest |
+| Container | Privileged **OR** `add: [SYS_ADMIN]` on top of `drop: [ALL]` | upstream documentation |
+
+**On `CAP_BPF` / `CAP_PERFMON`.** Linux kernel 5.8 split `CAP_SYS_ADMIN`'s BPF surface into the narrower `CAP_BPF` (load BPF programs and maps) + `CAP_PERFMON` (open perf events) capabilities. In principle a profiler that uses only BPF + perf-events can run with `add: [BPF, PERFMON]` instead of `add: [SYS_ADMIN]`. **Upstream `parca-agent` does not document support for this narrower set today** (per its security docs, the requirement is `root` or `CAP_SYS_ADMIN`); operators interested in the narrower split should track [parca-dev/parca-agent#3115](https://github.com/parca-dev/parca-agent/issues) (CAP_BPF/CAP_PERFMON tracking) and validate against their kernel before relying on it. The conservative grant remains `CAP_SYS_ADMIN`.
+
+## What tracecore's pod still requires
+
+**Unchanged from v0.2.x.** The tracecore DaemonSet's container SecurityContext is still:
+
+```yaml
+containerSecurityContext:
+  allowPrivilegeEscalation: false
+  readOnlyRootFilesystem: true
+  capabilities:
+    drop: [ALL]
+    add: []
+```
+
+The chart's conftest policy (`install/kubernetes/tracecore/policy/`) still rejects any capability addition — there is no v0.3.0 operator path that puts `CAP_SYS_ADMIN` on the tracecore pod itself. All new capability surface lives on the **separate** `parca-agent` DaemonSet.
+
+## Minimum-grant `parca-agent` SecurityContext
+
+A starting point for operators who want to deploy `parca-agent` alongside tracecore. Place the agent in its own namespace; do not co-locate it in the tracecore pod.
+
+```yaml
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: parca-agent
+  namespace: parca
+spec:
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: parca-agent
+  template:
+    metadata:
+      labels:
+        app.kubernetes.io/name: parca-agent
+    spec:
+      # Cross-namespace PID visibility required for symbolization.
+      hostPID: true
+      serviceAccountName: parca-agent
+      containers:
+        - name: parca-agent
+          image: ghcr.io/parca-dev/parca-agent:v0.46.0
+          securityContext:
+            # Minimum-grant: drop all, add the one capability the
+            # eBPF + perf-events path requires. Avoid `privileged:
+            # true` — the explicit capability add is narrower and
+            # passes restricted-tier audits with a documented
+            # exception, while `privileged` grants the union of
+            # all capabilities + device access.
+            allowPrivilegeEscalation: false
+            # `readOnlyRootFilesystem` is desirable but not asserted
+            # against the upstream agent here; verify against
+            # parca-dev/parca-agent/deploy/ for the current
+            # writable-path set before enabling.
+            capabilities:
+              drop: [ALL]
+              add: [SYS_ADMIN]
+          volumeMounts:
+            - { name: sys,  mountPath: /sys,  readOnly: false }
+            - { name: proc, mountPath: /host/proc, readOnly: true }
+            - { name: run,  mountPath: /run }
+      volumes:
+        - { name: sys,  hostPath: { path: /sys,  type: Directory } }
+        - { name: proc, hostPath: { path: /proc, type: Directory } }
+        - { name: run,  hostPath: { path: /run,  type: Directory } }
+```
+
+This is a **starting point**, not the upstream-recommended manifest. Pull the canonical deployment from [parca-dev/parca-agent/deploy/](https://github.com/parca-dev/parca-agent/tree/main/deploy) and adapt to your cluster's PSS tier. Two cluster-policy interactions to verify before rollout:
+
+1. **Pod Security Standards.** `hostPID: true` and `add: [SYS_ADMIN]` both violate **baseline** PSS (and therefore restricted). Clusters with namespace labels `pod-security.kubernetes.io/enforce: baseline` (or restricted) must place `parca-agent` in an exempted namespace.
+2. **OPA / Kyverno cluster policies.** Custom admission policies that ban capability additions, `hostPID`, or host-path mounts must add a `parca-agent`-namespace exception.
+
+## Failure modes when capabilities are missing
+
+These are the kernel-level failure shapes operators will see in `kubectl logs ds/parca-agent` when the SecurityContext is too restrictive. The agent's exact log strings vary by parca-agent version; the **errno / syscall** column is the stable surface — grep for the syscall name + errno code rather than the prose string.
+
+| Failure shape | Underlying syscall + errno | Root cause | Remediation |
+|---|---|---|---|
+| BPF program load fails at startup | `bpf(BPF_PROG_LOAD, …)` → `EPERM` | Container missing `CAP_SYS_ADMIN`. The kernel rejects BPF program load from an unprivileged process. | Add `capabilities.add: [SYS_ADMIN]` to the container `securityContext`, or set `securityContext.privileged: true`. |
+| Perf event open fails | `perf_event_open(…)` → `EACCES` or `EPERM` | Container has `CAP_SYS_ADMIN` but the kernel's `kernel.perf_event_paranoid` sysctl is `>= 2`, blocking unprivileged perf measurements. (`CAP_SYS_ADMIN` bypasses this on most kernels; some hardened distros require explicit `CAP_PERFMON` even with admin.) | Either lower `kernel.perf_event_paranoid` to `1` on the node (sysctl, requires node-level access), or upgrade kernel to ≥5.8 and add `CAP_PERFMON` to the container. |
+| BTF discovery fails at startup | `open("/sys/kernel/btf/vmlinux", …)` → `ENOENT` | Kernel is missing `CONFIG_DEBUG_INFO_BTF=y`. Most distro kernels ≥5.3 ship BTF; minimal / Alpine / older RHEL kernels may not. | Upgrade to a BTF-enabled kernel (`ls /sys/kernel/btf/vmlinux` on the node confirms), or pin nodes with a known-good kernel (Ubuntu 22.04+, RHEL 9+, Amazon Linux 2023, GKE / EKS / AKS managed images). |
+| BPF FS unavailable | `mount("bpf", "/sys/fs/bpf", "bpf", …)` → `EPERM` or `EACCES` | Container missing `CAP_SYS_ADMIN`, OR `/sys` host-path mount is `readOnly: true`, OR the node has no BPF FS available. | Ensure `CAP_SYS_ADMIN` is granted, the `/sys` mount is `readOnly: false`, and `mount \| grep bpf` on the node returns a `bpf` line. |
+| Workload PIDs not discoverable | `readdir("/proc")` returns only the agent's own PID namespace | Pod is missing `hostPID: true`. The agent's `/proc` view doesn't include workload PIDs across namespaces. | Set `hostPID: true` on the pod spec. |
+
+When triaging a real failure, capture the agent's full log (`kubectl logs --previous` for crash loops) and check it against [parca-dev/parca-agent/issues](https://github.com/parca-dev/parca-agent/issues) — operator failures outside the patterns above are upstream concerns, not tracecore concerns.
+
+## Helper / receiver removal checklist
+
+The following artefacts are gone at v0.3.0. Any operator config or CI workflow that references them fails fast (chart-render rejects unknown receiver keys; `pip install` fails on the deleted PyPI package).
+
+| Artefact | Action required |
+|---|---|
+| Chart values key `receivers.pyspy.*` | Remove the block. Chart-render in v0.3.0 emits a `NOTES.txt` deprecation warning for one minor; v0.4.0 removes the key entirely. |
+| `pip install tracecore-pyspy` in workload images | Remove from `Dockerfile` / `requirements.txt`. The PyPI package is yanked at v0.3.0; rebuilds will fail with `No matching distribution`. |
+| Workload-side `from tracecore_pyspy import attach; attach()` calls | Delete the import and call. No-op replacement — `parca-agent` requires zero workload code changes. |
+| Per-pod `/var/run/tracecore/pyspy/` `emptyDir` volume | Remove from your Pod spec. Was only needed for the UDS rendezvous. |
+| Alerts on `tracecore_receiver_errors_total{component="pyspy",kind=…}` | Delete. No corresponding metric in `parca-agent`; pivot to `parca_agent_*` self-metrics if you alert on profiler health. |
+| Pre-merge CI hooks for `tools/pyspy-lint` | Delete. The symbol-table lint guarded the cooperative receiver's "no out-of-process memory reads" property; it has no purpose once the receiver is gone. |
+
+## Verification
+
+1. **Before upgrading**, confirm parca-agent is deployable on at least one canary node:
+
+   ```bash
+   # Verify kernel BTF on a canary node
+   kubectl debug node/<canary-node> -it --image=busybox -- ls -la /host/sys/kernel/btf/vmlinux
+   # Expect: file exists. If missing, kernel upgrade required before v0.3.0 cutover.
+   ```
+
+2. **After upgrading**, verify the tracecore pod's SecurityContext is unchanged:
+
+   ```bash
+   kubectl -n tracecore-system get ds tracecore -o yaml \
+     | yq '.spec.template.spec.containers[0].securityContext'
+   # Expect: capabilities.drop == [ALL], capabilities.add == [] or null.
+   ```
+
+3. **Verify parca-agent boot** (in its own namespace):
+
+   ```bash
+   kubectl -n parca logs ds/parca-agent --tail=50 \
+     | grep -E 'started|listening|attached'
+   # Expect: "started" line. EPERM / BTF errors per the table above indicate misconfiguration.
+   ```
+
+## Rollback
+
+The cooperative pyspy receiver is **not** registered in v0.3.0's OCB binary (per `builder-config.yaml`). Recipe-toggle rollback is not available. If parca-agent doesn't meet your security or compatibility budget, pin the chart and image at the last v0.2.x tag (`v0.2.0-…`; substitute the latest `v0.2.x` tag from `git tag -l 'v0.2.*'`) and keep running the cooperative receiver:
+
+```bash
+helm upgrade tracecore install/kubernetes/tracecore \
+  --version <chart-package version matching v0.2.x> \
+  --set image.tag=<v0.2.x binary tag>
+```
+
+The cooperative receiver's PyPI helper (`tracecore-pyspy`) remains installable from PyPI's archive for one minor release after v0.3.0 cuts; pin `tracecore-pyspy==0.1.0` in your workload `requirements.txt`. PyPI yank happens at v0.4.0.
+
+## References
+
+- [RFC-0013 §Adoption matrix](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix) — why pyspy is deleted in favour of parca-agent
+- [RFC-0013 §Migration / rollout](../rfcs/0013-distro-first-pivot.md#migration--rollout) — PR-M and PR-N sequencing
+- [RFC-0009 §Safety properties](../rfcs/0009-pyspy-receiver-scope.md#proposal) — historical record of the cooperative receiver's zero-capability design
+- [`components/receivers/pyspy/README.md`](../../components/receivers/pyspy/README.md) — cooperative receiver's user-facing docs (carries the v0.3.0 deletion banner)
+- [`components/receivers/pyspy/RUNBOOK.md`](../../components/receivers/pyspy/RUNBOOK.md) — per-kind operator triage for the cooperative receiver (preserved for operators still on v0.2.x)
+- [Parca Agent / Requirements](https://github.com/parca-dev/parca-agent#requirements)
+- [Parca Agent / Security](https://www.parca.dev/docs/parca-agent-security)
+- [Linux Yama LSM (`ptrace_scope`)](https://docs.kernel.org/admin-guide/LSM/Yama.html) — relevant for operators evaluating in-cluster debugging policy alongside eBPF profiling
+- [Kubernetes Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/) — baseline / restricted / privileged tier definitions