Skip to content

docs(security): PR-N — pyspy capability surface + SecurityContext guide#200

Merged
trilamsr merged 1 commit into
mainfrom
pr-n-pyspy-security-posture
May 31, 2026
Merged

docs(security): PR-N — pyspy capability surface + SecurityContext guide#200
trilamsr merged 1 commit into
mainfrom
pr-n-pyspy-security-posture

Conversation

@trilamsr

@trilamsr trilamsr commented May 31, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds docs/migration/v0.2-to-v0.3.md covering the v0.3.0 security-posture migration per RFC-0013 §migration. The single operator-visible break at v0.3.0 is the Python-profiling story: the cooperative pyspy receiver (zero capabilities added; in-process faulthandler.dump_traceback over UDS per RFC-0009) is deleted in PR-M, and the replacement parca-agent (eBPF) requires CAP_SYS_ADMIN (or root) + hostPID: true + a BTF-enabled kernel ≥5.3.

The guide names:

  • The exact upstream capability requirement (root or CAP_SYS_ADMIN per parca-dev/parca-agent) and why CAP_BPF/CAP_PERFMON is not yet a documented narrower alternative (conservative grant remains CAP_SYS_ADMIN).
  • Failure shapes by syscall + errno (bpf(BPF_PROG_LOAD,…)EPERM, perf_event_open(…)EACCES, open("/sys/kernel/btf/vmlinux",…)ENOENT, etc.) — the stable surface across parca-agent versions, not paraphrased agent log strings.
  • A minimum-grant container SecurityContext snippet (DaemonSet-shaped, add: [SYS_ADMIN] not privileged: true) with explicit disclaimers about readOnlyRootFilesystem (deferred to upstream manifest verification) and PSS interactions.
  • A clean rollback path (pin v0.2.x chart + image; tracecore-pyspy PyPI helper remains installable one minor past v0.3.0).

Why this PR exists despite cooperative pyspy needing zero capabilities

tracecore's pyspy receiver does not use CAP_SYS_PTRACE. Per RFC-0009 §Safety properties and the chart's conftest policy (add: [] asserted in chart.yml), the cooperative design walks Python frames in-process via faulthandler.dump_traceback and ships them over UDS — no ptrace, no process_vm_readv. The security-posture change at v0.3.0 is the delta from the cooperative path (zero capabilities, tracecore pod) to the eBPF path (CAP_SYS_ADMIN, separate parca-agent pod). PR-N documents that delta.

Why now (not deferred with PR-M)

parca-agent research confirms the OTel Profiles signal is still Alpha (Mar 2026 release) and parca-agent has no OTLP profiles exporter yet. That means operators upgrading to v0.3.0 will run parca-agent alongside tracecore for at least one more minor — they need the security-posture delta documented before PR-M cuts the receiver. PR-N landing ahead of PR-M gives operators an evaluation window.

Drift fix

docs/migration/v0.1-to-v0.2.md's pyspy row claimed "Deferred until OTel Profiles GA. No upstream replacement exists today; the toggle survives until contrib ships pprofreceiver." This contradicted RFC-0013, which has named parca-agent (separate DaemonSet) as the replacement since the pivot landed. Updated the row to forward-reference the new guide and removed the stale "no upstream replacement" claim. Root cause: the v0.1-to-v0.2.md skeleton (PR #179, then fleshed in PR #191) predated the RFC-0013 §2 adoption-matrix line that explicitly maps components/receivers/pyspy/parca-agent at v0.3.0. Fixed at the row, not paved over.

Test plan

  • make doc-check — 510 markdown links resolve to on-disk files; banned-phrase lint clean across 109 markdown files; new file's RFC + pyspy README/RUNBOOK cross-links verified.
  • make checkgolangci-lint run ./... 0 issues; go vet ./... clean; go mod verify clean.
  • Pre-commit hooks (no-autoupdate-check, license-check) green at push time.
  • Reviewer sanity-check the SecurityContext YAML snippet renders as valid Kubernetes apps/v1.DaemonSet (apiVersion + spec.template structure).
  • Reviewer sanity-check the failure-mode table syscall + errno columns match Linux man-page conventions (bpf(2), perf_event_open(2)).
NONE

Adds docs/migration/v0.2-to-v0.3.md covering the v0.3.0 security-posture
migration per RFC-0013 §migration. Cooperative pyspy (zero capabilities,
in-process faulthandler) is deleted at v0.3.0; operators who want Python
profiling deploy parca-agent, which requires CAP_SYS_ADMIN (or root) +
hostPID + BTF-enabled kernel.

The guide names the exact capability surface, kernel requirement, kernel
syscall + errno failure shapes (not paraphrased agent log strings),
minimum-grant SecurityContext snippet, and rollback path. Conservative
on CAP_BPF/CAP_PERFMON — upstream parca-agent does not document the
narrower split today.

Updates docs/migration/v0.1-to-v0.2.md pyspy row to forward-reference
the new guide (was claiming "no upstream replacement exists today" —
RFC-0013 names parca-agent at v0.3.0).

Updates docs/README.md to index the migration/ subdirectory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) May 31, 2026 06:48
@trilamsr trilamsr merged commit 08893aa into main May 31, 2026
11 checks passed
@trilamsr trilamsr deleted the pr-n-pyspy-security-posture branch May 31, 2026 06:54
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

Reconcile the four pivot-tracking docs
(`docs/rfcs/0013-distro-first-pivot.md`, `CHANGELOG.md`,
`MILESTONES.md`, `docs/migration/v0.1-to-v0.2.md`) with the wave-3
(PR-B1-shape sibling ports) and wave-4 (PR-B2-shape upstream-only ports
+ PR-F.1 + PR-J + PR-L + PR-N) landings. Pure doc sweep — no code or
config touched.

## What changed

### `docs/rfcs/0013-distro-first-pivot.md` §migration

PR sequence rows updated with PR-number citations and landed markers:

- **PR-A2** (landed, #189, 2026-05-30)
- **PR-B2** (landed, #201) — also enumerates sibling-receiver follow-ups
under PR-B2 to dispel the slug collision with #188's PR-B2-labelled dcgm
port: stdoutexporter (#202), pyspy (#203), kernelevents (#208),
containerstdout (#209)
- **PR-F.1** (landed) — fleshed-out delete list
(`internal/{selftelemetry,telemetry}` + `components/receivers/dcgm/` +
`pkg/dcgm/` + one orphan clockreceiver integration test)
- **PR-F.2** re-scoped — now deletes the whole
`internal/{componentstatus,pipeline,pipelinebuilder,consumer,fanout,runtime/lifecycle}`
bundle in one cut once the last three pipeline+consumer-importing
receivers land (#204 k8sevents, #205 clockreceiver, #207 otlphttp). Per
the import-graph state — `internal/componentstatus`'s only non-test
consumer is `internal/pipeline`, so they delete together
- **PR-G** (landed, #182), **PR-H** (landed, #183)
- **PR-I.1a** (in flight — scaffold agent), **PR-I.1b** (pre-staged;
gate satisfied by #201)
- **PR-J** (landed, #195) — kept existing marker
- **PR-K.1** (in flight — separate agent landing)
- **PR-L** (landed, skeleton #179 + body #191) — flagged as living
document
- **PR-N** (landed, #200) — shipped at v0.1.0 ahead of v0.3.0 as a
doc-only update at `docs/migration/v0.2-to-v0.3.md`

### `CHANGELOG.md` [Unreleased]

- Restructured the pivot wave list as **four waves** (was three). Wave 3
enumerates PR-B1-shape sibling ports + support infra (#180-#194/#196).
Wave 4 enumerates PR-B2-shape upstream-only ports + PR-J (#195) + PR-F.1
(#206) + PR-N (#200) + lint/TOCTOU hardening (#198/#210).
- Tightened the PR-F.2 deferred note to point at the three open ports
(#204/#205/#207) as the gate.

### `MILESTONES.md`

- **M1** (pipeline runtime) — status row now cites PR-A2 (#189), PR-F.1
(#206), PR-F.2 gate (#204/#205/#207), PR-E (#180), retains
`internal/config/` (still load-bearing for `tracecore validate`).
- **M2** (self-telemetry) — status row now cites PR-F.1 (#206); flags
`internal/componentstatus` as travelling with `internal/pipeline` in
PR-F.2.
- **M8** (DCGM receiver) — status flipped to *landed-and-replaced*:
cites PR-F.1 (#206) deletion + PR-J (#195)
`docs/integrations/prometheus-scrape.md` recipe. Notes the inert chart
toggle retention until PR-K.3.

### `docs/migration/v0.1-to-v0.2.md`

- §`internal/*` package deletion (PR-F) status flips from "not yet open"
to "PR-F.1 landed (#206), PR-F.2 gated on three open ports".
- Open-items checklist expanded from 5 to 13 entries — tracks every PR
letter the migration guide cares about (A2 / E / F.1 / F.2 / I.1a-c / J
/ K.1-3 / L / N) with PR numbers and links.

## Why now

Tracking docs accumulated drift across wave-3 + wave-4 because every
sibling-port PR (and the support-infra PRs around them) updated the
bottom of `CHANGELOG.md` but did not always touch the upstream
sequencing section in RFC-0013. Per memory rule `[Keeping this document
current]`: status drift is a review blocker. This PR is the consolidated
catch-up; future port PRs include their RFC-row flip in-PR.

## What this PR does NOT change

- No code, no config, no YAML, no chart — only the four tracking docs.
- No new doc gates added; existing gates pass.
- No PRs other than the four named docs are modified.

## Test plan

- [x] `bash scripts/doc-check.sh` clean (33 test refs, 528 links
resolve, comment-noise diff gate clean vs `origin/main`, all 13 gates
green).
- [x] Pre-commit hook (`commitlint` 72-char subject limit + DCO +
AI-trailer gates) passed.
- [x] Pre-push hook (`make ci-fast` equivalent: `golangci-lint`, `go
vet`, `go mod verify`, `no-autoupdate-check`, `doc-check.sh`) passed on
second attempt after `git fetch origin main` populated the worktree's
`origin/main` ref — first push failed because the worktree previously
tracked the (gone) `pr-a2-ocb-main-swap` branch, so `doc-check.sh`'s
comment-noise diff-scope gate exited 128 on the missing ref. Root cause
fixed by the fetch; not a workaround.
- [ ] CI green on this branch.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
)

## Summary

PR-N (`docs/migration/v0.2-to-v0.3.md`) landed at #200 assuming PR-M
(delete pyspy + ship `parca-agent` recipe) would cut at v0.3.0, which
made the CAP_SYS_PTRACE → CAP_SYS_ADMIN/CAP_BPF migration a v0.3.0
break. #222 subsequently deferred PR-M to v0.4.0+ (triggers: OTel
Profiles → Beta + feature-gate removed AND parca-agent gains OTLP
export). The migration doc now contradicts reality —
`components/receivers/pyspy/` still ships in v0.3.0's OCB binary,
`tracecore-pyspy` stays on PyPI, and the chart's `receivers.pyspy.*` key
is still honoured. This PR reframes the timeline throughout: pyspy STAYS
in v0.3.0 with unchanged zero-capability posture, the CAP migration is
preserved as forward-looking operator preparation material for v0.4.0+,
and #222 is named as the source of truth for re-evaluation triggers. The
CAP migration content itself (the `parca-agent` requirements table,
minimum-grant SecurityContext, failure-mode table, removal checklist) is
retained because operators planning v0.4.0+ upgrades still need it; only
the timeline framing changes.

## Test plan

- [x] `make check` — green (fmt, tidy-check, lint, vet, mod-verify).
- [x] `make doc-check` — green (501 markdown links resolve,
banned-phrase lint clean across 106 files, comment-noise diff gate clean
vs origin/main, no rot-prone reference drift).
- [ ] Reviewer confirms the reframed timeline matches #222's deferral
memo (PR-M unblocks on OTel Profiles → Beta + parca-agent OTLP export,
neither met at v0.3.0 cut).
- [ ] Reviewer confirms the CAP migration content (parca-agent
requirements, minimum-grant SecurityContext, failure-mode table) is
preserved verbatim — only framing prose changed.

## Noticed but out-of-scope

- `components/receivers/pyspy/README.md` still carries a 2026-05-22
banner saying the receiver is "Scheduled for deletion at v0.3.0 per
RFC-0013 §7". Same staleness pattern as the migration doc, but kept out
of this PR's scope per the reconcile-only contract. Worth a follow-up
sweep that re-times the README + RUNBOOK banners against #222.
- RFC-0013 §Migration / rollout still carries the original PR-M / PR-N
sequencing pre-deferral. References section now flags this with a
"supersede with #222" note, but the RFC itself is untouched.

## Release notes

```release-notes
- docs(migration): reframe docs/migration/v0.2-to-v0.3.md to reflect PR-M deferral to v0.4.0+ (per #222) — pyspy receiver and tracecore-pyspy PyPI helper continue to ship in v0.3.0 with unchanged zero-capability posture; the CAP_SYS_PTRACE → CAP_SYS_ADMIN/CAP_BPF migration content is retained as forward-looking operator preparation for the v0.4.0+ cutover.
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant