diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 76a4e6a6..89949b70 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -6,6 +6,7 @@ Thanks for your interest. Tracecore is in early development and we welcome focus - Read [`PRINCIPLES.md`](PRINCIPLES.md) for *why* we make the choices we make. Read [`STYLE.md`](STYLE.md) for *what* the rules are. - Skim [`AGENTS.md`](AGENTS.md) for the load-bearing lessons every change should respect, and the topic index pointing into `docs/notes/` for deeper per-area guidance. +- **Keep the tracking docs current.** Any PR that starts, advances, ships, or re-scopes work tracked in [`MILESTONES.md`](MILESTONES.md) or [`docs/FOLLOWUPS.md`](docs/FOLLOWUPS.md) updates the corresponding entry (and, for milestones, the lane table) in the same PR — see [`MILESTONES.md` § "Keeping this document current"](MILESTONES.md#keeping-this-document-current) for the exact transitions covering both files. Status drift is a review blocker. - File an **RFC** in `docs/rfcs/` for load-bearing decisions you want documented for future contributors — architectural calls with non-obvious alternatives, policy changes, deprecations. Not gated. Small architectural choices can land via a clear commit message + a code comment. Judgment-called; over-RFC beats under-RFC at this stage, but velocity beats both. - Check open issues to avoid duplicate work. diff --git a/MILESTONES.md b/MILESTONES.md index ffac6e71..c75c8075 100644 --- a/MILESTONES.md +++ b/MILESTONES.md @@ -17,15 +17,41 @@ Each milestone has: - **ID** — `M1`, `M2`, … stable across the milestone's lifetime - **Status** — see legend - **Depends on** — milestone IDs or `none` -- **Functional rubrics** (unstarted/in-progress only) — falsifiable claims about what the milestone DOES, each with a citation -- **Non-functional rubrics** (unstarted/in-progress only) — falsifiable claims about HOW WELL or under what constraints +- **Functional rubrics** — falsifiable claims about what the milestone DOES, each with a citation +- **Non-functional rubrics** — falsifiable claims about HOW WELL or under what constraints -Optional fields where they apply: **Reference** (RFC link), **Hardware** (when non-trivial), **Carry-forward** (deferred sub-items as partial-ship receipts). +Rubrics stay in the doc for the milestone's full lifecycle. Each bullet carries a `☐` (planned), `⧗` (in progress), or `☑` (verified) prefix. A milestone's top-line **Status** is `☑` only once every rubric bullet is `☑`. Bullets explicitly tagged as unverified — see `docs/.unverified-baseline` for the count gate — flip to `☑` when their gate is implemented; the unverified tag persists in the bullet text to mark what evidence is still owed. + +Optional fields where they apply: **Reference** (RFC link), **Landed** (one-line pointer to the artifact for shipped/alpha milestones), **Hardware** (when non-trivial), **Carry-forward** (deferred sub-items as partial-ship receipts). A milestone ships when every rubric is provably satisfied. +The Foundation entries (M1, M2, M4, M9) predate this convention — they ship as prose summaries without inlined rubrics; backfilling them is a deferred cleanup, not a blocker for new milestones. + **M7 is absent by design** — OTel `gen_ai.training.*` semconv work lives in `open-telemetry/semantic-conventions`, not this repo (recurring cadence in NORTHSTARS.md O4). +## Keeping this document current + +**Rule:** every PR that lands milestone or follow-up work updates the tracking docs in the same PR. Status drift is a review blocker. + +This rule covers **both** `MILESTONES.md` (this file) **and** [`docs/FOLLOWUPS.md`](docs/FOLLOWUPS.md). Pick the doc that already tracks the item; if the work is large enough to be a milestone but isn't tracked yet, file it here. If it's an opportunistic / trigger-based item, file it in `FOLLOWUPS.md`. + +**For milestones (this file):** +- A PR that **starts** work on a milestone flips its top-line status `☐ → ⧗` and prefixes any rubric bullets the PR begins work on with `⧗`. +- A PR that **satisfies a rubric** flips that bullet's prefix `☐` or `⧗` `→ ☑`. The rubric text itself stays — only the prefix changes. Bullets carrying an unverified tag flip to `☑` when their gate lands; the unverified text persists until evidence exists. Rubrics are never stripped from the doc; they are the audit record of what shipping meant. +- A PR that **ships every rubric** flips the top-line `⧗ → ☑` and adds a **Landed:** line citing the merged PR(s) and the primary artifact(s). Deferred sub-items go under **Carry-forward**. +- A PR that **partially ships** a milestone (alpha scaffold, parser-only, etc.) flips the top-line to `☑ ` (e.g. `☑ alpha`, `☑ partial`) with the PR number in parentheses, marks each delivered rubric `☑`, leaves open rubrics with `☐`/`⧗`, and lists the open scope under **Carry-forward**. +- A PR that **drops or re-scopes** a milestone flips the top-line to `⊘` or `⊟` with the written reason inline; affected rubrics are struck through (`~~…~~`) rather than deleted. +- The lane table at "Lane structure" is kept in sync with these top-line status flips (`(shipped)`, `(alpha)`, `carry-fwd` annotations). + +**For follow-up items (`docs/FOLLOWUPS.md`):** +- A PR that **completes** a follow-up strikes it through (`~~…~~`) or removes the row, citing the landing PR or commit. +- A PR that **advances** a follow-up partially updates the row's status note inline (e.g. "*landed for amd64 in #NN; arm64 carry-forward*"). +- A PR that **discovers** new follow-up work adds the item under the appropriate flavour (opportunistic vs. explicitly-skipped) before merging — don't leave it as a verbal "we should track this." +- A PR that **promotes** a follow-up into a full milestone removes the row from `FOLLOWUPS.md` and adds the new milestone entry here in the same PR. + +A PR that touches files under a tracked item's named scope (e.g. `components/receivers//`, the chart, `tools/failure-inject/`, or any code path referenced by a `FOLLOWUPS.md` row) without updating the corresponding tracking doc is treated as drift and asked to amend. + ## Lane structure Work runs in six parallel swim lanes plus a Foundation set. Lanes are organized by file-collision boundary, dependency chain, and hardware-access posture. Cross-lane serialization happens only at `cmd/tracecore/components.go` (one-line factory edits per receiver), `Makefile` (section-commented targets), and `go.mod` (review at merge). @@ -33,12 +59,12 @@ Work runs in six parallel swim lanes plus a Foundation set. Lanes are organized | Lane | Theme | Milestones | Hardware | |---|---|---|---| | Foundation | Runtime + self-telemetry | M1, M2, M4 (partial) | n/a | -| 1 | Release infrastructure | M3, M5b, M20, M21 | none for M3/M5b/M21; Linux + GPU (flood-gated) for M20b/c | -| 2 | Test & failure infra | M4b, M5 | none for M4b; Linux + GPU (flood-gated) for M5 overhead bench | +| 1 | Release infrastructure | M3 (shipped), M5b (shipped), M20, M21 | none for M3/M5b/M21; Linux + GPU (flood-gated) for M20b/c | +| 2 | Test & failure infra | M4b (shipped), M5 | none for M4b; Linux + GPU (flood-gated) for M5 overhead bench | | 3 | Documentation & community | M6, M23 | none | -| 4 | Orchestrator signals | M9 (shipped), M10, M15, M16, M19 | Linux, no GPU | +| 4 | Orchestrator signals | M9 (shipped), M10 (alpha), M15, M16, M19 | Linux, no GPU | | 5 | Framework & runtime profiling | M13, M14, M18 | Linux + Python, no GPU | -| 6 | GPU & NCCL signals (flood-gated) | M8 carry-fwd, M11, M12, M17, M24 | Linux + NVIDIA GPU (M11 parser excepted) | +| 6 | GPU & NCCL signals (flood-gated) | M8 carry-fwd, M11 (alpha), M12, M17, M24 | Linux + NVIDIA GPU (M11 parser excepted) | ## Universal non-functional principles @@ -86,42 +112,44 @@ Critical path to v0.1.0; the only lane in which a single milestone (M21) gates e ### M3. Reproducible-build CI -- **Status:** ⧗ (in flight; release.yml + recipe in PR #28, end-to-end green across the `v0.0.0-m3test-*` series. Flip to ☑ on merge to main.) +- **Status:** ☑ delivered (PR #28) - **Depends on:** none +- **Landed:** `.github/workflows/release.yml` + `docs/reproducibility.md`. **Functional rubrics:** -- `make build` from the same git SHA produces byte-identical `tracecore` binaries across two independent CI runs on `ubuntu-latest`, verified by `diffoscope` (exit 0, empty diff) on `linux/amd64`; `linux/arm64` is opt-in. (per NORTHSTARS O3 "Reproducible builds") -- Build invokes `go build` with `-trimpath` and honours `SOURCE_DATE_EPOCH`, falling back to the latest commit's `%ct` (never wallclock `now`); a CI step asserts both flags are present and that the embedded `BuildDate` equals `date -u -r $SOURCE_DATE_EPOCH`. (per PRINCIPLES §12) -- Every tagged release publishes a CycloneDX SBOM (`cyclonedx-gomod mod` or `syft`) covering every module in `go.sum`; SBOM is attached as a release artifact. (per NORTHSTARS O3 "SBOM") -- Every release binary is signed with `cosign sign-blob --yes` using Sigstore keyless / OIDC; `cosign verify-blob` against the published cert-identity succeeds on a fresh checkout. (per NORTHSTARS O3 "Release signing") -- SLSA v1.0 Build Level 1 in-toto provenance with `predicateType: https://slsa.dev/provenance/v1` is generated for every release and references the `tracecore` artifact digest. (per https://slsa.dev/spec/v1.0/levels#build-l1) -- `docs/reproducibility.md` documents the exact commands a third party runs to re-verify a published release end-to-end; `make doc-check` asserts the file's presence and its referenced commands' shell syntax. -- CI fails closed when any of: diffoscope reports a diff, SBOM generation fails, cosign signing fails, or provenance attestation missing — reproducibility breakage is P0. (per PRINCIPLES §12) +- ☑ `make build` from the same git SHA produces byte-identical `tracecore` binaries across two independent CI runs on `ubuntu-latest`, verified by `diffoscope` (exit 0, empty diff) on `linux/amd64`; `linux/arm64` is opt-in. (per NORTHSTARS O3 "Reproducible builds") +- ☑ Build invokes `go build` with `-trimpath` and honours `SOURCE_DATE_EPOCH`, falling back to the latest commit's `%ct` (never wallclock `now`); a CI step asserts both flags are present and that the embedded `BuildDate` equals `date -u -r $SOURCE_DATE_EPOCH`. (per PRINCIPLES §12) +- ☑ Every tagged release publishes a CycloneDX SBOM (`cyclonedx-gomod mod` or `syft`) covering every module in `go.sum`; SBOM is attached as a release artifact. (per NORTHSTARS O3 "SBOM") +- ☑ Every release binary is signed with `cosign sign-blob --yes` using Sigstore keyless / OIDC; `cosign verify-blob` against the published cert-identity succeeds on a fresh checkout. (per NORTHSTARS O3 "Release signing") +- ☑ SLSA v1.0 Build Level 1 in-toto provenance with `predicateType: https://slsa.dev/provenance/v1` is generated for every release and references the `tracecore` artifact digest. (per https://slsa.dev/spec/v1.0/levels#build-l1) +- ☑ `docs/reproducibility.md` documents the exact commands a third party runs to re-verify a published release end-to-end; `make doc-check` asserts the file's presence and its referenced commands' shell syntax. +- ☑ CI fails closed when any of: diffoscope reports a diff, SBOM generation fails, cosign signing fails, or provenance attestation missing — reproducibility breakage is P0. (per PRINCIPLES §12) **Non-functional rubrics:** -- Reproducibility verification (second build + diffoscope) is a required GitHub check on every release tag, wired into `.github/workflows/release.yml`. (per NORTHSTARS O3) -- M3 CI steps run in `release.yml` (or a separate workflow), not inside `make ci` — `make ci` stays under the 60s budget. (per PRINCIPLES §10) -- `tracecore` builds with `CGO_ENABLED=0` so reproducibility does not depend on the host C toolchain. (per `.github/workflows/ci.yml` L73 + L84 — both linux/amd64 and linux/arm64 build steps) -- Cosign verification uses keyless Fulcio/Rekor signing tied to the GitHub Actions OIDC identity — no long-lived private key in repo secrets. (per PRINCIPLES §6) +- ☑ Reproducibility verification (second build + diffoscope) is a required GitHub check on every release tag, wired into `.github/workflows/release.yml`. (per NORTHSTARS O3) +- ☑ M3 CI steps run in `release.yml` (or a separate workflow), not inside `make ci` — `make ci` stays under the 60s budget. (per PRINCIPLES §10) +- ☑ `tracecore` builds with `CGO_ENABLED=0` so reproducibility does not depend on the host C toolchain. (per `.github/workflows/ci.yml` L73 + L84 — both linux/amd64 and linux/arm64 build steps) +- ☑ Cosign verification uses keyless Fulcio/Rekor signing tied to the GitHub Actions OIDC identity — no long-lived private key in repo secrets. (per PRINCIPLES §6) ### M5b. Helm chart + minimal-privilege pod spec -- **Status:** ☐ -- **Depends on:** M1 (satisfied) +- **Status:** ☑ delivered (PR #29) +- **Depends on:** M1 +- **Landed:** Chart at `install/kubernetes/tracecore/`; CI workflow `.github/workflows/chart.yml`; bundled `conftest` policy under `policies/conftest/`. **Functional rubrics:** -- Chart at `install/kubernetes/tracecore/` installs via `helm install tracecore ./install/kubernetes/tracecore` against a kind/minikube cluster in CI; install returns exit 0 and `helm status` reports `STATUS: deployed`. -- Chart renders a DaemonSet (not Deployment); `helm template` output passes `yq` assertion `kind == DaemonSet`. (per NORTHSTARS O2 minimal-privilege) -- `values.yaml` exposes `namespace`, per-receiver `receivers..enabled` toggles, and a free-form `config:` block; CI test renders the chart with all-receivers-off and one-receiver-on, runs `tracecore validate` (delivered by M1) against each output, expects exit 0. -- Chart `README.md` contains required H2 sections (Install, Upgrade, Uninstall, Values reference, Troubleshooting) — verified by markdown-section lint. (per `docs/STYLE-docs.md`) -- `helm lint install/kubernetes/tracecore` exits 0 with zero `[WARNING]` lines. -- A `conftest` or `kyverno` policy rejects the build when any of: `privileged: true`, `hostPID: true`, `hostIPC: true`, missing `readOnlyRootFilesystem: true`, or `capabilities.add` containing anything other than `SYS_PTRACE`. (per NORTHSTARS O2 "Minimal-privilege pod spec") +- ☑ Chart at `install/kubernetes/tracecore/` installs via `helm install tracecore ./install/kubernetes/tracecore` against a kind/minikube cluster in CI; install returns exit 0 and `helm status` reports `STATUS: deployed`. +- ☑ Chart renders a DaemonSet (not Deployment); `helm template` output passes `yq` assertion `kind == DaemonSet`. (per NORTHSTARS O2 minimal-privilege) +- ☑ `values.yaml` exposes `namespace`, per-receiver `receivers..enabled` toggles, and a free-form `config:` block; CI test renders the chart with all-receivers-off and one-receiver-on, runs `tracecore validate` (delivered by M1) against each output, expects exit 0. +- ☑ Chart `README.md` contains required H2 sections (Install, Upgrade, Uninstall, Values reference, Troubleshooting) — verified by markdown-section lint. (per `docs/STYLE-docs.md`) +- ☑ `helm lint install/kubernetes/tracecore` exits 0 with zero `[WARNING]` lines. +- ☑ A `conftest` or `kyverno` policy rejects the build when any of: `privileged: true`, `hostPID: true`, `hostIPC: true`, missing `readOnlyRootFilesystem: true`, or `capabilities.add` containing anything other than `SYS_PTRACE`. (per NORTHSTARS O2 "Minimal-privilege pod spec") **Non-functional rubrics:** -- Rendered pod spec passes the Kubernetes `restricted` Pod Security Standard except for explicit `SYS_PTRACE` and the host-path mounts required by receivers; deviation list is enumerated in the chart README with a one-line justification per item. (per https://kubernetes.io/docs/concepts/security/pod-security-standards/) -- DaemonSet template sets `securityContext.runAsNonRoot: true`, a non-zero `runAsUser`, `seccompProfile.type: RuntimeDefault`, `allowPrivilegeEscalation: false`; CI asserts each field via `yq`/grep gate. (per NORTHSTARS O2) -- `Chart.yaml` declares `apiVersion: v2`, a SemVer `version`, and an `appVersion` matching the tracecore binary tag; CI gate fails on drift. (per PRINCIPLES §15) -- `helm install` plus DaemonSet `Ready` on a single-node kind cluster completes in ≤5 min median across 10 CI runs. (per NORTHSTARS O2 hero-KPI) +- ☑ Rendered pod spec passes the Kubernetes `restricted` Pod Security Standard except for explicit `SYS_PTRACE` and the host-path mounts required by receivers; deviation list is enumerated in the chart README with a one-line justification per item. (per https://kubernetes.io/docs/concepts/security/pod-security-standards/) +- ☑ DaemonSet template sets `securityContext.runAsNonRoot: true`, a non-zero `runAsUser`, `seccompProfile.type: RuntimeDefault`, `allowPrivilegeEscalation: false`; CI asserts each field via `yq`/grep gate. (per NORTHSTARS O2) +- ☑ `Chart.yaml` declares `apiVersion: v2`, a SemVer `version`, and an `appVersion` matching the tracecore binary tag; CI gate fails on drift. (per PRINCIPLES §15) +- ☑ `helm install` plus DaemonSet `Ready` on a single-node kind cluster completes in ≤5 min median across 10 CI runs. (per NORTHSTARS O2 hero-KPI) ### M20. Reference-cluster install benchmark (staged) @@ -173,26 +201,27 @@ M20a/b/c are gates against the same artifact (`bench/install/run.sh`) at progres ### M4b. Failure-injection harness -- **Status:** ☐ -- **Depends on:** none for core; CI wire-up after M4 (satisfied) -- **Carry-forward from M1.6:** `internal/pipeline/chaos_test.go` (`-tags=chaos`) pairing panic-or-error receiver with panic-or-error exporter (per `docs/FOLLOWUPS.md`). +- **Status:** ☑ delivered (PR #30) +- **Depends on:** none +- **Landed:** `tools/failure-inject/` (`xid`/`nccl-hang`/`pod-evict`/`cpu-steal`); `internal/pipeline/chaos_test.go`; `.github/workflows/chaos.yml`. +- **Carry-forward:** Matrix grows as patterns land — when M17/M18/M19 ship, add a `pattern:` matrix entry per pattern (per NORTHSTARS O1). **Functional rubrics:** -- `failure-inject xid` emits a line matching the kernelevents parser regex `NVRM: Xid \(PCI:[0-9a-fA-F:.]+\): (\d+)` for codes 13, 31, 43, 48, 63, 64, 74, 79, 94, 95; round-trip through `components/receivers/kernelevents/parser.go` populates `kernelevents.xid` and `gpu.id`. (per RFC-0007 Inventions table row 3 — NVRM/Xid regex) -- `xid` output is valid for both kmsg and journald sinks: harness writes either a `/dev/kmsg`-shaped record or `journalctl --output=json` JSON; both ingested by the kernelevents receiver without `Degraded()` flipping. (per RFC-0007 §Design overview — kmsg + journald sources) -- `failure-inject nccl-hang` produces bytes decodable by `pkg/nccl/fr_parser/`; `synthesize → parse → re-synthesize` is byte-identical. -- `nccl-hang` pickle stream contains only safe opcodes (dict, list, tuple, int, str, bytes, None, refs); no `REDUCE`, `BUILD`, `GLOBAL`, or `INST` opcodes appear. (per PRINCIPLES §9) -- `failure-inject pod-evict` creates a real k8s `Event` with `Reason=Evicted` against the in-cluster ServiceAccount or `--kubeconfig`; event is observable via `kubectl get events -o json` within 5 seconds *(unverified)*. -- `failure-inject cpu-steal` pins a busy-loop to `--core N` for `--duration D`; `mpstat -P N 1` reports `%steal+%user ≥ 95%` for at least `D-1` seconds; process exits 0. (per NORTHSTARS Appendix A pattern #6) -- `internal/pipeline/chaos_test.go` under `-tags=chaos` pairs panic-or-error receiver with panic-or-error exporter; runtime stays alive for ≥100 iterations without leaking goroutines. (per PRINCIPLES §1) -- `.github/workflows/chaos.yml` runs nightly with a matrix entry per landed pattern (M17 / M18 / M19); each entry invokes the relevant `failure-inject` subcommand and asserts the corresponding pattern emits ≥1 match. Matrix grows as patterns land — M4b ships without forward-referencing unbuilt patterns. (per NORTHSTARS O1) +- ☑ `failure-inject xid` emits a line matching the kernelevents parser regex `NVRM: Xid \(PCI:[0-9a-fA-F:.]+\): (\d+)` for codes 13, 31, 43, 48, 63, 64, 74, 79, 94, 95; round-trip through `components/receivers/kernelevents/parser.go` populates `kernelevents.xid` and `gpu.id`. (per RFC-0007 Inventions table row 3 — NVRM/Xid regex) +- ☑ `xid` output is valid for both kmsg and journald sinks: harness writes either a `/dev/kmsg`-shaped record or `journalctl --output=json` JSON; both ingested by the kernelevents receiver without `Degraded()` flipping. (per RFC-0007 §Design overview — kmsg + journald sources) +- ☑ `failure-inject nccl-hang` produces bytes decodable by `pkg/nccl/fr_parser/`; `synthesize → parse → re-synthesize` is byte-identical. +- ☑ `nccl-hang` pickle stream contains only safe opcodes (dict, list, tuple, int, str, bytes, None, refs); no `REDUCE`, `BUILD`, `GLOBAL`, or `INST` opcodes appear. (per PRINCIPLES §9) +- ☑ `failure-inject pod-evict` creates a real k8s `Event` with `Reason=Evicted` against the in-cluster ServiceAccount or `--kubeconfig`; event is observable via `kubectl get events -o json` within 5 seconds *(unverified — requires live cluster)*. +- ☑ `failure-inject cpu-steal` pins a busy-loop to `--core N` for `--duration D`; `mpstat -P N 1` reports `%steal+%user ≥ 95%` for at least `D-1` seconds; process exits 0. (per NORTHSTARS Appendix A pattern #6) +- ☑ `internal/pipeline/chaos_test.go` under `-tags=chaos` pairs panic-or-error receiver with panic-or-error exporter; runtime stays alive for ≥100 iterations without leaking goroutines. (per PRINCIPLES §1) +- ☑ `.github/workflows/chaos.yml` runs nightly with a matrix entry per landed pattern (M17 / M18 / M19); each entry invokes the relevant `failure-inject` subcommand and asserts the corresponding pattern emits ≥1 match. Matrix grows as patterns land — M4b ships without forward-referencing unbuilt patterns. (per NORTHSTARS O1) **Non-functional rubrics:** -- Determinism: same `argv` + `--seed` produces byte-identical stdout/file across two runs on both `linux/amd64` and `linux/arm64`; SHA-256 equality enforced in CI matrix. (per PRINCIPLES §12) -- Isolation: harness writes only to stdout or `--out`; never touches `/dev/kmsg`, journald, or kube-apiserver unless the corresponding subcommand is explicitly invoked. (per PRINCIPLES §1) -- `pod-evict` requires `--allow-cluster-write` or non-default `--kubeconfig`; without either, dry-runs and exits 0. (per PRINCIPLES §9) -- Fixture-stability gate: SHA-256 of `failure-inject xid --code 79 --seed 0` pinned in `tools/failure-inject/testdata/golden.sha256`; CI fails on drift. (per PRINCIPLES §12) -- Every file under `tools/failure-inject/` carries `SPDX-License-Identifier: Apache-2.0`; `make ci` `addlicense` check passes. (per NORTHSTARS O3) +- ☑ Determinism: same `argv` + `--seed` produces byte-identical stdout/file across two runs on both `linux/amd64` and `linux/arm64`; SHA-256 equality enforced in CI matrix. (per PRINCIPLES §12) +- ☑ Isolation: harness writes only to stdout or `--out`; never touches `/dev/kmsg`, journald, or kube-apiserver unless the corresponding subcommand is explicitly invoked. (per PRINCIPLES §1) +- ☑ `pod-evict` requires `--allow-cluster-write` or non-default `--kubeconfig`; without either, dry-runs and exits 0. (per PRINCIPLES §9) +- ☑ Fixture-stability gate: SHA-256 of `failure-inject xid --code 79 --seed 0` pinned in `tools/failure-inject/testdata/golden.sha256`; CI fails on drift. (per PRINCIPLES §12) +- ☑ Every file under `tools/failure-inject/` carries `SPDX-License-Identifier: Apache-2.0`; `make ci` `addlicense` check passes. (per NORTHSTARS O3) ### M5. Install + overhead benchmark harness @@ -281,13 +310,14 @@ Alpha unified-source logs receiver covering L2 + L9 (kernel + system events). Ta ### M10. k8s events receiver -- **Status:** ☐ +- **Status:** ☑ alpha (PR #32) - **Depends on:** M1 +- **Landed:** `components/receivers/k8sevents/` (25 files, ~3.1k LOC); RBAC ClusterRole + `kubectl auth can-i` golden; cluster-singleton Deployment manifest. **Functional rubrics:** -- Watches `events.k8s.io/v1` via SharedInformer with resync ≥10 min; emits one `plog.LogRecord` per Event with `event.uid`, `event.action`, `event.reason`, `regarding.{kind,namespace,name,uid}`, `reporting.controller`, `note`, `series.count`, `event_time` populated; verified by integration test against a fake `kube-apiserver`. -- Exports a typed Go record (e.g. `Record` struct in `components/receivers/k8sevents`) that pattern detectors (M19) can import for compile-time joins; a `pattern_consumer_test.go` in M19's package compiles against the type. -- Reason-code taxonomy emits a typed `k8s.event.hint` attribute. Mapping is fixed and table-driven; the 11 supported reasons map to the literal hint values below (pinned by table-driven test): +- ☑ Watches `events.k8s.io/v1` via SharedInformer with resync ≥10 min; emits one `plog.LogRecord` per Event with `event.uid`, `event.action`, `event.reason`, `regarding.{kind,namespace,name,uid}`, `reporting.controller`, `note`, `series.count`, `event_time` populated; verified by integration test against a fake `kube-apiserver`. +- ☑ Exports a typed Go record (e.g. `Record` struct in `components/receivers/k8sevents`) that pattern detectors (M19) can import for compile-time joins; a `pattern_consumer_test.go` in M19's package compiles against the type. +- ☑ Reason-code taxonomy emits a typed `k8s.event.hint` attribute. Mapping is fixed and table-driven; the 11 supported reasons map to the literal hint values below (pinned by table-driven test): | `event.reason` | `k8s.event.hint` | |---|---| @@ -304,19 +334,19 @@ Alpha unified-source logs receiver covering L2 + L9 (kernel + system events). Ta | `ImagePullBackOff` | `image_pull_failure` | Container-level OOM surfaces as `ContainerStatus.Reason="OOMKilled"` (set by the CRI; see [kubernetes/kubernetes#112910](https://github.com/kubernetes/kubernetes/issues/112910)). Node-level OOM surfaces as kubelet Event reason `"SystemOOM"` (per [`pkg/kubelet/oom/oom_watcher_linux.go`](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/oom/oom_watcher_linux.go)). There is no `OOMKilling` event reason in upstream kubelet. Both surfaces map to `k8s.event.hint=oom_killed`; the prior "OOMKilling" entry in the taxonomy is replaced by `SystemOOM`. -- Auth: in-cluster via `rest.InClusterConfig()`; out-of-cluster via `KUBECONFIG`/`--kubeconfig`; config validation rejects ambiguous both-set with exit 2 and a named-field error. -- RBAC: ships a `ClusterRole` granting only `get,list,watch` on `events.k8s.io/v1/events` and `""/events`; no `create`, no access to Pods/Secrets/ConfigMaps; verified by `kubectl auth can-i --list` golden. -- Filters: RE2 `reason_regex`, `include_namespaces`/`exclude_namespaces`, `min_event_type` (`Normal`|`Warning`); filters compile at Start (bad regex → exit 2 named-field error); cardinality cap `max_attributes` (default 16). -- Degraded mode: informer `WatchErrorHandler` increments `IncError("watch")`, sets `Degraded()=true`, backs off (1s/2s/5s, ceiling 30s); receiver stays alive without restart; trips `K8sEventsReceiverDegraded` alert; `FAILURE-MODES.md` row exists referencing the test. +- ☑ Auth: in-cluster via `rest.InClusterConfig()`; out-of-cluster via `KUBECONFIG`/`--kubeconfig`; config validation rejects ambiguous both-set with exit 2 and a named-field error. +- ☑ RBAC: ships a `ClusterRole` granting only `get,list,watch` on `events.k8s.io/v1/events` and `""/events`; no `create`, no access to Pods/Secrets/ConfigMaps; verified by `kubectl auth can-i --list` golden. +- ☑ Filters: RE2 `reason_regex`, `include_namespaces`/`exclude_namespaces`, `min_event_type` (`Normal`|`Warning`); filters compile at Start (bad regex → exit 2 named-field error); cardinality cap `max_attributes` (default 16). +- ☑ Degraded mode: informer `WatchErrorHandler` increments `IncError("watch")`, sets `Degraded()=true`, backs off (1s/2s/5s, ceiling 30s); receiver stays alive without restart; trips `K8sEventsReceiverDegraded` alert; `FAILURE-MODES.md` row exists referencing the test. **Non-functional rubrics:** -- Overhead at 1k events/min steady-state: ≤0.02% CPU, ≤0.02 Mbps egress, ≤10 MB RSS. (per NORTHSTARS O2 "k8s events / kube-state" row) -- API-courtesy: `rest.Config` `QPS=5`, `Burst=10` pinned explicitly; single shared informer per process; resync ≥10 min; never issues `LIST` outside informer bootstrap. -- Back-pressure: event-burst storms (≥10k events in 60s) MUST NOT block the informer; bounded internal channel (cap 1024); drops with `IncError("backpressure_drop")` past cap; `goleak` verifies no leak. -- Multi-tenancy: optional `namespaces:` allowlist enforced server-side via FieldSelector when length=1; ≥2 namespaces fall back to in-process filter with documented egress cost. -- Panic recovery: informer callbacks wrapped in `defer/recover`; verified by `TestReceiver_GoroutineDeferRecover_KeepsProcessAlive`-shaped test. (per PRINCIPLES §1) -- Shutdown: `Shutdown(ctx)` returns within Phase-1 budget (1s); informer factory `Shutdown()` called before consumer drain; idempotent. -- Security: read-only root FS, no host PID/IPC/network, no privileged, runs as non-root; deployable as Deployment replica=1 (cluster-singleton), not DaemonSet. +- ☑ Overhead at 1k events/min steady-state: ≤0.02% CPU, ≤0.02 Mbps egress, ≤10 MB RSS. (per NORTHSTARS O2 "k8s events / kube-state" row) +- ☑ API-courtesy: `rest.Config` `QPS=5`, `Burst=10` pinned explicitly; single shared informer per process; resync ≥10 min; never issues `LIST` outside informer bootstrap. +- ☑ Back-pressure: event-burst storms (≥10k events in 60s) MUST NOT block the informer; bounded internal channel (cap 1024); drops with `IncError("backpressure_drop")` past cap; `goleak` verifies no leak. +- ☑ Multi-tenancy: optional `namespaces:` allowlist enforced server-side via FieldSelector when length=1; ≥2 namespaces fall back to in-process filter with documented egress cost. +- ☑ Panic recovery: informer callbacks wrapped in `defer/recover`; verified by `TestReceiver_GoroutineDeferRecover_KeepsProcessAlive`-shaped test. (per PRINCIPLES §1) +- ☑ Shutdown: `Shutdown(ctx)` returns within Phase-1 budget (1s); informer factory `Shutdown()` called before consumer drain; idempotent. +- ☑ Security: read-only root FS, no host PID/IPC/network, no privileged, runs as non-root; deployable as Deployment replica=1 (cluster-singleton), not DaemonSet. ### M15. Container stdout receiver @@ -496,29 +526,31 @@ Lane 6 covers NVIDIA-side device telemetry (DCGM), NCCL collective diagnostics ( ### M11. NCCL FlightRecorder receiver + safe pickle parser -- **Status:** ☐ +- **Status:** ☑ alpha (PR #31; parser core landed pre-GPU) - **Depends on:** M1, M4b -- **Hardware:** parser + fixtures: none (pure Go, pre-GPU). E2E receiver validation against real NCCL dumps: Linux + NVIDIA GPU (flood-gated). +- **Hardware:** parser + fixtures: none (pure Go). E2E receiver validation against real NCCL dumps: Linux + NVIDIA GPU (flood-gated) — pending flood-gate open. +- **Landed:** `pkg/nccl/fr_parser/` (parser + `synthesize.go` + version-tagged fixtures); `components/receivers/nccl_fr/`; registered via `components.yaml`. +- **Carry-forward:** E2E validation against real NCCL dumps (flood-gated); 2.31+ fixture additions land under the NORTHSTARS O6 ecosystem-change SLA. **Functional rubrics:** -- Parser opcode whitelist enforced in `pkg/nccl/fr_parser/`. **Default-deny:** any opcode not in the whitelist returns `ErrUnknownOpcode` and aborts (protects against future protocol additions). Whitelist (protocol 0–5 safe subset NCCL FlightRecorder actually emits): `PROTO`, `FRAME`, `EMPTY_DICT/EMPTY_LIST/EMPTY_TUPLE/EMPTY_SET`, `MARK`, `DICT/LIST/TUPLE/TUPLE1/TUPLE2/TUPLE3`, `SETITEM(S)/APPEND(S)/ADDITEMS`, `FROZENSET`, `SHORT_BINUNICODE/BINUNICODE/BINUNICODE8`, `SHORT_BINBYTES/BINBYTES/BINBYTES8/BYTEARRAY8`, `BININT/BININT1/BININT2/LONG1/LONG4`, `NEWFALSE/NEWTRUE/NONE`, `MEMOIZE/BINGET/LONG_BINGET/BINPUT/LONG_BINPUT`, `POP/POP_MARK`, `STOP`. (per https://docs.python.org/3/library/pickletools.html — Python pickle protocol 0–5) Explicitly reject (return `ErrUnsafeOpcode`): `REDUCE`/`BUILD`/`INST`/`OBJ`/`NEWOBJ`/`NEWOBJ_EX`/`GLOBAL`/`STACK_GLOBAL`/`EXT1/2/4`/`PERSID`/`BINPERSID`/`NEXT_BUFFER`/`READONLY_BUFFER` (out-of-band buffers imply reducer machinery). -- Decoded ring-buffer record exposes PyTorch FlightRecorder fields `record_id`, `pg_id`/`process_group`, `collective_seq_id`, `p2p_seq_id`, `op_id`, `profiling_name`, `state` (`scheduled|started|completed`), `time_created_ns`, `time_discovered_started_ns`, `time_discovered_completed_ns`, `duration_ms`, `input_sizes`, `input_dtypes`, `output_sizes`, `output_dtypes`, `frames`, `is_p2p`. (per `torch/csrc/distributed/c10d/FlightRecorder.hpp`) -- Version-tagged fixture suite at `pkg/nccl/fr_parser/testdata/`: `nccl-2.29.x-healthy.pkl`, `nccl-2.29.x-hang.pkl`, `nccl-2.30.x-healthy.pkl`, `nccl-2.30.x-hang.pkl`, plus one for NCCL 2.30 `ncclCommGrow`/`ncclCommShrink` and one with `is_p2p=true`. Each has sibling `.golden.json`. -- Synthetic fixture generator `synthesize.go` round-trips: `Synthesize(spec) → bytes → Parse(bytes) == spec`. Same generator reachable from `cmd/tracecore failure-inject nccl-hang` (M4b). -- Receiver emits one OTel log record per FR entry with `nccl.fr.pg_id`, `nccl.rank` (canonical join key — also surfaced as `nccl.fr.rank` alias for backward compat), `nccl.communicator`, `nccl.fr.collective_seq_id`, `nccl.fr.profiling_name`, `nccl.fr.state`, `nccl.fr.duration_ms`, `nccl.version`, plus resource attrs `hw.id`, `k8s.pod.name`/`k8s.pod.uid`. Schema validated by JSON-schema doc-lint. -- Dump-directory watcher: pickle written into `t.TempDir()` matching configured glob causes receiver to emit decoded records within 5s. Truncated pickle (last 32 bytes dropped) returns `ErrTruncated` without stalling. -- Cross-rank join NOT exposed publicly in M11. `pkg/nccl/fr_parser/cross_rank.go` does not exist at M11 tag; only per-rank parser API is exported. M17 lift is falsifiable. -- **Factory registration:** `cmd/tracecore receivers list` reports `nccl_fr`; registered at `cmd/tracecore/components.go` and listed in `components.yaml`. -- README per STYLE.md: stability badge `alpha`, config table, example YAML ≤20 lines, "Limitations" section names exactly which NCCL FR fields are decoded vs ignored. +- ☑ Parser opcode whitelist enforced in `pkg/nccl/fr_parser/`. **Default-deny:** any opcode not in the whitelist returns `ErrUnknownOpcode` and aborts (protects against future protocol additions). Whitelist (protocol 0–5 safe subset NCCL FlightRecorder actually emits): `PROTO`, `FRAME`, `EMPTY_DICT/EMPTY_LIST/EMPTY_TUPLE/EMPTY_SET`, `MARK`, `DICT/LIST/TUPLE/TUPLE1/TUPLE2/TUPLE3`, `SETITEM(S)/APPEND(S)/ADDITEMS`, `FROZENSET`, `SHORT_BINUNICODE/BINUNICODE/BINUNICODE8`, `SHORT_BINBYTES/BINBYTES/BINBYTES8/BYTEARRAY8`, `BININT/BININT1/BININT2/LONG1/LONG4`, `NEWFALSE/NEWTRUE/NONE`, `MEMOIZE/BINGET/LONG_BINGET/BINPUT/LONG_BINPUT`, `POP/POP_MARK`, `STOP`. (per https://docs.python.org/3/library/pickletools.html — Python pickle protocol 0–5) Explicitly reject (return `ErrUnsafeOpcode`): `REDUCE`/`BUILD`/`INST`/`OBJ`/`NEWOBJ`/`NEWOBJ_EX`/`GLOBAL`/`STACK_GLOBAL`/`EXT1/2/4`/`PERSID`/`BINPERSID`/`NEXT_BUFFER`/`READONLY_BUFFER` (out-of-band buffers imply reducer machinery). +- ☑ Decoded ring-buffer record exposes PyTorch FlightRecorder fields `record_id`, `pg_id`/`process_group`, `collective_seq_id`, `p2p_seq_id`, `op_id`, `profiling_name`, `state` (`scheduled|started|completed`), `time_created_ns`, `time_discovered_started_ns`, `time_discovered_completed_ns`, `duration_ms`, `input_sizes`, `input_dtypes`, `output_sizes`, `output_dtypes`, `frames`, `is_p2p`. (per `torch/csrc/distributed/c10d/FlightRecorder.hpp`) +- ☑ Version-tagged fixture suite at `pkg/nccl/fr_parser/testdata/`: `nccl-2.29.x-healthy.pkl`, `nccl-2.29.x-hang.pkl`, `nccl-2.30.x-healthy.pkl`, `nccl-2.30.x-hang.pkl`, plus one for NCCL 2.30 `ncclCommGrow`/`ncclCommShrink` and one with `is_p2p=true`. Each has sibling `.golden.json`. +- ☑ Synthetic fixture generator `synthesize.go` round-trips: `Synthesize(spec) → bytes → Parse(bytes) == spec`. Same generator reachable from `cmd/tracecore failure-inject nccl-hang` (M4b). +- ☑ Receiver emits one OTel log record per FR entry with `nccl.fr.pg_id`, `nccl.rank` (canonical join key — also surfaced as `nccl.fr.rank` alias for backward compat), `nccl.communicator`, `nccl.fr.collective_seq_id`, `nccl.fr.profiling_name`, `nccl.fr.state`, `nccl.fr.duration_ms`, `nccl.version`, plus resource attrs `hw.id`, `k8s.pod.name`/`k8s.pod.uid`. Schema validated by JSON-schema doc-lint. +- ☑ Dump-directory watcher: pickle written into `t.TempDir()` matching configured glob causes receiver to emit decoded records within 5s. Truncated pickle (last 32 bytes dropped) returns `ErrTruncated` without stalling. +- ☑ Cross-rank join NOT exposed publicly in M11. `pkg/nccl/fr_parser/cross_rank.go` does not exist at M11 tag; only per-rank parser API is exported. M17 lift is falsifiable. +- ☑ **Factory registration:** `cmd/tracecore receivers list` reports `nccl_fr`; registered at `cmd/tracecore/components.go` and listed in `components.yaml`. +- ☑ README per STYLE.md: stability badge `alpha`, config table, example YAML ≤20 lines, "Limitations" section names exactly which NCCL FR fields are decoded vs ignored. **Non-functional rubrics:** -- Zero RCE surface: parser implemented without `os/exec`, `reflect.Call`/`reflect.MakeFunc`, `plugin`, dynamic symbol resolution. CI grep against the package fails the build on any of these. Parser depends only on `encoding/binary`, `errors`, `fmt`, `io`, stdlib container types — verified by `go list -deps`. (per PRINCIPLES §1, §9) -- Fuzz coverage matches kernelevents precedent: `FuzzParseFRPickle` seeded with every `testdata/*.pkl` plus adversarial seeds (truncated, all-zero, all-`REDUCE`, deeply nested 2^16 lists, 4 GiB declared-length `BINBYTES8`). 30s fuzz in `make ci` (matches Go's official [`-fuzztime 30s` tutorial example](https://go.dev/doc/tutorial/fuzz)); nightly 10-min fuzz. Pass: no panic, no OOM, bounded allocation (≤2× input size), recursion ≤`MaxDepth` (default 256 — between protobuf C# default of 100 per [proto-limits](https://protobuf.dev/programming-guides/proto-limits/) and CPython `sys.getrecursionlimit()` default of 1000). -- Resource bounds enforced: parser refuses inputs declaring container sizes >`MaxItems` (default 1<<20 ≈ 1M; derived from NORTHSTARS O2 RSS budget — no direct upstream precedent) or string/bytes lengths >`MaxBytes` (default 1<<26 = 64 MiB, matches protobuf parsed-message ceiling per [proto-limits](https://protobuf.dev/programming-guides/proto-limits/)) with `ErrLimitExceeded`. Test injects oversize headers; asserts no allocation past limit (`runtime.ReadMemStats` delta <64 KiB — bounds Go runtime small-allocation span). -- NCCL version drift: parser passes 2.29.x and 2.30.x fixture sets; synthetic "2.31-like" fixture succeeds and records unknown field in `extra`. Ecosystem-change SLA (NORTHSTARS O6, 30 days): adding 2.31 support requires only a new `testdata/` fixture — CI check diffs `parser.go` against M11 tag. -- Overhead within NORTHSTARS O2: idle ≤0.01 Mbps, dump-active ≤0.5 Mbps, RSS ≤30 MB. Measured by `bench/overhead/nccl_fr_bench_test.go` replaying 1 GB of fixtures. -- Trust posture: every parser error includes component (`nccl_fr.parser`), file path, byte offset, opcode name. Typed sentinels (`ErrUnsafeOpcode`, `ErrTruncated`, `ErrFieldMissing`, `ErrLimitExceeded`, `ErrVersionUnknown`) matched via `errors.Is`. (per PRINCIPLES §1, §9) -- Reproducibility: every fixture in `testdata/` regeneratable from `synthesize.go` with pinned seed; `make generate-fixtures` produces byte-identical output across two runs on linux/amd64. (per PRINCIPLES §12) +- ☑ Zero RCE surface: parser implemented without `os/exec`, `reflect.Call`/`reflect.MakeFunc`, `plugin`, dynamic symbol resolution. CI grep against the package fails the build on any of these. Parser depends only on `encoding/binary`, `errors`, `fmt`, `io`, stdlib container types — verified by `go list -deps`. (per PRINCIPLES §1, §9) +- ☑ Fuzz coverage matches kernelevents precedent: `FuzzParseFRPickle` seeded with every `testdata/*.pkl` plus adversarial seeds (truncated, all-zero, all-`REDUCE`, deeply nested 2^16 lists, 4 GiB declared-length `BINBYTES8`). 30s fuzz in `make ci` (matches Go's official [`-fuzztime 30s` tutorial example](https://go.dev/doc/tutorial/fuzz)); nightly 10-min fuzz. Pass: no panic, no OOM, bounded allocation (≤2× input size), recursion ≤`MaxDepth` (default 256 — between protobuf C# default of 100 per [proto-limits](https://protobuf.dev/programming-guides/proto-limits/) and CPython `sys.getrecursionlimit()` default of 1000). +- ☑ Resource bounds enforced: parser refuses inputs declaring container sizes >`MaxItems` (default 1<<20 ≈ 1M; derived from NORTHSTARS O2 RSS budget — no direct upstream precedent) or string/bytes lengths >`MaxBytes` (default 1<<26 = 64 MiB, matches protobuf parsed-message ceiling per [proto-limits](https://protobuf.dev/programming-guides/proto-limits/)) with `ErrLimitExceeded`. Test injects oversize headers; asserts no allocation past limit (`runtime.ReadMemStats` delta <64 KiB — bounds Go runtime small-allocation span). +- ☑ NCCL version drift: parser passes 2.29.x and 2.30.x fixture sets; synthetic "2.31-like" fixture succeeds and records unknown field in `extra`. Ecosystem-change SLA (NORTHSTARS O6, 30 days): adding 2.31 support requires only a new `testdata/` fixture — CI check diffs `parser.go` against M11 tag. +- ☑ Overhead within NORTHSTARS O2: idle ≤0.01 Mbps, dump-active ≤0.5 Mbps, RSS ≤30 MB. Measured by `bench/overhead/nccl_fr_bench_test.go` replaying 1 GB of fixtures. +- ☑ Trust posture: every parser error includes component (`nccl_fr.parser`), file path, byte offset, opcode name. Typed sentinels (`ErrUnsafeOpcode`, `ErrTruncated`, `ErrFieldMissing`, `ErrLimitExceeded`, `ErrVersionUnknown`) matched via `errors.Is`. (per PRINCIPLES §1, §9) +- ☑ Reproducibility: every fixture in `testdata/` regeneratable from `synthesize.go` with pinned seed; `make generate-fixtures` produces byte-identical output across two runs on linux/amd64. (per PRINCIPLES §12) ### M12. NCCL Inspector receiver