diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md index 149b9488..d91025f2 100644 --- a/.github/ISSUE_TEMPLATE/feature_request.md +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -7,7 +7,7 @@ labels: enhancement ` marker) for examples naming contrib exporters. `scripts/doc-check.sh` walks fenced YAML blocks to assert each recipe's first non-blank line matches a file under `docs/integrations/examples/`. (per `docs/STYLE-docs.md` §5; see [docs/research/m5-m6-research.md](docs/research/m5-m6-research.md) §D3) +- ☑ `docs/integrations/{datadog,honeycomb,otel-backend,clickhouse-direct}.md` each contain ≥1 fenced ` ```yaml ` block whose first non-blank line resolves to a file under `docs/integrations/examples/` that the `validator-recipe` CI job runs `validate --config=` against. Gate splits by recipe scope: `tracecore validate` for examples naming only compiled-in components; `otel/opentelemetry-collector-contrib` (digest-pinned in lockstep with each recipe's `` marker) for examples naming contrib exporters. `scripts/doc-check.sh` walks fenced YAML blocks to assert each recipe's first non-blank line matches a file under `docs/integrations/examples/`. (per `docs/STYLE-docs.md` §5; see [docs/research/m5-m6-research.md](research/m5-m6-research.md) §D3) - ☑ Each recipe names its exporter using the upstream OTel Collector component ID (e.g. `otlphttp`, `datadog`, `clickhouse`) and pins the contrib release tag in a `` HTML comment; `doc-check.sh` greps for the marker and fails CI if absent. - ☑ `docs/maintainership.md` carries H2 headings `Commit access`, `RFC sponsorship`, `Security disclosure`, each answering its heading in sentence 1 and cross-referencing `CODEOWNERS`, `docs/rfcs/`, `SECURITY.md`; `doc-check.sh` asserts the heading set. (per `docs/STYLE-docs.md` §3) - ☑ `docs/FAILURE-MODES.md` enumerates ≥1 row each for `vendor SDK failure`, `exporter unreachable`, `config invalid`, each pointing to a real `Test*` identifier. (per `docs/STYLE-docs.md` §5) @@ -357,10 +357,10 @@ M20a/b/c are gates against the same artifact (`bench/install/run.sh`) at progres ### M9. kernelevents receiver - **Status:** ☑ shipped -- **Status (RFC-0013):** **DELETED in PR-K.2** — `components/receivers/kernelevents/` removed at v0.2.0; replaced by [`journaldreceiver` + `filelogreceiver` (kmsg) + OTTL transform for Xid regex](docs/integrations/journald-kernel.md). Customer-stable `kernelevents.xid` attribute preserved via the bundled recipe's OTTL normalization layer (RFC-0013 §3). `tools/failure-inject/xidgen/` (sole consumer of the receiver's wire shape) deleted alongside. +- **Status (RFC-0013):** **DELETED in PR-K.2** — `components/receivers/kernelevents/` removed at v0.2.0; replaced by [`journaldreceiver` + `filelogreceiver` (kmsg) + OTTL transform for Xid regex](integrations/journald-kernel.md). Customer-stable `kernelevents.xid` attribute preserved via the bundled recipe's OTTL normalization layer (RFC-0013 §3). `tools/failure-inject/xidgen/` (sole consumer of the receiver's wire shape) deleted alongside. - **Depends on:** M1 -- **Reference:** [RFC-0007](docs/rfcs/0007-kernelevents-receiver-scope.md); merged in PR #16. Alpha unified-source logs receiver covering L2 + L9 (kernel + system events). -- **Carry-forward:** SXid (NVSwitch) classifier; regex shape blocked on a captured production fixture; reframed under RFC-0013 §5 as an upstream contribution against `journaldreceiver` or a new OTTL function (the in-tree shard `docs/followups/M9.md` was removed in the v0.2.0 doc sweep — see [`docs/followups/_needs-prod-data.md`](docs/followups/_needs-prod-data.md) for the prod-fixture gate). +- **Reference:** [RFC-0007](rfcs/0007-kernelevents-receiver-scope.md); merged in PR #16. Alpha unified-source logs receiver covering L2 + L9 (kernel + system events). +- **Carry-forward:** SXid (NVSwitch) classifier; regex shape blocked on a captured production fixture; reframed under RFC-0013 §5 as an upstream contribution against `journaldreceiver` or a new OTTL function (the in-tree shard `followups/M9.md` was removed in the v0.2.0 doc sweep — see [`followups/_needs-prod-data.md`](followups/_needs-prod-data.md) for the prod-fixture gate). **Functional rubrics:** - ☑ Tails `/dev/kmsg` and `journalctl --output=json --follow` behind one config block via the `source` interface. (per `components/receivers/kernelevents/source.go`; RFC-0007 §Design overview) @@ -377,7 +377,7 @@ M20a/b/c are gates against the same artifact (`bench/install/run.sh`) at progres ### M10. k8s events receiver - **Status:** ☑ alpha (PR #32) -- **Status (RFC-0013):** **DELETED in PR-K.2** — `components/receivers/k8sevents/` removed at v0.2.0; replaced by [`k8sobjectsreceiver` (watch mode on `events`) + OTTL transform](docs/integrations/k8sobjects-events.md). Customer-stable `k8s.event.hint` 11-entry enum (pod_evicted, mount_failure, backoff, oom_killed, node_unhealthy, schedule_failure, create_failure, volume_attach_failure, container_status_unknown, node_pressure, image_pull_failure) preserved via the bundled recipe's OTTL normalization layer (RFC-0013 §3). The M19 pattern detector now consumes the typed `internal/synthesis/patterns/Record` + `NodeRecord` (severed from the in-tree receiver in PR-K.1, #211); fixture JSON authored against the upstream `k8sobjectsreceiver` schema unmarshals into the typed records unchanged. +- **Status (RFC-0013):** **DELETED in PR-K.2** — `components/receivers/k8sevents/` removed at v0.2.0; replaced by [`k8sobjectsreceiver` (watch mode on `events`) + OTTL transform](integrations/k8sobjects-events.md). Customer-stable `k8s.event.hint` 11-entry enum (pod_evicted, mount_failure, backoff, oom_killed, node_unhealthy, schedule_failure, create_failure, volume_attach_failure, container_status_unknown, node_pressure, image_pull_failure) preserved via the bundled recipe's OTTL normalization layer (RFC-0013 §3). The M19 pattern detector now consumes the typed `internal/synthesis/patterns/Record` + `NodeRecord` (severed from the in-tree receiver in PR-K.1, #211); fixture JSON authored against the upstream `k8sobjectsreceiver` schema unmarshals into the typed records unchanged. - **Depends on:** M1 - **Landed:** `components/receivers/k8sevents/` (25 files, ~3.1k LOC); RBAC ClusterRole + `kubectl auth can-i` golden; cluster-singleton Deployment manifest. - **Carry-forward:** End-to-end overhead measurement at 1k events/min asserting CPU + egress + RSS budgets together (per-component benches exist; integrated run is the open work). @@ -419,7 +419,7 @@ M20a/b/c are gates against the same artifact (`bench/install/run.sh`) at progres ### M15. Container stdout receiver - **Status:** ☑ alpha (opt-in via `containerstdout.enabled=true`) -- **Status (RFC-0013):** **DELETED in PR-K.2** — `components/receivers/containerstdout/` removed at v0.2.0; replaced by [`filelogreceiver` + container stanza + `file_storage` extension](docs/integrations/filelog-container.md). Per-rank attribution (`gen_ai.training.rank`, `gen_ai.training.job.id`) and derived metric `tracecore.container.lines_per_s` move to the bundled recipe's OTTL transform processor (RFC-0013 §3). Dataloader-timing extraction (M18 feed) lands as an OTTL parser preset operator contributed upstream to `filelogreceiver` / container stanza. +- **Status (RFC-0013):** **DELETED in PR-K.2** — `components/receivers/containerstdout/` removed at v0.2.0; replaced by [`filelogreceiver` + container stanza + `file_storage` extension](integrations/filelog-container.md). Per-rank attribution (`gen_ai.training.rank`, `gen_ai.training.job.id`) and derived metric `tracecore.container.lines_per_s` move to the bundled recipe's OTTL transform processor (RFC-0013 §3). Dataloader-timing extraction (M18 feed) lands as an OTTL parser preset operator contributed upstream to `filelogreceiver` / container stanza. - **Depends on:** M1 **Functional rubrics:** @@ -456,7 +456,7 @@ Superseded by RFC-0013 (kueue → `prometheusreceiver` recipe / kineto deferred - **Depends on:** M1 **Functional rubrics:** -- Scrapes the Kueue Prometheus endpoint (default `/metrics` on `kueue-controller-manager` service) via direct YAML config `endpoint`, `collection_interval`. ServiceMonitor-discovered scrape is deferred to a follow-up; see [docs/rfcs/0011-m16-kueue-receiver-scope.md](docs/rfcs/0011-m16-kueue-receiver-scope.md). +- Scrapes the Kueue Prometheus endpoint (default `/metrics` on `kueue-controller-manager` service) via direct YAML config `endpoint`, `collection_interval`. ServiceMonitor-discovered scrape is deferred to a follow-up; see [docs/rfcs/0011-m16-kueue-receiver-scope.md](rfcs/0011-m16-kueue-receiver-scope.md). - Emits OTLP metrics for all `kueue_`-prefixed families the endpoint serves. The explicit-test minimum (rubric set): `kueue_pending_workloads`, `kueue_admitted_active_workloads`, `kueue_admitted_workloads_total`, `kueue_admission_attempts_total`, `kueue_admission_wait_time_seconds` (histogram). The receiver tolerates absence of `kueue_cluster_queue_nominal_quota` / `kueue_cluster_queue_resource_usage` (gated upstream by Kueue's `enableClusterQueueResources` config). (per https://kueue.sigs.k8s.io/docs/reference/metrics/) - Histograms preserve bucket boundaries through OTel translation; bucket-edge-equality test. - Resource attributes per scrape: `k8s.cluster.name`, `service.namespace`, `service.instance.id`. Metric attributes preserve Kueue labels including `cluster_queue`, `local_queue`, `flavor`, `resource`, `status` (on `pending_workloads`), `result` (on `admission_attempts_total`), and `replica_role` (stamped on every Kueue metric; preservation lets operators distinguish leader-emitted metrics from follower views during HA failover). @@ -498,7 +498,7 @@ Superseded by RFC-0013 (kueue → `prometheusreceiver` recipe / kineto deferred ### M13. Python stack-sampling receiver (faulthandler-based) -> Conventionally referred to as "py-spy" after the popular Python-stack-sampling tool. The tracecore implementation uses CPython's built-in `faulthandler.dump_traceback` triggered cooperatively over a per-process Unix domain socket, not py-spy's `process_vm_readv` ptrace path. Naming preserved for operator familiarity; rubrics pin the implementation. See [RFC-0009](docs/rfcs/0009-pyspy-receiver-scope.md) for the design. +> Conventionally referred to as "py-spy" after the popular Python-stack-sampling tool. The tracecore implementation uses CPython's built-in `faulthandler.dump_traceback` triggered cooperatively over a per-process Unix domain socket, not py-spy's `process_vm_readv` ptrace path. Naming preserved for operator familiarity; rubrics pin the implementation. See [RFC-0009](rfcs/0009-pyspy-receiver-scope.md) for the design. - **Status:** ⧗ partial (Phase 1 PR #99 + Phase 2 PR #102 shipped the receiver scaffold, wire protocol, Python helper, lint gate, strace falsifier, and chart security context; Phase 3 cadence-assertion + overhead rubrics + dedup + safe.Call wrap pending; see Carry-forward) - **Status (RFC-0013):** DELETED at v0.3.0 - replaced by `parca-agent` (eBPF) via OTLP profiles sink, operator-deployed via a separate chart. Security posture changes (CAP_SYS_PTRACE → CAP_SYS_ADMIN/BPF - operator review window). `python/tracecore_pyspy/` PyPI helper, `tools/pyspy-lint/`, and `.github/workflows/{pyspy-integration,python-publish}.yml` deleted in the same boundary. See RFC-0013 §7. @@ -533,7 +533,7 @@ Superseded by RFC-0013 (kueue → `prometheusreceiver` recipe / kineto deferred - **Status:** ☑ alpha - **Status (RFC-0013):** DELETED at v0.2.0 (partial - receiver only; `pkg/kineto/` parser retained pending re-evaluation). Deferred until OTel Profiles graduates from Alpha; signal is the contrib `pprofreceiver` shipping GA. Sooner-than-Profiles workaround: Perfetto `trace_processor` wrapper. Final re-evaluation at v0.3.0 per RFC-0013 §7 / §4 PR-O. - **Depends on:** M1 -- **Reference:** [RFC-0012](docs/rfcs/0012-kineto-receiver-scope.md) +- **Reference:** [RFC-0012](rfcs/0012-kineto-receiver-scope.md) - **Hardware:** Linux; no GPU required to ingest checked-in `.pt.trace.json` fixtures - **Landed:** `pkg/kineto/` streaming Chrome-trace parser + `Synthesize` deterministic generator + toy_2step fixture + golden + `FuzzParseKinetoTrace`; `components/receivers/kineto/` receiver with `ProfilerStep#N` single-pass step tracker, `/proc//environ` rank discovery, deterministic 1% sampling, optional `cpu_op` aggregation, fsnotify watch loop, `safe.Call` wrap, README + RUNBOOK; factory wired in `components.yaml` + `cmd/tracecore/components.go`. - **Carry-forward:** Non-functional rubrics (CPU / egress / soak / 2 GB heap ceiling) pending PR D bench gates; `strace`-based read-only assertion (FOLLOWUPS row: trigger when ingest.go's OpenFile changes); upstream `gen_ai.training.*` semconv ratification (NORTHSTARS O4). @@ -589,9 +589,9 @@ Lane 6 covers NVIDIA-side device telemetry (DCGM), NCCL collective diagnostics ( - **Status:** ⧗ (alpha scaffold shipped in PR #18; cgo client + hardware integration carry-forward pending) - **Status (RFC-0013):** DELETED at v0.1.0 — **landed in PR-F.1 (#206)**: `components/receivers/dcgm/` + `pkg/dcgm/` removed (cgo client path never shipped real code; live ports removed in #188's PR-B2-shaped dcgm sweep). Replaced by `dcgm-exporter` + `prometheusreceiver` per `docs/integrations/prometheus-scrape.md` (PR-J, #195). NVIDIA's 1st-party `dcgm-exporter` covers every metric in the RFC-0005 set; cross-vendor `gpu.vendor` resource attribute lands via OTTL transform over Prometheus output (RFC-0013 §3, upstream-contribution target to OTel `hw.*` semconv per §5). Replacement applies to AMD (`ROCm/device-metrics-exporter`), Intel (`intel/xpumanager`), and Habana (Habana Prometheus Metric Exporter) on the same recipe shape. Chart `receivers.dcgm` toggle + `_helpers.tpl` doc-list + `NOTES.txt` warning retained until PR-K.3 (toggle is inert post-PR-A2: enabling it crashes the OCB binary at boot with "unknown factory"). - **Depends on:** M1 -- **Reference:** [RFC-0005](docs/rfcs/0005-dcgm-receiver-scope.md) +- **Reference:** [RFC-0005](rfcs/0005-dcgm-receiver-scope.md) - **Hardware:** Linux + NVIDIA GPU host with `nv-hostengine` reachable; driver R580 LTSB + DCGM 4.4.x reference (per [endoflife.date/nvidia](https://endoflife.date/nvidia) - R580 active support ends 2026-08-04, refresh LTSB pin within Q3 2026; DCGM 4.4.2 is current core release per [NVIDIA/DCGM tags](https://github.com/NVIDIA/DCGM/tags)) -- **Carry-forward:** (1) cgo client `client_cgo.go`; (2) hardware integration test at `//go:build dcgm,hardware`; (3) cardinality-cap calibration against ≥3 reference deployments; (4) `initial_delay` bench against real DGX boot; (5) per-metric toggles, vGPU, subprocess-isolation supervisor (defer-on-trigger; see [`docs/followups/M8.md`](docs/followups/M8.md)). +- **Carry-forward:** (1) cgo client `client_cgo.go`; (2) hardware integration test at `//go:build dcgm,hardware`; (3) cardinality-cap calibration against ≥3 reference deployments; (4) `initial_delay` bench against real DGX boot; (5) per-metric toggles, vGPU, subprocess-isolation supervisor (defer-on-trigger; see [`docs/followups/M8.md`](followups/M8.md)). **Functional rubrics:** - `pkg/dcgm/client_cgo.go` under `//go:build dcgm` implements every method on the `Client` interface against `github.com/NVIDIA/go-dcgm`; `cmd/tracecore receivers list` reports `dcgm [cgo]`. (per RFC-0005 §File layout + §Build-tag strategy) @@ -667,7 +667,7 @@ Lane 6 covers NVIDIA-side device telemetry (DCGM), NCCL collective diagnostics ( - **Status:** ☐ - **Depends on:** M8 cgo, M9, M11, M4b -- **Reference:** [`docs/patterns/pattern-1-nvlink-degradation.md`](docs/patterns/pattern-1-nvlink-degradation.md) +- **Reference:** [`docs/patterns/pattern-1-nvlink-degradation.md`](patterns/pattern-1-nvlink-degradation.md) - **Hardware:** replay corpus: none. Integration variants exercising real DCGM: Linux + NVIDIA GPU (flood-gated). **Functional rubrics:** diff --git a/NORTHSTARS.md b/docs/NORTHSTARS.md similarity index 88% rename from NORTHSTARS.md rename to docs/NORTHSTARS.md index d3377e20..f6ba5cf5 100644 --- a/NORTHSTARS.md +++ b/docs/NORTHSTARS.md @@ -2,7 +2,7 @@ The goals tracecore optimizes for. What success looks like, how it's measured, what we don't chase. -[`PRINCIPLES.md`](PRINCIPLES.md) tells you *why* we make decisions. [`STYLE.md`](STYLE.md) tells you *what* the conventions are. This document tells you *where we are trying to go.* +[`PRINCIPLES.md`](../PRINCIPLES.md) tells you *why* we make decisions. [`STYLE.md`](../STYLE.md) tells you *what* the conventions are. This document tells you *where we are trying to go.* Goals don't grow with the codebase; KPIs do. @@ -16,7 +16,7 @@ When a distributed AI training run breaks, the operator should be told what brok ## Adoption posture (RFC-0013) -> **Adopt > build.** Tracecore is an OpenTelemetry Collector distribution assembled via OCB from upstream + contrib components. In-house code is bounded to the four moat scopes named in [RFC-0013 §6](docs/rfcs/0013-distro-first-pivot.md): cross-signal pattern detectors, OTTL processors with windowed semantics, NCCL FlightRecorder parsing, and the install/overhead bench harness. Everything else adopts from upstream OTel + CNCF + OpenSSF projects, with patches contributed upstream first and forks only when upstream rejects. The objectives below are evaluated against this posture; in-house growth outside the four scopes is the exception, not the default. +> **Adopt > build.** Tracecore is an OpenTelemetry Collector distribution assembled via OCB from upstream + contrib components. In-house code is bounded to the four moat scopes named in [RFC-0013 §6](rfcs/0013-distro-first-pivot.md): cross-signal pattern detectors, OTTL processors with windowed semantics, NCCL FlightRecorder parsing, and the install/overhead bench harness. Everything else adopts from upstream OTel + CNCF + OpenSSF projects, with patches contributed upstream first and forks only when upstream rejects. The objectives below are evaluated against this posture; in-house growth outside the four scopes is the exception, not the default. ## Northstar @@ -78,7 +78,7 @@ Seven lines of work. Each has one accountable owner role, one hero KPI, supporti **Caveats:** - v0/v1 is pain-weighted (the ~30 layers mapping to the 15 patterns), not breadth-max. - Frontier layers (L38-L42: power/grid coupling, SDC, RLHF, MoE, FP8 numerics) are v2/v3 strategic differentiators, not coverage-completeness work. -- Assumes the OCB-assembled distribution posture per [RFC-0013](docs/rfcs/0013-distro-first-pivot.md): vendor / orchestrator coverage is delivered by adopting upstream + contrib receivers (`prometheusreceiver` against `dcgm-exporter` / ROCm / Intel / Habana; `filelogreceiver`; `journaldreceiver`; `k8sobjectsreceiver`) wired through the bundled recipe, not by building in-tree receivers. In-house code adds coverage only inside the four moat scopes (RFC-0013 §6) — primarily the pattern detectors and NCCL FlightRecorder receiver. Layer-count and "receivers at stable" KPIs above are measured against the OCB manifest's bundled receivers plus the moat components, not in-tree directories. +- Assumes the OCB-assembled distribution posture per [RFC-0013](rfcs/0013-distro-first-pivot.md): vendor / orchestrator coverage is delivered by adopting upstream + contrib receivers (`prometheusreceiver` against `dcgm-exporter` / ROCm / Intel / Habana; `filelogreceiver`; `journaldreceiver`; `k8sobjectsreceiver`) wired through the bundled recipe, not by building in-tree receivers. In-house code adds coverage only inside the four moat scopes (RFC-0013 §6) — primarily the pattern detectors and NCCL FlightRecorder receiver. Layer-count and "receivers at stable" KPIs above are measured against the OCB manifest's bundled receivers plus the moat components, not in-tree directories. - A non-NVIDIA stack gating the trajectory in a given quarter re-sequences the matrix with explicit written reason. No silent slip. --- @@ -106,15 +106,15 @@ Seven lines of work. Each has one accountable owner role, one hero KPI, supporti | Shutdown time | ≤1s p99 from SIGTERM | Self-telemetry, every shutdown | | Crash-the-workload incidents | Zero target; any occurrence is P0, fixed within 7 days, listed in CHANGELOG | Continuous | | Minimum-working-config | ≤20 lines YAML for default install | Per release | -| Per-component README quality | 100% of components meet [`STYLE.md`](STYLE.md) section requirements; doc-lint enforced | CI gate | +| Per-component README quality | 100% of components meet [`STYLE.md`](../STYLE.md) section requirements; doc-lint enforced | CI gate | | Self-telemetry SLOs | exporter failure rate ≤0.1% sustained; queue depth <80%; component restart <1/hr | Self-monitored; trips → P1 | | Error-message quality | every error includes component + operation + rank/host (custom `errchk` lint) | CI gate | | Degraded-mode survival | collector functional with any subset of vendor SDKs failing | Chaos-injection CI tests | | Time-to-identify-collector-misbehavior | ≤5 min via `/metrics` + `pprof` | Quarterly internal drill | | Minimal-privilege pod spec | only `CAP_SYS_PTRACE`; no privileged; read-only root FS; no host PID/IPC | CI-verified Helm chart | | Subcommand count | ≤5 in v0 | Per release | -| Documentation completeness | getting-started + troubleshooting + ≥4 backend integration recipes + [`FAILURE-MODES.md`](docs/FAILURE-MODES.md) | Ship gate for v0 | -| Operator NPS | ≥40 (above OSS-infra industry baseline); methodology in [`docs/nps.md`](docs/nps.md) | Quarterly aggregate; monthly during pilots | +| Documentation completeness | getting-started + troubleshooting + ≥4 backend integration recipes + [`FAILURE-MODES.md`](FAILURE-MODES.md) | Ship gate for v0 | +| Operator NPS | ≥40 (above OSS-infra industry baseline); methodology in [`docs/nps.md`](nps.md) | Quarterly aggregate; monthly during pilots | **Per-receiver overhead budgets** (sum ≤1 Mbps/node, ≤0.3% CPU, ≤100MB RSS at v0 active): @@ -186,20 +186,20 @@ Seven lines of work. Each has one accountable owner role, one hero KPI, supporti | KPI | Target | Cadence | |---|---|---| -| Reproducible builds | byte-identical via `diffoscope` in CI, per platform (linux/amd64 required; linux/arm64 opt-in) | Per release; breakage = P0 ([`PRINCIPLES.md`](PRINCIPLES.md) §12) | +| Reproducible builds | byte-identical via `diffoscope` in CI, per platform (linux/amd64 required; linux/arm64 opt-in) | Per release; breakage = P0 ([`PRINCIPLES.md`](../PRINCIPLES.md) §12) | | Release signing | every release artifact signed via `sigstore`/`cosign` | Per release | | SBOM | CycloneDX or SPDX, published with each release | Per release | -| Vulnerability disclosure SLA | 7-day acknowledgment; 30-day fix for critical; 90-day fix for high; CVE issued | Continuous; tracked in [`SECURITY.md`](SECURITY.md) | -| Dependency hygiene | `go mod tidy` clean per commit; weekly Dependabot; `govulncheck` in `make ci` (already in [`STYLE.md`](STYLE.md)) | Continuous | +| Vulnerability disclosure SLA | 7-day acknowledgment; 30-day fix for critical; 90-day fix for high; CVE issued | Continuous; tracked in [`SECURITY.md`](../SECURITY.md) | +| Dependency hygiene | `go mod tidy` clean per commit; weekly Dependabot; `govulncheck` in `make ci` (already in [`STYLE.md`](../STYLE.md)) | Continuous | | License purity | 100% Apache 2.0; `addlicense` enforced in `make ci` | Continuous | | Air-gapped install | tarball + offline-signed package; documented; works without internet | Validated per minor release | -**Operating rule:** *Trust under load is the product* ([`PRINCIPLES.md`](PRINCIPLES.md) §1). Any supply-chain regression - broken reproducibility, missing signatures, lapsed SBOM, missed disclosure SLA - is P0 and blocks the next release. +**Operating rule:** *Trust under load is the product* ([`PRINCIPLES.md`](../PRINCIPLES.md) §1). Any supply-chain regression - broken reproducibility, missing signatures, lapsed SBOM, missed disclosure SLA - is P0 and blocks the next release. **Caveats:** - SLSA L3 requires hermetic, parameterless builds signed by trusted infrastructure. The build system needs the work - not just the policy. - Reproducibility policy lives here (P0 designation, response); the CI gate that *enforces* it lives in O2. Same artifact, two homes - by design. -- Disclosure SLAs are aspirational until tracecore has a security inbox in place; [`SECURITY.md`](SECURITY.md) must name the contact before the SLA clock starts publicly. +- Disclosure SLAs are aspirational until tracecore has a security inbox in place; [`SECURITY.md`](../SECURITY.md) must name the contact before the SLA clock starts publicly. --- @@ -296,7 +296,7 @@ Seven lines of work. Each has one accountable owner role, one hero KPI, supporti | First-response PR review | <72hr first maintainer response (excluding weekends/holidays) | Continuous | | Time-to-merge for non-controversial PRs | <7 days median; <14 days p90 | Continuous | | Issue triage | <72hr first maintainer response with label or comment | Continuous | -| `make ci` execution time | <60s on a developer laptop ([`PRINCIPLES.md`](PRINCIPLES.md) §10) | Per CI run | +| `make ci` execution time | <60s on a developer laptop ([`PRINCIPLES.md`](../PRINCIPLES.md) §10) | Per CI run | | Quarterly retrospective | published as `docs/YYYY-QN-retrospective.md`; covers hits, misses, and RFC accept/reject/supersede log | Quarterly | **Operating rules:** @@ -331,18 +331,18 @@ Seven lines of work. Each has one accountable owner role, one hero KPI, supporti | KPI | Target | Cadence | |---|---|---| -| RFCs document load-bearing decisions worth writing down | Maintainer judgment; aim is documented reasoning, not RFC-coverage percentage (see [`PRINCIPLES.md`](PRINCIPLES.md) §15 and [`CONTRIBUTING.md`](CONTRIBUTING.md) §RFC process) | Continuous | +| RFCs document load-bearing decisions worth writing down | Maintainer judgment; aim is documented reasoning, not RFC-coverage percentage (see [`PRINCIPLES.md`](../PRINCIPLES.md) §15 and [`CONTRIBUTING.md`](../CONTRIBUTING.md) §RFC process) | Continuous | | Superseded RFCs preserved with link to successor | 100% | Continuous | | Maintainer count with merge authority | ≥3 by M9; ≥5 by M18 | Quarterly | | Maintainer onboarding doc | present in `docs/maintainership.md` | By M6 | | `CODEOWNERS` covers ≥80% of code paths | by M12 | Quarterly | | Lint-enforced principles | ≥6 of 15 principles enforced via `golangci-lint`; new rules added when a principle is violated once in code | Per release | | Audited principles (the remainder) | annual written audit attached to year-end release | Yearly | -| Code of Conduct | present, enforced, with named contact ([`CODE_OF_CONDUCT.md`](CODE_OF_CONDUCT.md) exists) | Continuous | -| Contribution guide | [`CONTRIBUTING.md`](CONTRIBUTING.md) must cover DCO, PR process, RFC process, security disclosure | By M3; reviewed yearly | +| Code of Conduct | present, enforced, with named contact ([`CODE_OF_CONDUCT.md`](../CODE_OF_CONDUCT.md) exists) | Continuous | +| Contribution guide | [`CONTRIBUTING.md`](../CONTRIBUTING.md) must cover DCO, PR process, RFC process, security disclosure | By M3; reviewed yearly | | Project decision log | RFC accept/reject/supersede status logged in each O6 quarterly retro | Quarterly | -**Operating rule:** *Decide late, write it down, revisit honestly* ([`PRINCIPLES.md`](PRINCIPLES.md) §15). Untracked decisions become precedent; precedent becomes orthodoxy. RFCs are one tool for writing things down; commit messages and code comments are others. Pick the lightest tool that still leaves a trail future contributors can find. +**Operating rule:** *Decide late, write it down, revisit honestly* ([`PRINCIPLES.md`](../PRINCIPLES.md) §15). Untracked decisions become precedent; precedent becomes orthodoxy. RFCs are one tool for writing things down; commit messages and code comments are others. Pick the lightest tool that still leaves a trail future contributors can find. **Caveats:** - Non-employee maintainer KPI depends on the adoption pipeline (O5) producing real contributors. If adoption stalls, this slips honestly - flagged in the quarterly review. @@ -392,8 +392,8 @@ Seven lines of work. Each has one accountable owner role, one hero KPI, supporti These will be written as RFCs as the work demands; each blocks specific KPI lock-in: -1. **Own-binary vs. extend `opentelemetry-collector-contrib`** - closed by [RFC-0013](docs/rfcs/0013-distro-first-pivot.md): tracecore is an OCB-assembled distribution from upstream + contrib + a thin moat module. RFC-0002 (originally an own-binary recommendation) was revised by RFC-0013; RFC-0001's "open question" closes the same way. -2. **Auto-update boundary** - closed: operator-pulled releases, no in-binary self-update mechanism. *(Resolved - [RFC-0008](docs/rfcs/0008-auto-update-boundary.md). A superseding RFC opens if a production-operator ask appears that operator-side delivery automation cannot serve.)* +1. **Own-binary vs. extend `opentelemetry-collector-contrib`** - closed by [RFC-0013](rfcs/0013-distro-first-pivot.md): tracecore is an OCB-assembled distribution from upstream + contrib + a thin moat module. RFC-0002 (originally an own-binary recommendation) was revised by RFC-0013; RFC-0001's "open question" closes the same way. +2. **Auto-update boundary** - closed: operator-pulled releases, no in-binary self-update mechanism. *(Resolved - [RFC-0008](rfcs/0008-auto-update-boundary.md). A superseding RFC opens if a production-operator ask appears that operator-side delivery automation cannot serve.)* 3. **eBPF integration scope** - consume eBPF data from other tools (v0-v1) or ship our own programs (v2+). Already in RFC 0001 as an open question. 4. **Helm vs. Operator deployment** - Helm for v0; Operator (CRDs) RFC-worthy when operator demand emerges. diff --git a/docs/README.md b/docs/README.md index a55d2ece..fa0eef8c 100644 --- a/docs/README.md +++ b/docs/README.md @@ -76,7 +76,7 @@ Source (receiver-side) recipes — RFC-0013 §migration PR-J replacements for th ## What goes where (for contributors) - **Why** a load-bearing decision was made → an RFC under [rfcs/](rfcs/). -- **What** a quarterly commitment looks like → [MILESTONES.md](../MILESTONES.md). +- **What** a quarterly commitment looks like → [MILESTONES.md](MILESTONES.md). - **Tracked deferrals** (with revisit triggers) → the matching per-milestone shard under [followups/](followups/) (or `_needs-prod-data.md` / `_needs-gpu.md` for resource-gated items). See [followups/README.md](followups/README.md) for which shard owns what. - **A failure mode** + the test that pins it → [FAILURE-MODES.md](FAILURE-MODES.md) (runtime) or the component RUNBOOK (per-component). - **A pattern that transfers across receivers** → in-source as a doc comment on the canonical implementation; the next author copies the code. diff --git a/docs/RELEASE-CHECKLIST.md b/docs/RELEASE-CHECKLIST.md index 6c0a1033..2d17654e 100644 --- a/docs/RELEASE-CHECKLIST.md +++ b/docs/RELEASE-CHECKLIST.md @@ -45,7 +45,7 @@ All patch/minor gates, plus: - [ ] **≥12 of 15 NORTHSTAR patterns shipped**. Cross-check `docs/patterns/pattern-*-*.md` against - [`NORTHSTARS.md`](../NORTHSTARS.md) Appendix A. + [`NORTHSTARS.md`](NORTHSTARS.md) Appendix A. - [ ] **Verdict schema v1.0 RC** published at `https://schema.tracecore.io/verdict/1.0.0-rc1.json` (or repo-mirrored if CDN not yet wired) diff --git a/docs/followups/M3.md b/docs/followups/M3.md index 1dfc601a..b33420e9 100644 --- a/docs/followups/M3.md +++ b/docs/followups/M3.md @@ -43,7 +43,7 @@ the verifier walkthrough; install/kubernetes/tracecore/README.md's job that downloads an artifact from a sibling job — exactly the "build influences signing" pattern L3 forbids. *Trigger:* M21 release-checklist pass. -- [ ] **Release-asset shape reconciliation** — see [MILESTONES.md §M21 Carry-forward from M3](../../MILESTONES.md). Milestone-tracked, not opportunistic. +- [ ] **Release-asset shape reconciliation** — see [MILESTONES.md §M21 Carry-forward from M3](../MILESTONES.md). Milestone-tracked, not opportunistic. - [ ] **Nightly drift cron.** A scheduled workflow that picks the latest published tag, re-runs steps 2–4 of `docs/reproducibility.md`, and opens an issue on diffoscope diff. The reproducibility claim diff --git a/docs/followups/README.md b/docs/followups/README.md index fef782fe..5226e974 100644 --- a/docs/followups/README.md +++ b/docs/followups/README.md @@ -105,7 +105,7 @@ out of politeness. - Milestone-tracked carry-forwards (cross-platform CI, chaos-light, benchstat in CI, sustained-load, backend integration recipes, the M8 receiver-author ergonomics review, etc.) live in - [`MILESTONES.md`](../../MILESTONES.md) as + [`MILESTONES.md`](../MILESTONES.md) as **"Carry-forward from M1.6:"** bullets under their target milestone — grep there for `carry-forward`. - The legacy path `docs/FOLLOWUPS.md` is preserved as a redirect diff --git a/docs/maintainership.md b/docs/maintainership.md index dd189770..cb9b0455 100644 --- a/docs/maintainership.md +++ b/docs/maintainership.md @@ -80,13 +80,13 @@ sponsoring maintainer is responsible for: - Recording the accepted decision in the RFC's `Status:` field (`accepted`) and ensuring the merge happens. - Updating any document the RFC directly invalidates in the same PR - — typically [NORTHSTARS.md § "Open questions tracked as RFCs"](../NORTHSTARS.md#open-questions-tracked-as-rfcs) + — typically [NORTHSTARS.md § "Open questions tracked as RFCs"](NORTHSTARS.md#open-questions-tracked-as-rfcs) when an RFC closes one of those entries (see RFC-0008 for the pattern). - Watching for follow-up implementation work to land or be tracked in the [`docs/followups/`](followups/README.md) shards per the currency rule in - [MILESTONES.md § "Keeping this document current"](../MILESTONES.md#keeping-this-document-current). + [MILESTONES.md § "Keeping this document current"](MILESTONES.md#keeping-this-document-current). **Rejecting an RFC.** Any maintainer may close an RFC with rationale in the closing comment. The rationale is the durable artifact — a diff --git a/docs/patterns/README.md b/docs/patterns/README.md index 67d844cf..1b17e182 100644 --- a/docs/patterns/README.md +++ b/docs/patterns/README.md @@ -35,7 +35,7 @@ These pages assume: - An OTLP backend is receiving the metrics (Prometheus, Datadog, Honeycomb, Mimir all confirmed in the backend matrix). -Four of [NORTHSTARS Appendix A's 15 patterns](../../NORTHSTARS.md#appendix-a-the-15-named-root-cause-patterns) have walkthroughs today - the DCGM-observable subset. The remaining eleven are tracked under their owning milestones in [`MILESTONES.md`](../../MILESTONES.md) (e.g., M17 pattern #1, M18 pattern #6, M19 pattern #14). New walkthroughs land alongside the receiver / detector that surfaces the pattern. +Four of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15-named-root-cause-patterns) have walkthroughs today - the DCGM-observable subset. The remaining eleven are tracked under their owning milestones in [`MILESTONES.md`](../MILESTONES.md) (e.g., M17 pattern #1, M18 pattern #6, M19 pattern #14). New walkthroughs land alongside the receiver / detector that surfaces the pattern. | Pattern | File | DCGM signal | |---|---|---| diff --git a/docs/proposals/gen-ai-training-semconv.md b/docs/proposals/gen-ai-training-semconv.md index 684cb914..5b926505 100644 --- a/docs/proposals/gen-ai-training-semconv.md +++ b/docs/proposals/gen-ai-training-semconv.md @@ -167,7 +167,7 @@ Backends materializing one timeseries per `rank` at 10⁵ ranks should expect ca [tracecore](https://github.com/TraceCoreAI/tracecore) is the reference implementation, with the namespace flagged PROPOSED on emit until this proposal lands. -The M13 pyspy receiver (design-locked in [RFC-0009](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0009-pyspy-receiver-scope.md)) commits to emitting `gen_ai.training.rank` and `gen_ai.training.world_size` on every record, with the derivation rules in this proposal's §Proposed names. Three further milestones on tracecore's roadmap (M14 Kineto profiler-trace consumer, M15 container stdout receiver, M18 straggler detector) extend the namespace to `step`, `job.id`, and `run.id` semantics as they ship; see [MILESTONES.md](https://github.com/TraceCoreAI/tracecore/blob/main/MILESTONES.md). The PROPOSED flag on each emitting receiver's README signals upstream-pending status; tracecore will adopt the official names verbatim if/when this proposal lands with divergence from the draft. +The M13 pyspy receiver (design-locked in [RFC-0009](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0009-pyspy-receiver-scope.md)) commits to emitting `gen_ai.training.rank` and `gen_ai.training.world_size` on every record, with the derivation rules in this proposal's §Proposed names. Three further milestones on tracecore's roadmap (M14 Kineto profiler-trace consumer, M15 container stdout receiver, M18 straggler detector) extend the namespace to `step`, `job.id`, and `run.id` semantics as they ship; see [MILESTONES.md](https://github.com/TraceCoreAI/tracecore/blob/main/docs/MILESTONES.md). The PROPOSED flag on each emitting receiver's README signals upstream-pending status; tracecore will adopt the official names verbatim if/when this proposal lands with divergence from the draft. ## Open questions for the SIG diff --git a/docs/research/m15-container-stdout.md b/docs/research/m15-container-stdout.md index 827c9ed7..069a2aca 100644 --- a/docs/research/m15-container-stdout.md +++ b/docs/research/m15-container-stdout.md @@ -3,7 +3,7 @@ > **Status (2026-05-22):** snapshot — superseded by RFC-0013 (filelogreceiver adoption). Body retained as decision input. See [RFC-0013](../rfcs/0013-distro-first-pivot.md). Synthesis of six parallel research passes for [`M15` in -MILESTONES.md](../../MILESTONES.md#m15-container-stdout-receiver), +MILESTONES.md](../MILESTONES.md#m15-container-stdout-receiver), performed 2026-05-19 against current upstream sources. Purpose: address the architectural knowledge gaps **before** any design or code work. @@ -154,7 +154,7 @@ If we depend on filelog (see §5), we inherit `fileconsumer` for free. Rolling our own on `fsnotify` only makes sense if we explicitly reject filelog; pulling `nxadm/tail` adds a stale dependency for no gain. -**Rubric note.** [`MILESTONES.md` §M15](../../MILESTONES.md#m15-container-stdout-receiver) +**Rubric note.** [`MILESTONES.md` §M15](../MILESTONES.md#m15-container-stdout-receiver) says the receiver "follows inode, not path". Poll-plus-fingerprint satisfies the *intent* (correctly track rotation) but does it by file identity (fingerprint hash) rather than by `fstat` inode number. The @@ -356,7 +356,7 @@ revised accordingly.** ### 7.1 Evidence that `gen_ai.training.*` is a deliberate project goal -[`NORTHSTARS.md`](../../NORTHSTARS.md) O4 row (line 38) names "Standards" +[`NORTHSTARS.md`](../NORTHSTARS.md) O4 row (line 38) names "Standards" as a top-level objective with `gen_ai.training.*` external implementations as the hero KPI. Line 202 states the goal verbatim: "author and shepherd the OpenTelemetry `gen_ai.training.*` semantic @@ -368,7 +368,7 @@ on someone else's vocabulary." The O4 commitments include: - First merged `gen_ai.training.*` upstream PR by M6 (M6 in progress). - Tracecore receivers emit semconv attribute names: 100% per release. -[`MILESTONES.md`](../../MILESTONES.md) line 29 reinforces this: +[`MILESTONES.md`](../MILESTONES.md) line 29 reinforces this: "M7 is absent by design — OTel `gen_ai.training.*` semconv work lives in `open-telemetry/semantic-conventions`, not this repo (recurring cadence in NORTHSTARS.md O4)." @@ -422,7 +422,7 @@ If O4 status is "stalled or rejected" at design-doc time, the fallback to `tracecore.training.*` is well-evidenced. If O4 status is "draft PR open and tracking", hold the bet. -This is a [`MILESTONES.md`](../../MILESTONES.md) rubric question with +This is a [`MILESTONES.md`](../MILESTONES.md) rubric question with 8 cross-receiver call sites (lines 29, 358, 360, 433, 453, 459, 460, 481 — see §11 R-1). Any change must be a single cross-cutting PR, not an M15-scope edit. diff --git a/docs/research/m16-kueue.md b/docs/research/m16-kueue.md index 6926c7d5..9532d5e1 100644 --- a/docs/research/m16-kueue.md +++ b/docs/research/m16-kueue.md @@ -11,7 +11,7 @@ Prom→OTLP translation + semconv, `k8sevents` template inventory, repo-internal constraints) plus follow-up local reads on `clockreceiver` and `dcgm`. This note is the input to the implementation PR. Where the rubric in -[MILESTONES.md](../../MILESTONES.md) is contradicted or under-specified by +[MILESTONES.md](../MILESTONES.md) is contradicted or under-specified by upstream reality, this note proposes the rubric update; the PR carries it. **Alpha scope:** 22 kueue-prefixed metric families passed through; diff --git a/docs/research/m5-m6-research.md b/docs/research/m5-m6-research.md index b25f601e..a3a5233b 100644 --- a/docs/research/m5-m6-research.md +++ b/docs/research/m5-m6-research.md @@ -1011,7 +1011,7 @@ Discussion is the feedback channel. [otlphttp-readme]: https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlphttpexporter [ch-readme]: https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/clickhouseexporter [jaeger-207]: https://github.com/jaegertracing/jaeger/issues/207 -[rubric-fix]: ../../MILESTONES.md +[rubric-fix]: ../MILESTONES.md [doc-check]: ../../scripts/doc-check.sh [mlperf]: https://github.com/mlcommons/training [validate-cmd]: https://github.com/open-telemetry/opentelemetry-collector/blob/main/otelcol/command_validate.go diff --git a/docs/rfcs/0002-own-binary-vs-otel-contrib.md b/docs/rfcs/0002-own-binary-vs-otel-contrib.md index 445df0b2..331648c0 100644 --- a/docs/rfcs/0002-own-binary-vs-otel-contrib.md +++ b/docs/rfcs/0002-own-binary-vs-otel-contrib.md @@ -13,7 +13,7 @@ Tracecore ships as its own single-binary Go collector, not as a set of receivers ## Motivation -RFC 0001 deferred the choice between (a) importing `go.opentelemetry.io/collector/component` directly and effectively becoming a downstream of otelcol-contrib, and (b) defining our own component contract and shipping our own binary. Several northstar KPIs in [`NORTHSTARS.md`](../../NORTHSTARS.md) depend on which way this lands — O1's coverage trajectory, O2's convenience and reproducibility KPIs, O3's supply-chain ownership, and O5's adoption motion all assume an architectural answer. Continuing to defer is itself a decision (status quo = own binary, since that's what `cmd/tracecore/` is scaffolded as). +RFC 0001 deferred the choice between (a) importing `go.opentelemetry.io/collector/component` directly and effectively becoming a downstream of otelcol-contrib, and (b) defining our own component contract and shipping our own binary. Several northstar KPIs in [`NORTHSTARS.md`](../NORTHSTARS.md) depend on which way this lands — O1's coverage trajectory, O2's convenience and reproducibility KPIs, O3's supply-chain ownership, and O5's adoption motion all assume an architectural answer. Continuing to defer is itself a decision (status quo = own binary, since that's what `cmd/tracecore/` is scaffolded as). The cost of not deciding: KPI targets remain hypothetical, contributors don't know whether to file PRs upstream or here, and the trust narrative ("you can audit the build") stays ambiguous. @@ -92,7 +92,7 @@ This is a green-field decision; nothing exists to migrate from. The implications ## References - [RFC 0001: Architecture Overview](0001-architecture-overview.md) — closes its §Open questions item on the OTel collector dependency -- [`NORTHSTARS.md`](../../NORTHSTARS.md) — O1 architectural assumption depends on this RFC +- [`NORTHSTARS.md`](../NORTHSTARS.md) — O1 architectural assumption depends on this RFC - [`PRINCIPLES.md`](../../PRINCIPLES.md) §1 (trust under load), §2 (reversibility before optionality), §6 (defaults bias toward private) - [`STYLE.md`](../../STYLE.md) §Repo layout, §Component layout, §Component registration, §Vendor SDK isolation - [OpenTelemetry Collector contribution guide](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/CONTRIBUTING.md) diff --git a/docs/rfcs/0003-pipeline-runtime-and-component-contract.md b/docs/rfcs/0003-pipeline-runtime-and-component-contract.md index bada5bbd..b8d0a6fd 100644 --- a/docs/rfcs/0003-pipeline-runtime-and-component-contract.md +++ b/docs/rfcs/0003-pipeline-runtime-and-component-contract.md @@ -15,7 +15,7 @@ Component registration uses an **explicit `Factories` map** in `cmd/tracecore/co The contract closely mirrors OpenTelemetry Collector v0.152.0 shapes (verified verbatim against current upstream source), with three intentional divergences: `*slog.Logger` instead of `*zap.Logger`, simpler `TelemetrySettings`, and several patterns deferred until they have a concrete consumer. This RFC names every deferred pattern and the milestone that would introduce it. -This RFC closes [`MILESTONES.md`](../../MILESTONES.md) M1's open contract questions and is the prerequisite for every receiver milestone (M8–M16) and for M2 (self-telemetry). +This RFC closes [`MILESTONES.md`](../MILESTONES.md) M1's open contract questions and is the prerequisite for every receiver milestone (M8–M16) and for M2 (self-telemetry). ## Motivation @@ -575,8 +575,8 @@ Subsequent receiver milestones (M8 onward) implement this contract. If the contr - [RFC 0002: Own binary vs. extending opentelemetry-collector-contrib](0002-own-binary-vs-otel-contrib.md) - [`PRINCIPLES.md`](../../PRINCIPLES.md) §1 (trust under load), §2 (reversibility), §6 (private-first), §9 (failure modes are part of the API), §15 (decide late) - [`STYLE.md`](../../STYLE.md) §Concurrency, §Logging, §Error handling, §Component layout, §Component registration, §Testing, §Vendor SDK isolation -- [`NORTHSTARS.md`](../../NORTHSTARS.md) §O2 (Convenience & Quality) -- [`MILESTONES.md`](../../MILESTONES.md) §M1 +- [`NORTHSTARS.md`](../NORTHSTARS.md) §O2 (Convenience & Quality) +- [`MILESTONES.md`](../MILESTONES.md) §M1 - OpenTelemetry Collector v0.152.0 source (verified verbatim May 2026): - `component/component.go` — `Component` interface and lifecycle docs - `component/host.go` — `Host` interface diff --git a/docs/rfcs/0008-auto-update-boundary.md b/docs/rfcs/0008-auto-update-boundary.md index d3078488..87d4e095 100644 --- a/docs/rfcs/0008-auto-update-boundary.md +++ b/docs/rfcs/0008-auto-update-boundary.md @@ -101,7 +101,7 @@ There is no migration. This RFC ratifies the current default (no auto-update, no - [PRINCIPLES §2 "Reversibility before optionality"](../../PRINCIPLES.md#2-reversibility-before-optionality) — the trade-off this RFC resolves in favor of doing less. - [PRINCIPLES §6 "Defaults bias toward private"](../../PRINCIPLES.md#6-defaults-bias-toward-private) — no outbound network the operator did not configure. - [PRINCIPLES §11 "Backwards compatibility is something you opt into, not out of"](../../PRINCIPLES.md#11-backwards-compatibility-is-something-you-opt-into-not-out-of) — the post-1.0 deprecation cycle that this RFC does not weaken. -- [NORTHSTARS § "Open questions tracked as RFCs"](../../NORTHSTARS.md#open-questions-tracked-as-rfcs) — the open-questions list this RFC closes entry 2 of. +- [NORTHSTARS § "Open questions tracked as RFCs"](../NORTHSTARS.md#open-questions-tracked-as-rfcs) — the open-questions list this RFC closes entry 2 of. - [Sigstore cosign keyless verification](https://docs.sigstore.dev/cosign/signing/overview/) — the trust root the operator's delivery system uses against tracecore release artifacts. - [SLSA v1.0 Build L1](https://slsa.dev/spec/v1.0/levels#build-l1) — the provenance level tracecore release tags carry (per M3, `.github/workflows/release.yml`). diff --git a/docs/rfcs/0009-pyspy-receiver-scope.md b/docs/rfcs/0009-pyspy-receiver-scope.md index d71e7e8c..39070b80 100644 --- a/docs/rfcs/0009-pyspy-receiver-scope.md +++ b/docs/rfcs/0009-pyspy-receiver-scope.md @@ -13,7 +13,7 @@ Tracecore's first Python-runtime receiver samples in-process call stacks from a The receiver never reads another process's memory. No `process_vm_readv`. No `ptrace`. No signal trigger. Restricted-tier Pod Security Standards passes with zero capability additions; this is asserted in CI by a `yq` check on `helm template` output. The conventional "py-spy" name is preserved for operator familiarity; the implementation is unrelated to [benfred/py-spy](https://github.com/benfred/py-spy)'s ptrace path. -This design diverges from the M13 rubric in [`MILESTONES.md`](../../MILESTONES.md) §M13 in three places (signal trigger, capability reservation, hash width). Section [Rubric amendments](#rubric-amendments) enumerates the verbatim changes; they land in the same PR as this RFC's acceptance per the rule at MILESTONES.md §38. +This design diverges from the M13 rubric in [`MILESTONES.md`](../MILESTONES.md) §M13 in three places (signal trigger, capability reservation, hash width). Section [Rubric amendments](#rubric-amendments) enumerates the verbatim changes; they land in the same PR as this RFC's acceptance per the rule at MILESTONES.md §38. ## Motivation @@ -62,7 +62,7 @@ A single `sync.Mutex`-protected `inFlight bool` gates both. If the previous fram **Receiver, reader and emit loop** (Go). Reads the length-prefixed dump frame from the UDS, parses `faulthandler`'s line format (`Thread 0x (most recent call first):` followed by ` File "", line , in ` lines), hashes each frame list into `stack.id` via `hash/fnv` `New128a` over the byte-encoded `(file, func, line)` tuples in order, dedups against a per-cadence-window LRU keyed by `stack.id`, emits one `plog.LogRecord` per distinct stack per window with `repeat.count` accumulated. -`fnv128a` (not `fnv64a`) is chosen because `stack.id` is M18's cross-rank join key (per [`MILESTONES.md`](../../MILESTONES.md) M18 straggler-detector decision tree). A single 64-bit collision across ranks would produce a false straggler match. Birthday-bound derivation: `n²/(2·2^k)` distinct (rank, stack) pairs per fleet-day. For a 10⁴-rank training job × ~10 distinct main-thread stacks per rank per day × ~100 fleet-days (per the M18 replay-corpus design horizon), `n ≈ 10⁷`. At `n=10⁷`: 64-bit collision probability ≈ 2.7e-6 (one false match per ~370k fleet-days); 128-bit ≈ 1.5e-25 (effectively zero across the project lifetime). 128-bit is chosen so a multi-job aggregator running against a fleet for years does not produce a false straggler verdict from hash collision alone. Go's stdlib `hash/fnv` package exposes `New128a()` returning a `hash.Hash` with 16-byte output, no third-party dep. +`fnv128a` (not `fnv64a`) is chosen because `stack.id` is M18's cross-rank join key (per [`MILESTONES.md`](../MILESTONES.md) M18 straggler-detector decision tree). A single 64-bit collision across ranks would produce a false straggler match. Birthday-bound derivation: `n²/(2·2^k)` distinct (rank, stack) pairs per fleet-day. For a 10⁴-rank training job × ~10 distinct main-thread stacks per rank per day × ~100 fleet-days (per the M18 replay-corpus design horizon), `n ≈ 10⁷`. At `n=10⁷`: 64-bit collision probability ≈ 2.7e-6 (one false match per ~370k fleet-days); 128-bit ≈ 1.5e-25 (effectively zero across the project lifetime). 128-bit is chosen so a multi-job aggregator running against a fleet for years does not produce a false straggler verdict from hash collision alone. Go's stdlib `hash/fnv` package exposes `New128a()` returning a `hash.Hash` with 16-byte output, no third-party dep. **Cadence pairing with M18.** The 15s main-thread cadence is chosen against M18's "≥3 consecutive main-thread samples" threshold for the GIL-hold pattern (per `MILESTONES.md` M18 decision tree). 15s × 3 = 45s minimum sustained-state detection window. The Phase 3 deliverable list below carries an explicit cross-link fixture that asserts the M13 cadence × M18 threshold product holds at every build, so neither side can drift silently. @@ -181,7 +181,7 @@ There is no `cap_missing` row. The receiver requires no capability addition. ### Non-functional rubrics -Per [`MILESTONES.md`](../../MILESTONES.md) §M13. The four ceilings below are **design contracts** inherited from NORTHSTARS O2 per-receiver budgets and the M13 rubric. They are not measurements. The asserting tests below are Phase 1 (`pyspy-lint`, integration scaffold) / Phase 3 (overhead benchstat, 1-hour soak) deliverables; current numbers are targets, not validated bounds. The receiver promotes from alpha to beta only when overhead rubrics pass under benchstat p<0.05 across two consecutive `main` runs, at which point the "measured" wording replaces "design contract" wording here. +Per [`MILESTONES.md`](../MILESTONES.md) §M13. The four ceilings below are **design contracts** inherited from NORTHSTARS O2 per-receiver budgets and the M13 rubric. They are not measurements. The asserting tests below are Phase 1 (`pyspy-lint`, integration scaffold) / Phase 3 (overhead benchstat, 1-hour soak) deliverables; current numbers are targets, not validated bounds. The receiver promotes from alpha to beta only when overhead rubrics pass under benchstat p<0.05 across two consecutive `main` runs, at which point the "measured" wording replaces "design contract" wording here. - Sustained CPU ≤0.05% (design contract) via `syscall.Getrusage(RUSAGE_SELF)` delta over a 10-min run at default cadence. CI ceiling 5× (0.25%) for shared-runner variance, matching M14's identical multiplier. Measurement deferred to Phase 3 overhead benchstat. - Sustained egress ≤0.05 Mbps (design contract) via a counting OTLP sink over the same window. Measurement deferred to Phase 3. @@ -190,7 +190,7 @@ Per [`MILESTONES.md`](../../MILESTONES.md) §M13. The four ceilings below are ** ### Rubric amendments -Acceptance of this RFC carries the following same-PR edits to [`MILESTONES.md`](../../MILESTONES.md) §M13, per the rule at MILESTONES.md §38. Each row shows the full current bullet and the replacement. +Acceptance of this RFC carries the following same-PR edits to [`MILESTONES.md`](../MILESTONES.md) §M13, per the rule at MILESTONES.md §38. Each row shows the full current bullet and the replacement. **M13 Functional rubric, bullet on trigger mechanism:** > Current: *"Trigger is `faulthandler.dump_traceback(file=fd, all_threads=bool)` invoked via `os.kill(pid, SIGUSR1)` to a pre-registered handler. Integration test asserts `strace` shows no `ptrace`/`process_vm_readv` from tracecore. (per https://docs.python.org/3/library/faulthandler.html)"* @@ -324,9 +324,9 @@ The Python helper is `pip install`-only. No vendored CPython, no compiled extens - [Python docs / `faulthandler`](https://docs.python.org/3/library/faulthandler.html). Source for `dump_traceback`, `register`, file-descriptor support. The `register` `all_threads` registration-time binding motivates the choice not to use `register`; this source is the verifying reference. - [Kubernetes Pod Security Standards](https://kubernetes.io/docs/concepts/security/pod-security-standards/). Restricted tier admits no `capabilities.add` entries; this RFC's no-cap-addition posture is what unlocks restricted-tier passage. -- [NORTHSTARS.md §O2 Convenience & Quality](../../NORTHSTARS.md#o2-convenience--quality) (per-receiver budget), [§O4 Standards](../../NORTHSTARS.md#o4-standards) (semconv shepherding), [§O6 Velocity](../../NORTHSTARS.md#o6-velocity) (ecosystem-change SLA). +- [NORTHSTARS.md §O2 Convenience & Quality](../NORTHSTARS.md#o2-convenience--quality) (per-receiver budget), [§O4 Standards](../NORTHSTARS.md#o4-standards) (semconv shepherding), [§O6 Velocity](../NORTHSTARS.md#o6-velocity) (ecosystem-change SLA). - [PRINCIPLES.md §1 Trust under load is the product](../../PRINCIPLES.md#1-trust-under-load-is-the-product) (never crash the workload), [§9 Failure modes are part of the API](../../PRINCIPLES.md#9-failure-modes-are-part-of-the-api), [§15 Decide late, write it down, revisit honestly](../../PRINCIPLES.md#15-decide-late-write-it-down-revisit-honestly). -- [`MILESTONES.md`](../../MILESTONES.md) §M13. Functional and non-functional rubrics this RFC is scoped to satisfy, with the amendments enumerated in [Rubric amendments](#rubric-amendments). +- [`MILESTONES.md`](../MILESTONES.md) §M13. Functional and non-functional rubrics this RFC is scoped to satisfy, with the amendments enumerated in [Rubric amendments](#rubric-amendments). - [`docs/rfcs/0007-kernelevents-receiver-scope.md`](0007-kernelevents-receiver-scope.md). Closest comparable receiver-scope RFC; M13 reuses its `lifecycle.Lifecycle` plumbing and degraded-mode posture conventions. - [`install/kubernetes/tracecore/values.schema.json`](../../install/kubernetes/tracecore/values.schema.json), [`install/kubernetes/tracecore/policies/conftest/tracecore.rego`](../../install/kubernetes/tracecore/policies/conftest/tracecore.rego), [`install/kubernetes/tracecore/README.md`](../../install/kubernetes/tracecore/README.md) "Minimum-privilege deviations" section. The M5b chart's `SYS_PTRACE` reservation stays in place for a future host-process inspection receiver. This RFC does not consume it. - [py-spy issue #227](https://github.com/benfred/py-spy/issues/227), [CPython issue #137185](https://github.com/python/cpython/issues/137185). Cited in the original M13 rubric; this RFC corrects both attributions per [Rubric amendments](#rubric-amendments). diff --git a/docs/rfcs/0010-containerstdout-receiver-scope.md b/docs/rfcs/0010-containerstdout-receiver-scope.md index f98a22e1..6a676acf 100644 --- a/docs/rfcs/0010-containerstdout-receiver-scope.md +++ b/docs/rfcs/0010-containerstdout-receiver-scope.md @@ -458,7 +458,7 @@ Mapping from MILESTONES.md M15 rubric bullets (lines 357-375) to RFC subsections - [RFC-0006](0006-self-telemetry-surface.md) — `IncError(Kind)` cardinality contract - [RFC-0007](0007-kernelevents-receiver-scope.md) — streaming-source receiver shape reference - [RFC-0009](0009-pyspy-receiver-scope.md) — rank-attribution chain shared with M15 -- [MILESTONES.md M15 entry](../../MILESTONES.md#m15-container-stdout-receiver) +- [MILESTONES.md M15 entry](../MILESTONES.md#m15-container-stdout-receiver) - [PRINCIPLES.md](../../PRINCIPLES.md) §1, §6, §11, §15 -- [NORTHSTARS.md](../../NORTHSTARS.md) O2 (Convenience), O4 (Standards) +- [NORTHSTARS.md](../NORTHSTARS.md) O2 (Convenience), O4 (Standards) - containerd #11149 (open upstream): diff --git a/docs/rfcs/0011-m16-kueue-receiver-scope.md b/docs/rfcs/0011-m16-kueue-receiver-scope.md index da3c093c..250fa787 100644 --- a/docs/rfcs/0011-m16-kueue-receiver-scope.md +++ b/docs/rfcs/0011-m16-kueue-receiver-scope.md @@ -305,8 +305,8 @@ so the rubric and code are reviewed against each other. - [`docs/research/m16-kueue-spike/`](../research/m16-kueue-spike/) — runnable spike artifacts (30 tests under `-race`, live `/metrics` fixture, draft RUNBOOK/README/alerts). -- [MILESTONES.md M16](../../MILESTONES.md) — rubric. -- [NORTHSTARS.md:126](../../NORTHSTARS.md) — Scheduler ingest budget row. +- [MILESTONES.md M16](../MILESTONES.md) — rubric. +- [NORTHSTARS.md:126](../NORTHSTARS.md) — Scheduler ingest budget row. - [STRATEGY.md](../STRATEGY.md) — divergence table rows for Prometheus-scrape filtering and parser-library choice. - Kueue upstream: `kubernetes-sigs/kueue` release-0.17 branch. diff --git a/docs/rfcs/0012-kineto-receiver-scope.md b/docs/rfcs/0012-kineto-receiver-scope.md index b679db74..04544500 100644 --- a/docs/rfcs/0012-kineto-receiver-scope.md +++ b/docs/rfcs/0012-kineto-receiver-scope.md @@ -13,7 +13,7 @@ The Kineto receiver ingests `*.pt.trace.json` Chrome-trace files produced by `to ## Motivation -M14 in [`MILESTONES.md`](../../MILESTONES.md) commits tracecore to consuming PyTorch profiler dumps as a first-class signal, because Kineto traces are the only available source for per-kernel GPU activity, CPU-side `aten::*` op timing, and `cudaLaunchKernel` correlation. No upstream OpenTelemetry component parses Kineto's Chrome-trace dialect. The closest analogue, the OTel `pprofreceiver`, consumes pprof-format heap and CPU profiles, not the device-trace structure Kineto emits. Without a Kineto receiver, tracecore cannot answer the GPU-side half of the straggler-detection patterns in M18 (slow-kernel, dataloader-stall, NCCL-wait), and the `gen_ai.training.*` namespace tracecore is shepherding upstream (per [`docs/proposals/gen-ai-training-semconv.md`](../proposals/gen-ai-training-semconv.md)) has no GPU-trace surface to test against. +M14 in [`MILESTONES.md`](../MILESTONES.md) commits tracecore to consuming PyTorch profiler dumps as a first-class signal, because Kineto traces are the only available source for per-kernel GPU activity, CPU-side `aten::*` op timing, and `cudaLaunchKernel` correlation. No upstream OpenTelemetry component parses Kineto's Chrome-trace dialect. The closest analogue, the OTel `pprofreceiver`, consumes pprof-format heap and CPU profiles, not the device-trace structure Kineto emits. Without a Kineto receiver, tracecore cannot answer the GPU-side half of the straggler-detection patterns in M18 (slow-kernel, dataloader-stall, NCCL-wait), and the `gen_ai.training.*` namespace tracecore is shepherding upstream (per [`docs/proposals/gen-ai-training-semconv.md`](../proposals/gen-ai-training-semconv.md)) has no GPU-trace surface to test against. The cost of not doing it: M18 ships without GPU-attributed root cause, M14 stays at `⧗`, and the upstream `gen_ai.training.*` proposal lacks the cross-signal join evidence (Python stack from M13 plus GPU kernel from M14 plus container stdout from M15 plus NCCL FlightRecorder from M16) that grounds its `step_id` carrier-attribute claim. @@ -142,7 +142,7 @@ Resource attrs are set once per ingest at the `ptrace.ResourceSpans` scope, not **Fallthrough categories.** Kineto categories tracecore does not consume (`gpu_user_annotation`, `external_correlation`, `cpu_instant_event`, `overhead`, `cuda_sync`, `cuda_event`, `collective_comm`, MTIA/XPU/HPU variants) each increment `tracecore.receiver.kineto.unknown_category{kineto.category="…"}` and the event is dropped. Counter cardinality is bounded by Kineto's own `_activityTypeNames` enum (≤30 values). -The `gen_ai.training.*` namespace is the tracecore strategic bet per [NORTHSTARS.md §O4](../../NORTHSTARS.md) and the upstream proposal at [`docs/proposals/gen-ai-training-semconv.md`](../proposals/gen-ai-training-semconv.md). If the upstream PR is rejected, a collector-side `attributesprocessor` rename is the mitigation path. No M14 schema change is required to switch namespaces. +The `gen_ai.training.*` namespace is the tracecore strategic bet per [NORTHSTARS.md §O4](../NORTHSTARS.md) and the upstream proposal at [`docs/proposals/gen-ai-training-semconv.md`](../proposals/gen-ai-training-semconv.md). If the upstream PR is rejected, a collector-side `attributesprocessor` rename is the mitigation path. No M14 schema change is required to switch namespaces. ### Step-ID detection @@ -210,7 +210,7 @@ When `Aggregate: true`, consecutive events in the parsed stream where all of: collapse into the preceding span. `repeat.count` increments, `dur` accumulates, `EndTime` advances. Step boundary or category change or name change flushes the accumulator. -The M14 rubric calls this "fr_trace-inspired." The analogy is purely semantic. Upstream `tools/flight_recorder/fr_trace.py` (PyTorch 2.5+) aggregates NCCL collectives across ranks, not repeated `cpu_op` events within a rank's profile. This RFC explicitly disclaims that the aggregation toggle is a port of upstream `fr_trace.py` behavior; it is a tracecore-defined emission-volume reduction motivated by [NORTHSTARS.md §O2](../../NORTHSTARS.md) per-receiver budgets. +The M14 rubric calls this "fr_trace-inspired." The analogy is purely semantic. Upstream `tools/flight_recorder/fr_trace.py` (PyTorch 2.5+) aggregates NCCL collectives across ranks, not repeated `cpu_op` events within a rank's profile. This RFC explicitly disclaims that the aggregation toggle is a port of upstream `fr_trace.py` behavior; it is a tracecore-defined emission-volume reduction motivated by [NORTHSTARS.md §O2](../NORTHSTARS.md) per-receiver budgets. ### Error taxonomy and self-telemetry @@ -218,7 +218,7 @@ In `pkg/kineto/errors.go` (matchable via `errors.Is`): - `ErrTraceMalformed`: JSON structure invalid (missing `traceEvents`, malformed event, overlapping ProfilerStep windows). Maps to `selftelemetry.KindParse`. - `ErrTruncated`: EOF mid-event or mid-array. Maps to `KindParse`. -- `ErrSchemaUnknown`: `schemaVersion` outside the tested set (1.0; 2.0 when verified). The receiver continues with best-effort parsing and bumps a `kineto.schema_drift` counter for operator visibility, per [NORTHSTARS.md §O6](../../NORTHSTARS.md) ecosystem-change handling. +- `ErrSchemaUnknown`: `schemaVersion` outside the tested set (1.0; 2.0 when verified). The receiver continues with best-effort parsing and bumps a `kineto.schema_drift` counter for operator visibility, per [NORTHSTARS.md §O6](../NORTHSTARS.md) ecosystem-change handling. - `ErrLimitExceeded`: `MaxBytes` / `MaxEvents` / `MaxStringBytes` cap hit. Maps to receiver-local `KindLimitExceeded`. Receiver-local `Kind` constants in `selftel.go` (extending the canonical set per the `dcgm` and `kernelevents` precedent): @@ -319,7 +319,7 @@ The accepted option is the design as specified above (single-pass streaming pars Listed by descending load-bearing-impact: the highest item is the only one the receiver cannot defend against in code alone. -1. **Kineto-side `schemaVersion` drift.** `schemaVersion` has been `1.0` across every torch version in the tested range, but the field is owned by the libkineto writer and could change without coordination. Mitigation: `ErrSchemaUnknown` is a soft error; the receiver continues with best-effort parsing and bumps the `kineto.schema_drift` counter so operators see drift before silent data loss. The 30-day ecosystem-change SLA in [NORTHSTARS.md §O6](../../NORTHSTARS.md) gates the response to a confirmed bump. +1. **Kineto-side `schemaVersion` drift.** `schemaVersion` has been `1.0` across every torch version in the tested range, but the field is owned by the libkineto writer and could change without coordination. Mitigation: `ErrSchemaUnknown` is a soft error; the receiver continues with best-effort parsing and bumps the `kineto.schema_drift` counter so operators see drift before silent data loss. The 30-day ecosystem-change SLA in [NORTHSTARS.md §O6](../NORTHSTARS.md) gates the response to a confirmed bump. 2. **Upstream rejection of the `gen_ai.training.*` namespace.** The proposal at [`docs/proposals/gen-ai-training-semconv.md`](../proposals/gen-ai-training-semconv.md) is upstream-pending. If OpenTelemetry rejects the namespace, M14's emitted attributes diverge from upstream semconv. Mitigation: a collector-side `attributesprocessor` rename closes the gap with no receiver code change required, because the namespace is a string prefix the processor can rewrite. @@ -331,7 +331,7 @@ Listed by descending load-bearing-impact: the highest item is the only one the r These items were drafted as Open Questions but resolved during design lock. They appear here as decided positions, so the reader sees what was considered alternative-by-alternative. The shipped Kineto receiver is `[DEFERRED]` pending OTel Profiles GA per [RFC-0013](0013-distro-first-pivot.md) §2 + §4 — re-evaluated at v0.3.0 PR-O. The standalone `docs/followups/M14.md` shard was removed in the v0.2.0 doc sweep (git history preserves it); reopen under the same path when the Profiles signal graduates. -1. **`schemaVersion` policy on unknown versions: tolerate + counter, do not refuse.** Kineto's `schemaVersion` has been `1.0` for the tested torch range. Refusing unknown versions would surface an in-tree-tested-set-mismatch at every Kineto 2.0 rollout before a 30-day SLA cycle can update the tested set. Tolerate-plus-counter (`tracecore.kineto.schema_drift{kineto.schema_version}`) lets operators see the drift without losing trace ingest; refuse only on parse failure. The `ErrSchemaUnknown` sentinel is reserved for a future strict-mode (`ParseOptions.StrictSchemaVersion` not yet added) so tooling can opt into hard rejection if needed. (Per [NORTHSTARS.md §O6](../../NORTHSTARS.md) ecosystem-change SLA.) +1. **`schemaVersion` policy on unknown versions: tolerate + counter, do not refuse.** Kineto's `schemaVersion` has been `1.0` for the tested torch range. Refusing unknown versions would surface an in-tree-tested-set-mismatch at every Kineto 2.0 rollout before a 30-day SLA cycle can update the tested set. Tolerate-plus-counter (`tracecore.kineto.schema_drift{kineto.schema_version}`) lets operators see the drift without losing trace ingest; refuse only on parse failure. The `ErrSchemaUnknown` sentinel is reserved for a future strict-mode (`ParseOptions.StrictSchemaVersion` not yet added) so tooling can opt into hard rejection if needed. (Per [NORTHSTARS.md §O6](../NORTHSTARS.md) ecosystem-change SLA.) 2. **`hostPID` vs shared-PID-namespace: default `none`; document `hostPID: true` as opt-in.** Both work for `/proc//environ` reads, but `hostPID: true` broadens the pod attack surface (every host process visible to the receiver). The receiver code does not require either; rank discovery falls back through filename + receiver-process env. The M5b chart should expose a `kineto.pidStrategy: host|shared|none` value defaulting to `none` (most restrictive; trades observability for safety), with `hostPID: true` documented as the high-fidelity opt-in for clusters that have already accepted the broader posture. Chart-side knob lands in a follow-up PR alongside the M5b chart update. @@ -369,13 +369,13 @@ PR C flips the M14 top-line to `☑ alpha`, matching the M11 precedent (alpha = ## References -- [`MILESTONES.md`](../../MILESTONES.md) §M14. Functional and non-functional rubrics this RFC is scoped to satisfy. +- [`MILESTONES.md`](../MILESTONES.md) §M14. Functional and non-functional rubrics this RFC is scoped to satisfy. - [`docs/rfcs/0003-pipeline-runtime-and-component-contract.md`](0003-pipeline-runtime-and-component-contract.md). `Component`, `ReceiverFactory`, and `TelemetrySettings` contract; the implementation-notes deviation list (`internal/consumer`, `internal/safe`, `NewFactory()` convention). - [`docs/rfcs/0009-pyspy-receiver-scope.md`](0009-pyspy-receiver-scope.md). M13 sibling receiver. Same `gen_ai.training.rank` join key; precedent for shipping an RFC pre-implementation. - [`docs/proposals/gen-ai-training-semconv.md`](../proposals/gen-ai-training-semconv.md). Upstream namespace bet that grounds M14's attribute choices. - [`docs/research/m15-container-stdout.md`](../research/m15-container-stdout.md). Env-var sources per orchestrator; Kubeflow PyTorchJob v1 PodSpec-`RANK` offset finding that drives the `/proc//environ` choice. - [`PRINCIPLES.md`](../../PRINCIPLES.md) §1 (trust under load), §9 (failure modes are part of the API), §12 (reproducible everything), §15 (decide late + RFC for schema). -- [`NORTHSTARS.md`](../../NORTHSTARS.md) §O2 (per-receiver budget), §O4 (`gen_ai.training.*` shepherding), §O6 (ecosystem-change 30-day SLA). +- [`NORTHSTARS.md`](../NORTHSTARS.md) §O2 (per-receiver budget), §O4 (`gen_ai.training.*` shepherding), §O6 (ecosystem-change 30-day SLA). - `internal/selftelemetry/interface.go`. The `Receiver` surface (`IncError`, `IncEmissions`, `ObserveLatency`, `SetDegraded`, `MarkActivity`) and canonical `Kind*` constants. - PyTorch sources (verified May 2026): - `pytorch/pytorch torch/profiler/profiler.py:44`: `PROFILER_STEP_NAME = "ProfilerStep"`. diff --git a/install/kubernetes/tracecore/Chart.yaml b/install/kubernetes/tracecore/Chart.yaml index 89f59fcf..95d25b3e 100644 --- a/install/kubernetes/tracecore/Chart.yaml +++ b/install/kubernetes/tracecore/Chart.yaml @@ -36,7 +36,7 @@ annotations: - name: chart-readme url: https://github.com/tracecoreai/tracecore/blob/main/install/kubernetes/tracecore/README.md - name: milestones - url: https://github.com/tracecoreai/tracecore/blob/main/MILESTONES.md + url: https://github.com/tracecoreai/tracecore/blob/main/docs/MILESTONES.md artifacthub.io/changes: | - kind: added description: Restricted Pod Security Standard DaemonSet with per-receiver toggles. diff --git a/module/pkg/nccl/fr_parser/limits_test.go b/module/pkg/nccl/fr_parser/limits_test.go index bc482e8f..36a61176 100644 --- a/module/pkg/nccl/fr_parser/limits_test.go +++ b/module/pkg/nccl/fr_parser/limits_test.go @@ -60,7 +60,7 @@ func TestParse_MaxItemsEnforced(t *testing.T) { // TestParse_MaxDepthEnforced asserts deeply nested MARK pushes abort // with ErrDepthExceeded before allocating a 2^16-deep Go tree. The -// fuzz spec (MILESTONES.md §M11) names this as an adversarial seed. +// fuzz spec (docs/MILESTONES.md §M11) names this as an adversarial seed. func TestParse_MaxDepthEnforced(t *testing.T) { // Build: PROTO 5, then N × (EMPTY_LIST MARK), then N × APPENDS, STOP. const N = 300 // > DefaultMaxDepth (256) diff --git a/module/pkg/nccl/fr_parser/record.go b/module/pkg/nccl/fr_parser/record.go index d5a72e11..d9626cfe 100644 --- a/module/pkg/nccl/fr_parser/record.go +++ b/module/pkg/nccl/fr_parser/record.go @@ -7,7 +7,7 @@ import "fmt" // Record is a decoded NCCL FlightRecorder ring-buffer entry. Field // names mirror PyTorch's torch/csrc/distributed/c10d/FlightRecorder.hpp // (struct Entry / pickled key names). Every field listed in -// MILESTONES.md §M11 line 504 is present. +// docs/MILESTONES.md §M11 line 504 is present. type Record struct { RecordID int64 PgID int64 diff --git a/module/pkg/nccl/fr_parser/record_test.go b/module/pkg/nccl/fr_parser/record_test.go index 0a067f6e..3669c5bf 100644 --- a/module/pkg/nccl/fr_parser/record_test.go +++ b/module/pkg/nccl/fr_parser/record_test.go @@ -10,7 +10,7 @@ import ( // TestDecodeRecords_RoundTrip asserts the Record schema round-trips // through Synthesize → Parse → DecodeRecords. Every PyTorch FR field -// enumerated in MILESTONES.md §M11 line 504 is exercised by at least +// enumerated in docs/MILESTONES.md §M11 line 504 is exercised by at least // one record in `wantRecords` below; if a future schema additon // requires another field, add it here AND extend Record. func TestDecodeRecords_RoundTrip(t *testing.T) { diff --git a/scripts/doc-check.sh b/scripts/doc-check.sh index 5562b19a..e942035a 100755 --- a/scripts/doc-check.sh +++ b/scripts/doc-check.sh @@ -59,7 +59,7 @@ link_count=0 while IFS= read -r mdfile; do # Extract every [text](target) markdown link, skipping links that live # inside fenced code blocks (``` … ```). Plans and design docs frequently - # quote MILESTONES.md / RFC-style content inside fenced blocks; those + # quote docs/MILESTONES.md / RFC-style content inside fenced blocks; those # links resolve relative to the doc that owns them, not the quoting doc. while IFS= read -r target; do [ -z "$target" ] && continue @@ -168,19 +168,19 @@ echo "doc-check: $nonmd_count non-.md intra-repo link(s) resolve to on-disk path # --- Unverified-marker count baseline --------------------------------------- # -# Drift pattern this gate closes: `(unverified)` markers in MILESTONES.md +# Drift pattern this gate closes: `(unverified)` markers in docs/MILESTONES.md # are honest acknowledgments that a number / claim isn't yet pinned to a # source. They are a measurement-debt receipt — useful while in flight, # corrosive if they multiply silently over time. This gate asserts the # count does not grow between merges; reducing it (by adding a citation # or resolving the measurement) is always allowed. # -# Scope: MILESTONES.md only. The baseline lives at docs/.unverified-baseline +# Scope: docs/MILESTONES.md only. The baseline lives at docs/.unverified-baseline # and is bumped down whenever a marker resolves. To intentionally add a # new marker (e.g., for a new milestone), bump the baseline up in the same # PR with a one-line commit-message rationale. -milestones_doc="MILESTONES.md" +milestones_doc="docs/MILESTONES.md" baseline_file="docs/.unverified-baseline" if [ -f "$milestones_doc" ] && [ -f "$baseline_file" ]; then @@ -198,7 +198,7 @@ fi # --- docs/reproducibility.md presence + shell-syntax gate ------------------- # -# Drift pattern this gate closes: MILESTONES.md §M3 cites +# Drift pattern this gate closes: docs/MILESTONES.md §M3 cites # docs/reproducibility.md as the third-party verification recipe for every # published release. The file's value depends on its fenced bash/sh blocks # being executable as written; a stray `then` or unbalanced quote turns @@ -306,7 +306,7 @@ echo "doc-check: banned-phrase lint clean across $mdcount markdown file(s) outsi # --- Required-section assertion for docs/maintainership.md ------------------ # -# Drift pattern this gate closes: MILESTONES.md §M6 requires +# Drift pattern this gate closes: docs/MILESTONES.md §M6 requires # docs/maintainership.md to carry three H2 headings — "Commit access", # "RFC sponsorship", "Security disclosure" — each answering its heading # in sentence 1 and cross-referencing CODEOWNERS / docs/rfcs/ / @@ -343,7 +343,7 @@ fi # M6 integration recipes: examples-file + markers + staleness. # -# Drift pattern this gate closes: MILESTONES.md §M6 requires each +# Drift pattern this gate closes: docs/MILESTONES.md §M6 requires each # docs/integrations/.md recipe to (a) contain at least one # fenced ```yaml block whose first non-blank line resolves to a file # under docs/integrations/examples/, (b) declare the upstream contrib @@ -529,7 +529,7 @@ echo "doc-check: ${#required_docs[@]} required top-level doc(s) present" # M6 getting-started: <=5 fenced bash/sh blocks. # -# Drift pattern this gate closes: MILESTONES.md §M6 caps +# Drift pattern this gate closes: docs/MILESTONES.md §M6 caps # docs/getting-started.md at five imperative shell commands from # `git clone` to first OTLP byte. The cap exists so the page stays # scannable; >5 blocks means the recipe has grown beyond a quickstart @@ -551,7 +551,7 @@ fi # M6 nps.md: three H3 survey-question headings. # -# Drift pattern this gate closes: MILESTONES.md §M6 requires nps.md +# Drift pattern this gate closes: docs/MILESTONES.md §M6 requires nps.md # to declare three survey questions verbatim as `### ` headings # (recommend / worst-part / best-part). Without the gate, the prose # can drift away from the canonical survey copy and the methodology @@ -580,7 +580,7 @@ fi # M6 FAILURE-MODES.md: vendor SDK + exporter unreachable + config invalid. # -# Drift pattern this gate closes: MILESTONES.md §M6 requires +# Drift pattern this gate closes: docs/MILESTONES.md §M6 requires # FAILURE-MODES.md to enumerate at least one row each for # `vendor SDK failure`, `exporter unreachable`, `config invalid`, # each pointing to a real `Test*` identifier. The Test*-resolves gate diff --git a/tools/failure-inject/podevict/podevict.go b/tools/failure-inject/podevict/podevict.go index c1a56a77..45dcbf07 100644 --- a/tools/failure-inject/podevict/podevict.go +++ b/tools/failure-inject/podevict/podevict.go @@ -30,7 +30,7 @@ import ( var ErrClusterWriteUnsupported = errors.New("pod-evict: --allow-cluster-write is not yet implemented; rerun without the flag for a dry-run YAML on stdout") // validReasons enumerates the kubelet eviction reasons the M19 -// detector recognizes per MILESTONES.md §M19. Keeping the list small +// detector recognizes per docs/MILESTONES.md §M19. Keeping the list small // and explicit at this surface stops a typo from silently shipping a // fixture that won't match any detector. var validReasons = map[string]string{ diff --git a/tools/failure-inject/podevict/podevict_test.go b/tools/failure-inject/podevict/podevict_test.go index 1f5cc975..892c4e09 100644 --- a/tools/failure-inject/podevict/podevict_test.go +++ b/tools/failure-inject/podevict/podevict_test.go @@ -71,7 +71,7 @@ func TestPodEvict_ClusterWriteReturnsErr(t *testing.T) { // TestPodEvict_HeadlineRegex asserts the dry-run output carries // enough surface for the future M19 detector's headline regex -// (per MILESTONES.md §M19: /Pod .* evicted at .* due to disk pressure/). +// (per docs/MILESTONES.md §M19: /Pod .* evicted at .* due to disk pressure/). func TestPodEvict_HeadlineRegex(t *testing.T) { t.Parallel() var buf bytes.Buffer