Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/kernelevents-integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,17 @@ on:
paths:
- 'components/receivers/kernelevents/**'
- 'internal/runtime/lifecycle/**'
- 'cmd/tracecore/**'
- 'internal/pipeline/**'
- 'internal/selftelemetry/**'
- '.github/workflows/kernelevents-integration.yml'
pull_request:
paths:
- 'components/receivers/kernelevents/**'
- 'internal/runtime/lifecycle/**'
- 'cmd/tracecore/**'
- 'internal/pipeline/**'
- 'internal/selftelemetry/**'
- '.github/workflows/kernelevents-integration.yml'

permissions:
Expand Down
6 changes: 6 additions & 0 deletions .github/workflows/pyspy-integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,17 @@ on:
paths:
- 'components/receivers/pyspy/**'
- 'internal/runtime/lifecycle/**'
- 'cmd/tracecore/**'
- 'internal/pipeline/**'
- 'internal/selftelemetry/**'
- '.github/workflows/pyspy-integration.yml'
pull_request:
paths:
- 'components/receivers/pyspy/**'
- 'internal/runtime/lifecycle/**'
- 'cmd/tracecore/**'
- 'internal/pipeline/**'
- 'internal/selftelemetry/**'
- '.github/workflows/pyspy-integration.yml'

permissions:
Expand Down
59 changes: 49 additions & 10 deletions MILESTONES.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,22 +86,52 @@ Every milestone, in every lane, satisfies all seven principles below. Depth live

- **Status:** ☑ delivered (PRs #12 + #13)
- **Depends on:** none (foundational)
- **Reference:** [RFC-0003](docs/rfcs/0003-pipeline-runtime-and-component-contract.md) — Component / Host / Factory contracts, two-phase shutdown (1s ingest + 10s drain), push-based consumer interfaces, factory map generated from `components.yaml`, operator-UX guarantees (config `file:line:col` errors, `safe.Call` for vendor SDKs, empty-pipeline boot, first-data log line, `pipelinetest.New(t)` test fixture, `tracecore validate` subcommand). Contract documented in [`internal/pipeline/README.md`](internal/pipeline/README.md).
- **Reference:** [RFC-0003](docs/rfcs/0003-pipeline-runtime-and-component-contract.md). Contract documented in [`internal/pipeline/README.md`](internal/pipeline/README.md).

**Functional rubrics:**
- ☑ Component / Host / Factory contracts and per-signal factory methods land per RFC-0003 §`Component` interface / §`Host` interface / §Per-signal factory methods; `internal/pipeline` package implements both.
- ☑ Two-phase shutdown (1 s ingest stop + 10 s drain) per RFC-0003 §Lifecycle; deadline-bounded `Runtime.Shutdown` in `internal/pipeline/runtime.go`.
- ☑ Push-based `Consumer` interfaces and `pdata` types per RFC-0003 §`Consumer` interfaces; `internal/consumer` package.
- ☑ Factory map generated from `components.yaml` via `tools/components-gen` per RFC-0003 §Factory registration and components-gen; output at `cmd/tracecore/components.go`.
- ☑ Vendor SDK panic wrapping via `safe.Call` per RFC-0003 §Vendor SDK panic wrapping; `internal/runtime/safe`.
- ☑ Operator UX: config errors carry `file:line:col`, empty-pipeline boot, first-data log line, `pipelinetest.New(t)` test fixture, `tracecore validate` subcommand; all per RFC-0003 §Operator UX patterns + §Tests + §CLI integration.

**Non-functional rubrics:**
- ☑ Component contract documented in `internal/pipeline/README.md` so receiver authors have a single canonical reference. (per RFC-0003 §Tests)

### M2. Self-telemetry surface

- **Status:** ☑ delivered (PR #17)
- **Depends on:** M1
- **Reference:** [RFC-0006](docs/rfcs/0006-self-telemetry-surface.md) — `/metrics` + `/healthz` + `/readyz` on a configurable port; `selftelemetry.Receiver` interface injected via `TelemetrySettings.MeterProvider`; O2 SLO gauges (`exporter.failure_rate`, `queue.depth_ratio`, `component.restart_count_per_hour`); three OTel divergences closed (Host.ReportStatus, CreateSettings BuildInfo, TelemetrySettings MeterProvider). SLO thresholds NOT wired to `/readyz` — RFC-0006 chose degraded ≠ not-ready so k8s doesn't evict on transient backend issues; operators alert via Prometheus rules on the SLO gauges.
- **Carry-forward:** see [`docs/FOLLOWUPS.md`](docs/FOLLOWUPS.md) "M2 still-deferred items" (pprof endpoint, queue impl, restart mechanism, OTLP push reader, MetricsLevel knob, histogram tuning, per-role CreateSettings split, TracerProvider field).
- **Reference:** [RFC-0006](docs/rfcs/0006-self-telemetry-surface.md).
- **Carry-forward:** see [`docs/followups/M2.md`](docs/followups/M2.md) (pprof endpoint, queue impl, restart mechanism, OTLP push reader, MetricsLevel knob, histogram tuning, per-role CreateSettings split, TracerProvider field).

**Functional rubrics:**
- ☑ `/metrics` + `/healthz` + `/readyz` HTTP endpoints exposed on a configurable port per RFC-0006 §Operator surface; `internal/telemetry` ships the listener.
- ☑ `selftelemetry.Receiver` interface injected via `TelemetrySettings.MeterProvider` per RFC-0006 §Producer surface; `internal/selftelemetry/receiver_impl.go`.
- ☑ O2 SLO gauges shipped (`exporter.failure_rate`, `queue.depth_ratio`, `component.restart_count_per_hour`) per RFC-0006 §Architecture + NORTHSTARS O2; emitted from `internal/telemetry/slo.go`.
- ☑ Three OTel collector divergences closed: `Host.ReportStatus`, `CreateSettings.BuildInfo`, `TelemetrySettings.MeterProvider`; per RFC-0006 §Divergences from OTel collector v0.152.0.

**Non-functional rubrics:**
- ☑ SLO thresholds intentionally NOT wired into `/readyz`: RFC-0006 chose "degraded ≠ not-ready" so kubelet does not evict on transient backend issues; operators alert via Prometheus rules on the SLO gauges. (per RFC-0006 §Operator surface)
- ☑ Self-metric names carry a deprecation policy so renames don't silently break operator dashboards. (per RFC-0006 §Deprecation policy for self-metric names)

### M4. Lint and test harness (partial)

- **Status:** ☑ partial (stock lint shipped; custom analyzers carried to Lane 2)
- **Depends on:** M1
- **Reference:** `golangci-lint` config with `errcheck`/`gofumpt`/`govet`/`revive`/`depguard`; license-header check via `addlicense`; `make ci` <60s on a dev laptop (per PRINCIPLES.md §10); `make doc-check` verifies test-name references in docs; `make alert-check` verifies RUNBOOK ↔ alerts.yaml parity.
- **Reference:** no RFC; convention is `.golangci.yml` + `Makefile` + `scripts/`.
- **Carry-forward to Lane 2:** custom `tools/errchk` (fmt.Errorf-must-include-component) and `tools/doclint` (component README 7-section enforcer) — defer until a real drift incident motivates the cost.

**Functional rubrics:**
- ☑ `golangci-lint` configured with `errcheck` / `gofumpt` / `govet` / `revive` / `depguard`; gate runs in CI `verify-lint` job. (per `.golangci.yml`)
- ☑ License-header check via `addlicense` runs in `make ci` and rejects un-headered Go files. (per `Makefile` `license-check` target)
- ☑ `make doc-check` verifies every Test/Fuzz/Benchmark identifier referenced in `docs/**.md` exists in the source tree. (per `scripts/doc-check.sh`)
- ☑ `make alert-check` verifies RUNBOOK ↔ `prometheus-alerts.example.yaml` parity per component. (per `scripts/alert-check.sh`)

**Non-functional rubrics:**
- ☑ `make ci` completes in <60 s on a dev laptop so the inner loop stays fast. (per PRINCIPLES.md §10)

---

## Lane 1 — Release infrastructure
Expand Down Expand Up @@ -177,7 +207,7 @@ M20a/b/c are gates against the same artifact (`bench/install/run.sh`) at progres
- **Status:** ☐
- **Depends on:** M3, M5b, M6, ≥3 receivers at alpha (M8 partial; M10/M13/M15/M16 from Lanes 4–5; M11/M12 from Lane 6 if flood gate open)
- **NORTHSTARS coupling:** NORTHSTARS.md O1 targets 3 patterns covered at M6/v0. If the flood gate has not opened by M21, only M19 (pattern #14, GPU-independent) is guaranteed; M17 (pattern #1) and M18 (pattern #6, build-time coupled to M17's `cross_rank.go`) are at risk. Either the flood gate opens before M21 or NORTHSTARS O1 is explicitly relaxed in M21's release notes with written reason — silent divergence is a blocker per PRINCIPLES §15.
- **Carry-forward from M3:** asset-shape reconciliation owed at the v0.1.0 cut. M3's `release.yml` publishes raw `tracecore_<tag>_linux_amd64` (not the `.tar.gz` line 156 names), `*.cosign.bundle` (not the detached `*.sig` line 156 names), and `*.intoto.jsonl` (file is a Sigstore bundle JSON, not in-toto JSONLextension is the de-facto convention but misleads sniff-by-extension tooling). M21 decides: keep raw-binary + bundle, switch to tar.gz + detached `.sig`, and pick a stable name for the Sigstore-bundle artifact. The hardening backlog (SLSA L3, build-env sanitization, CycloneDX `mod`→`app`, cosign / `gh attestation` flag tightening, nightly drift cron, repo tag-protection on `v*`, CI Actions linter, github-actions Dependabot, Rekor log-index in release notes) lives in `docs/FOLLOWUPS.md` "M3 release-pipeline hardening (post-PR #28)".
- **Carry-forward from M3:** asset-shape reconciliation owed at the v0.1.0 cut. M3's `release.yml` publishes raw `tracecore_<tag>_linux_amd64` (not the `.tar.gz` line 156 names), `*.cosign.bundle` (not the detached `*.sig` line 156 names), and `*.intoto.jsonl` (file is a Sigstore bundle JSON, not in-toto JSONL; extension is the de-facto convention but misleads sniff-by-extension tooling). M21 decides: keep raw-binary + bundle, switch to tar.gz + detached `.sig`, and pick a stable name for the Sigstore-bundle artifact. The hardening backlog (SLSA L3, build-env sanitization, CycloneDX `mod`→`app`, cosign / `gh attestation` flag tightening, nightly drift cron, repo tag-protection on `v*`, CI Actions linter, github-actions Dependabot, Rekor log-index in release notes) lives in [`docs/followups/M3.md`](docs/followups/M3.md) "M3 release-pipeline hardening (post-PR #28)".

**Functional rubrics:**
- Annotated git tag `v0.1.0` exists; `git describe --exact-match v0.1.0` resolves; tag is signed and `git tag -v v0.1.0` succeeds. (per PRINCIPLES §14)
Expand Down Expand Up @@ -236,7 +266,7 @@ M20a/b/c are gates against the same artifact (`bench/install/run.sh`) at progres
- `make bench` runs both benchmarks locally on a host that meets each one's hardware requirement; exits 0 iff both result files produced.
- Install benchmark runs scheduled nightly against `main` regardless of overhead-bench availability. (per NORTHSTARS O2 operating rule #1)
- Convenience-regression CI gate fails with `convenience-regression` label when install-time median crosses 5 min (install bench) or overhead median crosses 0.3% (overhead bench) vs previous green run. (per NORTHSTARS O2)
- `benchstat` wired into CI for M1.6 micro-benchmarks as non-gating delta print against `main`. (per `docs/FOLLOWUPS.md`)
- `benchstat` wired into CI for M1.6 micro-benchmarks as non-gating delta print against `main`. (per [`docs/followups/opportunistic.md`](docs/followups/opportunistic.md))
- Sustained-load benchmark exists: 1000 metrics/sec × 24h memory profile captured and archived. (per NORTHSTARS O2 RSS budget)

**Non-functional rubrics:**
Expand Down Expand Up @@ -302,11 +332,20 @@ M20a/b/c are gates against the same artifact (`bench/install/run.sh`) at progres

- **Status:** ☑ shipped
- **Depends on:** M1
- **Reference:** [RFC-0007](docs/rfcs/0007-kernelevents-receiver-scope.md); merged in PR #16.
- **Reference:** [RFC-0007](docs/rfcs/0007-kernelevents-receiver-scope.md); merged in PR #16. Alpha unified-source logs receiver covering L2 + L9 (kernel + system events).
- **Carry-forward:** SXid (NVSwitch) classifier; regex shape blocked on a captured production fixture; tracked in [`docs/followups/M9.md`](docs/followups/M9.md).

Alpha unified-source logs receiver covering L2 + L9 (kernel + system events). Tails `/dev/kmsg` and `journalctl --output=json --follow` behind one config block via the `source` interface in `components/receivers/kernelevents/source.go`; lifecycle (cancel, WaitGroup, panic recovery, channel ownership) lives in `internal/runtime/lifecycle.Lifecycle`. NVRM-prefixed Xid extraction populates `kernelevents.xid` and `gpu.id` (PCI BDF); RE2 filters compile at Start (DoS-safe); trace context propagated from journald `_TRACE_ID`/`_SPAN_ID`. Overhead at ≤0.02% CPU and ≤10 MB RSS verified via `bench_test.go` and Linux `Getrusage` in `cpu_linux_test.go`. Non-Linux builds ship as immediately-degraded stubs.
**Functional rubrics:**
- ☑ Tails `/dev/kmsg` and `journalctl --output=json --follow` behind one config block via the `source` interface. (per `components/receivers/kernelevents/source.go`; RFC-0007 §Design overview)
- ☑ Lifecycle (cancel, WaitGroup, panic recovery, channel ownership) lives in `internal/runtime/lifecycle.Lifecycle`, not inlined per receiver. (per RFC-0007 §Design overview)
- ☑ NVRM-prefixed Xid extraction populates `kernelevents.xid` and `gpu.id` (PCI BDF) on emitted log records. (per RFC-0007 §Design overview)
- ☑ RE2 `reason_regex` and source-filter regexes compile at Start (DoS-safe; a bad regex fails Validate with exit 2). (per RFC-0007 §Config schema)
- ☑ Trace context propagated from journald `_TRACE_ID` / `_SPAN_ID` onto emitted records. (per RFC-0007 §Design overview)
- ☑ Non-Linux builds ship as immediately-degraded stubs (no panic on boot; `Degraded()=true` from Start). (per RFC-0007 §Design overview)

- **Carry-forward:** SXid (NVSwitch) classifier — regex shape blocked on a captured production fixture; tracked in [`docs/FOLLOWUPS.md`](docs/FOLLOWUPS.md).
**Non-functional rubrics:**
- ☑ Overhead ≤0.02% CPU and ≤10 MB RSS, verified via `bench_test.go` and Linux `Getrusage` in `cpu_linux_test.go`. (per NORTHSTARS O2 per-receiver budget; RFC-0007 §Stability + deprecation policy)
- ☑ Stability + deprecation policy: receiver-local kinds documented; attribute renames require dual-emit cycle. (per RFC-0007 §Stability + deprecation policy)

### M10. k8s events receiver

Expand Down Expand Up @@ -510,7 +549,7 @@ Lane 6 covers NVIDIA-side device telemetry (DCGM), NCCL collective diagnostics (
- **Depends on:** M1
- **Reference:** [RFC-0005](docs/rfcs/0005-dcgm-receiver-scope.md)
- **Hardware:** Linux + NVIDIA GPU host with `nv-hostengine` reachable; driver R580 LTSB + DCGM 4.4.x reference (per [endoflife.date/nvidia](https://endoflife.date/nvidia) — R580 active support ends 2026-08-04, refresh LTSB pin within Q3 2026; DCGM 4.4.2 is current core release per [NVIDIA/DCGM tags](https://github.com/NVIDIA/DCGM/tags))
- **Carry-forward:** (1) cgo client `client_cgo.go`; (2) hardware integration test at `//go:build dcgm,hardware`; (3) cardinality-cap calibration against ≥3 reference deployments; (4) `initial_delay` bench against real DGX boot; (5) per-metric toggles, vGPU, subprocess-isolation supervisor (defer-on-trigger; see `docs/FOLLOWUPS.md`).
- **Carry-forward:** (1) cgo client `client_cgo.go`; (2) hardware integration test at `//go:build dcgm,hardware`; (3) cardinality-cap calibration against ≥3 reference deployments; (4) `initial_delay` bench against real DGX boot; (5) per-metric toggles, vGPU, subprocess-isolation supervisor (defer-on-trigger; see [`docs/followups/M8.md`](docs/followups/M8.md)).

**Functional rubrics:**
- `pkg/dcgm/client_cgo.go` under `//go:build dcgm` implements every method on the `Client` interface against `github.com/NVIDIA/go-dcgm`; `cmd/tracecore receivers list` reports `dcgm [cgo]`. (per RFC-0005 §File layout + §Build-tag strategy)
Expand Down
14 changes: 8 additions & 6 deletions docs/followups/M3.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,12 +192,14 @@ script with exit 1 and a diff message naming both flag sets. -->
`docs/**.md` actually exists.~~ *Shipped:* `scripts/doc-check.sh`
header reads "verify every Test*/Fuzz*/Benchmark* name referenced
in docs"; wired into `make doc-check` and `make ci`.
- [ ] **Backfill Foundation milestone rubrics (M1, M2, M4, M9).**
Those entries predate the per-rubric `☑` convention adopted in
- [x] **Backfill Foundation milestone rubrics (M1, M2, M4, M9).**
~~Those entries predate the per-rubric `☑` convention adopted in
PR #53 and ship as prose-only delivery summaries. Restore their
original functional + non-functional rubric blocks from the
respective RFCs (RFC-0003 for M1, RFC-0006 for M2, no RFC for
M4 lint harness, RFC-0007 for M9) with `☑` prefixes. *Target:*
opportunistic; the missing audit trail is mild because each is
already shipped, but consistency across the doc helps future
readers grep for "shipped under what rubric set".
M4 lint harness, RFC-0007 for M9) with `☑` prefixes.~~ *Shipped:*
MILESTONES.md M1, M2, M4, M9 now each carry **Functional rubrics**
+ **Non-functional rubrics** blocks with `☑` bullets citing
RFC sections / shipped paths. Future readers can grep
"shipped under what rubric set" symmetrically across all
milestones.
25 changes: 11 additions & 14 deletions docs/followups/otlphttp.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,23 +179,20 @@ each item correctly.
P2 caught a 30s vs 10s drift in this PR. *Source:* P2
R-default-parity. *Trigger:* second doc-vs-code drift
incident.
- [ ] **Workflow paths trigger extends to substrate code.**
install-bench workflow now includes `cmd/tracecore/**` and
- [x] **Workflow paths trigger extends to substrate code.**
~~install-bench workflow now includes `cmd/tracecore/**` and
`internal/pipeline/**` (P3 fix). Other workflows should
audit their paths filters analogously. *Source:* P3-Rev1
audit their paths filters analogously.~~ *Source:* P3-Rev1
#10. *Audit (2026-05-20):* `chart.yml` and `install-bench.yml`
both include the substrate (`cmd/tracecore/**`,
`internal/**`); ✅ substrate-aware. `kernelevents-integration.yml`
and `pyspy-integration.yml` cover only
`components/receivers/<name>/**` + `internal/runtime/lifecycle/**`;
a `cmd/tracecore` factory wiring or `internal/pipeline`
contract change can land without re-running these integration
jobs. `chaos.yml` covers `tools/failure-inject/**` +
`internal/synthesis/**` only; substrate-coupling is indirect,
acceptable. *Remaining:* tighten the two integration workflows
to add `cmd/tracecore/**` + `internal/selftelemetry/**` +
`internal/pipeline/**`. *Trigger:* now (audit-driven; remaining
change is a 6-line YAML edit per workflow).
`internal/**`); ✅ substrate-aware. `chaos.yml` covers
`tools/failure-inject/**` + `internal/synthesis/**` only;
substrate-coupling is indirect, acceptable. *Shipped:*
`kernelevents-integration.yml` and `pyspy-integration.yml`
now include `cmd/tracecore/**` + `internal/pipeline/**` +
`internal/selftelemetry/**` in both push and pull_request
`paths:` filters, so a factory-wiring or pipeline-contract
change re-runs the two integration suites.
- [ ] **MILESTONES.md status flips bundled with delivery PRs.**
CONTRIBUTING L10 review blocker. This PR delivers M20a +
parts of M5 rubric but did not flip the checkboxes.
Expand Down