From db883efea38b714eecc64050f495af04da0a0224 Mon Sep 17 00:00:00 2001 From: Tri Lam Date: Sat, 30 May 2026 02:36:28 -0700 Subject: [PATCH 1/2] chore(pivot): pre-PR-A drift sweep + Helm security tighten MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Doc + Helm + small code-quality wins before PR-A (OCB skeleton). All edits are zero-blocker, no OCB or recipe dependency. Doc drift (10 files): - README quickstart: components.yaml is current; builder-config.yaml at v0.1.0 - CONTRIBUTING: drop clockreceiver/dcgm walkthrough refs; route via RFC-0013 §6 + tracecore-components - STYLE: mark components.yaml + components-gen superseded by RFC-0013 - STRATEGY: reconcile divergence row (metric names adopt otelcol_*) - FAILURE-MODES: tracecore_* -> otelcol_*; flag deprecated-receiver rows - getting-started: stale metric flagged as pre-v0.1.0 placeholder - FLAKY-TESTS: clockreceiver test row marked scheduled-for-deletion - docs/README: receiver index annotated per RFC-0013 §7 - MILESTONES: M5/M9/M13/M14/M16 lane rows marked obsolete-post-RFC-0013 Helm + chart (5 files): - Chart.yaml: appVersion sync comment updated for post-PR-A target - install README: pre-OCB build instructions flagged transitional - values.schema.json: additionalProperties:false on pod/container securityContext (catches typo overrides) - daemonset.yaml: downward-API env now gated on containerstdout/k8s_events/kernelevents enabled flags; fail-guard when containerstdout.enabled=true but rbac.create=false - containerstdout-rbac.yaml: TODO(RFC-0013) on cluster-wide Node get (least-priv scoping defer to OCB-era RBAC refactor) - conftest tracecore.rego: hostPath path validator for containerstdout-pod-logs volume Code quality (3 files): - internal/runtime/lifecycle/lifecycle.go: WARN log runtime.NumGoroutine() on Shutdown leak - cmd/tracecore/integration_test.go: goleak.Find() runs unconditionally (was gated on test-suite exit=0) - components/receivers/nccl_fr/factory.go: early cfg.Validate() in factory (mirrors containerstdout pattern) Skipped (justified): - Self-tel wiring extract: surviving receivers (containerstdout, nccl_fr) have divergent patterns; helper would force forced abstraction. Reassess post-PR-F. - Containerstdout API surface trim: every exported symbol referenced by black-box _test.go files in same dir. Unexporting forces test rewrites for an M15 alpha. Defer to post-PR-A when test scaffold gets rewritten anyway. LOC delta: +74 / -53. make ci green. Refs RFC-0013. Signed-off-by: Tri Lam --- .gitignore | 1 + CONTRIBUTING.md | 9 +++------ MILESTONES.md | 6 +++--- README.md | 8 +++----- STYLE.md | 1 + cmd/tracecore/integration_test.go | 12 +++++------ components/receivers/nccl_fr/factory.go | 3 +++ docs/FAILURE-MODES.md | 20 +++++++++---------- docs/FLAKY-TESTS.md | 2 +- docs/README.md | 18 ++++++++--------- docs/STRATEGY.md | 2 +- docs/getting-started.md | 8 ++++---- install/kubernetes/tracecore/Chart.yaml | 8 ++++---- install/kubernetes/tracecore/README.md | 4 +++- .../policies/conftest/tracecore.rego | 10 ++++++++++ .../templates/containerstdout-rbac.yaml | 1 + .../tracecore/templates/daemonset.yaml | 5 +++++ .../kubernetes/tracecore/values.schema.json | 4 ++-- internal/runtime/lifecycle/lifecycle.go | 6 +++++- 19 files changed, 75 insertions(+), 53 deletions(-) diff --git a/.gitignore b/.gitignore index fb4188ad..c0e62d09 100644 --- a/.gitignore +++ b/.gitignore @@ -47,3 +47,4 @@ go.work.sum # Catch-all for `.local.` overrides (direnv, dev configs, etc.) *.local.* +.claude/scheduled_tasks.lock diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index f03f6bf7..a2c009e8 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -113,8 +113,8 @@ kind of receiver you're building. | Shape | When to copy | Reference | |---|---|---| -| **Trivial ticker receiver** (no external SDK, no streaming source) | Simple metric source that ticks and emits | [`components/receivers/clockreceiver/`](components/receivers/clockreceiver/) | -| **Vendor-SDK receiver** (NVML, NCCL, AMD-ROCm) | External SDK with cgo + stub split, blocking Connect, watch/scrape loop | [`components/receivers/dcgm/`](components/receivers/dcgm/) | +| **Trivial ticker receiver** (no external SDK, no streaming source) | Simple metric source that ticks and emits | see RFC-0013 §6 (the moat) and the (forthcoming) `tracecoreai/tracecore-components` repo for new component templates | +| **Vendor-SDK receiver** (NVML, NCCL, AMD-ROCm) | External SDK with cgo + stub split, blocking Connect, watch/scrape loop | see RFC-0013 §6 (the moat) and the (forthcoming) `tracecoreai/tracecore-components` repo for new component templates | | **Streaming-source logs receiver** (kmsg, journald, eBPF, ETW) | Continuously-readable FD or subprocess emitting log/event records | [`components/receivers/kernelevents/`](components/receivers/kernelevents/) | Before writing code, read the canonical implementation that matches @@ -158,10 +158,7 @@ ticker / streaming-source receiver): 7. **`components.yaml` row** + `make generate` - regenerates `cmd/tracecore/components.go`. `make ci` runs `generate-check` to refuse a stale generated file. -8. **Docs-parity test** - mirror - `components/receivers/dcgm/docs_parity_test.go` for the README - ↔ example_config ↔ RUNBOOK ↔ IncError-kinds invariants. The - AST walker (`extractEmittedKinds`) is reusable. +8. **Docs-parity test** - see RFC-0013 §6 (the moat) and the (forthcoming) `tracecoreai/tracecore-components` repo for new component templates covering the README ↔ example_config ↔ RUNBOOK ↔ IncError-kinds invariants. 9. **`example-daemonset.yaml`** (if k8s is a target) - pin the shape with a parse test that asserts probes + security context defaults (mirror dcgm's `TestExampleDaemonset_…`). diff --git a/MILESTONES.md b/MILESTONES.md index e3bf8d72..42913146 100644 --- a/MILESTONES.md +++ b/MILESTONES.md @@ -81,10 +81,10 @@ Work runs in six parallel swim lanes plus a Foundation set. Lanes are organized |---|---|---|---| | Foundation | Runtime + self-telemetry | M1, M2, M4 (partial) | n/a | | 1 | Release infrastructure | M3 (shipped), M5b (shipped), M20, M21 | none for M3/M5b/M21; Linux + GPU (flood-gated) for M20b/c | -| 2 | Test & failure infra | M4b (shipped), M5 | none for M4b; Linux + GPU (flood-gated) for M5 overhead bench | +| 2 | Test & failure infra | M4b (shipped), M5 - obsolete post-RFC-0013 | none for M4b; Linux + GPU (flood-gated) for M5 overhead bench | | 3 | Documentation & community | M6, M23 | none | -| 4 | Orchestrator signals | M9 (shipped), M10 (alpha), M15, M16, M19 | Linux, no GPU | -| 5 | Framework & runtime profiling | M13, M14, M18 | Linux + Python, no GPU | +| 4 | Orchestrator signals | M9 (shipped) - obsolete post-RFC-0013, M10 (alpha), M15, M16 - obsolete post-RFC-0013, M19 | Linux, no GPU | +| 5 | Framework & runtime profiling | M13 - obsolete post-RFC-0013, M14 - obsolete post-RFC-0013, M18 | Linux + Python, no GPU | | 6 | GPU & NCCL signals (flood-gated) | M8 carry-fwd, M11 (alpha), M12, M17, M24 | Linux + NVIDIA GPU (M11 parser excepted) | ## Universal non-functional principles diff --git a/README.md b/README.md index c5e8520f..cc9cef24 100644 --- a/README.md +++ b/README.md @@ -44,12 +44,10 @@ What's safe to deploy today, what's still shipping. Honest read at HEAD; check [ ## Quickstart ```sh -# 1. Build the binary via the OpenTelemetry Collector Builder (OCB). -# builder-config.yaml pins upstream + contrib component versions to -# a single release cycle and pulls in the tracecore-components -# moat (ncclfrreceiver, rankjoinprocessor, patterndetectorprocessor). +# 1. Build the binary. Today the build manifest is `components.yaml`; +# `builder-config.yaml` (OCB) lands at v0.1.0 per RFC-0013 §1. go mod download -make build # delegates to: builder --config=builder-config.yaml +make build # 2. Write a minimal config: telemetrygeneratorreceiver emitting a # heartbeat metric every second into the debug exporter. diff --git a/STYLE.md b/STYLE.md index 6141665e..f5f73d97 100644 --- a/STYLE.md +++ b/STYLE.md @@ -93,6 +93,7 @@ apologetics, sentinel discipline). The summary: - **Explicit factory map**, not `init()` side effects. - Factories registered in `cmd/tracecore/components.go` by importing each component and calling its `NewFactory()`. The file is **generated** from `components.yaml` by `tools/components-gen`; `make generate` rewrites it after a `components.yaml` edit. +- Note: components.yaml + components-gen are superseded by RFC-0013 §1 (OCB manifest) and will be deleted at PR-A. - Banned: `import _ "..."` for registration side effects. Example shape: diff --git a/cmd/tracecore/integration_test.go b/cmd/tracecore/integration_test.go index 040f4bcd..f514072b 100644 --- a/cmd/tracecore/integration_test.go +++ b/cmd/tracecore/integration_test.go @@ -51,14 +51,14 @@ func TestMain(m *testing.M) { if p, err := binaryPathOnce(); err == nil && p != "" { _ = os.RemoveAll(filepath.Dir(p)) } - // After tests pass, check for leaked goroutines. A receiver that + // After tests run, check for leaked goroutines. A receiver that // ignores ctx and leaves its tick goroutine running past Shutdown // is a contract violation `go test -race` would not catch — goleak - // does. Skip the check on failing tests; failures may legitimately - // leak partial state. - if code == 0 { - if err := goleak.Find(); err != nil { - fmt.Fprintln(os.Stderr, "TestMain: goroutine leak detected:", err) + // does. Run unconditionally: a failing unrelated test must not + // mask a leak we care about diagnosing. + if err := goleak.Find(); err != nil { + fmt.Fprintln(os.Stderr, "TestMain: goroutine leak detected:", err) + if code == 0 { code = 1 } } diff --git a/components/receivers/nccl_fr/factory.go b/components/receivers/nccl_fr/factory.go index c25376cd..c823ef50 100644 --- a/components/receivers/nccl_fr/factory.go +++ b/components/receivers/nccl_fr/factory.go @@ -39,6 +39,9 @@ func (*factory) CreateLogs(ctx context.Context, set pipeline.CreateSettings, cfg if !ok { return nil, fmt.Errorf("nccl_fr: unexpected config type %T", cfg) } + if err := c.Validate(); err != nil { + return nil, fmt.Errorf("nccl_fr: %w", err) + } r := newReceiver(set, c, next) if set.Telemetry.MeterProvider != nil { if rt, err := selftelemetry.NewReceiver(set.ID, set.Telemetry.MeterProvider); err == nil { diff --git a/docs/FAILURE-MODES.md b/docs/FAILURE-MODES.md index a0bfc14d..74a20fe9 100644 --- a/docs/FAILURE-MODES.md +++ b/docs/FAILURE-MODES.md @@ -26,14 +26,14 @@ For SREs landing here via the Prometheus alert payload rather than `runbook_url` | `KernelEventsHighParseErrorRate` | recipe: kernel-events transform | `journaldreceiver` + `filelogreceiver` + OTTL | | `K8sEventsReceiverDegraded` | recipe: k8s-events hint mapping | `k8sobjectsreceiver` + OTTL | | `K8sEventsBackpressureDrops` | recipe: k8s-events hint mapping | `k8sobjectsreceiver` + OTTL | -| `ContainerStdoutDegraded` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` | -| `ContainerStdoutRotationStalled` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` | -| `ContainerStdoutBackpressure` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` | -| `ContainerStdoutCursorWriteFailed` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` | -| `ContainerStdoutWatchFlap` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` | -| `ContainerStdoutCardinalityFingerprint` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` | -| `ContainerStdoutCardinalityAttribution` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` | -| `ContainerStdoutCardinalityRateLimit` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` | +| `ContainerStdoutDegraded` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` | +| `ContainerStdoutRotationStalled` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` | +| `ContainerStdoutBackpressure` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` | +| `ContainerStdoutCursorWriteFailed` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` | +| `ContainerStdoutWatchFlap` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` | +| `ContainerStdoutCardinalityFingerprint` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` | +| `ContainerStdoutCardinalityAttribution` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` | +| `ContainerStdoutCardinalityRateLimit` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` | Per-alert `runbook_url` is wired in each recipe's bundled `prometheus-alerts.example.yaml`; this table is the doc-side @@ -95,7 +95,7 @@ existing alerts survive the swap from in-tree receiver to upstream + OTTL. | Scenario | Behaviour | Test | |---|---|---| -| 🟢 Exporter unreachable (network error mid-send) | `otlphttp` retries on retryable HTTP status codes (429/502/503/504) and on network errors with exponential backoff; final error propagates to the receiver as a `Permanent` or `Retryable` `kind`, surfaced via `tracecore_exporter_calls_total{outcome="error"}`. | `components/exporters/otlphttp/otlphttp_test.go::TestExporter_RetriesOnNetworkError` | +| 🟢 Exporter unreachable (network error mid-send) | `otlphttp` retries on retryable HTTP status codes (429/502/503/504) and on network errors with exponential backoff; final error propagates to the receiver as a `Permanent` or `Retryable` `kind`, surfaced via `otelcol_exporter_calls_total{outcome="error"}` (post-RFC-0013 naming). | `components/exporters/otlphttp/otlphttp_test.go::TestExporter_RetriesOnNetworkError` | | 🟢 Vendor SDK failure (`dcgm-exporter` unreachable at Start) | `prometheusreceiver` records the scrape failure and emits `up=0`; the pipeline continues without the scrape target's contribution rather than failing the whole binary. Source: upstream `prometheusreceiver` scraping `dcgm-exporter` per the bundled recipe. | recipe-level alert `DCGMReceiverDegraded`; see `tracecore-recipes` chart. | | 🟢 Config invalid (unknown top-level field) | Loader returns a `file:line:column` error citing the offending key; `tracecore validate` and `tracecore collect` both exit 2 (EX_DATAERR) before any I/O. | `internal/config/load_test.go::TestLoad_UnknownTopLevelField_LineNumberedError` | | 🟢 Config invalid (bad exporter endpoint) | `otlphttp` rejects non-http/https schemes at validate time with `otlphttp: endpoint: scheme must be http or https`; exit 2. | `components/exporters/otlphttp/otlphttp_test.go::TestConfig_Validate_RejectsNonHTTPSchemes` | @@ -114,7 +114,7 @@ existing alerts survive the swap from in-tree receiver to upstream + OTTL. | 🟢 `Server.Shutdown` without `Server.Start` | No-op. Returns nil. Mirrors `Component.Shutdown` idempotency. | `internal/telemetry/server_test.go::TestServer_ShutdownIsIdempotent` | | 🟡 `Server.Shutdown` exceeds `ShutdownBudget` (800ms) | http.Server cancels in-flight requests; returns within budget. Leaves headroom in PRINCIPLES §1 1s overall budget. | `internal/telemetry/server_test.go::TestServer_ShutdownWithin1s` | | 🟢 Repeated `Server.Start`/`Server.Shutdown` cycles | Listener fd is closed each Shutdown; no leak. `goleak` in TestMain catches regressions. | `internal/telemetry/server_test.go::TestServer_ShutdownIsIdempotent` (covers the cycle) | -| 🔴 `/metrics` handler panics during scrape | promhttp catches the panic internally (its default behaviour) and returns 500. Our handler chain doesn't add recovery middleware in M2. **Carry-forward from M2:** dedicated panic-recovery + dedicated metric `tracecore.telemetry.scrape_panics_total`. | - (no current test; promhttp default is the only safety net) | +| 🔴 `/metrics` handler panics during scrape | promhttp catches the panic internally (its default behaviour) and returns 500. Our handler chain doesn't add recovery middleware in M2. **Carry-forward from M2:** dedicated panic-recovery + dedicated metric `otelcol.telemetry.scrape_panics_total`. | - (no current test; promhttp default is the only safety net) | ## Operator quick reference diff --git a/docs/FLAKY-TESTS.md b/docs/FLAKY-TESTS.md index f24670f4..61301db6 100644 --- a/docs/FLAKY-TESTS.md +++ b/docs/FLAKY-TESTS.md @@ -6,7 +6,7 @@ proceed. Don't burn an iteration chasing a known flake. ## Active -### `cmd/tracecore.TestIntegration_ClockreceiverToStdoutexporter` +### `cmd/tracecore.TestIntegration_ClockreceiverToStdoutexporter` - test target scheduled for deletion per RFC-0013 - **First seen flaking:** 2026-05-14 (M8 worktree, iteration 1) - **Platform:** darwin/arm64 (macOS dev laptop), `-race` enabled. diff --git a/docs/README.md b/docs/README.md index 8ba8f7d3..78b8b4b5 100644 --- a/docs/README.md +++ b/docs/README.md @@ -51,17 +51,17 @@ Legend: 👤 operator · 🛠️ contributor · 🏛️ maintainer · 🌐 exter | Path | Audience | Purpose | |---|---|---| -| [components/receivers/dcgm/README.md](../components/receivers/dcgm/README.md) | 👤 🛠️ | DCGM receiver config, deployment notes, cardinality budget. | -| [components/receivers/dcgm/RUNBOOK.md](../components/receivers/dcgm/RUNBOOK.md) | 👤 | Operator playbook keyed by alert + failure mode inventory. | -| [components/receivers/kernelevents/README.md](../components/receivers/kernelevents/README.md) | 👤 🛠️ | kmsg + journald receiver config, event taxonomy, caveats. | -| [components/receivers/kernelevents/RUNBOOK.md](../components/receivers/kernelevents/RUNBOOK.md) | 👤 | Operator playbook + failure mode inventory. | -| [components/receivers/clockreceiver/README.md](../components/receivers/clockreceiver/README.md) | 🛠️ | Canonical test receiver; reference shape for new receivers. | -| [components/receivers/k8sevents/README.md](../components/receivers/k8sevents/README.md) | 👤 🛠️ | Kubernetes events receiver config and event taxonomy. | -| [components/receivers/k8sevents/RUNBOOK.md](../components/receivers/k8sevents/RUNBOOK.md) | 👤 | Operator playbook + failure mode inventory. | +| [components/receivers/dcgm/README.md](../components/receivers/dcgm/README.md) | 👤 🛠️ | DCGM receiver config, deployment notes, cardinality budget. - scheduled for deletion per RFC-0013 §7 | +| [components/receivers/dcgm/RUNBOOK.md](../components/receivers/dcgm/RUNBOOK.md) | 👤 | Operator playbook keyed by alert + failure mode inventory. - scheduled for deletion per RFC-0013 §7 | +| [components/receivers/kernelevents/README.md](../components/receivers/kernelevents/README.md) | 👤 🛠️ | kmsg + journald receiver config, event taxonomy, caveats. - scheduled for deletion per RFC-0013 §7 | +| [components/receivers/kernelevents/RUNBOOK.md](../components/receivers/kernelevents/RUNBOOK.md) | 👤 | Operator playbook + failure mode inventory. - scheduled for deletion per RFC-0013 §7 | +| [components/receivers/clockreceiver/README.md](../components/receivers/clockreceiver/README.md) | 🛠️ | Canonical test receiver; reference shape for new receivers. - scheduled for deletion per RFC-0013 §7 | +| [components/receivers/k8sevents/README.md](../components/receivers/k8sevents/README.md) | 👤 🛠️ | Kubernetes events receiver config and event taxonomy. - scheduled for deletion per RFC-0013 §7 | +| [components/receivers/k8sevents/RUNBOOK.md](../components/receivers/k8sevents/RUNBOOK.md) | 👤 | Operator playbook + failure mode inventory. - scheduled for deletion per RFC-0013 §7 | | [components/receivers/nccl_fr/README.md](../components/receivers/nccl_fr/README.md) | 👤 🛠️ | NCCL FlightRecorder receiver + safe-pickle parser scope. | | [components/receivers/nccl_fr/RUNBOOK.md](../components/receivers/nccl_fr/RUNBOOK.md) | 👤 | Operator playbook + per-kind triage (incl. pickle deny-boundary). | -| [components/receivers/pyspy/README.md](../components/receivers/pyspy/README.md) | 👤 🛠️ | On-demand Python stack-sampling receiver (faulthandler-based). | -| [components/receivers/pyspy/RUNBOOK.md](../components/receivers/pyspy/RUNBOOK.md) | 👤 | Operator playbook + per-kind triage (RFC-0009 degraded modes). | +| [components/receivers/pyspy/README.md](../components/receivers/pyspy/README.md) | 👤 🛠️ | On-demand Python stack-sampling receiver (faulthandler-based). - scheduled for deletion per RFC-0013 §7 | +| [components/receivers/pyspy/RUNBOOK.md](../components/receivers/pyspy/RUNBOOK.md) | 👤 | Operator playbook + per-kind triage (RFC-0009 degraded modes). - scheduled for deletion per RFC-0013 §7 | | [components/exporters/otlphttp/README.md](../components/exporters/otlphttp/README.md) | 👤 🛠️ | OTLP/HTTP exporter - production sink to an OTel collector or backend. | | [internal/telemetry/README.md](../internal/telemetry/README.md) | 👤 🛠️ | Self-telemetry surface contract: `/metrics`, `/healthz`, `/readyz`. | | [internal/telemetry/SECURITY.md](../internal/telemetry/SECURITY.md) | 👤 | Security model for the self-telemetry endpoints. | diff --git a/docs/STRATEGY.md b/docs/STRATEGY.md index 88343aff..f146a9db 100644 --- a/docs/STRATEGY.md +++ b/docs/STRATEGY.md @@ -59,7 +59,7 @@ Current accepted divergences: | `CreateSettings` shape | `{ID, Telemetry, BuildInfo, _ struct{}}` shared across roles | `receiver.Settings` / `exporter.Settings` / `processor.Settings` with `{ID, TelemetrySettings (embedded), BuildInfo, _ struct{}}` | M2 added `BuildInfo` and the unkeyed-init guard; per-role split deferred to first cross-role config divergence. | **done** (M2) for BuildInfo + guard; **post-v1** for per-role split | | `TelemetrySettings` shape | `{Logger *slog.Logger, MeterProvider, Resource, _ struct{}}` | `{Logger *zap.Logger, TracerProvider, MeterProvider, Resource, _ struct{}}` | M2 added MeterProvider + guard; slog vs zap is a pre-existing permanent divergence; TracerProvider deferred to tracing milestone. | **done** (M2) for MeterProvider; **post-v1** for TracerProvider | | Self-telemetry HTTP default bind | `localhost:8888` | `0.0.0.0:8888` | Security tiebreaker: pre-1.0 tracecore favours safe-by-default. Operators override via `telemetry.listen`. | permanent | -| Self-telemetry metric names | `tracecore.receiver.*` | `otelcol_*` | Separate surface; operators KNOW they're scraping tracecore. | permanent | +| Self-telemetry metric names | `tracecore.receiver.*` | `otelcol_*` | Separate surface; operators KNOW they're scraping tracecore. | superseded by RFC-0013 §3 (tracecore adopts upstream `otelcol_*` at v0.1.0) | | `componentstatus` package | `internal/componentstatus` (in-tree) | `go.opentelemetry.io/collector/component/componentstatus` (external module) | Avoids pulling the OTel collector component module just for this one fn. Revisit at M22 OTel-compat sweep. | permanent (revisit M22) | | Factory interface decomposition | Three monolithic `ReceiverFactory` / `ProcessorFactory` / `ExporterFactory` | Base `component.Factory` embedded into per-role factories, plus `XStability()` per signal | Stability tracking deferred to post-v1.0 (RFC-0003 §"Deferred"); decomposition adds no value without it. | **post-v1** (when stability tracking lands) | | Sentinel error name | `ErrSignalNotSupported` | `pipeline.ErrSignalNotSupported` | Aligned in M1.6 Phase-19 audit (was `ErrSignalUnsupported`). | **done** (M1.6) | diff --git a/docs/getting-started.md b/docs/getting-started.md index 417513fc..ef1e8512 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -108,7 +108,7 @@ stable across the pivot per | `kernelevents.xid` | log attribute | NVRM Xid code (integer). | | `gpu.id` | log + metric attribute | PCI BDF for the GPU at fault. | | `gpu.vendor` | resource attribute | `nvidia` \| `amd` \| `intel` \| `habana`, normalized by OTTL. | -| `tracecore.container.lines_per_s` | metric | Per-rank line rate (15s window). | +| `tracecore.container.lines_per_s` | metric | Per-rank line rate (15s window). (pre-v0.1.0 placeholder; final name follows RFC-0013 §3) | | `gen_ai.training.rank` | resource attribute | Cross-receiver join key. | | `gen_ai.training.job_id` | resource attribute | Cross-receiver join key. | | NCCL FlightRecorder span schema | trace | `ncclfrreceiver` (retained per RFC-0013 §6). | @@ -122,9 +122,9 @@ them unless you explicitly override. `service/telemetry` exposes Prometheus metrics, a health-check endpoint via the `healthcheck` extension, and zpages via the `zpages` extension. The default chart values enable all three on -`localhost`. Standard `otelcol_*` metric names apply; alerts should -not assume the legacy `tracecore_*` prefix (renamed per -[RFC-0013 §Migration PR-B](rfcs/0013-distro-first-pivot.md#migration--rollout)). +`localhost`. Standard `otelcol_*` metric names apply per +[RFC-0013 §3](rfcs/0013-distro-first-pivot.md#3-customer-stable-contracts); +alerts must not assume the legacy `tracecore_*` prefix. ## Add a real receiver diff --git a/install/kubernetes/tracecore/Chart.yaml b/install/kubernetes/tracecore/Chart.yaml index 738012f9..245288a0 100644 --- a/install/kubernetes/tracecore/Chart.yaml +++ b/install/kubernetes/tracecore/Chart.yaml @@ -5,10 +5,10 @@ type: application # version: chart-package version. Independent from appVersion; bumped on # any chart change. Pre-1.0.0 while the install surface evolves. version: 0.1.0 -# appVersion: the tracecore binary release the chart's default image -# tag points at. Bumped in lockstep with internal/version/version.go -# Version; drift between the two is gated by -# scripts/chart-appversion-check.sh (wired into `make doc-check`). +# appVersion tracks the tracecore binary version. Pre-PR-A: synced via +# scripts/chart-appversion-check.sh against internal/version/version.go. +# Post-PR-A: sync target becomes builder-config.yaml's `dist.version` +# field per RFC-0013 §1. appVersion: "0.1.0-m9-alpha" kubeVersion: ">=1.28.0-0" home: https://github.com/tracecoreai/tracecore diff --git a/install/kubernetes/tracecore/README.md b/install/kubernetes/tracecore/README.md index 3d31d31b..78a9f402 100644 --- a/install/kubernetes/tracecore/README.md +++ b/install/kubernetes/tracecore/README.md @@ -229,7 +229,9 @@ grace (~45s) serializes the rollout. Override with clusters must mirror the image to an internal registry and set `--set image.repository=/tracecore` (+ optional `imagePullSecrets`). For local evaluation against an unreleased SHA, -build the image with the in-tree `Dockerfile` and load it directly: +build the image with the in-tree `Dockerfile` and load it directly. +Note: build path pivots to OCB-assembled binary at v0.1.0 per RFC-0013. +The instructions below are pre-pivot. `make build && docker buildx build --platform=linux/amd64 --build-arg BINARY_PATH=tracecore -t tracecore:dev . && kind load docker-image tracecore:dev`; then install with diff --git a/install/kubernetes/tracecore/policies/conftest/tracecore.rego b/install/kubernetes/tracecore/policies/conftest/tracecore.rego index 796de71d..0696f114 100644 --- a/install/kubernetes/tracecore/policies/conftest/tracecore.rego +++ b/install/kubernetes/tracecore/policies/conftest/tracecore.rego @@ -172,6 +172,16 @@ deny contains msg if { msg := sprintf("%s/%s enables containerstdout but missing hostPath volume 'containerstdout-pod-logs' (/var/log/pods); the tailer cannot read CRI symlinks without it", [input.kind, input.metadata.name]) } +# containerstdout-pod-logs hostPath must point at /var/log/pods. Any +# other path defeats the CRI symlink resolution contract and risks +# tailing an attacker-controlled directory. +deny contains msg if { + some vol in object.get(pod_spec, "volumes", []) + vol.name == "containerstdout-pod-logs" + vol.hostPath.path != "/var/log/pods" + msg := sprintf("containerstdout-pod-logs volume must mount /var/log/pods, got %q", [vol.hostPath.path]) +} + # Required hostPath: cursor directory. Without it cursor writes go to # the read-only rootfs and every checkpoint increments # KindCursorWriteFailed — verified by TestFailure_CursorWriteFailedReadOnlyFs. diff --git a/install/kubernetes/tracecore/templates/containerstdout-rbac.yaml b/install/kubernetes/tracecore/templates/containerstdout-rbac.yaml index 26792a8b..4f342676 100644 --- a/install/kubernetes/tracecore/templates/containerstdout-rbac.yaml +++ b/install/kubernetes/tracecore/templates/containerstdout-rbac.yaml @@ -40,6 +40,7 @@ rules: # the downward API and reads only that node's record. - apiGroups: [""] resources: ["nodes"] + # TODO(RFC-0013): scope to per-node Node via aggregator pattern when refactoring RBAC for OCB swap verbs: ["get"] --- apiVersion: rbac.authorization.k8s.io/v1 diff --git a/install/kubernetes/tracecore/templates/daemonset.yaml b/install/kubernetes/tracecore/templates/daemonset.yaml index 2114e915..16524538 100644 --- a/install/kubernetes/tracecore/templates/daemonset.yaml +++ b/install/kubernetes/tracecore/templates/daemonset.yaml @@ -26,6 +26,9 @@ spec: # apiserver client and needs the projected SA token mount. # Otherwise honour the explicit values.yaml knob (defaults # false because the alpha receiver set is API-server-free). + {{- if and .Values.receivers.containerstdout.enabled (not .Values.receivers.containerstdout.rbac.create) }} + {{- fail "containerstdout.enabled=true requires containerstdout.rbac.create=true; otherwise the ServiceAccount token is mounted without backing ClusterRole" }} + {{- end }} automountServiceAccountToken: {{ or .Values.serviceAccount.automount .Values.receivers.containerstdout.enabled }} {{- with .Values.imagePullSecrets }} imagePullSecrets: {{- toYaml . | nindent 8 }} @@ -73,6 +76,7 @@ spec: securityContext: {{- toYaml .Values.containerSecurityContext | nindent 12 }} # Downward-API env vars for receivers that stamp pod/node # context onto emitted attributes. + {{- if or .Values.receivers.containerstdout.enabled .Values.receivers.k8s_events.enabled .Values.receivers.kernelevents.enabled }} env: - name: K8S_POD_NAME valueFrom: @@ -86,6 +90,7 @@ spec: valueFrom: fieldRef: fieldPath: spec.nodeName + {{- end }} {{- if .Values.telemetry.enabled }} {{- $listen := .Values.telemetry.listen -}} {{- $port := regexReplaceAll ".*:" $listen "" }} diff --git a/install/kubernetes/tracecore/values.schema.json b/install/kubernetes/tracecore/values.schema.json index afd1cf8e..ad04a74b 100644 --- a/install/kubernetes/tracecore/values.schema.json +++ b/install/kubernetes/tracecore/values.schema.json @@ -56,7 +56,7 @@ "podSecurityContext": { "type": "object", - "additionalProperties": true, + "additionalProperties": false, "properties": { "runAsNonRoot": { "type": "boolean" }, "runAsUser": { "type": "integer", "minimum": 1 }, @@ -76,7 +76,7 @@ "containerSecurityContext": { "type": "object", - "additionalProperties": true, + "additionalProperties": false, "properties": { "allowPrivilegeEscalation": { "type": "boolean", "const": false }, "readOnlyRootFilesystem": { "type": "boolean", "const": true }, diff --git a/internal/runtime/lifecycle/lifecycle.go b/internal/runtime/lifecycle/lifecycle.go index 730e4ce4..f5c41eb6 100644 --- a/internal/runtime/lifecycle/lifecycle.go +++ b/internal/runtime/lifecycle/lifecycle.go @@ -21,6 +21,7 @@ import ( "errors" "fmt" "log/slog" + "runtime" "sync" "sync/atomic" ) @@ -155,7 +156,10 @@ func (l *Lifecycle) Shutdown(ctx context.Context) error { case <-done: return nil case <-ctx.Done(): - l.logger.Warn("lifecycle: shutdown deadline elapsed before goroutine exited") + // NumGoroutine is process-wide, not lifecycle-local; surfacing it + // here lets operators eyeball whether the leak is plausibly ours. + l.logger.Warn("lifecycle: shutdown deadline elapsed before goroutine exited", + "process_goroutines", runtime.NumGoroutine()) err := fmt.Errorf("lifecycle shutdown: %w", ctx.Err()) l.mu.Lock() l.shutdownErr = err From f8324e5fb6e06d5b615e3914bf86a87ed2315a40 Mon Sep 17 00:00:00 2001 From: Tri Lam Date: Sat, 30 May 2026 02:48:36 -0700 Subject: [PATCH 2/2] fix(helm): drop nonexistent k8s_events.enabled ref in daemonset gate Root cause: my downward-API env gate referenced .Values.receivers.k8s_events.enabled but values.yaml has no k8s_events block (k8s_events receiver factory exists but never got a values.yaml knob; receiver runs with factory defaults). Result: helm lint + helm template both failed with "nil pointer evaluating interface {}.enabled". Both render and install-bench CI jobs surfaced this. Fix: tighten the gate to (containerstdout || kernelevents) only. k8s_events receiver does not consume downward-API env vars in the current factory wiring, so excluding it is correct. Verified: alpine/helm:3.16.4 lint clean post-fix. Signed-off-by: Tri Lam --- install/kubernetes/tracecore/templates/daemonset.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/install/kubernetes/tracecore/templates/daemonset.yaml b/install/kubernetes/tracecore/templates/daemonset.yaml index 16524538..ee720c44 100644 --- a/install/kubernetes/tracecore/templates/daemonset.yaml +++ b/install/kubernetes/tracecore/templates/daemonset.yaml @@ -76,7 +76,7 @@ spec: securityContext: {{- toYaml .Values.containerSecurityContext | nindent 12 }} # Downward-API env vars for receivers that stamp pod/node # context onto emitted attributes. - {{- if or .Values.receivers.containerstdout.enabled .Values.receivers.k8s_events.enabled .Values.receivers.kernelevents.enabled }} + {{- if or .Values.receivers.containerstdout.enabled .Values.receivers.kernelevents.enabled }} env: - name: K8S_POD_NAME valueFrom: