Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,4 @@ go.work.sum

# Catch-all for `<thing>.local.<ext>` overrides (direnv, dev configs, etc.)
*.local.*
.claude/scheduled_tasks.lock
9 changes: 3 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,8 +113,8 @@ kind of receiver you're building.

| Shape | When to copy | Reference |
|---|---|---|
| **Trivial ticker receiver** (no external SDK, no streaming source) | Simple metric source that ticks and emits | [`components/receivers/clockreceiver/`](components/receivers/clockreceiver/) |
| **Vendor-SDK receiver** (NVML, NCCL, AMD-ROCm) | External SDK with cgo + stub split, blocking Connect, watch/scrape loop | [`components/receivers/dcgm/`](components/receivers/dcgm/) |
| **Trivial ticker receiver** (no external SDK, no streaming source) | Simple metric source that ticks and emits | see RFC-0013 §6 (the moat) and the (forthcoming) `tracecoreai/tracecore-components` repo for new component templates |
| **Vendor-SDK receiver** (NVML, NCCL, AMD-ROCm) | External SDK with cgo + stub split, blocking Connect, watch/scrape loop | see RFC-0013 §6 (the moat) and the (forthcoming) `tracecoreai/tracecore-components` repo for new component templates |
| **Streaming-source logs receiver** (kmsg, journald, eBPF, ETW) | Continuously-readable FD or subprocess emitting log/event records | [`components/receivers/kernelevents/`](components/receivers/kernelevents/) |

Before writing code, read the canonical implementation that matches
Expand Down Expand Up @@ -158,10 +158,7 @@ ticker / streaming-source receiver):
7. **`components.yaml` row** + `make generate` - regenerates
`cmd/tracecore/components.go`. `make ci` runs `generate-check`
to refuse a stale generated file.
8. **Docs-parity test** - mirror
`components/receivers/dcgm/docs_parity_test.go` for the README
↔ example_config ↔ RUNBOOK ↔ IncError-kinds invariants. The
AST walker (`extractEmittedKinds`) is reusable.
8. **Docs-parity test** - see RFC-0013 §6 (the moat) and the (forthcoming) `tracecoreai/tracecore-components` repo for new component templates covering the README ↔ example_config ↔ RUNBOOK ↔ IncError-kinds invariants.
9. **`example-daemonset.yaml`** (if k8s is a target) - pin the
shape with a parse test that asserts probes + security context
defaults (mirror dcgm's `TestExampleDaemonset_…`).
Expand Down
6 changes: 3 additions & 3 deletions MILESTONES.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,10 +81,10 @@ Work runs in six parallel swim lanes plus a Foundation set. Lanes are organized
|---|---|---|---|
| Foundation | Runtime + self-telemetry | M1, M2, M4 (partial) | n/a |
| 1 | Release infrastructure | M3 (shipped), M5b (shipped), M20, M21 | none for M3/M5b/M21; Linux + GPU (flood-gated) for M20b/c |
| 2 | Test & failure infra | M4b (shipped), M5 | none for M4b; Linux + GPU (flood-gated) for M5 overhead bench |
| 2 | Test & failure infra | M4b (shipped), M5 - obsolete post-RFC-0013 | none for M4b; Linux + GPU (flood-gated) for M5 overhead bench |
| 3 | Documentation & community | M6, M23 | none |
| 4 | Orchestrator signals | M9 (shipped), M10 (alpha), M15, M16, M19 | Linux, no GPU |
| 5 | Framework & runtime profiling | M13, M14, M18 | Linux + Python, no GPU |
| 4 | Orchestrator signals | M9 (shipped) - obsolete post-RFC-0013, M10 (alpha), M15, M16 - obsolete post-RFC-0013, M19 | Linux, no GPU |
| 5 | Framework & runtime profiling | M13 - obsolete post-RFC-0013, M14 - obsolete post-RFC-0013, M18 | Linux + Python, no GPU |
| 6 | GPU & NCCL signals (flood-gated) | M8 carry-fwd, M11 (alpha), M12, M17, M24 | Linux + NVIDIA GPU (M11 parser excepted) |

## Universal non-functional principles
Expand Down
8 changes: 3 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,12 +44,10 @@ What's safe to deploy today, what's still shipping. Honest read at HEAD; check [
## Quickstart

```sh
# 1. Build the binary via the OpenTelemetry Collector Builder (OCB).
# builder-config.yaml pins upstream + contrib component versions to
# a single release cycle and pulls in the tracecore-components
# moat (ncclfrreceiver, rankjoinprocessor, patterndetectorprocessor).
# 1. Build the binary. Today the build manifest is `components.yaml`;
# `builder-config.yaml` (OCB) lands at v0.1.0 per RFC-0013 §1.
go mod download
make build # delegates to: builder --config=builder-config.yaml
make build

# 2. Write a minimal config: telemetrygeneratorreceiver emitting a
# heartbeat metric every second into the debug exporter.
Expand Down
1 change: 1 addition & 0 deletions STYLE.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@ apologetics, sentinel discipline). The summary:

- **Explicit factory map**, not `init()` side effects.
- Factories registered in `cmd/tracecore/components.go` by importing each component and calling its `NewFactory()`. The file is **generated** from `components.yaml` by `tools/components-gen`; `make generate` rewrites it after a `components.yaml` edit.
- Note: components.yaml + components-gen are superseded by RFC-0013 §1 (OCB manifest) and will be deleted at PR-A.
- Banned: `import _ "..."` for registration side effects.

Example shape:
Expand Down
12 changes: 6 additions & 6 deletions cmd/tracecore/integration_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,14 +51,14 @@ func TestMain(m *testing.M) {
if p, err := binaryPathOnce(); err == nil && p != "" {
_ = os.RemoveAll(filepath.Dir(p))
}
// After tests pass, check for leaked goroutines. A receiver that
// After tests run, check for leaked goroutines. A receiver that
// ignores ctx and leaves its tick goroutine running past Shutdown
// is a contract violation `go test -race` would not catch — goleak
// does. Skip the check on failing tests; failures may legitimately
// leak partial state.
if code == 0 {
if err := goleak.Find(); err != nil {
fmt.Fprintln(os.Stderr, "TestMain: goroutine leak detected:", err)
// does. Run unconditionally: a failing unrelated test must not
// mask a leak we care about diagnosing.
if err := goleak.Find(); err != nil {
fmt.Fprintln(os.Stderr, "TestMain: goroutine leak detected:", err)
if code == 0 {
code = 1
}
}
Expand Down
3 changes: 3 additions & 0 deletions components/receivers/nccl_fr/factory.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ func (*factory) CreateLogs(ctx context.Context, set pipeline.CreateSettings, cfg
if !ok {
return nil, fmt.Errorf("nccl_fr: unexpected config type %T", cfg)
}
if err := c.Validate(); err != nil {
return nil, fmt.Errorf("nccl_fr: %w", err)
}
r := newReceiver(set, c, next)
if set.Telemetry.MeterProvider != nil {
if rt, err := selftelemetry.NewReceiver(set.ID, set.Telemetry.MeterProvider); err == nil {
Expand Down
20 changes: 10 additions & 10 deletions docs/FAILURE-MODES.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,14 @@ For SREs landing here via the Prometheus alert payload rather than `runbook_url`
| `KernelEventsHighParseErrorRate` | recipe: kernel-events transform | `journaldreceiver` + `filelogreceiver` + OTTL |
| `K8sEventsReceiverDegraded` | recipe: k8s-events hint mapping | `k8sobjectsreceiver` + OTTL |
| `K8sEventsBackpressureDrops` | recipe: k8s-events hint mapping | `k8sobjectsreceiver` + OTTL |
| `ContainerStdoutDegraded` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutRotationStalled` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutBackpressure` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutCursorWriteFailed` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutWatchFlap` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutCardinalityFingerprint` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutCardinalityAttribution` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutCardinalityRateLimit` | in-tree `containerstdout` (until v0.2.0) | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutDegraded` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutRotationStalled` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutBackpressure` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutCursorWriteFailed` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutWatchFlap` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutCardinalityFingerprint` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutCardinalityAttribution` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` |
| `ContainerStdoutCardinalityRateLimit` | in-tree `containerstdout` (until v0.2.0) - scheduled for deletion per RFC-0013 | `filelogreceiver` + container stanza + `file_storage` |

Per-alert `runbook_url` is wired in each recipe's bundled
`prometheus-alerts.example.yaml`; this table is the doc-side
Expand Down Expand Up @@ -95,7 +95,7 @@ existing alerts survive the swap from in-tree receiver to upstream + OTTL.

| Scenario | Behaviour | Test |
|---|---|---|
| 🟢 Exporter unreachable (network error mid-send) | `otlphttp` retries on retryable HTTP status codes (429/502/503/504) and on network errors with exponential backoff; final error propagates to the receiver as a `Permanent` or `Retryable` `kind`, surfaced via `tracecore_exporter_calls_total{outcome="error"}`. | `components/exporters/otlphttp/otlphttp_test.go::TestExporter_RetriesOnNetworkError` |
| 🟢 Exporter unreachable (network error mid-send) | `otlphttp` retries on retryable HTTP status codes (429/502/503/504) and on network errors with exponential backoff; final error propagates to the receiver as a `Permanent` or `Retryable` `kind`, surfaced via `otelcol_exporter_calls_total{outcome="error"}` (post-RFC-0013 naming). | `components/exporters/otlphttp/otlphttp_test.go::TestExporter_RetriesOnNetworkError` |
| 🟢 Vendor SDK failure (`dcgm-exporter` unreachable at Start) | `prometheusreceiver` records the scrape failure and emits `up=0`; the pipeline continues without the scrape target's contribution rather than failing the whole binary. Source: upstream `prometheusreceiver` scraping `dcgm-exporter` per the bundled recipe. | recipe-level alert `DCGMReceiverDegraded`; see `tracecore-recipes` chart. |
| 🟢 Config invalid (unknown top-level field) | Loader returns a `file:line:column` error citing the offending key; `tracecore validate` and `tracecore collect` both exit 2 (EX_DATAERR) before any I/O. | `internal/config/load_test.go::TestLoad_UnknownTopLevelField_LineNumberedError` |
| 🟢 Config invalid (bad exporter endpoint) | `otlphttp` rejects non-http/https schemes at validate time with `otlphttp: endpoint: scheme must be http or https`; exit 2. | `components/exporters/otlphttp/otlphttp_test.go::TestConfig_Validate_RejectsNonHTTPSchemes` |
Expand All @@ -114,7 +114,7 @@ existing alerts survive the swap from in-tree receiver to upstream + OTTL.
| 🟢 `Server.Shutdown` without `Server.Start` | No-op. Returns nil. Mirrors `Component.Shutdown` idempotency. | `internal/telemetry/server_test.go::TestServer_ShutdownIsIdempotent` |
| 🟡 `Server.Shutdown` exceeds `ShutdownBudget` (800ms) | http.Server cancels in-flight requests; returns within budget. Leaves headroom in PRINCIPLES §1 1s overall budget. | `internal/telemetry/server_test.go::TestServer_ShutdownWithin1s` |
| 🟢 Repeated `Server.Start`/`Server.Shutdown` cycles | Listener fd is closed each Shutdown; no leak. `goleak` in TestMain catches regressions. | `internal/telemetry/server_test.go::TestServer_ShutdownIsIdempotent` (covers the cycle) |
| 🔴 `/metrics` handler panics during scrape | promhttp catches the panic internally (its default behaviour) and returns 500. Our handler chain doesn't add recovery middleware in M2. **Carry-forward from M2:** dedicated panic-recovery + dedicated metric `tracecore.telemetry.scrape_panics_total`. | - (no current test; promhttp default is the only safety net) |
| 🔴 `/metrics` handler panics during scrape | promhttp catches the panic internally (its default behaviour) and returns 500. Our handler chain doesn't add recovery middleware in M2. **Carry-forward from M2:** dedicated panic-recovery + dedicated metric `otelcol.telemetry.scrape_panics_total`. | - (no current test; promhttp default is the only safety net) |

## Operator quick reference

Expand Down
2 changes: 1 addition & 1 deletion docs/FLAKY-TESTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ proceed. Don't burn an iteration chasing a known flake.

## Active

### `cmd/tracecore.TestIntegration_ClockreceiverToStdoutexporter`
### `cmd/tracecore.TestIntegration_ClockreceiverToStdoutexporter` - test target scheduled for deletion per RFC-0013

- **First seen flaking:** 2026-05-14 (M8 worktree, iteration 1)
- **Platform:** darwin/arm64 (macOS dev laptop), `-race` enabled.
Expand Down
18 changes: 9 additions & 9 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,17 +51,17 @@ Legend: 👤 operator · 🛠️ contributor · 🏛️ maintainer · 🌐 exter

| Path | Audience | Purpose |
|---|---|---|
| [components/receivers/dcgm/README.md](../components/receivers/dcgm/README.md) | 👤 🛠️ | DCGM receiver config, deployment notes, cardinality budget. |
| [components/receivers/dcgm/RUNBOOK.md](../components/receivers/dcgm/RUNBOOK.md) | 👤 | Operator playbook keyed by alert + failure mode inventory. |
| [components/receivers/kernelevents/README.md](../components/receivers/kernelevents/README.md) | 👤 🛠️ | kmsg + journald receiver config, event taxonomy, caveats. |
| [components/receivers/kernelevents/RUNBOOK.md](../components/receivers/kernelevents/RUNBOOK.md) | 👤 | Operator playbook + failure mode inventory. |
| [components/receivers/clockreceiver/README.md](../components/receivers/clockreceiver/README.md) | 🛠️ | Canonical test receiver; reference shape for new receivers. |
| [components/receivers/k8sevents/README.md](../components/receivers/k8sevents/README.md) | 👤 🛠️ | Kubernetes events receiver config and event taxonomy. |
| [components/receivers/k8sevents/RUNBOOK.md](../components/receivers/k8sevents/RUNBOOK.md) | 👤 | Operator playbook + failure mode inventory. |
| [components/receivers/dcgm/README.md](../components/receivers/dcgm/README.md) | 👤 🛠️ | DCGM receiver config, deployment notes, cardinality budget. - scheduled for deletion per RFC-0013 §7 |
| [components/receivers/dcgm/RUNBOOK.md](../components/receivers/dcgm/RUNBOOK.md) | 👤 | Operator playbook keyed by alert + failure mode inventory. - scheduled for deletion per RFC-0013 §7 |
| [components/receivers/kernelevents/README.md](../components/receivers/kernelevents/README.md) | 👤 🛠️ | kmsg + journald receiver config, event taxonomy, caveats. - scheduled for deletion per RFC-0013 §7 |
| [components/receivers/kernelevents/RUNBOOK.md](../components/receivers/kernelevents/RUNBOOK.md) | 👤 | Operator playbook + failure mode inventory. - scheduled for deletion per RFC-0013 §7 |
| [components/receivers/clockreceiver/README.md](../components/receivers/clockreceiver/README.md) | 🛠️ | Canonical test receiver; reference shape for new receivers. - scheduled for deletion per RFC-0013 §7 |
| [components/receivers/k8sevents/README.md](../components/receivers/k8sevents/README.md) | 👤 🛠️ | Kubernetes events receiver config and event taxonomy. - scheduled for deletion per RFC-0013 §7 |
| [components/receivers/k8sevents/RUNBOOK.md](../components/receivers/k8sevents/RUNBOOK.md) | 👤 | Operator playbook + failure mode inventory. - scheduled for deletion per RFC-0013 §7 |
| [components/receivers/nccl_fr/README.md](../components/receivers/nccl_fr/README.md) | 👤 🛠️ | NCCL FlightRecorder receiver + safe-pickle parser scope. |
| [components/receivers/nccl_fr/RUNBOOK.md](../components/receivers/nccl_fr/RUNBOOK.md) | 👤 | Operator playbook + per-kind triage (incl. pickle deny-boundary). |
| [components/receivers/pyspy/README.md](../components/receivers/pyspy/README.md) | 👤 🛠️ | On-demand Python stack-sampling receiver (faulthandler-based). |
| [components/receivers/pyspy/RUNBOOK.md](../components/receivers/pyspy/RUNBOOK.md) | 👤 | Operator playbook + per-kind triage (RFC-0009 degraded modes). |
| [components/receivers/pyspy/README.md](../components/receivers/pyspy/README.md) | 👤 🛠️ | On-demand Python stack-sampling receiver (faulthandler-based). - scheduled for deletion per RFC-0013 §7 |
| [components/receivers/pyspy/RUNBOOK.md](../components/receivers/pyspy/RUNBOOK.md) | 👤 | Operator playbook + per-kind triage (RFC-0009 degraded modes). - scheduled for deletion per RFC-0013 §7 |
| [components/exporters/otlphttp/README.md](../components/exporters/otlphttp/README.md) | 👤 🛠️ | OTLP/HTTP exporter - production sink to an OTel collector or backend. |
| [internal/telemetry/README.md](../internal/telemetry/README.md) | 👤 🛠️ | Self-telemetry surface contract: `/metrics`, `/healthz`, `/readyz`. |
| [internal/telemetry/SECURITY.md](../internal/telemetry/SECURITY.md) | 👤 | Security model for the self-telemetry endpoints. |
Expand Down
2 changes: 1 addition & 1 deletion docs/STRATEGY.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Current accepted divergences:
| `CreateSettings` shape | `{ID, Telemetry, BuildInfo, _ struct{}}` shared across roles | `receiver.Settings` / `exporter.Settings` / `processor.Settings` with `{ID, TelemetrySettings (embedded), BuildInfo, _ struct{}}` | M2 added `BuildInfo` and the unkeyed-init guard; per-role split deferred to first cross-role config divergence. | **done** (M2) for BuildInfo + guard; **post-v1** for per-role split |
| `TelemetrySettings` shape | `{Logger *slog.Logger, MeterProvider, Resource, _ struct{}}` | `{Logger *zap.Logger, TracerProvider, MeterProvider, Resource, _ struct{}}` | M2 added MeterProvider + guard; slog vs zap is a pre-existing permanent divergence; TracerProvider deferred to tracing milestone. | **done** (M2) for MeterProvider; **post-v1** for TracerProvider |
| Self-telemetry HTTP default bind | `localhost:8888` | `0.0.0.0:8888` | Security tiebreaker: pre-1.0 tracecore favours safe-by-default. Operators override via `telemetry.listen`. | permanent |
| Self-telemetry metric names | `tracecore.receiver.*` | `otelcol_*` | Separate surface; operators KNOW they're scraping tracecore. | permanent |
| Self-telemetry metric names | `tracecore.receiver.*` | `otelcol_*` | Separate surface; operators KNOW they're scraping tracecore. | superseded by RFC-0013 §3 (tracecore adopts upstream `otelcol_*` at v0.1.0) |
| `componentstatus` package | `internal/componentstatus` (in-tree) | `go.opentelemetry.io/collector/component/componentstatus` (external module) | Avoids pulling the OTel collector component module just for this one fn. Revisit at M22 OTel-compat sweep. | permanent (revisit M22) |
| Factory interface decomposition | Three monolithic `ReceiverFactory` / `ProcessorFactory` / `ExporterFactory` | Base `component.Factory` embedded into per-role factories, plus `XStability()` per signal | Stability tracking deferred to post-v1.0 (RFC-0003 §"Deferred"); decomposition adds no value without it. | **post-v1** (when stability tracking lands) |
| Sentinel error name | `ErrSignalNotSupported` | `pipeline.ErrSignalNotSupported` | Aligned in M1.6 Phase-19 audit (was `ErrSignalUnsupported`). | **done** (M1.6) |
Expand Down
Loading
Loading