diff --git a/CHANGELOG.md b/CHANGELOG.md index 826d7472..2f9dcb09 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,6 +13,8 @@ Pivot landed across four waves of PRs: - Wave 4 — PR-B2-shape sibling ports (mechanical import swap to upstream `go.opentelemetry.io/collector/{component,receiver,consumer,pipeline}`; lands together with #195 PR-J recipes, #199 RFC §migration amendment for PR-I/K sub-slicing, #200 PR-N security-posture migration, #198 lint concurrency fix, #210 TOCTOU race-window test hardening): #195 PR-J four upstream-receiver recipes; #197 PR-F precursor containerstdout off selftel+lc; #198 golangci-lint stale-PID fix; #199 RFC §migration PR-I/K amendment + PR-B2 gate; #200 PR-N pyspy capability-surface guide; #201 PR-B2 nccl_fr upstream (canonical); #202 stdoutexporter upstream; #203 pyspy upstream; #204 k8sevents upstream; #205 PR-B3 clockreceiver upstream; #206 PR-F.1 (delete `internal/{selftelemetry,telemetry}` + `components/receivers/dcgm/` + `pkg/dcgm/`); #207 otlphttp upstream; #208 kernelevents upstream; #209 containerstdout upstream; #210 lifecycle TOCTOU concurrent-Add race hardening (kernelevents + k8sevents); #211 PR-K.1 sever patterns lib + replay runner from k8sevents; #212 wave-3/4 docs sweep; #213 clockreceiver convention restore. - Wave 5 — receiver-source deletions (this PR + follow-ups): PR-K.2 (this PR) deletes `components/receivers/{clockreceiver,kernelevents,k8sevents,containerstdout}/` + `tools/failure-inject/xidgen/` + the `containerstdout-on-values.yaml` chart fixture; PR-K.3 (next) clears the chart values keys + DaemonSet template refs + `NOTES.txt` deprecation warnings. PR-F.2's `internal/{componentstatus,pipeline,pipelinebuilder,consumer,fanout,runtime/lifecycle}` cut is now gated only on the surviving consumers (`nccl_fr`, `pyspy`, `otlphttp`, `stdoutexporter`, `internal/pipeline/chaos_test.go`) landing on upstream-only. +**RFC-0013 namespace alignment landed: in-tree component self-telemetry renamed `tracecore.*` → `otelcol...*`.** The eight surviving in-tree components (`clockreceiver`, `containerstdout`, `k8sevents`, `kernelevents`, `nccl_fr`, `pyspy`, `otlphttp`, `stdoutexporter`) each emit through their own per-component MeterProvider; instrument names now match the upstream `otelcol___` convention (e.g. `otelcol_receiver_containerstdout_errors_total`, `otelcol_exporter_otlphttp_calls_total`). Label shape preserved (`component_id`, `kind`, `result` unchanged). Per-component scope name unchanged (still the Go import path) — the receiver-scoped meter cannot collide with the OCB pipeline-runtime's own `otelcol_*` namespace. Three `prometheus-alerts.example.yaml` files (containerstdout, k8sevents, kernelevents) + `docs/examples/prometheus-alerts.example.yaml` rewritten against the new namespace. Migration table added to `docs/migration/v0.1-to-v0.2.md` under "In-tree receiver / exporter namespace alignment" with the rename matrix, per-component `` substitutions, and a PromQL diff recipe. + **PR-F.1 landed: `components/receivers/dcgm/` + `pkg/dcgm/` + `internal/selftelemetry/` + `internal/telemetry/` deleted; one orphan clockreceiver integration test deleted.** Net deletion across the four moats RFC-0013 §migration step 8 promised. Deletes: - `components/receivers/dcgm/` + `pkg/dcgm/` — cgo stub never shipped real code; live ports removed in #188's PR-B2-shaped dcgm sweep; kueue + kineto already deleted in #168. - `internal/selftelemetry/` — every consumer (containerstdout, clockreceiver, kernelevents, k8sevents, nccl_fr, dcgm, pyspy, stdoutexporter, otlphttp) ported onto receiver/exporter-scoped sibling `selftel.go` files in wave-3 of the pivot (#184/#185/#186/#187/#188/#193/#194/#196/#197). The 5-method `selftelemetry.Receiver` and 1-method `selftelemetry.Exporter` interfaces (and the `Kind` canonical-set enum) leave the tree. diff --git a/components/exporters/otlphttp/README.md b/components/exporters/otlphttp/README.md index 8a1e556b..726c85f8 100644 --- a/components/exporters/otlphttp/README.md +++ b/components/exporters/otlphttp/README.md @@ -60,7 +60,7 @@ Operator-supplied `headers` are sent verbatim on every outgoing request; useful ## Self-telemetry labels -The exporter increments `tracecore.exporter.calls_total{result, kind, component_id}` on every Consume*. The `kind` values emitted by otlphttp are exporter-local low-cardinality strings declared in [`selftel.go`](selftel.go) (sibling-scoped, package-local — see RFC-0013 §migration PR-B1; the `internal/selftelemetry` canonical set is being deleted in PR-F): +The exporter increments `otelcol.exporter.otlphttp.calls_total{result, kind, component_id}` on every Consume* (Prometheus scrape renders this as `otelcol_exporter_otlphttp_calls_total`). The `kind` values emitted by otlphttp are exporter-local low-cardinality strings declared in [`selftel.go`](selftel.go) (sibling-scoped, package-local — see RFC-0013 §migration v0.1.0 namespace alignment; the `internal/selftelemetry` canonical set was deleted in PR-F.1): | `kind` | When | Operator first-step | |---|---|---| @@ -75,7 +75,7 @@ Operator dashboards split by `kind` to triage: ## Partial-success responses -A 200 OK response with a populated `partial_success` body is treated as a full success in v0.1.0; the response body is NOT decoded. The OTLP spec marks client handling of `partial_success` as OPTIONAL. Operators who need rejected-count reporting can scrape `tracecore.exporter.calls_total{result=success}` and compare against upstream input counts via their backend. +A 200 OK response with a populated `partial_success` body is treated as a full success in v0.1.0; the response body is NOT decoded. The OTLP spec marks client handling of `partial_success` as OPTIONAL. Operators who need rejected-count reporting can scrape `otelcol_exporter_otlphttp_calls_total{result="success"}` and compare against upstream input counts via their backend. ## Signals supported diff --git a/components/exporters/otlphttp/otlphttp.go b/components/exporters/otlphttp/otlphttp.go index 2574edc5..4232a6bf 100644 --- a/components/exporters/otlphttp/otlphttp.go +++ b/components/exporters/otlphttp/otlphttp.go @@ -312,34 +312,27 @@ func buildUserAgent(bi component.BuildInfo) string { // newSelfTelemetry wires the per-exporter self-telemetry handle. // Returns a no-op when the MeterProvider is absent or instrument // registration fails; the register-failure path also ticks -// `tracecore.selftelemetry.init_errors_total` via recordInitError so -// operators can alert on > 0. Mirrors the nccl_fr sibling — same wire -// shape, no internal/selftelemetry import. +// `otelcol.selftelemetry.init_errors_total` via recordInitError so +// operators can alert on > 0. Mirrors the stdoutexporter sibling — +// same wire shape, no internal/selftelemetry import. // // NOTE on ExporterCarrier removal: // // v0.1.x otlphttp exposed `SelfExporter() selftelemetry.Exporter` so // the runtime's reader-collection path could feed -// `tracecore.exporter.failure_rate`. The PR-B1 sibling port dropped -// the `selftelemetry.ExporterCarrier` implementation: +// `tracecore.exporter.failure_rate`. RFC-0013 PR-A2 deleted the +// `cmd/tracecore` hand-wired entry point and PR-F.1 deleted +// `internal/selftelemetry` entirely — the carrier surface has no +// remaining consumer. // -// - The runtime path that consumed ExporterCarrier (`cmd/tracecore` -// in v0.1.x) silently skipped components that didn't implement -// it. There is no current production caller in this tree; the -// v0.1.x ConsumeCarrier was the only consumer and PR-F deletes it. -// - `tracecore_exporter_failure_rate` still appears in scrape via the -// SLO observable gauge (reports 0 with no readers registered). -// - `tracecore.exporter.calls_total{result,kind,component_id}` +// - `otelcol.exporter.otlphttp.calls_total{result,kind,component_id}` // continues to surface because the sibling impl emits it on // `set.MeterProvider` directly — dashboards / alerts keyed on the -// calls_total counter do not regress. -// - PR-F deletes `internal/selftelemetry` entirely, so the contract -// evaporates regardless. Removing now keeps the sibling -// import-graph clean and matches the stdoutexporter precedent. -// -// The per-exporter failure_rate gauge feed is the documented gap; the -// runtime degrades to the "no per-exporter signal" mode in line with -// the v0.1.x contract. +// calls_total counter rate-derive failure via +// PromQL `rate(otelcol_exporter_otlphttp_calls_total{result="error"}[5m])`. +// - The per-exporter failure_rate gauge feed is intentionally +// dropped; the v0.1.x SLO observable gauge contract is replaced +// by the upstream OCB pipeline-runtime counters. func newSelfTelemetry(ctx context.Context, set exporter.Settings, logger *zap.Logger) selfExporter { if set.MeterProvider == nil { logger.Warn("otlphttp: no MeterProvider; self-telemetry using noop") @@ -604,7 +597,7 @@ func (e *otlpExporter) doOnce(ctx context.Context, endpoint string, body []byte, // spec marks client handling as OPTIONAL. v1 treats all 200s // as full success without parsing the body. Operators who // want partial-success reporting can scrape the - // `tracecore.exporter.calls_total` counter, which still + // `otelcol.exporter.otlphttp.calls_total` counter, which still // counts the 200 as a success here. return false, retryHint{}, nil case isRetryableStatus(resp.StatusCode): diff --git a/components/exporters/otlphttp/selftel.go b/components/exporters/otlphttp/selftel.go index 4c3baf9d..62e68056 100644 --- a/components/exporters/otlphttp/selftel.go +++ b/components/exporters/otlphttp/selftel.go @@ -1,11 +1,14 @@ // SPDX-License-Identifier: Apache-2.0 // Exporter-scoped self-telemetry surface. Replaces the v0.1.x -// dependency on `internal/selftelemetry`, which is slated for deletion -// in RFC-0013 PR-F. Metric names + label shape are preserved -// (`tracecore.exporter.calls_total{result,kind,component_id}`) so -// dashboards / alerts don't regress. The instrumentation scope name is -// THIS exporter's Go import path — when the exporter moves under +// dependency on `internal/selftelemetry`. Metric names follow the +// upstream OTel collector `otelcol___` +// convention per RFC-0013 §migration v0.1.0 namespace alignment: +// `otelcol.exporter.otlphttp.calls_total{result,kind,component_id}` +// (Prometheus exporter renders the dots as underscores). Label shape +// is preserved (`component_id`) so multi-instance disambiguation in +// dashboards is unchanged from v0.1.x. The instrumentation scope name +// is THIS exporter's Go import path — when the exporter moves under // `module/` in PR-I, the scope name moves with it, matching OTel // convention. // @@ -69,16 +72,13 @@ var errNilMeterProvider = errors.New("otlphttp: MeterProvider is nil") // removal in otlphttp.go). // // Why drop FailureRateReader / ExporterCarrier: -// - The runtime path that consumed ExporterCarrier (`cmd/tracecore` -// in v0.1.x) silently skipped components that didn't implement -// it — the documented "no per-exporter signal" degraded mode. The -// production-runtime caller has no current consumer in this tree. -// - `tracecore_exporter_failure_rate` still appears in scrape via the -// SLO observable gauge (reports 0 with no readers). -// - PR-F deletes `internal/selftelemetry` entirely, so any code that -// referenced ExporterCarrier here would have to be removed then -// anyway. Removing now keeps the sibling import-graph clean and -// matches the stdoutexporter precedent. +// - The runtime path that consumed ExporterCarrier +// (`cmd/tracecore.collect.collectFailureRateReaders` in v0.1.x) +// was deleted by RFC-0013 PR-A2 along with the hand-wired entry +// point, so the carrier interface has no remaining consumer. +// - `internal/selftelemetry` (which owned the carrier) was deleted +// by RFC-0013 PR-F.1. Operators rate-derive failure rate via +// PromQL `rate(otelcol_exporter_otlphttp_calls_total{result="error"}[5m])`. type selfExporter interface { IncCallSuccess() IncCallFailure(k kind) @@ -95,7 +95,7 @@ func (noopSelfExporter) IncCallFailure(kind) {} var _ selfExporter = noopSelfExporter{} // newSelfExporter returns a real selfExporter backed by an OTel counter -// `tracecore.exporter.calls_total{result, kind, component_id}` acquired +// `otelcol.exporter.otlphttp.calls_total{result, kind, component_id}` acquired // from mp. The component's id is attached as the `component_id` label // on every emission. Metric name + label shape preserved from the // v0.1.x internal selftelemetry package so dashboards / alerts don't @@ -107,7 +107,7 @@ func newSelfExporter(id component.ID, mp metric.MeterProvider) (selfExporter, er meter := mp.Meter(instrumentationScope) calls, err := meter.Int64Counter( - "tracecore.exporter.calls_total", + "otelcol.exporter.otlphttp.calls_total", metric.WithDescription("Exporter Consume* calls partitioned by result"), ) if err != nil { @@ -145,7 +145,7 @@ func (e *selfExporterImpl) IncCallFailure(k kind) { )) } -// recordInitError ticks tracecore.selftelemetry.init_errors_total when +// recordInitError ticks otelcol.selftelemetry.init_errors_total when // exporter wiring falls back to noop telemetry. Operators alert on // `> 0` to learn that self-telemetry isn't really plugged in. Panics // from a broken MeterProvider are swallowed — recordInitError IS the @@ -158,7 +158,7 @@ func recordInitError(ctx context.Context, mp metric.MeterProvider, kindLabel, co } meter := mp.Meter(instrumentationScope) c, err := meter.Int64Counter( - "tracecore.selftelemetry.init_errors_total", + "otelcol.selftelemetry.init_errors_total", metric.WithDescription("Counter of self-telemetry construction failures that fell back to the noop implementation."), ) if err != nil { diff --git a/components/exporters/otlphttp/selftel_test.go b/components/exporters/otlphttp/selftel_test.go index a6b2d480..f70b789e 100644 --- a/components/exporters/otlphttp/selftel_test.go +++ b/components/exporters/otlphttp/selftel_test.go @@ -43,7 +43,7 @@ func collectRM(t *testing.T, rdr *sdkmetric.ManualReader) metricdata.ResourceMet } // findInstrument returns the first metricdata.Metrics whose Name matches the -// supplied OTel-dot name (e.g. "tracecore.exporter.calls_total"). Returns +// supplied OTel-dot name (e.g. "otelcol.exporter.otlphttp.calls_total"). Returns // (nil, false) if absent. Scope-agnostic: walks all scope metrics. func findInstrument(rm metricdata.ResourceMetrics, name string) (metricdata.Metrics, bool) { for _, sm := range rm.ScopeMetrics { @@ -115,7 +115,7 @@ func TestOtlphttp_NewExporter_NilProviderErrors(t *testing.T) { // M2 metric contract for exporters. After IncCallSuccess() ×2 + // IncCallFailure(kindMarshal) ×1 + IncCallFailure(kindIO) ×1 + // IncCallFailure(kindDownstream) ×1, the ManualReader collects -// tracecore.exporter.calls_total with datapoints partitioned by result and +// otelcol.exporter.otlphttp.calls_total with datapoints partitioned by result and // (for failures) kind, labeled with the component_id. A regression that // drops the kind label, the component_id label, the result label, or the // metric-name prefix fails here. @@ -132,9 +132,9 @@ func TestOtlphttp_EmitsCallsTotal_WithResultKindAndComponentID(t *testing.T) { se.IncCallFailure(kindDownstream) rm := collectRM(t, rdr) - m, ok := findInstrument(rm, "tracecore.exporter.calls_total") + m, ok := findInstrument(rm, "otelcol.exporter.otlphttp.calls_total") if !ok { - t.Fatalf("metric tracecore.exporter.calls_total absent; have: %s", dumpNames(rm)) + t.Fatalf("metric otelcol.exporter.otlphttp.calls_total absent; have: %s", dumpNames(rm)) } sum, ok := m.Data.(metricdata.Sum[int64]) if !ok { @@ -180,7 +180,7 @@ func TestOtlphttp_ScopeNameIsExporterImportPath(t *testing.T) { } se.IncCallSuccess() rm := collectRM(t, rdr) - scope, ok := scopeOf(rm, "tracecore.exporter.calls_total") + scope, ok := scopeOf(rm, "otelcol.exporter.otlphttp.calls_total") if !ok { t.Fatalf("calls_total absent") } @@ -192,7 +192,7 @@ func TestOtlphttp_ScopeNameIsExporterImportPath(t *testing.T) { // TestOtlphttp_RecordInitError_TicksInitErrorsCounter pins: when factory wiring // fails (newSelfExporter returns an error), recordInitError surfaces a -// tracecore.selftelemetry.init_errors_total tick with kind="exporter", +// otelcol.selftelemetry.init_errors_total tick with kind="exporter", // the component_id label, and reason="instrument_register". This is the // only signal that an exporter fell back to noop telemetry; dropping the // recordInitError call must fail this test. @@ -201,7 +201,7 @@ func TestOtlphttp_RecordInitError_TicksInitErrorsCounter(t *testing.T) { recordInitError(context.Background(), mp, "exporter", testID().String(), reasonInstrumentRegister) rm := collectRM(t, rdr) - m, ok := findInstrument(rm, "tracecore.selftelemetry.init_errors_total") + m, ok := findInstrument(rm, "otelcol.selftelemetry.init_errors_total") if !ok { t.Fatalf("init_errors_total absent; have: %s", dumpNames(rm)) } @@ -240,11 +240,11 @@ func TestOtlphttp_RecordInitError_NilProviderIsSafe(t *testing.T) { // TestOtlphttp_FallsBackToNoopWhenMeterFails pins the factory // observability contract end-to-end: when newSelfExporter returns an -// error (synthetic register failure for every tracecore.exporter.* +// error (synthetic register failure for every otelcol.exporter.otlphttp.* // instrument), the factory MUST (1) leave the exporter with a working // noop telemetry field (no nil, no panic on hot-path calls), AND (2) -// tick tracecore.selftelemetry.init_errors_total via recordInitError. -// Mirrors the nccl_fr sibling test seam. +// tick otelcol.selftelemetry.init_errors_total via recordInitError. +// Mirrors the stdoutexporter sibling test seam. func TestOtlphttp_FallsBackToNoopWhenMeterFails(t *testing.T) { mp, rdr := newTestMeterProvider(t) failing := &failingExporterMP{real: mp} @@ -268,12 +268,12 @@ func TestOtlphttp_FallsBackToNoopWhenMeterFails(t *testing.T) { exp.telemetry.IncCallFailure(kindIO) rm := collectRM(t, rdr) - if m, ok := findInstrument(rm, "tracecore.exporter.calls_total"); ok { + if m, ok := findInstrument(rm, "otelcol.exporter.otlphttp.calls_total"); ok { if sum, ok := m.Data.(metricdata.Sum[int64]); ok && len(sum.DataPoints) > 0 { t.Errorf("noop fallback leaked Inc* into calls_total datapoints: %v", sum.DataPoints) } } - m, ok := findInstrument(rm, "tracecore.selftelemetry.init_errors_total") + m, ok := findInstrument(rm, "otelcol.selftelemetry.init_errors_total") if !ok { t.Fatalf("init_errors_total absent after factory fallback; have: %s", dumpNames(rm)) } @@ -385,9 +385,9 @@ func dumpNames(rm metricdata.ResourceMetrics) string { } // failingExporterMP wraps a real MeterProvider but fails every instrument -// registration whose name starts with "tracecore.exporter.". Mirrors the -// stdoutexporter sibling test seam so a future refactor that reorders the -// newSelfExporter constructor doesn't silently bypass coverage. +// registration whose name starts with "otelcol.exporter.otlphttp.". +// Mirrors the stdoutexporter sibling test seam so a future refactor that +// reorders the newSelfExporter constructor doesn't silently bypass coverage. type failingExporterMP struct { embedded.MeterProvider real metric.MeterProvider @@ -401,7 +401,7 @@ type failingExporterMeter struct { metric.Meter } -const exporterInstrumentPrefix = "tracecore.exporter." +const exporterInstrumentPrefix = "otelcol.exporter.otlphttp." var errSyntheticExporterFailure = errors.New("synthetic: exporter instrument registration failed") diff --git a/components/exporters/stdoutexporter/selftel.go b/components/exporters/stdoutexporter/selftel.go index 4e33c8e0..16db8224 100644 --- a/components/exporters/stdoutexporter/selftel.go +++ b/components/exporters/stdoutexporter/selftel.go @@ -1,11 +1,14 @@ // SPDX-License-Identifier: Apache-2.0 // Exporter-scoped self-telemetry surface. Replaces the v0.1.x -// dependency on `internal/selftelemetry`, which is slated for deletion -// in RFC-0013 PR-F. Metric names + label shape are preserved -// (`tracecore.exporter.calls_total{result,kind,component_id}`) so -// dashboards / alerts don't regress. The instrumentation scope name is -// THIS exporter's Go import path — when the exporter moves under +// dependency on `internal/selftelemetry`. Metric names follow the +// upstream OTel collector `otelcol___` +// convention per RFC-0013 §migration v0.1.0 namespace alignment: +// `otelcol.exporter.stdoutexporter.calls_total{result,kind,component_id}` +// (Prometheus exporter renders the dots as underscores). Label shape +// is preserved (`component_id`) so multi-instance disambiguation in +// dashboards is unchanged from v0.1.x. The instrumentation scope name +// is THIS exporter's Go import path — when the exporter moves under // `module/` in PR-I, the scope name moves with it, matching OTel // convention. // @@ -66,17 +69,12 @@ var errNilMeterProvider = errors.New("stdoutexporter: MeterProvider is nil") // - stdoutexporter is the canonical debug / example exporter; it // writes JSON lines to a configured io.Writer (stdout in // production). Operators don't alert on its failure_rate. -// - The runtime path that consumes ExporterCarrier -// (`cmd/tracecore/collect.collectFailureRateReaders`) silently -// skips components that don't implement it — the documented -// "no per-exporter signal" degraded mode. -// - `tracecore_exporter_failure_rate` still appears in scrape -// (observable gauge fires from the SLO source with 0 when the -// registry is empty), so the M2 acceptance check in -// `cmd/tracecore/integration_telemetry_test.go` still passes. -// - PR-F deletes `internal/selftelemetry` entirely, so any code -// that referenced ExporterCarrier here would have to be removed -// then anyway. Removing now keeps the sibling import-graph clean. +// - `internal/selftelemetry` (which owned the ExporterCarrier +// interface) was deleted in RFC-0013 PR-F.1, so any code that +// referenced it had to be removed anyway. The +// `otelcol_exporter_stdoutexporter_calls_total` counter is the +// only surfaced signal — operators rate-derive failure rate via +// PromQL `rate(calls_total{result="error"}[5m])` instead. type selfExporter interface { IncCallSuccess() IncCallFailure(k kind) @@ -93,7 +91,7 @@ func (noopSelfExporter) IncCallFailure(kind) {} var _ selfExporter = noopSelfExporter{} // newSelfExporter returns a real selfExporter backed by an OTel counter -// `tracecore.exporter.calls_total{result, kind, component_id}` acquired +// `otelcol.exporter.stdoutexporter.calls_total{result, kind, component_id}` acquired // from mp. The component's id is attached as the `component_id` label // on every emission. Metric name + label shape preserved from the // v0.1.x internal selftelemetry package so dashboards / alerts don't @@ -105,7 +103,7 @@ func newSelfExporter(id component.ID, mp metric.MeterProvider) (selfExporter, er meter := mp.Meter(instrumentationScope) calls, err := meter.Int64Counter( - "tracecore.exporter.calls_total", + "otelcol.exporter.stdoutexporter.calls_total", metric.WithDescription("Exporter Consume* calls partitioned by result"), ) if err != nil { @@ -143,7 +141,7 @@ func (e *selfExporterImpl) IncCallFailure(k kind) { )) } -// recordInitError ticks tracecore.selftelemetry.init_errors_total when +// recordInitError ticks otelcol.selftelemetry.init_errors_total when // exporter wiring falls back to noop telemetry. Operators alert on // `> 0` to learn that self-telemetry isn't really plugged in. Panics // from a broken MeterProvider are swallowed — recordInitError IS the @@ -156,7 +154,7 @@ func recordInitError(ctx context.Context, mp metric.MeterProvider, kindLabel, co } meter := mp.Meter(instrumentationScope) c, err := meter.Int64Counter( - "tracecore.selftelemetry.init_errors_total", + "otelcol.selftelemetry.init_errors_total", metric.WithDescription("Counter of self-telemetry construction failures that fell back to the noop implementation."), ) if err != nil { diff --git a/components/exporters/stdoutexporter/selftel_test.go b/components/exporters/stdoutexporter/selftel_test.go index aebacb96..606d81a6 100644 --- a/components/exporters/stdoutexporter/selftel_test.go +++ b/components/exporters/stdoutexporter/selftel_test.go @@ -44,7 +44,7 @@ func collectRM(t *testing.T, rdr *sdkmetric.ManualReader) metricdata.ResourceMet } // findInstrument returns the first metricdata.Metrics whose Name matches the -// supplied OTel-dot name (e.g. "tracecore.exporter.calls_total"). Returns +// supplied OTel-dot name (e.g. "otelcol.exporter.stdoutexporter.calls_total"). Returns // (nil, false) if absent. Scope-agnostic: walks all scope metrics. func findInstrument(rm metricdata.ResourceMetrics, name string) (metricdata.Metrics, bool) { for _, sm := range rm.ScopeMetrics { @@ -114,7 +114,7 @@ func TestSelfTelemetry_NewExporter_NilProviderErrors(t *testing.T) { // TestSelfTelemetry_EmitsCallsTotal_WithResultKindAndComponentID pins the // M2 metric contract for exporters. After IncCallSuccess() ×2 + // IncCallFailure(kindMarshal) ×1 + IncCallFailure(kindIO) ×1, the -// ManualReader collects tracecore.exporter.calls_total with datapoints +// ManualReader collects otelcol.exporter.stdoutexporter.calls_total with datapoints // partitioned by result and (for failures) kind, labeled with the // component_id. A regression that drops the kind label, the component_id // label, the result label, or the metric-name prefix fails here. @@ -130,9 +130,9 @@ func TestSelfTelemetry_EmitsCallsTotal_WithResultKindAndComponentID(t *testing.T se.IncCallFailure(kindIO) rm := collectRM(t, rdr) - m, ok := findInstrument(rm, "tracecore.exporter.calls_total") + m, ok := findInstrument(rm, "otelcol.exporter.stdoutexporter.calls_total") if !ok { - t.Fatalf("metric tracecore.exporter.calls_total absent; have: %s", dumpNames(rm)) + t.Fatalf("metric otelcol.exporter.stdoutexporter.calls_total absent; have: %s", dumpNames(rm)) } sum, ok := m.Data.(metricdata.Sum[int64]) if !ok { @@ -177,7 +177,7 @@ func TestSelfTelemetry_ScopeNameIsExporterImportPath(t *testing.T) { } se.IncCallSuccess() rm := collectRM(t, rdr) - scope, ok := scopeOf(rm, "tracecore.exporter.calls_total") + scope, ok := scopeOf(rm, "otelcol.exporter.stdoutexporter.calls_total") if !ok { t.Fatalf("calls_total absent") } @@ -189,7 +189,7 @@ func TestSelfTelemetry_ScopeNameIsExporterImportPath(t *testing.T) { // TestRecordInitError_TicksInitErrorsCounter pins: when factory wiring // fails (newSelfExporter returns an error), recordInitError surfaces a -// tracecore.selftelemetry.init_errors_total tick with kind="exporter", +// otelcol.selftelemetry.init_errors_total tick with kind="exporter", // the component_id label, and reason="instrument_register". This is the // only signal that an exporter fell back to noop telemetry; dropping the // recordInitError call must fail this test. @@ -198,7 +198,7 @@ func TestRecordInitError_TicksInitErrorsCounter(t *testing.T) { recordInitError(context.Background(), mp, "exporter", testID().String(), reasonInstrumentRegister) rm := collectRM(t, rdr) - m, ok := findInstrument(rm, "tracecore.selftelemetry.init_errors_total") + m, ok := findInstrument(rm, "otelcol.selftelemetry.init_errors_total") if !ok { t.Fatalf("init_errors_total absent; have: %s", dumpNames(rm)) } @@ -237,10 +237,10 @@ func TestRecordInitError_NilProviderIsSafe(t *testing.T) { // TestFactory_FallsBackToNoopWhenMeterFails pins the factory // observability contract end-to-end: when newSelfExporter returns an -// error (synthetic register failure for every tracecore.exporter.* +// error (synthetic register failure for every otelcol.exporter.stdoutexporter.* // instrument), the factory MUST (1) leave the exporter with a working // noop telemetry field (no nil, no panic on hot-path calls), AND (2) -// tick tracecore.selftelemetry.init_errors_total via recordInitError. +// tick otelcol.selftelemetry.init_errors_total via recordInitError. // Mirrors the nccl_fr sibling test seam. func TestFactory_FallsBackToNoopWhenMeterFails(t *testing.T) { mp, rdr := newTestMeterProvider(t) @@ -266,12 +266,12 @@ func TestFactory_FallsBackToNoopWhenMeterFails(t *testing.T) { exp.telemetry.IncCallFailure(kindIO) rm := collectRM(t, rdr) - if m, ok := findInstrument(rm, "tracecore.exporter.calls_total"); ok { + if m, ok := findInstrument(rm, "otelcol.exporter.stdoutexporter.calls_total"); ok { if sum, ok := m.Data.(metricdata.Sum[int64]); ok && len(sum.DataPoints) > 0 { t.Errorf("noop fallback leaked Inc* into calls_total datapoints: %v", sum.DataPoints) } } - m, ok := findInstrument(rm, "tracecore.selftelemetry.init_errors_total") + m, ok := findInstrument(rm, "otelcol.selftelemetry.init_errors_total") if !ok { t.Fatalf("init_errors_total absent after factory fallback; have: %s", dumpNames(rm)) } @@ -366,9 +366,9 @@ func dumpNames(rm metricdata.ResourceMetrics) string { } // failingExporterMP wraps a real MeterProvider but fails every instrument -// registration whose name starts with "tracecore.exporter.". Mirrors the -// nccl_fr sibling test seam so a future refactor that reorders the -// newSelfExporter constructor doesn't silently bypass coverage. +// registration whose name starts with "otelcol.exporter.stdoutexporter.". +// Mirrors the nccl_fr sibling test seam so a future refactor that reorders +// the newSelfExporter constructor doesn't silently bypass coverage. type failingExporterMP struct { embedded.MeterProvider real metric.MeterProvider @@ -382,7 +382,7 @@ type failingExporterMeter struct { metric.Meter } -const exporterInstrumentPrefix = "tracecore.exporter." +const exporterInstrumentPrefix = "otelcol.exporter.stdoutexporter." var errSyntheticExporterFailure = errors.New("synthetic: exporter instrument registration failed") diff --git a/components/exporters/stdoutexporter/stdoutexporter.go b/components/exporters/stdoutexporter/stdoutexporter.go index 559d4920..4e91fa23 100644 --- a/components/exporters/stdoutexporter/stdoutexporter.go +++ b/components/exporters/stdoutexporter/stdoutexporter.go @@ -63,23 +63,15 @@ func newExporter(cfg *Config) *stdoutExporter { // // v0.1.x stdoutexporter exposed `SelfExporter() selftelemetry.Exporter` // so `cmd/tracecore/collect.collectFailureRateReaders` could feed -// `tracecore.exporter.failure_rate`. The PR-B1 sibling port dropped -// the `selftelemetry.ExporterCarrier` implementation; this port (off -// `internal/pipeline` + `internal/consumer` onto upstream) preserves -// that drop. The rationale that justified PR-B1's drop still holds: +// `tracecore.exporter.failure_rate`. RFC-0013 PR-A2 deleted the +// `cmd/tracecore` hand-wired entry point and PR-F.1 deleted +// `internal/selftelemetry` entirely — the carrier surface has no +// remaining consumer. Operators rate-derive failure rate via PromQL +// `rate(otelcol_exporter_stdoutexporter_calls_total{result="error"}[5m])` +// against the post-RFC-0013 namespace-aligned counter. // -// - The runtime's reader-collection path silently skips components -// that don't implement the carrier (documented behavior). -// - stdoutexporter is the canonical debug / example exporter; -// operators don't alert on its failure_rate. Real backends in -// `components/exporters/otlphttp` carry that contract. -// - `tracecore_exporter_failure_rate` still appears in scrape via -// the SLO observable gauge (reports 0 with no readers). -// - PR-F deletes `internal/selftelemetry` entirely, so the contract -// evaporates regardless. -// -// `tracecore.exporter.calls_total` continues to surface because the -// sibling impl emits it on `set.MeterProvider` directly. +// `otelcol.exporter.stdoutexporter.calls_total` continues to surface +// because the sibling impl emits it on `set.MeterProvider` directly. // Start is a no-op — stdoutexporter has no goroutines, no // connections, and no resources to acquire at pipeline-start time. @@ -106,7 +98,7 @@ func (e *stdoutExporter) ConsumeMetrics(_ context.Context, md pmetric.Metrics) e if md.MetricCount() == 0 { // Empty payloads still count as successful Consume calls — // the contract was fulfilled, just with zero work. Counting - // them keeps `tracecore_exporter_calls_total` consistent + // them keeps `otelcol_exporter_stdoutexporter_calls_total` consistent // with the operator's "calls received" intuition. e.telemetry.IncCallSuccess() return nil diff --git a/components/receivers/nccl_fr/README.md b/components/receivers/nccl_fr/README.md index a9fa2e84..a37798e3 100644 --- a/components/receivers/nccl_fr/README.md +++ b/components/receivers/nccl_fr/README.md @@ -122,8 +122,10 @@ exists, not from the bytes within it. - **`DumpDir` missing:** the receiver logs one Warn with operator hints, then goes silent. After 30s of continuous failure it flips to degraded - operators alert on the - `tracecore.receiver.degraded_seconds_total{component_id="nccl_fr/…"}` - metric. `/readyz` does NOT flip on transient receiver-degraded + `otelcol.receiver.ncclfr.degraded_seconds_total{component_id="nccl_fr/…"}` + metric (Prometheus scrape renders this as + `otelcol_receiver_ncclfr_degraded_seconds_total`). `/readyz` does + NOT flip on transient receiver-degraded events by project policy (see `internal/telemetry/README.md`). - **Truncated pickle (in-progress write):** logged at Info; retried on next mtime change. diff --git a/components/receivers/nccl_fr/selftel.go b/components/receivers/nccl_fr/selftel.go index 13c207d3..45ad01b1 100644 --- a/components/receivers/nccl_fr/selftel.go +++ b/components/receivers/nccl_fr/selftel.go @@ -1,13 +1,20 @@ // SPDX-License-Identifier: Apache-2.0 // Receiver-scoped self-telemetry surface. Replaces the v0.1.x -// dependency on `internal/selftelemetry`, which is slated for deletion -// in RFC-0013 PR-F. Metric names + label shape are preserved -// (`tracecore.receiver.errors_total{kind,component_id}` and siblings) -// so dashboards / alerts don't regress. The instrumentation scope name -// is THIS receiver's Go import path — when the receiver moves to -// `module/receiver/ncclfrreceiver/` in PR-I.1, the scope name moves -// with it, matching OTel convention. +// dependency on `internal/selftelemetry`, which was deleted in +// RFC-0013 PR-F.1. Metric names follow the upstream OTel collector +// `otelcol___` convention per RFC-0013 +// §migration v0.1.0 namespace alignment: instruments register as +// `otelcol.receiver.ncclfr.errors_total{kind,component_id}` (OTel-dot +// form; the Prometheus exporter renders this as +// `otelcol_receiver_ncclfr_errors_total`). Label shape is preserved +// (`component_id` still partitions per-instance) so multi-instance +// disambiguation in dashboards is unchanged from v0.1.x. The +// instrumentation scope name is THIS receiver's Go import path — when +// the receiver moves to `module/receiver/ncclfrreceiver/` in PR-I.1, +// the scope name moves with it, matching OTel convention. Operators +// migrating from v0.1.x dashboards rename `tracecore_receiver_*` → +// `otelcol_receiver_ncclfr_*` per docs/migration/v0.1-to-v0.2.md. package ncclfr @@ -82,8 +89,10 @@ var _ selfTelemetry = noopSelfTelemetry{} // newSelfTelemetry returns a real selfTelemetry backed by OTel metric // instruments acquired from mp. The component's id is attached as the // `component_id` label on every emission. Registers the same five -// instruments the v0.1.x internal selftelemetry package registered, so -// scraped metric names + label shape are unchanged. +// instruments the v0.1.x internal selftelemetry package registered; +// the OTel-dot prefix changed from `tracecore.receiver.*` to +// `otelcol.receiver.ncclfr.*` per RFC-0013 namespace alignment, label +// shape is unchanged. func newSelfTelemetry(id component.ID, mp metric.MeterProvider) (selfTelemetry, error) { if mp == nil { return nil, errNilMeterProvider @@ -92,21 +101,21 @@ func newSelfTelemetry(id component.ID, mp metric.MeterProvider) (selfTelemetry, attrSet := attribute.NewSet(attribute.String("component_id", id.String())) errsCtr, err := meter.Int64Counter( - "tracecore.receiver.errors_total", + "otelcol.receiver.ncclfr.errors_total", metric.WithDescription("Errors observed by a receiver, partitioned by kind"), ) if err != nil { return nil, fmt.Errorf("errors_total counter: %w", err) } emissionsCtr, err := meter.Int64Counter( - "tracecore.receiver.emissions_total", + "otelcol.receiver.ncclfr.emissions_total", metric.WithDescription("Data points / events emitted by a receiver"), ) if err != nil { return nil, fmt.Errorf("emissions_total counter: %w", err) } latencyHist, err := meter.Float64Histogram( - "tracecore.receiver.collection_latency_seconds", + "otelcol.receiver.ncclfr.collection_latency_seconds", metric.WithDescription("Receiver collection cycle latency in seconds"), metric.WithUnit("s"), // Bucket boundaries chosen for sub-millisecond dump-poll cycles @@ -133,7 +142,7 @@ func newSelfTelemetry(id component.ID, mp metric.MeterProvider) (selfTelemetry, st.activityUnix.Store(time.Now().Unix()) if _, err := meter.Float64ObservableCounter( - "tracecore.receiver.degraded_seconds_total", + "otelcol.receiver.ncclfr.degraded_seconds_total", metric.WithDescription("Cumulative seconds the receiver has been in the degraded state"), metric.WithUnit("s"), metric.WithFloat64Callback(func(_ context.Context, obs metric.Float64Observer) error { @@ -145,7 +154,7 @@ func newSelfTelemetry(id component.ID, mp metric.MeterProvider) (selfTelemetry, } if _, err := meter.Int64ObservableGauge( - "tracecore.receiver.last_activity_unix_seconds", + "otelcol.receiver.ncclfr.last_activity_unix_seconds", metric.WithDescription("Unix-second timestamp of the receiver's last successful activity"), metric.WithInt64Callback(func(_ context.Context, obs metric.Int64Observer) error { obs.Observe(st.activityUnix.Load(), metric.WithAttributeSet(attrSet)) @@ -233,7 +242,7 @@ func (s *selfTelemetryImpl) degradedTotalSeconds() float64 { return acc.Seconds() } -// recordInitError ticks tracecore.selftelemetry.init_errors_total when +// recordInitError ticks otelcol.selftelemetry.init_errors_total when // receiver wiring falls back to noop telemetry. Operators alert on // `> 0` to learn that self-telemetry isn't really plugged in. Panics // from a broken MeterProvider are swallowed — recordInitError IS the @@ -246,7 +255,7 @@ func recordInitError(ctx context.Context, mp metric.MeterProvider, kindLabel, co } meter := mp.Meter(instrumentationScope) c, err := meter.Int64Counter( - "tracecore.selftelemetry.init_errors_total", + "otelcol.selftelemetry.init_errors_total", metric.WithDescription("Counter of self-telemetry construction failures that fell back to the noop implementation."), ) if err != nil { diff --git a/components/receivers/nccl_fr/selftel_test.go b/components/receivers/nccl_fr/selftel_test.go index 24d2e246..d36b999f 100644 --- a/components/receivers/nccl_fr/selftel_test.go +++ b/components/receivers/nccl_fr/selftel_test.go @@ -40,8 +40,8 @@ func collect(t *testing.T, rdr *sdkmetric.ManualReader) metricdata.ResourceMetri } // findInstrument returns the first metricdata.Metrics whose Name matches the -// supplied OTel-dot name (e.g. "tracecore.receiver.errors_total"). Returns -// (nil, false) if absent. Scope-agnostic: walks all scope metrics. +// supplied OTel-dot name (e.g. "otelcol.receiver.ncclfr.errors_total"). +// Returns (nil, false) if absent. Scope-agnostic: walks all scope metrics. func findInstrument(rm metricdata.ResourceMetrics, name string) (metricdata.Metrics, bool) { for _, sm := range rm.ScopeMetrics { for _, m := range sm.Metrics { @@ -115,7 +115,7 @@ func TestSelfTelemetry_NewReceiver_NilProviderErrors(t *testing.T) { // TestSelfTelemetry_EmitsErrorsTotal_WithKindAndComponentID pins the M2 // metric contract. After IncError(kindEnumerate) ×2 + IncError(kindParse) ×1, -// the ManualReader collects tracecore.receiver.errors_total with +// the ManualReader collects otelcol.receiver.ncclfr.errors_total with // datapoints partitioned by kind and labeled with the component_id. A // regression that drops the kind label, the component_id label, or the // metric-name prefix fails here. @@ -130,9 +130,9 @@ func TestSelfTelemetry_EmitsErrorsTotal_WithKindAndComponentID(t *testing.T) { st.IncError(kindParse) rm := collect(t, rdr) - m, ok := findInstrument(rm, "tracecore.receiver.errors_total") + m, ok := findInstrument(rm, "otelcol.receiver.ncclfr.errors_total") if !ok { - t.Fatalf("metric tracecore.receiver.errors_total absent; have: %s", dumpNames(rm)) + t.Fatalf("metric otelcol.receiver.ncclfr.errors_total absent; have: %s", dumpNames(rm)) } sum, ok := m.Data.(metricdata.Sum[int64]) if !ok { @@ -177,9 +177,9 @@ func TestSelfTelemetry_EmitsEmissionsTotal(t *testing.T) { st.IncEmissions(5) st.IncEmissions(-1) rm := collect(t, rdr) - m, ok := findInstrument(rm, "tracecore.receiver.emissions_total") + m, ok := findInstrument(rm, "otelcol.receiver.ncclfr.emissions_total") if !ok { - t.Fatalf("metric tracecore.receiver.emissions_total absent; have: %s", dumpNames(rm)) + t.Fatalf("metric otelcol.receiver.ncclfr.emissions_total absent; have: %s", dumpNames(rm)) } sum, ok := m.Data.(metricdata.Sum[int64]) if !ok { @@ -205,7 +205,7 @@ func TestSelfTelemetry_ScopeNameIsReceiverImportPath(t *testing.T) { } st.IncEmissions(1) rm := collect(t, rdr) - scope, ok := scopeOf(rm, "tracecore.receiver.emissions_total") + scope, ok := scopeOf(rm, "otelcol.receiver.ncclfr.emissions_total") if !ok { t.Fatalf("emissions_total absent") } @@ -217,7 +217,7 @@ func TestSelfTelemetry_ScopeNameIsReceiverImportPath(t *testing.T) { // TestRecordInitError_TicksInitErrorsCounter pins: when factory wiring // fails (NewReceiver returns an error), recordInitError surfaces a -// tracecore.selftelemetry.init_errors_total tick with kind="receiver", +// otelcol.selftelemetry.init_errors_total tick with kind="receiver", // the component_id label, and reason="instrument_register". This is the // only signal that a receiver fell back to noop telemetry; dropping the // recordInitError call must fail this test. @@ -226,7 +226,7 @@ func TestRecordInitError_TicksInitErrorsCounter(t *testing.T) { recordInitError(context.Background(), mp, "receiver", testSettings().ID.String(), reasonInstrumentRegister) rm := collect(t, rdr) - m, ok := findInstrument(rm, "tracecore.selftelemetry.init_errors_total") + m, ok := findInstrument(rm, "otelcol.selftelemetry.init_errors_total") if !ok { t.Fatalf("init_errors_total absent; have: %s", dumpNames(rm)) } @@ -265,10 +265,10 @@ func TestRecordInitError_NilProviderIsSafe(t *testing.T) { // TestFactory_FallsBackToNoopWhenMeterFails pins the factory // observability contract end-to-end: when newSelfTelemetry returns an -// error (synthetic register failure for every tracecore.receiver.* +// error (synthetic register failure for every otelcol.receiver.ncclfr.* // instrument), the factory MUST (1) leave the receiver with a working // noop telemetry field (no nil, no panic on hot-path calls), AND (2) -// tick tracecore.selftelemetry.init_errors_total via recordInitError. +// tick otelcol.selftelemetry.init_errors_total via recordInitError. // This is the regression seam the dcgm sibling test pins for that // receiver; nccl_fr needs the same guarantee. func TestFactory_FallsBackToNoopWhenMeterFails(t *testing.T) { @@ -294,12 +294,12 @@ func TestFactory_FallsBackToNoopWhenMeterFails(t *testing.T) { recv.telemetry.IncError(kindEnumerate) rm := collect(t, rdr) - if m, ok := findInstrument(rm, "tracecore.receiver.errors_total"); ok { + if m, ok := findInstrument(rm, "otelcol.receiver.ncclfr.errors_total"); ok { if sum, ok := m.Data.(metricdata.Sum[int64]); ok && len(sum.DataPoints) > 0 { t.Errorf("noop fallback leaked IncError into errors_total datapoints: %v", sum.DataPoints) } } - m, ok := findInstrument(rm, "tracecore.selftelemetry.init_errors_total") + m, ok := findInstrument(rm, "otelcol.selftelemetry.init_errors_total") if !ok { t.Fatalf("init_errors_total absent after factory fallback; have: %s", dumpNames(rm)) } @@ -323,8 +323,8 @@ func dumpNames(rm metricdata.ResourceMetrics) string { } // failingReceiverMP wraps a real MeterProvider but fails every instrument -// registration whose name starts with "tracecore.receiver.". Mirrors the -// dcgm sibling test seam so a future refactor that reorders the +// registration whose name starts with "otelcol.receiver.ncclfr.". Mirrors +// the dcgm sibling test seam so a future refactor that reorders the // newSelfTelemetry constructor doesn't silently bypass coverage. type failingReceiverMP struct { embedded.MeterProvider @@ -339,7 +339,7 @@ type failingReceiverMeter struct { metric.Meter } -const receiverInstrumentPrefix = "tracecore.receiver." +const receiverInstrumentPrefix = "otelcol.receiver.ncclfr." var errSyntheticReceiverFailure = errors.New("synthetic: receiver instrument registration failed") diff --git a/components/receivers/pyspy/selftel.go b/components/receivers/pyspy/selftel.go index 19fb1b95..5350b770 100644 --- a/components/receivers/pyspy/selftel.go +++ b/components/receivers/pyspy/selftel.go @@ -1,13 +1,16 @@ // SPDX-License-Identifier: Apache-2.0 // Receiver-scoped self-telemetry surface. Replaces the v0.1.x -// dependency on `internal/selftelemetry`, which is slated for deletion -// in RFC-0013 PR-F. Metric names + label shape are preserved -// (`tracecore.receiver.errors_total{kind,component_id}` and siblings) -// so dashboards / alerts don't regress. The instrumentation scope name -// is THIS receiver's Go import path — when the receiver moves to -// `module/receiver/pyspyreceiver/` in PR-I.1, the scope name moves -// with it, matching OTel convention. +// dependency on `internal/selftelemetry`. Metric names follow the +// upstream OTel collector `otelcol___` +// convention per RFC-0013 §migration v0.1.0 namespace alignment: +// `otelcol.receiver.pyspy.errors_total{kind,component_id}` and +// siblings (Prometheus exporter renders the dots as underscores). +// Label shape is preserved (`component_id`) so multi-instance +// disambiguation in dashboards is unchanged from v0.1.x. The +// instrumentation scope name is THIS receiver's Go import path — +// when the receiver moves to `module/receiver/pyspyreceiver/` in +// PR-I.1, the scope name moves with it, matching OTel convention. package pyspy @@ -78,21 +81,21 @@ func newSelfTelemetry(id component.ID, mp metric.MeterProvider) (selfTelemetry, attrSet := attribute.NewSet(attribute.String("component_id", id.String())) errsCtr, err := meter.Int64Counter( - "tracecore.receiver.errors_total", + "otelcol.receiver.pyspy.errors_total", metric.WithDescription("Errors observed by a receiver, partitioned by kind"), ) if err != nil { return nil, fmt.Errorf("errors_total counter: %w", err) } emissionsCtr, err := meter.Int64Counter( - "tracecore.receiver.emissions_total", + "otelcol.receiver.pyspy.emissions_total", metric.WithDescription("Data points / events emitted by a receiver"), ) if err != nil { return nil, fmt.Errorf("emissions_total counter: %w", err) } latencyHist, err := meter.Float64Histogram( - "tracecore.receiver.collection_latency_seconds", + "otelcol.receiver.pyspy.collection_latency_seconds", metric.WithDescription("Receiver collection cycle latency in seconds"), metric.WithUnit("s"), // Bucket boundaries chosen for sub-millisecond dump-poll cycles @@ -119,7 +122,7 @@ func newSelfTelemetry(id component.ID, mp metric.MeterProvider) (selfTelemetry, st.activityUnix.Store(time.Now().Unix()) if _, err := meter.Float64ObservableCounter( - "tracecore.receiver.degraded_seconds_total", + "otelcol.receiver.pyspy.degraded_seconds_total", metric.WithDescription("Cumulative seconds the receiver has been in the degraded state"), metric.WithUnit("s"), metric.WithFloat64Callback(func(_ context.Context, obs metric.Float64Observer) error { @@ -131,7 +134,7 @@ func newSelfTelemetry(id component.ID, mp metric.MeterProvider) (selfTelemetry, } if _, err := meter.Int64ObservableGauge( - "tracecore.receiver.last_activity_unix_seconds", + "otelcol.receiver.pyspy.last_activity_unix_seconds", metric.WithDescription("Unix-second timestamp of the receiver's last successful activity"), metric.WithInt64Callback(func(_ context.Context, obs metric.Int64Observer) error { obs.Observe(st.activityUnix.Load(), metric.WithAttributeSet(attrSet)) @@ -219,7 +222,7 @@ func (s *selfTelemetryImpl) degradedTotalSeconds() float64 { return acc.Seconds() } -// recordInitError ticks tracecore.selftelemetry.init_errors_total when +// recordInitError ticks otelcol.selftelemetry.init_errors_total when // receiver wiring falls back to noop telemetry. Operators alert on // `> 0` to learn that self-telemetry isn't really plugged in. Panics // from a broken MeterProvider are swallowed — recordInitError IS the @@ -232,7 +235,7 @@ func recordInitError(ctx context.Context, mp metric.MeterProvider, kindLabel, co } meter := mp.Meter(instrumentationScope) c, err := meter.Int64Counter( - "tracecore.selftelemetry.init_errors_total", + "otelcol.selftelemetry.init_errors_total", metric.WithDescription("Counter of self-telemetry construction failures that fell back to the noop implementation."), ) if err != nil { diff --git a/components/receivers/pyspy/selftel_test.go b/components/receivers/pyspy/selftel_test.go index 10be6be5..818d79d1 100644 --- a/components/receivers/pyspy/selftel_test.go +++ b/components/receivers/pyspy/selftel_test.go @@ -41,7 +41,7 @@ func collectMetrics(t *testing.T, rdr *sdkmetric.ManualReader) metricdata.Resour } // findInstrument returns the first metricdata.Metrics whose Name matches the -// supplied OTel-dot name (e.g. "tracecore.receiver.errors_total"). Returns +// supplied OTel-dot name (e.g. "otelcol.receiver.pyspy.errors_total"). Returns // (nil, false) if absent. Scope-agnostic: walks all scope metrics. func findInstrument(rm metricdata.ResourceMetrics, name string) (metricdata.Metrics, bool) { for _, sm := range rm.ScopeMetrics { @@ -115,7 +115,7 @@ func TestPyspy_NewReceiver_NilProviderErrors(t *testing.T) { // TestPyspy_EmitsErrorsTotal_WithKindAndComponentID pins the M2 // metric contract. After IncError(kindTargetGone) ×2 + IncError(kindParseError) ×1, -// the ManualReader collects tracecore.receiver.errors_total with +// the ManualReader collects otelcol.receiver.pyspy.errors_total with // datapoints partitioned by kind and labeled with the component_id. A // regression that drops the kind label, the component_id label, or the // metric-name prefix fails here. @@ -130,9 +130,9 @@ func TestPyspy_EmitsErrorsTotal_WithKindAndComponentID(t *testing.T) { st.IncError(kindParseError) rm := collectMetrics(t, rdr) - m, ok := findInstrument(rm, "tracecore.receiver.errors_total") + m, ok := findInstrument(rm, "otelcol.receiver.pyspy.errors_total") if !ok { - t.Fatalf("metric tracecore.receiver.errors_total absent; have: %s", dumpNames(rm)) + t.Fatalf("metric otelcol.receiver.pyspy.errors_total absent; have: %s", dumpNames(rm)) } sum, ok := m.Data.(metricdata.Sum[int64]) if !ok { @@ -175,7 +175,7 @@ func TestPyspy_ScopeNameIsReceiverImportPath(t *testing.T) { } st.IncEmissions(1) rm := collectMetrics(t, rdr) - scope, ok := scopeOf(rm, "tracecore.receiver.emissions_total") + scope, ok := scopeOf(rm, "otelcol.receiver.pyspy.emissions_total") if !ok { t.Fatalf("emissions_total absent") } @@ -187,7 +187,7 @@ func TestPyspy_ScopeNameIsReceiverImportPath(t *testing.T) { // TestPyspy_RecordInitError_TicksInitErrorsCounter pins: when factory wiring // fails (newSelfTelemetry returns an error), recordInitError surfaces a -// tracecore.selftelemetry.init_errors_total tick with kind="receiver", +// otelcol.selftelemetry.init_errors_total tick with kind="receiver", // the component_id label, and reason="instrument_register". This is the // only signal that a receiver fell back to noop telemetry; dropping the // recordInitError call must fail this test. @@ -196,7 +196,7 @@ func TestPyspy_RecordInitError_TicksInitErrorsCounter(t *testing.T) { recordInitError(context.Background(), mp, "receiver", testID().String(), reasonInstrumentRegister) rm := collectMetrics(t, rdr) - m, ok := findInstrument(rm, "tracecore.selftelemetry.init_errors_total") + m, ok := findInstrument(rm, "otelcol.selftelemetry.init_errors_total") if !ok { t.Fatalf("init_errors_total absent; have: %s", dumpNames(rm)) } @@ -235,10 +235,10 @@ func TestPyspy_RecordInitError_NilProviderIsSafe(t *testing.T) { // TestPyspy_FallsBackToNoopWhenMeterFails pins the factory // observability contract end-to-end: when newSelfTelemetry returns an -// error (synthetic register failure for every tracecore.receiver.* +// error (synthetic register failure for every otelcol.receiver.pyspy.* // instrument), the factory MUST (1) leave the receiver with a working // noop telemetry field (no nil, no panic on hot-path calls), AND (2) -// tick tracecore.selftelemetry.init_errors_total via recordInitError. +// tick otelcol.selftelemetry.init_errors_total via recordInitError. func TestPyspy_FallsBackToNoopWhenMeterFails(t *testing.T) { mp, rdr := newTestMeterProvider(t) failing := &failingReceiverMP{real: mp} @@ -262,12 +262,12 @@ func TestPyspy_FallsBackToNoopWhenMeterFails(t *testing.T) { recv.telemetry.IncError(kindTargetGone) rm := collectMetrics(t, rdr) - if m, ok := findInstrument(rm, "tracecore.receiver.errors_total"); ok { + if m, ok := findInstrument(rm, "otelcol.receiver.pyspy.errors_total"); ok { if sum, ok := m.Data.(metricdata.Sum[int64]); ok && len(sum.DataPoints) > 0 { t.Errorf("noop fallback leaked IncError into errors_total datapoints: %v", sum.DataPoints) } } - m, ok := findInstrument(rm, "tracecore.selftelemetry.init_errors_total") + m, ok := findInstrument(rm, "otelcol.selftelemetry.init_errors_total") if !ok { t.Fatalf("init_errors_total absent after factory fallback; have: %s", dumpNames(rm)) } @@ -295,8 +295,8 @@ func dumpNames(rm metricdata.ResourceMetrics) string { } // failingReceiverMP wraps a real MeterProvider but fails every instrument -// registration whose name starts with "tracecore.receiver.". Mirrors the -// nccl_fr sibling test seam so a future refactor that reorders the +// registration whose name starts with "otelcol.receiver.pyspy.". Mirrors +// the nccl_fr sibling test seam so a future refactor that reorders the // newSelfTelemetry constructor doesn't silently bypass coverage. type failingReceiverMP struct { embedded.MeterProvider @@ -311,7 +311,7 @@ type failingReceiverMeter struct { metric.Meter } -const receiverInstrumentPrefix = "tracecore.receiver." +const receiverInstrumentPrefix = "otelcol.receiver.pyspy." var errSyntheticReceiverFailure = errors.New("synthetic: receiver instrument registration failed") diff --git a/docs/examples/prometheus-alerts.example.yaml b/docs/examples/prometheus-alerts.example.yaml index 6132671d..d521eac8 100644 --- a/docs/examples/prometheus-alerts.example.yaml +++ b/docs/examples/prometheus-alerts.example.yaml @@ -1,40 +1,62 @@ -# Starter Prometheus alert rules for tracecore's M2 self-telemetry -# surface. Five alerts cover the operationally meaningful failure -# modes. Adjust thresholds + `for:` durations to match your SLOs; -# the defaults here are deliberately conservative. +# Starter Prometheus alert rules for tracecore's RFC-0013 namespace- +# aligned self-telemetry surface. Four alerts cover the operationally +# meaningful failure modes that the in-tree receivers + exporters emit +# directly via their own MeterProvider (per-component +# `otelcol...` instruments). Adjust thresholds + +# `for:` durations to match your SLOs; the defaults here are +# deliberately conservative. # # Wire by including under `rule_files:` in prometheus.yml, or convert # to a PrometheusRule CRD for prometheus-operator setups. # -# RFC-0013 migration note: the `tracecore_*` metric names below are -# replaced by standard `otelcol_*` names at v0.1.0 per RFC-0013 §4 -# (v0.1.0 operator-visible breaks). The rules will need migration via -# the v0.2.0 OTTL normalization layer (RFC-0013 §3) - do not rewrite -# the rules in this PR; that work is scoped to the v0.2.0 cut. +# Per-receiver alert rules also live under +# `components/receivers//prometheus-alerts.example.yaml` with +# kind-specific severities (rotation_stalled, cursor_write_failed, +# etc.). The rules in this file are receiver-agnostic — they target +# the regex `otelcol_receiver_.*_(errors|degraded|emissions)_*` so a +# new receiver inherits coverage on first scrape. +# +# RFC-0013 namespace alignment (v0.1.0): instrument names changed from +# `tracecore.*` → `otelcol...*`. See +# `docs/migration/v0.1-to-v0.2.md` for the full rename table. + groups: - name: tracecore.self-telemetry interval: 30s rules: - # Exporter failure rate sustained — operators' #1 page-worthy - # signal. The gauge is a rolling 60s window; alerting above 0.01 - # over 5 minutes means a real upstream is rejecting pushes. - - alert: TracecoreExporterFailureRateHigh - expr: tracecore_exporter_failure_rate > 0.01 + # Exporter call-failure rate sustained — operators' #1 page-worthy + # signal. The counter is per-exporter; sum across exporters or + # filter by `exporter=~"otlphttp|stdoutexporter"`. Threshold of + # 0.01 (1%) over 5m means a real backend is rejecting pushes. + - alert: TracecoreExporterCallFailureRateHigh + expr: | + ( + sum by (component_id) ( + rate({__name__=~"otelcol_exporter_.*_calls_total", result="failure"}[5m]) + ) + / + clamp_min(sum by (component_id) ( + rate({__name__=~"otelcol_exporter_.*_calls_total"}[5m]) + ), 0.001) + ) > 0.01 for: 5m labels: severity: warning annotations: - summary: tracecore exporter failure rate exceeds 1% (5m sustained) + summary: tracecore exporter call failure rate exceeds 1% (5m sustained) description: | - tracecore_exporter_failure_rate is {{ $value }} on - {{ $labels.instance }} for the last 5 minutes. Inspect - tracecore_exporter_calls_total{result="failure"} to see - which exporter is failing. + Exporter {{ $labels.component_id }} on {{ $labels.instance }} + is failing ≥1% of Consume* calls over the last 5 minutes. + Inspect the per-exporter + `otelcol_exporter__calls_total{result="failure"}` + stream broken out by `kind` (marshal / io / downstream) + to localize the cause. # Receiver stuck in degraded state — accumulating # degraded-seconds while no recovery transition fires. - alert: TracecoreReceiverDegraded - expr: rate(tracecore_receiver_degraded_seconds_total[5m]) > 0.05 + expr: | + rate({__name__=~"otelcol_receiver_.*_degraded_seconds_total"}[5m]) > 0.05 for: 5m labels: severity: warning @@ -48,24 +70,25 @@ groups: # Stale activity — a receiver that hasn't reported activity in # 5 minutes. Possible deadlock or stuck upstream. - alert: TracecoreReceiverNoActivity - expr: (time() - tracecore_receiver_last_activity_unix_seconds) > 300 + expr: | + (time() - {__name__=~"otelcol_receiver_.*_last_activity_unix_seconds"}) > 300 for: 1m labels: severity: warning annotations: summary: tracecore receiver {{ $labels.component_id }} has been silent for >5 minutes description: | - tracecore_receiver_last_activity_unix_seconds is + The last_activity_unix_seconds gauge is {{ $value | humanizeDuration }} behind wall-clock on {{ $labels.instance }} for component - {{ $labels.component_id }}. The receiver may be - wedged or its upstream may have disappeared. + {{ $labels.component_id }}. The receiver may be wedged or + its upstream may have disappeared. # Self-telemetry construction silently fell back to noop. # Operators see no per-component metrics from this binary even # though the surface is up; investigate the binary's log. - alert: TracecoreSelftelemetryInitErrors - expr: increase(tracecore_selftelemetry_init_errors_total[10m]) > 0 + expr: increase(otelcol_selftelemetry_init_errors_total[10m]) > 0 labels: severity: warning annotations: @@ -74,20 +97,5 @@ groups: Component {{ $labels.component_id }} (kind={{ $labels.kind }}) fell back to the noop selftelemetry impl on {{ $labels.instance }}; reason={{ $labels.reason }}. - Per-component metrics from this component are absent - from the scrape. - - # Build identity informational — fires on a join against an - # external "blessed version" set. Operators redefine the - # expression to match their fleet management approach; the - # alert ships as scaffolding. - - alert: TracecoreBuildIdentityKnown - expr: tracecore_build_info == 1 - labels: - severity: info - annotations: - summary: tracecore {{ $labels.version }} ({{ $labels.revision }}) on {{ $labels.instance }} - description: | - Informational. Replace with a real version-drift alert - once your fleet defines a "current" tracecore version - (e.g., `tracecore_build_info{version!="v0.2.0"}`). + Per-component metrics from this component are absent from + the scrape. diff --git a/docs/migration/v0.1-to-v0.2.md b/docs/migration/v0.1-to-v0.2.md index 25ec510e..cb0c3b04 100644 --- a/docs/migration/v0.1-to-v0.2.md +++ b/docs/migration/v0.1-to-v0.2.md @@ -104,6 +104,44 @@ The two metrics the chart commits to keeping available across OCB version bumps [`internal/integration/ocb_scrape_test.go`](../../internal/integration/ocb_scrape_test.go) (`TestOCBScrape_UpstreamMetricVocabulary`) is the regression gate: an upstream rename of either metric fails this test before it can ship. +### In-tree receiver / exporter namespace alignment (RFC-0013 v0.1.0) + +The four surviving in-tree components (`nccl_fr`, `pyspy`, `otlphttp`, `stdoutexporter`) emit their own self-telemetry via a per-component MeterProvider. Their instrument names changed from the v0.1.x `tracecore.*` family to the upstream `otelcol___` convention so the in-tree namespace does not collide with the OCB pipeline-runtime's own `otelcol_*` family. The four legacy in-tree receivers (`clockreceiver`, `containerstdout`, `k8sevents`, `kernelevents`) and the in-tree boot-path internals were deleted in RFC-0013 PR-K.2 / PR-F.2 — see the "Orphan in-tree components" table above for their upstream-receiver replacements. + +The Prometheus scrape endpoint is unchanged (`:8888/metrics`); the metric NAMES on that endpoint changed. + +| v0.1.x metric (Prom underscore form) | v0.1.0 metric (post-rename) | +|---|---| +| `tracecore_receiver_errors_total{component_id=~"/.*",kind}` | `otelcol_receiver__errors_total{component_id,kind}` | +| `tracecore_receiver_emissions_total{component_id=~"/.*"}` | `otelcol_receiver__emissions_total{component_id}` | +| `tracecore_receiver_collection_latency_seconds{component_id=~"/.*"}` | `otelcol_receiver__collection_latency_seconds{component_id}` | +| `tracecore_receiver_degraded_seconds_total{component_id=~"/.*"}` | `otelcol_receiver__degraded_seconds_total{component_id}` | +| `tracecore_receiver_last_activity_unix_seconds{component_id=~"/.*"}` | `otelcol_receiver__last_activity_unix_seconds{component_id}` | +| `tracecore_exporter_calls_total{component_id=~"/.*",result,kind}` | `otelcol_exporter__calls_total{component_id,result,kind}` | +| `tracecore_selftelemetry_init_errors_total{kind,component_id,reason}` | `otelcol_selftelemetry_init_errors_total{kind,component_id,reason}` | + +Where `` is the OCB component name without underscores. Per-component substitutions: + +| Component | `` | +|---|---| +| `components/receivers/nccl_fr` | `ncclfr` (note: underscore stripped) | +| `components/receivers/pyspy` | `pyspy` | +| `components/exporters/otlphttp` | `otlphttp` | +| `components/exporters/stdoutexporter` | `stdoutexporter` | + +**Label shape is preserved.** `component_id` continues to partition per-instance (e.g. `ncclfr/default`); the `kind` label values are unchanged (`watch`, `parse`, etc.). Dashboards and alerts that filtered on `kind` need only the metric-name rename, not a label-selector rewrite. + +**Migration recipe for PromQL rules:** + +```diff +- rate(tracecore_exporter_calls_total{component_id=~"otlphttp/.*", result="error"}[5m]) ++ rate(otelcol_exporter_otlphttp_calls_total{component_id=~"otlphttp/.*", result="error"}[5m]) +``` + +For OTLP-emitted dashboards consuming the OTel-dot form, the prefix changed from `tracecore.receiver.*` to `otelcol.receiver..*` (dots, not underscores; backend converts at scrape). + +The receiver-agnostic starter at `docs/examples/prometheus-alerts.example.yaml` has been rewritten against the new namespace using regex matchers (`{__name__=~"otelcol_receiver_.*_errors_total"}`) so a future in-tree receiver inherits coverage on first scrape. + ### `stdoutexporter` failure-rate gap In v0.1.x, the in-tree `stdoutexporter` contributed to the aggregate `tracecore_exporter_failure_rate` gauge via `selftelemetry.ExporterCarrier`. In v0.2.0, the upstream `debugexporter` (which replaces `stdoutexporter` as the chart default) writes to stdout and effectively never fails — the `otelcol_exporter_send_failed_*` counter stays pinned at zero.