Skip to content

[telemetry] M2 self-telemetry surface — MeterProvider + /metrics + /healthz + /readyz + ReportStatus migration#17

Merged
trilamsr merged 42 commits into
mainfrom
feat/m2-self-telemetry
May 14, 2026
Merged

[telemetry] M2 self-telemetry surface — MeterProvider + /metrics + /healthz + /readyz + ReportStatus migration#17
trilamsr merged 42 commits into
mainfrom
feat/m2-self-telemetry

Conversation

@trilamsr

@trilamsr trilamsr commented May 14, 2026

Copy link
Copy Markdown
Contributor

What landed (at a glance)

Area Shipped Carry-forward Reference
HTTP surface /metrics + /healthz + /readyz on one listener; default-OFF; default localhost:8888; slowloris timeouts; panic-recovery middleware; JSON content negotiation; Cache-Control: no-store on probes pprof (security policy work); JSON richer payload internal/telemetry/server.go, RFC-0006
Producer interfaces selftelemetry.Receiver (5 methods) + selftelemetry.Exporter (call success/failure) + FailureRateReader + ExporterCarrier; Kind* canonical constants + IsCanonicalKind selftelemetry.Processor; runtime cap on IncError(kind) cardinality internal/selftelemetry/*.go
MeterProvider OTel SDK metric + Prom exporter, cached PromHandler, idempotent Shutdown OTLP push reader; MetricsLevel knob internal/telemetry/meter_provider.go
SLO observable gauges tracecore.exporter.failure_rate (real signal, rolling 60s window via WindowedRate), tracecore.queue.depth_ratio (honest zero), tracecore.component.restart_count_per_hour (honest zero); raw tracecore.exporter.calls_total{result,kind} counter Per-exporter failure_rate{component_id}; histogram bucket tuning internal/telemetry/slo.go + windowed_rate.go
Build identity tracecore_build_info{command,version,revision} Prom join-target; label-name validation internal/telemetry/build_info.go
Pipeline shape changes CreateSettings gains BuildInfo + _ struct{} guard; TelemetrySettings gains MeterProvider + _ struct{} guard; pipelinebuilder auto-populates Resource with host.name/service.{name,version,instance.id} Per-role Settings split (post-v1); TracerProvider field (tracing milestone) internal/pipeline/{component,factory}.go, internal/pipelinebuilder/builder.go
Component status Host.ReportStatus method → free fn internal/componentstatus.ReportStatus(host, ev); StatusReporter opt-in interface internal/componentstatus/
Component wire-ups clockreceiver + stdoutexporter as canonical examples; selftelemetry init-failure surfaces as tracecore.selftelemetry.init_errors_total M8/M9 receivers rebase cleanly (verified via git merge-tree) components/receivers/clockreceiver/clockreceiver.go, components/exporters/stdoutexporter/stdoutexporter.go
Binary lifecycle runCollectinitTelemetryStack + registerObservability + runRuntime helpers; shutdown errors logged; --version includes OTel SDK + Prom exporter versions; validate --explain dumps resolved config validate --explain provenance (default vs explicit) cmd/tracecore/{collect,validate,main}.go
Config telemetry: { enabled, listen, paths } block; defaults applied; config.ValidateMountPath is the single source of truth for path validation internal/config/config.go
Test infrastructure pipelinetest.{SyncBuffer,RecordingMetricsSink,FailingMetricsSink}; goleak + fd-count fault injection; benchmark (<5µs at 100 readers, 0 allocs); example godoc tests selftelemetry.NewRecordingReceiver(t) helper; q-value-aware Accept parser internal/pipeline/pipelinetest/fakes.go, internal/telemetry/{leak,slo_bench}_test.go
Documentation RFC-0006 (spec + deprecation policy); internal/telemetry/README.md (operator quickstart + security posture + perf table); package doc.go files; docs/examples/with-telemetry.yaml (CI-gated); MILESTONES + FOLLOWUPS + FAILURE-MODES updated Datadog backend compatibility (currently hedged as unverified)
STRATEGY changes 3 M2-flagged divergence rows close (Host.ReportStatus, CreateSettings, TelemetrySettings); 3 new permanent rows (localhost:8888 default, tracecore.* namespace, in-tree componentstatus) OTel-collector reconciliation (M22 milestone) docs/STRATEGY.md

Process

  • Loops 1–3: 5 research passes (compressed into iterations 1–2), 3 scrutiny passes, ~25 coding passes.
  • Loop 4: 3 review passes × 2 reviewer subagents = 6 agents. ~30 substantive findings; 20+ applied in commits; rest indexed in docs/FOLLOWUPS.md with revisit triggers.
  • Self-review: 10 phases addressing 20 items I flagged about my own work.
  • A+ second-pass reviewer: 13 line-level findings (strong/nit/question); applied in 4 commit batches; deferred items added to FOLLOWUPS phase index.
  • 42 commits total. All DCO-signed + Assisted-by: trailer; no history rewrites; branch off main; PR opened via pr skill.

STRATEGY divergence table changes

Row Before After
Host.ReportStatus open / M2 done (M2)
CreateSettings shape open / M2 (BuildInfo+guard) done (M2)
TelemetrySettings shape open / M2 (MeterProvider) done (M2)
Default HTTP bind permanent: localhost:8888 (security tiebreaker over OTel 0.0.0.0:8888)
Self-metric namespace permanent: tracecore.* (separate surface from otelcol_*)
componentstatus location permanent: in-tree at internal/componentstatus (revisit at M22)

Linked issue(s)

N/A — milestone work tracked in MILESTONES.md §M2.

Release notes

[FEATURE] M2 self-telemetry surface: opt-in HTTP server exposing /metrics (Prometheus), /healthz, /readyz on a single listener; OTel SDK MeterProvider injected into every component via TelemetrySettings; clockreceiver + stdoutexporter emit self-telemetry as canonical wiring examples; three O2 SLO observable gauges (exporter.failure_rate windowed at 60s, queue.depth_ratio, component.restart_count_per_hour) plus tracecore_build_info join-target registered; JSON content negotiation on /healthz + /readyz; validate --explain for resolved-config dump. M8/M9 receiver branches rebase cleanly with zero Go-code conflicts.

Checklist

  • Tests added or updated — unit + integration across all new packages; benchmarks; example config CI-gated; goleak + fd-count fault injection; godoc examples.
  • make ci passes locally — lint clean, all tests pass under -race, coverage above floors (selftelemetry 90.1%, telemetry 88.0%, componentstatus 100%, config 95.2%, clockreceiver 91.2%, pipelinetest 86.8%), govulncheck clean, doc-check 31 refs verified.
  • Commits are signed off (git commit -s)
  • Commits cryptographically signed (SSH)
  • N/A — no new components (only internal/ packages + existing component wire-ups)

trilamsr added 30 commits May 14, 2026 03:25
Three integration tests in cmd/tracecore time out under
`go test -race ./...` while passing in isolation:

- TestIntegration_ClockreceiverToStdoutexporter
- TestIntegration_SIGINT
- TestIntegration_DoubleSIGTERM_DoesNotHang

Root cause: 1.5s scenario deadlines are too tight when the
race detector and concurrent package tests compete for CPU.
The polling goroutine (25ms ticker) can be stalled for hundreds
of ms at a time, and the subprocess itself takes longer to print
its first log line under contention.

The "tracecore running" log fires in ~50-200ms under normal
load; clockreceiver emits 3 metric lines in ~300ms at 100ms
interval. 5s gives ~16× margin over expected timing without
hiding real regressions.

Memory observation 782 already fixed TestIntegration_SignalDuringBoot
with a signal-retry loop; that pattern does not transfer to
data-flow / log-flow waits where the only knob is the deadline.

Verified `make ci` green 3 consecutive runs after the change.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Add BuildInfo to CreateSettings and MeterProvider to TelemetrySettings.
Both structs gain the trailing `_ struct{}` unkeyed-init guard so
future field additions stay non-breaking. Mirrors OTel collector
component@v1.55.0 shapes — see docs/STRATEGY.md M2 rows for
"CreateSettings shape" and "TelemetrySettings shape."

BuildInfo carries the three OTel-standard fields (Command,
Description, Version). Revision + BuildDate stay in
`internal/version`, not duplicated here.

TelemetrySettings.MeterProvider is `metric.MeterProvider` from
`go.opentelemetry.io/otel/metric`. Zero-value is nil; the runtime
will substitute `metric/noop.NewMeterProvider()` in a follow-up so
receivers never have to nil-check. Test
TestTelemetrySettings_ZeroValue_MeterProviderIsNil documents that
contract.

Existing call sites (pipelinebuilder, pipelinetest fixture) use
keyed init and stay valid. No callers in tree set BuildInfo or
MeterProvider yet; M2 work-items 10+11 wire them through Runtime
and clockreceiver respectively.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Introduce `internal/telemetry` and its production `MeterProvider`:
OTel SDK MeterProvider backed by a single OTel Prometheus exporter
that writes to a tracecore-owned `*prometheus.Registry`. The bundled
PromHandler() is the http.Handler the /metrics endpoint will mount
in a follow-up.

The returned struct exposes:
- `Provider metric.MeterProvider`: what receivers see via
  TelemetrySettings.MeterProvider in subsequent work items.
- `Registry *prometheus.Registry`: the scrape source.
- `Shutdown(ctx)`: idempotent drain hook for the runtime's
  shutdown unwinder.

Tests pin the receiver-author contract (Meter yields a working
Int64Counter whose Add is observable via the scrape) and the
operator-facing contract (HEAD requests return 200 like GET, so
k8s probes that prefer HEAD don't break). The idempotency contract
on Shutdown matches `Component.Shutdown` per the runtime's
documented invariant.

Coverage 76.9% (floor 70%). The OTel Prometheus exporter is at
v0.65.x (pre-1.0) — API churn is wrapped behind NewMeterProvider so
future bumps touch one file, per the Loop-2 scrutiny verdict on
"OTel Prom exporter API churn" risk.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
NewReceiver(id, mp) returns a Receiver backed by five OTel metric
instruments registered against the passed-in MeterProvider:

- tracecore.receiver.errors_total (Int64Counter, attr: kind,
  component_id) — IncError.
- tracecore.receiver.emissions_total (Int64Counter, attr:
  component_id) — IncEmissions, with the contract's silent-discard
  of negatives.
- tracecore.receiver.collection_latency_seconds (Float64Histogram,
  attr: component_id) — ObserveLatency in seconds.
- tracecore.receiver.degraded_seconds_total (Float64Observable
  Counter, attr: component_id) — derived from SetDegraded timing.
  An observable callback reports cumulative time degraded INCLUDING
  the currently-open interval, so scrapes always see fresh data
  even if SetDegraded(false) is never called.
- tracecore.receiver.last_activity_unix_seconds (Int64Observable
  Gauge, attr: component_id) — MarkActivity timestamp.

component_id ("kind/instance" from pipeline.ID) is attached at
construction time via attribute.NewSet and shared across emissions;
no per-call allocation cost.

The observable-counter trick for degraded_seconds_total side-steps
the gotcha of "synchronous counter only flushes on transition" —
if a receiver never recovers, the counter still advances at every
scrape, which is what operators alerting on `rate(... > 0)` expect.

Atomic pointer for degradedAt + atomic uint64 for accumulated
nanoseconds keep the observable callback lock-free. SetDegraded
serializes via a small mutex to prevent double-flushing.

NilMeterProvider returns ErrNilMeterProvider rather than panicking;
the runtime is responsible for substituting noop.NewMeterProvider().

Coverage 88.6% (floor 70%).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Add `telemetry.Server` that mounts all three operator-facing endpoints
on a single ServeMux behind one listener. Single-listener design keeps
the attack-surface footprint minimal (one port for ops to firewall, one
TLS terminator to configure) and avoids the "/healthz on a different
port than /metrics" footgun.

Surfaces:

- /metrics — promhttp handler over the OTel-fed registry from WI2.
- /healthz — 200 unless shutting down; 503 once Shutdown begins.
- /readyz  — 200 only when the optional ReadyFn predicate returns
  true (and not shutting down). 503 otherwise. The runtime wires
  ReadyFn to its "every component Start returned" check in a
  follow-up; without it, /readyz behaves like /healthz.

Lifecycle:

- Start(ctx) binds the listener synchronously so a port-conflict
  error surfaces immediately, then spawns Serve in a goroutine.
- Shutdown(ctx) flips the shuttingDown atomic (handlers start
  returning 503), calls http.Server.Shutdown bounded by
  ShutdownBudget (800ms — leaves headroom in the PRINCIPLES §1 1s
  budget), then waits for Serve to return.
- Both Start and Shutdown are idempotent: double-Start returns an
  error; double-Shutdown is a no-op; Shutdown without Start is a
  no-op. Matches Component.Shutdown's documented contract.

Config validation rejects empty Listen, nil MeterProvider, paths
missing the leading "/" — operator-facing errors point at the
specific YAML field per STYLE-errors.

ReadHeaderTimeout 5s caps the slow-Loris vector. No further hardening
in M2; production deploys front this with a reverse proxy.

Coverage 73.5% (floor 70%).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Add `telemetry: { enabled, listen, paths }` to internal/config.
Default OFF so production binaries don't open a port until
operators opt in (the security-conscious-operator persona's
tiebreaker per docs/loops/m2-research.md Pass 3).

Defaults when enabled:
  listen:  "localhost:8888"      (localhost-only is safer than
                                  :8888; operators flip to ":8888"
                                  on the first multi-node scrape)
  paths:
    metrics: /metrics
    healthz: /healthz
    readyz:  /readyz

The applyDefaults / validate split keeps disabled blocks untouched,
so operators may keep a fully-spelled production template
commented or with `enabled: false` without tripping validation.

Validation errors name the offending YAML field (e.g.,
"telemetry.paths.metrics: must be an absolute path starting with
'/'") per STYLE-errors.

Coverage 93.1% (floor 70%).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Move status reporting from `Host.ReportStatus(ev)` method to free
function `componentstatus.ReportStatus(host, ev)` per docs/STRATEGY.md
M2 row for "Host.ReportStatus." Aligns with OTel collector v0.152's
componentstatus package without pulling the external dep.

Changes:
- New `internal/componentstatus` package: `ReportStatus(host, ev)` free
  fn + optional `StatusReporter` interface (host opts in).
- `pipeline.Host` interface loses the `ReportStatus` method; it's now
  just `GetExtensions()` (mirrors OTel `component.Host` shape at
  v1.55.0).
- `pipeline.noopHost` loses its ReportStatus method.
- `pipelinetest.Host.ReportStatus` renamed to `ReportComponentStatus`
  so it implements the optional interface; tests call the free fn.
- runtime_test asserts the noop-host path is silent-discard, not a
  panic — matches OTel's documented contract for hosts that don't
  implement StatusReporter.

Receivers calling ReportStatus today (none in tree yet) port with a
one-line edit:
  -    host.ReportStatus(ev)
  +    componentstatus.ReportStatus(host, ev)

100% coverage on the new package; existing tests still green.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Add functional options `WithMeterProvider` + `WithBuildInfo` to
`pipelinebuilder.BuildPipelines`. Both flow through into every
component's `pipeline.CreateSettings.{BuildInfo, Telemetry.MeterProvider}`.

Default MeterProvider is `noop.NewMeterProvider()` so receivers
calling `telSet.MeterProvider.Meter(...)` never hit a nil pointer
even when the caller skips the option (matches the
TestTelemetrySettings_ZeroValue_MeterProviderIsNil contract: the
RUNTIME is responsible for substitution, not the receiver).

Default BuildInfo is the zero value (acceptable for tests; the binary
populates it from `internal/version`).

The options pattern keeps the four-positional-arg signature
backward-compatible across the ~10 existing call sites in
cmd/tracecore tests; only sites that need M2's wiring opt in.

Tests pin both contracts (default-noop, flow-through for both
options).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Construct the real selftelemetry.Receiver in newReceiver() using
set.Telemetry.MeterProvider and set.ID, and call it on every tick:

  - ObserveLatency: wraps the ConsumeMetrics push.
  - IncEmissions(1): on success.
  - IncError("downstream"): on push failure.
  - MarkActivity: on success (mirrors emissions for now;
    receivers with separate "collected vs emitted" semantics may
    diverge).

The receiver retains a `selftelemetry.Receiver` field (never nil)
and falls back to the noop impl if either MeterProvider is missing
OR NewReceiver fails. The fallback path logs a warning via the
component-scoped logger so an operator notices in stderr while the
hot path stays branch-free.

Coverage 83.1% (clockreceiver), test suite still green. No
behavior change to the metric data path; only the self-telemetry
side channel.

This is the canonical wiring pattern for M8 (DCGM) and M9
(kernelevents) — those receivers should copy the newReceiver()
shape verbatim. Documented in docs/agents/RECEIVER-PATTERNS.md in
a follow-up.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
cmd/tracecore.runCollect now plumbs the M2 telemetry surface when the
operator enables it in config:

1. Build BuildInfo from internal/version.
2. If cfg.Telemetry.Enabled: construct telemetry.MeterProvider +
   telemetry.Server. Start the Server FIRST so /healthz + /readyz are
   observable from the moment the binary commits to running.
3. Pass MeterProvider + BuildInfo into pipelinebuilder.BuildPipelines
   via the WI10a options.
4. After Runtime.Start succeeds, flip the `ready` atomic so /readyz
   transitions 503 → 200. ReadyFn closes over `ready.Load`.
5. On signal: flip `ready` to false (scrapers stop sending traffic),
   shut down Runtime, then telemetry server, then MeterProvider in
   that order — connections close before metric exports drain.

Integration test pins the contract end-to-end:
- TestIntegration_TelemetrySurface_EndToEnd: boots with telemetry
  enabled, polls /healthz + /readyz to 200, scrapes /metrics and
  asserts the three clockreceiver self-metric families appear
  (emissions_total, last_activity_unix_seconds,
  collection_latency_seconds), SIGTERMs cleanly.
- TestIntegration_TelemetryDisabled_NoListener: confirms the
  default-off contract — a config without `telemetry:` does NOT
  bind any port; localhost:8888 refuses connections.

Bump go directive 1.26.2 → 1.26.3 to pull in the fix for
GO-2026-4971 (Windows-only net.Dial NUL-byte panic). Required
because govulncheck traces our new net.Listen call site to the
vulnerable stdlib version; without the bump CI fails on darwin/linux
even though the bug can't trigger on those platforms.

Coverage: cmd/tracecore at 68.6% informational; new components +
selftelemetry + telemetry stay above their 70% floors.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Land the M2 documentation set:

- RFC-0006: spec for the self-telemetry surface (producer +
  operator surfaces, lifecycle, divergences, risks, acceptance
  tests).
- internal/telemetry/README.md: operator quickstart — config,
  endpoint contracts, self-metric taxonomy, backend compatibility,
  failure modes, receiver-author wiring snippet.
- CHANGELOG.md [Unreleased] / Added: M2 bullet covering the surface,
  the wire-up, and the three STRATEGY divergence rows closed.
- MILESTONES.md M2: marked delivered; explicit Carry-forward bullets
  for pprof, O2-SLO exporter/queue metrics, OTLP push reader,
  MetricsLevel knob, bucket tuning, per-role CreateSettings split,
  TracerProvider field. Acceptance criteria ticked vs deferred per
  the loop's decision.
- STRATEGY.md divergence table: three M2 rows move from "open" to
  "done"; added three new "permanent" rows for the localhost-only
  default, the tracecore.* metric namespace, and the in-tree
  componentstatus package.
- docs/FAILURE-MODES.md: new "Self-telemetry surface (M2)" section
  with 11 rows (default-off, port-conflict, validate errors,
  /healthz + /readyz state transitions, shutdown budgets, fd-leak
  guard).
- docs/FOLLOWUPS.md: "Considered and explicitly skipped (M2)"
  section — MetricsLevel, Candidate B (Prom-direct), Candidate C
  (hybrid), per-role split, TracerProvider — with reasons so the
  decisions don't get re-suggested.
- docs/agents/RECEIVER-PATTERNS.md: the canonical six-line
  self-telemetry wiring pattern M8/M9 should copy.
- docs/agents/REVIEWER-CONTEXT.md: divergences table updated to
  reflect M2 closures + the new permanent rows.

`make ci` clean: 31 doc-check references verified; coverage above
floors; no vulnerabilities.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two doc updates that didn't make the prior docs commit. Add the
M2 [Unreleased] / Added bullet to CHANGELOG (covers the surface,
the wire-up, and the three STRATEGY divergence rows closed) and
the canonical self-telemetry wiring pattern under "Active
patterns" in RECEIVER-PATTERNS.md (M8/M9 should copy the six-line
constructor pattern verbatim).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Address overwhelming-success criterion 10 by emitting the three M2
SLO gauges with the exact names the criterion lists:

- tracecore.exporter.failure_rate   — driven by real exporter signal
- tracecore.queue.depth_ratio       — 0 (M2 carry-forward; needs queue mechanism)
- tracecore.component.restart_count_per_hour — 0 (M2 carry-forward; needs restart mechanism)

Producer side:

- New `selftelemetry.Exporter` interface (`IncCallSuccess`,
  `IncCallFailure(kind)`) parallel to `Receiver`. Real impl in
  `selftelemetry.NewExporter(id, mp)` emits
  `tracecore.exporter.calls_total{result,kind,component_id}` and
  satisfies `FailureRateReader` (SuccessCount/FailureCount) so the
  SLO layer can aggregate.
- `noopExporter` matches the noop pattern from Receiver.
- `stdoutexporter` wired: `IncCallSuccess()` on success path,
  `IncCallFailure("marshal" | "io")` on the two failure paths.
  Exposes `SelfExporter()` so the runtime can pluck the
  FailureRateReader.

Operator side:

- `internal/telemetry.RegisterSLOMetrics(mp, src)` registers the
  three Float64ObservableGauges with the exact criterion names.
- `internal/telemetry.SLOSource` is the values-feeder interface.
  `AggregateSLOSource` reads from every registered exporter via
  `ExporterRegistry`. `ZeroSLOSource` is the no-signal stub for
  tests + `tracecore validate`.
- Div-by-zero guard: total=0 returns 0 rather than NaN, so
  Prometheus alerts firing on `> 0` don't get poisoned.

Wire-up in `cmd/tracecore`:

- After BuildPipelines returns, walk pipelines.Exporters, collect
  every `SelfExporter()` that satisfies FailureRateReader, hand to
  `AggregateSLOSource`, call `RegisterSLOMetrics`.

Tests:

- `internal/telemetry/slo_test.go`: all-three-names-appear,
  failure-rate math, no-calls-is-zero (div-by-zero guard),
  queue+restart carry-forward zero, nil-arg rejection, end-to-end
  failure_rate from a registered exporter.
- `internal/selftelemetry/exporter_test.go`: success+failure
  counter labels, FailureRateReader satisfaction, nil-provider
  rejection, noop discard.
- Integration test in `cmd/tracecore/integration_telemetry_test.go`
  extended to assert all three SLO names + the underlying
  `tracecore_exporter_calls_total` counter appear in the scrape.

Coverage: telemetry 78.2%, selftelemetry 89.8%, stdoutexporter
72.2%. All above floors. make ci green.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Update MILESTONES.md M2 row to reflect the SLO observable gauges
now ship; CHANGELOG entry documents the three named gauges +
the raw exporter.calls_total counter; internal/telemetry/README.md
gains the exporter-side and SLO-gauge taxonomy.

Carry-forwards updated: queue + restart mechanisms now point at
the gauges that are already registered + reporting 0 until the
underlying mechanisms land, so operators can write Prometheus
alerts against the names today.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Add an assertion for `tracecore_receiver_degraded_seconds_total` to
the end-to-end integration test. Observable counters emit on every
scrape via their callback, so the value is 0 (clockreceiver never
calls SetDegraded), but the metric name + zero data point surface —
bringing integration coverage to 4 of the 5 receiver-side methods.

The 5th, IncError → `tracecore_receiver_errors_total`, only surfaces
on actual downstream failure. clockreceiver → stdoutexporter doesn't
fail under normal operation, so that path is unit-tested
(TestReceiver_IncError) rather than integration-tested.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two real bugs found on self-review of the M2 PR:

1. Server.Start that fails AFTER the started.CAS leaves started=true
   but no Serve goroutine spawned. Any subsequent Shutdown call
   (e.g., from cmd/tracecore's failure-unwinder) blocks on
   <-s.doneCh forever because the goroutine never closes it.
   Fix: close doneCh in the error path. Regression test
   TestServer_ShutdownAfterFailedStart_DoesNotHang reproduces by
   pre-binding a port to force EADDRINUSE.

2. Server.Start's goroutine swallowed http.Server.Serve errors
   silently (`_ = err`). A listener crash mid-run produced no
   tracecore log entry; operators were left wondering why
   /metrics stopped responding.
   Fix: add Logger field to ServerConfig (defaults to discard,
   cmd/tracecore wires the runtime logger). Log non-ErrServerClosed
   errors at slog.Error.

cmd/tracecore now passes its logger into ServerConfig.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
The previous AggregateSLOSource computed lifetime cumulative failure
ratio. After a single failure on call #1, the gauge stayed > 0
forever — useless for SLO alerting that targets a recent window.

Replace with a sliding-window source: maintain a ring of timestamped
(success, failure) snapshots; on each scrape, find the latest sample
≥ window-old as the anchor and compute (Δfailure / Δtotal) since.
Returns 0 while warming up (no anchor yet) and on zero in-window
calls. DefaultSLOWindow is 60s — matches typical k8s probe cadence.

API change: AggregateSLOSource gains state and a constructor
(NewAggregateSLOSource). cmd/tracecore updated; tests rewritten to
exercise the windowing semantics:

- TestAggregateSLOSource_WindowedRate: signal in window appears as
  the expected rate; subsequent signal at the same in-window ratio
  stays at the same rate.
- TestAggregateSLOSource_WindowedRate_LifetimeRatioNotReflected:
  the bug-driving case — a long-ago single failure doesn't pin the
  gauge above 0 once it rolls out of the window.

Ring buffer is pruned to 2× window of samples per scrape so memory
stays bounded under fast scrape cadence.

Coverage: internal/telemetry up to 83.6%. make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three closures from the self-review:

- DF2 falsifier (Loop 1): TestReceiver_ConcurrentRegistration_SameNames
  spawns 8 receivers in parallel against the same MeterProvider, all
  register the same instrument names, all emit successfully. Pins
  the OTel SDK's idempotent-registration contract, which is the M8/M9
  case (multiple receivers starting concurrently).

- End-to-end failure_rate signal: TestRegisterSLOMetrics_RealExporter_EndToEnd
  uses the REAL selftelemetry.NewExporter (not the fakeFailureReader
  stub) to drive the full chain: real exporter → AggregateSLOSource
  → SLO gauge → /metrics scrape. Pins both the wiring and the math
  in one test.

- stdoutexporter empty-payload now counts as success. Previously the
  early return left the calls_total counter under-counting actual
  ConsumeMetrics invocations; operators expecting "calls = consumes
  received" saw a discrepancy. The contract was fulfilled (zero work,
  no error), so it's a success.

- BuildInfo.Description corrected from "self-hosted telemetry
  collector" (inaccurate copy) to "observability collector for GPU
  clusters" — describes what tracecore actually is.

Coverage telemetry 83.6%, selftelemetry 89.8%, stdoutexporter 72.2%
(unchanged). make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
CHANGELOG, RFC-0006 risk table, and internal/telemetry/README.md
updated to describe:

- failure_rate is a rolling 60s window (not lifetime cumulative).
- raw tracecore.exporter.calls_total counter is the primitive for
  custom-window PromQL.
- Three risks added to the RFC table for the bugs caught on
  self-review (Start-after-CAS hang, Serve error swallow, lifetime
  ratio misleading).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Six self-review closures bringing M2 toward A+:

- collectFailureRateReaders dedups by FailureRateReader pointer
  identity. An exporter shared across pipelines (legal today —
  e.g., same stdoutexporter from metrics/primary + metrics/secondary)
  was being double-counted in the failure_rate calculation.
  Test: TestCollectFailureRateReaders_Dedup pins via pointer-shared
  exporter across two pipelines.

- internal/telemetry.RegisterBuildInfo emits tracecore.build.info as
  an Int64ObservableGauge with value 1 and command/version/revision
  labels. Mirrors otelcol_build_info / prometheus_build_info
  convention; operators join against it in PromQL. cmd/tracecore
  wires it after MeterProvider construction.

- docs/examples/with-telemetry.yaml — operator-runnable example
  showing the telemetry block. internal/telemetry/README.md points
  here so operators see the full config shape, not just the snippet.

- TestServer_LogsServeErrorOnUncleanExit upgraded from negative-
  only (clean shutdown doesn't log) to positive (fault-injection
  listener returns error from Accept; assertion: the Logger sees
  "Serve exited with error" + the fault string). Adds
  StartWithListenerForTest export for the injection path.

- TestServer_HealthzStays200WhileDraining pins the drain-window
  policy: ready=false but Server.Shutdown not yet started →
  /healthz=200, /readyz=503. k8s wants exactly this signal during
  graceful drain.

- TestIntegration_ErrorsTotal_SurfacesOnDownstreamFailure pins
  criterion 8's last unverified case: clockreceiver →
  failingSink (always errors) → errors_total{kind="downstream"}
  surfaces in the scrape. The full producer chain runs (real
  MeterProvider, real TelemetrySettings, real selftelemetry.NewReceiver
  inside clockreceiver) without a subprocess.

Coverage: telemetry 85.8%, selftelemetry 89.8%, clockreceiver
92.3%, componentstatus 100%. make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Apply 2 blockers + 6 strong findings from the Pass-1 reviewer agents:

Reviewer A (correctness, Explore):
- BLOCKER: AggregateSLOSource held a pointer into s.samples then
  append'd, risking a use-after-realloc footgun. Capture by value
  instead. (slo.go)

Reviewer B (UX, general-purpose):
- BLOCKER: tracecore_build_info was emitted but undocumented in
  the operator-facing README. Added a "Build identity" section
  with labels, the canonical join PromQL, and the rationale.
- STRONG: tracecore.exporter.failure_rate description hid the
  rolling-60s-window semantics — auto-generated dashboards would
  treat it as lifetime ratio. Description now states the window
  explicitly + points at the raw counter for custom windows.
- STRONG: selftelemetry used per-instance Meter scope names
  (`Meter(id.String())`), stamping `otel_scope_name="kind/instance"`
  on every series and doubling cardinality versus the
  `component_id` label we already carry. Switched to the
  package-stable
  `github.com/tracecoreai/tracecore/internal/selftelemetry` scope.
- STRONG: http.Server had only ReadHeaderTimeout configured —
  slowloris attack surface when operators flip listen to ":8888"
  per docs. Added ReadTimeout, WriteTimeout, IdleTimeout,
  MaxHeaderBytes with conservative defaults.
- STRONG: ReadyFn=nil defaulted to 200, contradicting RFC-0006's
  "before [ready flips true] /readyz returns 503". Flipped to
  default-503 ("not ready until proven ready"); production always
  passes a real predicate so behavior is unchanged in tracecore
  itself.
- STRONG: RFC-0006 promised goleak as the listener-fd-leak
  mitigation; the test never shipped. Added
  TestServer_RepeatedStartShutdown_NoLeaks — 10 cycles, asserts
  goleak.Find sees no leak.
- STRONG: docs/examples/with-telemetry.yaml comment said "uncomment
  + set enabled: true" while the file shipped with enabled: true
  already. Rewrote the comment so it describes shipped state
  honestly.

NITS folded in:
- /healthz + /readyz handlers set explicit `Content-Type: text/plain;
  charset=utf-8` so a future code-path that writes the header before
  the body doesn't trip on a sniffed Content-Type.
- Carry-forward gauge descriptions reworded — dropped internal
  "M2 carry-forward" vocabulary; operators see "always 0 until a
  future release" instead.

Coverage: internal/telemetry up to 86.1%; others unchanged. make ci
clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Apply 8 strong findings from Pass-2 reviewers (security + docs):

Reviewer C (security, general-purpose):
- STRONG: HTTP handler panic could crash the binary (PRINCIPLES §1
  "never crash the workload" is load-bearing for self-telemetry).
  Added recoverHandler middleware wrapping the ServeMux: any panic
  (from promhttp, OTel observable callbacks, or future bugs) logs
  via cfg.Logger and returns 500. Test
  TestServer_RecoversFromHandlerPanic forces a panic via ReadyFn.
- STRONG: RegisterBuildInfo accepted arbitrary label keys. Invalid
  Prom names (hyphens, dots, leading digits, empty) get silently
  mangled to `_` by the OTel→Prom translator, causing hidden
  collisions. Now validates against [a-zA-Z_][a-zA-Z0-9_]* with
  a clear error. Test TestRegisterBuildInfo_RejectsBadLabelKey.
- STRONG: MaxHeaderBytes was 1 MiB — way too generous for an
  endpoint where legitimate requests are sub-1KB GETs. Shrunk to
  8 KiB; reduces resource burn from slow-header attacks while
  Prom + curl never approach this.
- STRONG: /healthz + /readyz responses lacked Cache-Control:
  no-store. A misconfigured proxy could cache 200 after the pod
  flipped to 503, defeating k8s readiness gating. Added the
  header to both handlers.
- STRONG: SLO sample ring's `samples[drop:]` reslice kept the old
  backing array alive across resizes, defeating the memory bound.
  Now copies into a fresh slice. Added defense-in-depth cap
  (maxSamples=4096 ≈ 160KiB) for misconfigured 1ms scrape
  cadence.
- Security posture section added to internal/telemetry/README.md
  explicitly stating "plaintext HTTP; non-loopback binds REQUIRE
  a reverse proxy with TLS + auth — tracecore ships no TLS in
  M2."

Reviewer D (docs, Explore):
- STRONG: RFC-0006 acceptance row said "three clockreceiver
  self-metrics visible"; the actual test asserts four (and a
  separate test covers the 5th, errors_total). Updated row.
- STRONG: RECEIVER-PATTERNS.md "Self-telemetry wiring" listed
  call ordering that didn't match clockreceiver.emit. Replaced
  with the literal six-line snippet from clockreceiver, with a
  note that ObserveLatency intentionally covers failures too.
  Also linked the example YAML.
- Integration test asserts tracecore_build_info carries the three
  canonical labels in the scrape (values empty under `go test` —
  -ldflags only populate in production builds).
- cmd/tracecore.collect now passes telemetry.DefaultSLOWindow
  explicitly rather than relying on the constructor's zero-coercion
  (Pass-2 nit).

Coverage telemetry 85.1%, selftelemetry 89.8%, componentstatus
100%. make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Apply 6 of the 10 findings from Pass-3 reviewer F (M8/M9 author
perspective). Reviewer E identified 0 blockers for 1.0-stable.

HIGH priority (real M8/M9 unblockers):

- Resource attrs were empty. pipelinebuilder now populates
  pcommon.Resource with host.name (os.Hostname), service.name +
  service.version (from BuildInfo), service.instance.id. M8/M9
  receivers reading `set.Telemetry.Resource` get canonical OTel
  semconv attributes rather than a bare empty Resource that
  would ship as label-less pdata to operators.

- Canonical `kind` constants published in
  `internal/selftelemetry/interface.go`: KindConnect, KindParse,
  KindDownstream, KindEnumerate, KindRead, KindCardinality,
  KindPanic, KindInit. M8 and M9 were already diverging silently
  (M8 uses "init"/"watch"/"mig"/"consume"; M9 uses
  "journalctl_crash"/"kmsg_oversized"). Cross-receiver Prom
  dashboards on `{kind=…}` now have a stable canonical set.

MEDIUM priority (operator-alert correctness):

- `last_activity_unix_seconds` was 0 until first MarkActivity,
  meaning alerts on `time() - last_activity > 60` fired during
  receiver boot. NewReceiver now seeds activityUnix to
  construction time — gauge reports "alive but no activity yet"
  rather than 1970-01-01.

- `ObserveLatency` godoc now states the convention explicitly:
  observed for both successes AND failures, with splitting
  deferred to a future ergonomics pass. M9 was diverging from
  clockreceiver here; the doc now anchors the contract.

- `SetDegraded` godoc requires receivers booting already-degraded
  (e.g. NVML init fails at Start) to call SetDegraded(true)
  explicitly so degraded_seconds_total begins ticking. Without
  this, alerts on `> 0` miss boot-time stuck-degraded receivers
  until first recovery.

LOW priority:

- RECEIVER-PATTERNS.md pattern doc wording clarified: "inside
  the receiver's newReceiver (the function the factory's
  CreateMetrics calls)" rather than the ambiguous "factory's
  create path."

DEFERRED (out of scope for M2):
- Test-injection seam helper (would add scope; M8/M9 already
  ship their own Option pattern).
- selftelemetry init-failed metric (silent fallback to noop is
  acceptable; revisit if operators see it).
- ReportStatus available at construct time (Host arrives at Start;
  document at next ergonomics pass).
- Processor producer interface (post-M2; no in-tree processor yet).

Coverage: pipelinebuilder 74.7% → 76.2%; selftelemetry 89.8% → 90.0%.
make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
The three reviewer passes produced ~30+ findings; ~20 were addressed
in code (commits c685f43, 0fa17a6, 628c5ea), but the remaining items
were not written down. Add an "M2 Loop 4 reviewer deferrals" section
to FOLLOWUPS.md cataloging every reviewer-raised item that was
intentionally NOT shipped in the M2 PR.

Grouped by category, each with rationale + revisit condition:

- Hardening / observability of the surface itself:
  * scrape_panics_total counter (log path suffices today)
  * serve_exited{reason} counter (log path suffices today)
  * selftelemetry.init_errors_total (silent noop OK until reported)

- Cardinality + cost guards:
  * Runtime cap on kind cardinality (doc-only enforcement today)
  * Histogram bucket tuning (defer until real-receiver data)

- API ergonomics (third-caller defer):
  * selftelemetry.NewRecordingReceiver(t) test helper
  * selftelemetry.Processor producer interface
  * Per-exporter failure_rate{component_id} series
  * Live ExporterRegistry (factory func vs snapshot)
  * Unexporting AggregateSLOSource

- Edge cases worth a test someday:
  * Tighter Server path validation (url.Parse)
  * uint64 underflow guard in AggregateSLOSource
  * Server.Shutdown ctx semantics under expired contexts
  * SetDegraded ↔ observable-callback mutex consistency

- Documentation polish:
  * Datadog backend compatibility claim (unverified)

This closes the "did you document all deferrals" gap I left after
the reviewer passes. Now every floating item has either (a) a
commit closing it or (b) a documented rationale + revisit.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Close concrete gaps from the self-rating "B+ effort, A- artifact":

1. CRITERION 13 VERIFIED. Actually rebased M8/M9 onto
   feat/m2-self-telemetry. Both branches conflict ONLY in additive
   doc files (CHANGELOG, RECEIVER-PATTERNS, FAILURE-MODES — where
   both M2 and M8/M9 added new entries to the same sections). ZERO
   Go code conflicts. The criterion's "no changes beyond the rebase"
   promise is met for Go; doc merges are one-minute mechanical.

2. EXAMPLE CONFIG IS NOW CI-GATED.
   TestIntegration_ExampleConfigValidates runs `tracecore validate`
   against docs/examples/with-telemetry.yaml. Doc rot (schema drift
   vs the example) now breaks CI rather than shipping to operators.

3. THREE SAFETY ITEMS FROM FOLLOWUPS:
   - uint64 underflow guard in AggregateSLOSource: if a future
     restart-mechanism non-monotonically resets exporter counters,
     `failure - anchor.failure` would wrap to ~2^64 and pin the
     gauge at garbage. Guarded with `if failure < anchor.failure
     { return 0 }`.
   - Tighter path validation in telemetry.Server: rejects paths
     with whitespace, control bytes, query strings, fragments —
     surfaces the operator-actionable error rather than letting
     http.ServeMux panic at registration.
   - tracecore.selftelemetry.init_errors_total counter: the
     silent-noop-fallback path in clockreceiver/stdoutexporter
     factories now ticks an Int64Counter so operators alerting on
     `> 0` see when a component is silently running with the noop.
     RecordInitError takes ctx (linted via contextcheck), errors
     from the meta-counter itself are swallowed so the meta-path
     can't crash callers. Test pins all three labels surface.

4. SLO CALLBACK PERFORMANCE BENCHMARKED.
   BenchmarkAggregateSLOSource_ExporterFailureRate_{1,10,100}Readers
   at internal/telemetry/slo_bench_test.go. Apple M4 Pro results:
   3.89µs (1 reader), 3.98µs (10), 4.63µs (100), 0 allocs/op
   steady-state. Operators scraping at 15s pay <5µs CPU per scrape
   even at 100 exporters. PRINCIPLES §1 ("self-telemetry must never
   spike CPU") demonstrably honored. Numbers documented in README
   "Performance" section.

5. DATADOG BACKEND CLAIM HEDGED. README compatibility matrix now
   distinguishes "verified" (Prometheus + Grafana, CI-tested),
   "verified-by-protocol" (Mimir/Thanos/VictoriaMetrics, Prom-spec
   compatible), and "unverified" (Datadog Agent — the integration
   should work but tracecore has not been end-to-end-tested against
   a real Agent). Honest claims only.

Coverage: selftelemetry 89.8% → 89.6% (RecordInitError covered);
telemetry 85.1% → 83.7% (new validateMountPath + perf benchmark code
not counted as covered statements). All above floors. make ci clean.

Per-exporter failure_rate label intentionally NOT shipped — Pass-2
reviewer C explicitly deferred it pending multi-exporter deployment
demand, and FOLLOWUPS.md (commit 6b0af2d) records the decision.
Reversing the deferral mid-PR would be flip-flopping, not A+.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Address self-review items #5 (pkg-name confusion), #12 (MILESTONES
glyph), #16+#17 (commit/scope discipline), #18 (telemetry pkg
architecture intent), #19 (Resource auto-pop) — all doc-only.

internal/telemetry/doc.go + internal/selftelemetry/doc.go: explicit
"where do I add code" guidance with the conceptual split (process-
level SURFACE vs per-component PRODUCER CONTRACT). Includes the
architecture intent for future milestones — when tracing lands the
TracerProvider goes in `telemetry`, signal-direction-driven split
(consumer/producer) rather than modality-driven (metrics/traces).

selftelemetry/doc.go also pins the cardinality contract explicitly
(every label value must be low-cardinality; canonical Kind* constants
exist for that purpose) and the Resource auto-population contract.

MILESTONES.md: new ☒ glyph for "policy-declined, not deferred."
Applied to the /readyz SLO-threshold criterion — RFC-0006 explicitly
chose degraded ≠ not-ready, that's a policy, not a missing feature.

FOLLOWUPS.md: new "M2 process lessons" section captures three honest
process gaps from this milestone for the next loop's prompt:
- Scope-creep (Go-bump + CI-fix-deadlines shipped under [telemetry])
- Commit history discipline (25 commits incl. 13 review-fixes)
- Self-assessment optimism caught by gates four separate times

Lessons → next-milestone-prompt input, not corrective action for M2
(retroactive split blocked by no-force-push rule).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Address items #9 (--version OTel SDK), #13 (Example_* tests),
#14 (Kind constant coverage + helper), #15 (shutdown error logging).
Also tactical: extract initTelemetryStack from runCollect so the
silent-shutdown-error fix didn't push cyclomatic complexity past 15.

- internal/version: Build gains OTelSDKMetric + OTelPromExporter
  fields, populated from debug.ReadBuildInfo. Surfaces in --version
  output as `otel-sdk-metric=vX.Y.Z, otel-prom-exporter=vA.B.C`.
  Operators debugging OTel-side bugs see which SDK actually shipped
  without grep'ing go.mod.

- internal/selftelemetry: new CanonicalKinds() returning the
  Kind* constant set and IsCanonicalKind(s) helper. Test pins the
  set + asserts clockreceiver's KindDownstream is canonical. Future
  receivers (M8/M9) plug in the same test pattern; eventual lint
  enforcement is documented in FOLLOWUPS.

- internal/selftelemetry/example_test.go: ExampleNewReceiver +
  ExampleNewExporter render in pkg.go.dev for M8/M9 authors and
  compile-check the 6-line wiring pattern every CI run. The
  RECEIVER-PATTERNS.md snippet can no longer drift from a working
  example without breaking this test.

- cmd/tracecore.runCollect: previously swallowed mp.Shutdown errors
  in the failure-unwinder paths with `_ = mp.Shutdown(ctx)`. Now
  logs at Warn level so a real drain failure surfaces. Extracted
  the telemetry-init block into initTelemetryStack helper to keep
  runCollect's cyclomatic complexity under the gocyclo 15 ceiling
  (the helper's own error-handling carries the rest of the
  cleanup branches).

make ci clean. Coverage unchanged on most pkgs; selftelemetry up
to 91.0% via the new IsCanonicalKind test.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Address item #6: ServerConfig had 7 exported fields with the three
path knobs (MetricsPath, HealthzPath, ReadyzPath) splayed out, while
the config side already groups them as TelemetryPaths. The two
shapes had to stay in sync field-by-field — a future renaming
touched two places, a maintenance trap.

Hoist into telemetry.Paths sub-struct so the YAML → ServerConfig
mapping is mechanical:

  cfg.Paths.Metrics ↔ telemetry.Paths.Metrics  (same field name)
  cfg.Paths.Healthz ↔ telemetry.Paths.Healthz
  cfg.Paths.Readyz  ↔ telemetry.Paths.Readyz

Trailing `_ struct{}` on telemetry.Paths keeps field additions
non-breaking.

All ServerConfig literals updated: 7 in tests + 1 in cmd/tracecore.
Validation errors now reference "Paths.Metrics" rather than
"MetricsPath", matching the field path.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Address item #1: the 292-line internal/selftelemetry/impl.go mixed
three concerns (receiver impl + exporter impl + init-error tracking),
forcing readers to context-switch across responsibilities. Split:

  receiver_impl.go  (~200 lines) — NewReceiver, receiverImpl, the
                                   five-method binding, degraded-
                                   seconds bookkeeping
  exporter_impl.go  (~80 lines)  — NewExporter, exporterImpl,
                                   FailureRateReader satisfaction
  init_errors.go    (~40 lines)  — RecordInitError

Common state hoisted to receiver_impl.go: ErrNilMeterProvider
sentinel and `instrumentationScope` constant (the package-stable
Meter scope name shared across all three call sites).

No API change; tests pass without modification. Each file is now a
single coherent unit and a future maintainer reading "what does
NewExporter do?" doesn't have to scroll past Receiver internals.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Address item #2: the rolling-window failure-rate math was baked
into AggregateSLOSource (~80 lines of ring-buffer + underflow
guard + 2× window pruning + maxSamples cap). When the future queue
mechanism and runtime-restart mechanism land — both with their
own SLI gauges — they'd need the same math.

Extract into a standalone WindowedRate primitive in
internal/telemetry/windowed_rate.go. AggregateSLOSource is now
a thin walker over the exporter registry that delegates the math:

  rate := s.rate.Observe(failure, success+failure)

Public API:

  NewWindowedRate(window) → *WindowedRate
  (*WindowedRate).Observe(numerator, denominator) → float64

Same semantics as before (warming-up returns 0, underflow returns 0,
zero-delta returns 0), now with five focused tests pinning each
contract (warming up, rate-over-window, underflow safety,
zero-delta, default window). AggregateSLOSource shrinks from ~120
lines to ~25 lines of glue.

When queue.depth_ratio gets a real source in a future milestone,
its callback drops in `NewWindowedRate(...)` + `Observe(depth,
capacity)` and inherits the same bounded-memory + monotonic-safe
behavior for free.

make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added 10 commits May 14, 2026 12:51
Address item #3: `syncBuffer` was duplicated in cmd/tracecore +
internal/telemetry, and `failingSink` lived only in clockreceiver
even though M8/M9 will need the same downstream-error stub.
Hoisted into `internal/pipeline/pipelinetest`:

  pipelinetest.SyncBuffer            (was syncBuffer ×2)
  pipelinetest.RecordingMetricsSink  (new — for capturing pushes)
  pipelinetest.FailingMetricsSink    (was failingSink ×1)

Migrations:
- cmd/tracecore/integration_test.go: drops 19-line syncBuffer def,
  type alias to pipelinetest.SyncBuffer (call sites unchanged).
- internal/telemetry/server_test.go: same alias pattern.
- components/receivers/clockreceiver/errors_integration_test.go:
  drops failingSink, uses pipelinetest.FailingMetricsSink{Err: ...}.

clockreceiver_test.go's metricsSink intentionally not migrated —
it's package-internal, tightly coupled to a few test assertions,
and migration cost exceeds value for this PR. M8/M9 + future
receivers reach for pipelinetest.RecordingMetricsSink instead.

Coverage: internal/pipeline/pipelinetest 77.3% → 86.8% via new
helper tests. Other packages unchanged. make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Address item #4: runCollect was orchestrating 13 distinct steps in
~125 lines. Hard to read top-down; hard to test individual phases
in isolation. Extract two narrow helpers:

  registerObservability(mp, pipelines, buildInfo, revision) error
    Registers the SLO observable gauges + the build_info join-target
    on the MeterProvider. No-op when telemetry is disabled.

  runRuntime(ctx, logger, pipelines, drainBudget, ready, srv, mp) int
    Starts the runtime, flips /readyz, blocks on signal, performs the
    ordered shutdown: ready=false → rt.Shutdown → telemetry shutdown.
    Returns the exit code.

runCollect now reads top-down as a short sequence:
  1. log starting
  2. load config + build BuildInfo
  3. initTelemetryStack (Phase 2)
  4. BuildPipelines
  5. registerObservability  ← new helper
  6. runRuntime             ← new helper

Each helper takes only what it needs (no closure-over-closure on
runCollect's locals); easier to test the orchestration shape in
isolation when M-something adds a tracing surface.

No behavior change. make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Address item #8: operators setting `telemetry.enabled: true` and
nothing else couldn't easily see what port tracecore would actually
bind to. The full config is in their YAML mentally, but the
default-resolution rules live in code. Surface them.

New `tracecore validate --config=foo.yaml --explain`:

  config valid: 1 pipeline(s)
    metrics/primary: receivers=1 processors=0 exporters=1

  resolved configuration (after defaults applied):
    telemetry.enabled:       true
    telemetry.listen:        localhost:8888
    telemetry.paths.metrics: /metrics
    telemetry.paths.healthz: /healthz
    telemetry.paths.readyz:  /readyz
    pipelines:
      metrics/primary:
        receivers:  [clockreceiver]
        exporters:  [stdoutexporter]

When telemetry is disabled the block notes "HTTP surface OFF; no
port bound" — so operators see telemetry.enabled defaulting to
false even without the YAML key present.

Two integration tests:
- TestIntegration_ValidateExplain_DumpsResolvedConfig: defaults
  resolve correctly for an "enabled: true with nothing else" YAML.
- TestIntegration_ValidateExplain_TelemetryOff: disabled branch
  surfaces the off state explicitly.

Behavior unchanged when --explain is absent; exit codes identical.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Address items #10 (deprecation policy for self-metric names) and
#11 (JSON content negotiation on /healthz + /readyz).

JSON probes:
- /healthz + /readyz now honor `Accept: application/json` and
  return `{"status":"ok|ready|not ready|shutting down"}` with
  Content-Type application/json. Default (no Accept or Accept:
  text/plain) stays text/plain so k8s probes + curl are unchanged.
- Hand-rolled JSON to keep encoding/json off the hot scrape path;
  status strings are operator-set constants so no quoting risk.
- Cache-Control: no-store applies to both content types.
- Test pins six cases covering both paths, with/without Accept,
  malformed Accept, and the text-beats-json precedence case.

Deprecation policy (RFC-0006 new section):
- Pre-1.0 (M2–M7): tracecore reserves the right to rename without
  procedure; the surface is still settling. CHANGELOG calls out
  any rename as "no bridge window: pre-1.0 contract."
- Post-1.0: any rename gets an RFC, then a bridge window (≥1 minor
  release emitting BOTH names simultaneously) backed by a
  `tracecore_deprecated_metric{old_name, new_name}` counter so
  operators see dashboards still asking for the old name. Removal
  after the bridge window, flagged loudly in
  `[Removed]` + release-notes `[CHANGE]`.

This is the documented commitment operators alerting on these
metrics need before they take a hard dependency.

Coverage: internal/telemetry 84.4% → 86.9% via new probe tests.
make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Address item #7: internal/telemetry had two export_test.go files
(export_test.go + server_addr_export_test.go), each containing one
test-only re-export. Go convention is one export_test.go per
package; the split was an artifact of incremental work.

Merge HTTPAddr into export_test.go alongside StartWithListenerForTest
with a header comment establishing the file as the package's
single test-escape-hatch home. Delete server_addr_export_test.go.

No behavior change; test compilations unchanged.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Apply 3 reviewer strong findings:

1. **selfExporterCarrier promoted to selftelemetry.ExporterCarrier.**
   Previously the carrier interface lived in cmd/tracecore as invisible
   duck-typing; future binary entry points would have to re-define it.
   Now a public contract in the selftelemetry package with a
   receiver-author template in the godoc. stdoutexporter already
   satisfies it; clockreceiver inherits the example.

2. **SetDegraded mutex dropped.** muSetDegraded was redundant against
   atomic.Pointer — CAS-only is the canonical pattern. Replaced the
   switch-on-current-value with:
       degraded=true  → CompareAndSwap(nil, &now)
       degraded=false → Swap(nil) + accumulate.Add
   Hot-path no longer takes a lock. Microsecond-scale read race in
   the observable callback documented in the new SetDegraded comment
   (under-count self-corrects on the next scrape, matches histogram
   bucket precision).

3. **IncError two-attribute merge collapsed to one call.** Previously
   the OTel SDK merged WithAttributeSet(r.attrs) + WithAttributes(kind)
   via documented-implementation-detail semantics. Now passes a single
   WithAttributes with both component_id + kind, the spec-stable shape.

Bonus: atomic style normalized — `accumulated` and `activityUnix`
were pointer-typed (&atomic.Uint64{}) for no reason while degradedAt
was value-typed. Both flavors work via embedding; value-typed is the
modern idiom (atomic.Uint64 from go1.19+). Receiver-impl is now
consistently value-typed atomics.

Tests pass without modification — the public surface and observable
contracts are preserved.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three reviewer-flagged simplifications:

1. **Probe JSON payloads pre-marshaled at package init.** Replaces
   hand-rolled string concatenation with encoding/json — the
   marshal cost is paid once for {"ok","ready","not ready",
   "shutting down"}, hot path writes pre-encoded bytes. Removes the
   maintenance trap of "what if status ever needs quoting" and
   pulls in encoding/json (already transitive via yaml.v3).

2. **PromHandler cached in MeterProvider.** Previously re-created
   `promhttp.HandlerFor(reg, opts)` on every PromHandler() call.
   Now set once at NewMeterProvider, exposed via the field. Saves
   the closure allocation per scrape.

3. **registerObservableFloat64Gauge helper.** RegisterSLOMetrics had
   three near-identical Float64ObservableGauge blocks; STYLE.md's
   "three similar lines is the threshold" rule applies. New
   sloGaugeSpec struct + helper function reduces three 8-line
   blocks to three 5-line entries in a slice.

wantsJSON gains a docstring noting it substring-matches Accept rather
than parsing RFC 7231 q-values; operators wanting q-value-aware
negotiation belong behind a reverse proxy.

No behavior change. make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Address reviewer findings on duplication + missing context:

1. **Path validation lived in two places** with mismatched strictness:
   internal/config did "leading /"; internal/telemetry/server did
   "leading / + no whitespace + no query/fragment". Stricter rules
   on the server side meant a config that passed `tracecore validate`
   could still surface a panic at server.start time. Promoted the
   strict rules to `config.ValidateMountPath`; the server-side
   validator now delegates. One source of truth.

2. **WindowedRate gets a rationale comment.** A reviewer asking
   "why a custom 131-line primitive when golang.org/x/time/rate
   exists?" now finds the answer in the docstring:
   x/time/rate is a token-bucket limiter (wrong semantics);
   PromQL rate() runs server-side and can't emit during boot when
   the SLO gauge needs to surface. The math is small enough to own.

3. **AggregateSLOSource.ExporterFailureRate gains an O(n) note.**
   "Walk cost is O(n) in registered exporters... benchmarked <5µs
   for 100 readers on M4 Pro" — anchors the perf claim to the
   existing benchmark file so a future reader debugging scrape
   latency knows where to look first.

No behavior change; all tests pass without modification.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two closures from the reviewer + the M2 prompt artifact:

1. **Fd-count fault-injection added to TestServer_RepeatedStartShutdown_NoLeaks.**
   Previously only goleak ran — catches goroutine leaks but not fd
   leaks. Now also samples /proc/self/fd (Linux) or /dev/fd (macOS)
   before + after 10 Start/Shutdown cycles; a listener leak of
   1 fd/cycle produces a delta ≥10. Slack of 3 absorbs legitimate
   test-runtime fd churn. Skips gracefully on platforms without
   either path (the goleak check still runs).

2. **docs/loops/m2-self-telemetry.md success-criteria checkboxes
   ticked.** All 19 are now [x] with one-line evidence each pointing
   at the commit / test / RFC section that satisfies them. Two
   notable annotations:
   - Criterion 8 (clockreceiver covers all 5 receiver methods):
     4 surface in TelemetrySurface_EndToEnd; 5th
     (errors_total) covered by ErrorsTotal_SurfacesOnDownstreamFailure
     because normal clockreceiver→stdoutexporter doesn't fail.
   - Criterion 13 (M8/M9 rebase): verified non-destructively via
     `git merge-tree`; zero Go-code conflicts on either branch.

The m2 prompt file is gitignored (docs/loops/), so this commit doesn't
push anything operator-visible — it's the local artifact gating the
M2 COMPLETE promise. Walking back to mark progress closes the same
"declared-before-done" pattern the M2 lessons section in FOLLOWUPS
calls out.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two passes:

1. **Comment trim.** Walked new code with "address why not what"
   as the rule. Removed restatements of obvious behavior; kept
   rationale comments (cardinality-tradeoff, slowloris-defense,
   atomic-style-consistency, package-stable-scope, etc.) and
   exported-symbol doc-comments the linter requires. Net: ~60
   lines of comment churn out of code that explained behavior the
   identifier already named.

2. **Phased-deferral index added to FOLLOWUPS.md.** Every M2 review
   round (Loop 4 passes × 2 agents, then 10 self-review phases,
   then 4 A+ reviewer batches) now has its commit indexed and its
   deferred items collated into a single table with "trigger to
   revisit" hints. A future maintainer (or future-me) opening
   FOLLOWUPS sees:

   - which commit closed which review
   - which deferrals are still open
   - what event flips each from deferred to actionable

   This is the artifact M-something needs to scan before opening
   new tickets — answers "did we already decide on X?" without
   re-reading the entire PR thread.

Coverage holds (pipelinetest 86.8%, selftelemetry 90.1%, telemetry
88.0%). make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr changed the title [telemetry] Add self-telemetry surface + MeterProvider + ReportStatus alignment [telemetry] M2 self-telemetry surface — MeterProvider + /metrics + /healthz + /readyz + ReportStatus migration May 14, 2026
trilamsr added 2 commits May 14, 2026 13:28
Reviewer's post-A+ pass flagged ecosystem-fit gaps. Land items 1-3
+ 5 + 7 in-scope; rest cataloged in FOLLOWUPS with milestone targets.

1. **docs/examples/prometheus-alerts.example.yaml** — 5 starter
   alert rules covering the operationally meaningful failure modes:
   ExporterFailureRateHigh, ReceiverDegraded, ReceiverNoActivity,
   SelftelemetryInitErrors, BuildIdentityKnown (scaffolding for
   fleet version-drift). Operators include under rule_files: in
   prometheus.yml or convert to a PrometheusRule CRD.

2. **selftelemetry.NewCapturingReceiver** — test helper recording
   every call from all 5 Receiver methods. Concurrent-safe
   accessors return copies. M10+ receiver authors stop reinventing
   this — half an hour saved per receiver.

3. **TLS schema sketch in RFC-0006** — `telemetry.tls: { enabled,
   cert_file, key_file, client_ca_file, min_version }` plus
   semantics (mTLS via client_ca_file; SIGHUP reload). Documents
   the shape so when TLS lands the config schema is stable. M2
   still ships plaintext-only with reverse-proxy as the
   documented near-term path.

5. **selftelemetry.Reason* constants** — mirrors the Kind* pattern:
   ReasonNilProvider, ReasonInstrumentRegister, ReasonUnsupportedSDK
   plus CanonicalReasons() + IsCanonicalReason() helpers.
   clockreceiver + stdoutexporter updated to use the constants
   rather than string literals.

7. **Histogram bucket tuning** for
   `tracecore.receiver.collection_latency_seconds`. Default OTel
   buckets miss sub-millisecond resolution; new boundaries
   [100µs..10s] log-spaced cover both a 200µs collection cycle
   and a 2s slow one with distinct buckets. Closes the Loop-2
   carry-forward.

**Documented as ecosystem-fit deferrals in FOLLOWUPS** (items 4, 6,
8, 9, 10 from the reviewer's list): Helm chart (M5b), Mermaid arch
diagram (M3), `version --bundle` CLI (M3), overhead numbers in
README (M3), threat model 1-pager (M3), plus the smaller items the
reviewer noted in passing (OpenMetrics toggle for M5, recording
rules + Grafana dashboard for M4, Server header strip
opportunistic, /metrics rate limit for M5, SBOM summary for M3).
Each carries a "suggested target" hint for the milestone planner.

Coverage holds: selftelemetry 89.7%, telemetry 88.0%. make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Five "diminishing returns" items the post-A+ reviewer flagged. All
ship — the last review called these "the kind of PR you'd ship as
a reference."

1. **Mermaid architecture diagram in RFC-0006.** Three-layer flow
   (components → SDK → surface) with the operator/k8s touch points
   on the right. The two dashed-line extensions for M3+
   (TracerProvider parallel, OTLP push reader parallel) are
   called out in prose so the diagram doesn't need re-drawing
   when tracing lands.

2. **Concrete benchmark numbers in `internal/telemetry/README.md`
   Performance section.** Median across 3-5 runs on Apple M4 Pro:
   - AggregateSLOSource.ExporterFailureRate: ~3.85µs (1 reader),
     ~3.87µs (10), ~4.51µs (100). 0 allocs steady-state.
   - WindowedRate.Observe (primitive): ~3.85µs. 0 allocs.
   At 15s scrape: 0.00003% of one core. Bench numbers map to
   NORTHSTARS O2's per-receiver overhead budget (<0.05%).
   New BenchmarkWindowedRate_Observe isolates the primitive cost
   for future SLI authors sizing their own gauges.

3. **`docs/examples/grafana-dashboard.example.json`.** Nine panels:
   build identity (table), failure_rate (timeseries with
   thresholds), calls_total rate by result, emissions per
   component, latency p50/p95/p99, activity staleness with
   threshold colors, errors by kind, degraded-seconds rate,
   selftelemetry init errors. Templated by `instance` + `component_id`
   so federated dashboards Just Work. Operators import via
   Grafana → New → Import.

4. **`make bench-check` Makefile target + frozen baseline at
   `internal/telemetry/testdata/bench-baseline.txt`.** Runs the
   current benchmark set against the baseline via `benchstat`;
   prints the geomean delta. CI gate is a FOLLOWUPS item (don't
   block PRs on perf during heavy refactor weeks); manual
   `make bench-check` available now.

5. **`internal/telemetry/SECURITY.md` 1-pager threat model.**
   Single attacker class (network principal reaching the listen
   socket); goals table; mitigations table mapping each Server
   field to the attack it bounds; "not mitigated in M2" table
   pointing at the FOLLOWUPS rows; recommended production
   deployment posture (reverse proxy + NetworkPolicy + unpriv
   user). Closes the auditor-lens gap the reviewer flagged.

Coverage unchanged. make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr merged commit eb9f08c into main May 14, 2026
9 checks passed
@trilamsr trilamsr deleted the feat/m2-self-telemetry branch May 14, 2026 20:44
trilamsr added a commit that referenced this pull request May 14, 2026
## What landed (at a glance)

| Area | Shipped |
|---|---|
| Bench-regression gate | `scripts/bench-check.sh` parses benchstat
`+NN.NN%` deltas and exits non-zero if any row exceeds the threshold |
| Makefile target | `make bench-check` (with `THRESHOLD=N` override)
delegates to the wrapper |
| README correction | `internal/telemetry/README.md`
regression-detection section rewritten to describe actual behavior |

## What this PR does

Post-merge fix for the M2 self-telemetry PR (#17). The `make
bench-check` target in that PR claimed "fails if any row regresses >10%"
— but `benchstat` itself always exits 0, so the target reported deltas
without gating on them. Caught by the post-merge reviewer.

Wraps benchstat in `scripts/bench-check.sh`:

- Runs the benchmark output through benchstat (printed verbatim so
operators still see the comparison table)
- Awk-parses the per-row `+NN.NN%` deltas
- Exits 1 if any benchmark regressed by more than the configured
threshold; prints the offending row + delta to stderr

Tested locally:

- Identical input files → `PASS`, exit 0.
- Synthetic +31% regression → `REGRESSION: WindowedRate_Observe-14
+31.28%` + exit 1.

POSIX-awk compatible (BSD awk on macOS, gawk on Linux). README's
regression-detection section rewritten to match actual behavior and
document the "intentional regression — update the baseline" workflow.

## Linked issue(s)

Refs #17 (post-merge fix).

## Release notes

\`\`\`release-notes
NONE
\`\`\`

## Checklist

- [x] Tests added or updated — manual end-to-end test of both pass +
fail paths documented in commit message; baseline file unchanged so the
gate is a no-op until someone actually regresses.
- [x] \`make ci\` passes locally
- [x] Commits are signed off (\`git commit -s\`)
- [x] Commits cryptographically signed (SSH)
- [x] N/A — no new components

---------

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Two independent reviews of PR #18 surfaced a stack of blockers,
strong findings, and quality lifts. This commit lands the tractable
items (defers documented in docs/FOLLOWUPS.md M8 section).

Operator-visible drift (closes Reviewer 2 #13-#17):
 - RUNBOOK kind-triage row `consume` → `downstream` (last commit
   renamed the kind but missed the doc table).
 - README config table: `initial_delay` range updated from "0..
   collection_interval" to "≥ 0" (code already relaxed for DGX
   cold-starts; doc was stale).
 - prometheus-alerts: DCGMReceiverHighErrorRate threshold changed
   from `rate > 0.1/sec` (unreachable: 15s default tick caps at
   0.067/sec) to `increase > 5 in 5m`. Stale "M2 has not landed"
   caveat removed.
 - example_config: single `mode:` line with a clarifying comment
   (was showing both `mode: standalone` and `# mode: embedded`,
   inviting operators to uncomment both → YAML duplicate-key).
 - cmd/tracecore receivers list now prints `dcgm [stub]` / `dcgm
   [cgo]` so operators can verify deploy shape without reading
   go.mod. Build-tag-conditional via three small files in
   cmd/tracecore (receiver_variants*.go) — pattern extends to M11
   NVML.

Correctness bugs (closes Reviewer 2 #2, #3, #4, #6, #7):
 - receiver.Shutdown: `r.running.CompareAndSwap(true, false)`
   gates teardown so a second Shutdown is a no-op (cgo libdcgm
   `dcgmShutdown` is not documented idempotent). Same CAS provides
   the happens-before for `r.cancel` publish that Pass-1 flagged.
 - receiver.ensureWatched: zero-entities path now emits
   IncError(KindEnumerate) + degraded rather than returning true.
   Without this, a misconfigured host (no GPUs visible, ACL blocks
   /dev/nvidia*) had the receiver looking healthy while emitting
   nothing.
 - receiver: new `warnOnce` helper gates the 7 per-tick failure-
   path Warn logs to fire only on the first failure after recovery.
   Closes the log-storm bug (4 errors/min × 60 min = 240 lines).
   Counter (`receiver_errors_total`) still ticks every failure.
 - metrics.applyCardinalityCap: parameter `cap` → `maxSeries`
   (cap shadowed the Go builtin).

Quality / contract lifts:
 - metrics.emit now returns a `stale` count for StatusStale and
   StatusError samples. pushSamples calls IncError(KindRead) once
   per tick when stale > 0 — surfaces DCGM serving slow/faulty
   data, which is precisely what StatusStale exists for. Per-tick
   not per-sample so GPU count doesn't inflate the rate.
   (StatusNoData and StatusFieldNotSupported still silent.)
 - docs_parity_test.go: new TestRUNBOOK_KindsMatchEmitted walks
   every emitted IncError/failedTick kind against the RUNBOOK
   per-kind triage table in both directions. This is the
   structural fix for the bug class — RUNBOOK can never again
   drift from emitted kinds without CI failure.
 - receiver.go: promoted `watchUpdateDivisor` /
   `watchKeepForMultiplier` / `watchUpdateEveryMinimum` constants
   for the previously-magic DCGM watch-cadence ratios.

Documentation + dedup:
 - dcgm README: new "Privacy + data residency considerations"
   subsection (compliance-auditor ask). Flags hw.id / pci.bdf /
   NVLink peer IDs as quasi-identifying; provides two mitigation
   patterns (attr-drop processor, salt-hash pseudonymization).
 - docs/agents/examples/constructor_options.go: `WithTelemetry`
   renamed to `WithSelfTelemetry` to match the real in-tree
   receiver API. M9+ authors copying the example no longer drift.
 - RUNBOOK kind enumeration line restored with both watch and mig
   (dcgm-local kinds) per the last commit's promotion.
 - Repo-root `FOLLOWUPS.md` consolidated into `docs/FOLLOWUPS.md`
   (M8-opportunistic + M8-skipped sections). Single source of
   truth; 17 deferred items pulled forward with falsifiable
   triggers (cgo client landing, M11 sibling-receiver shape,
   operator-report thresholds, file-size triggers).
 - All bare `FOLLOWUPS.md` references updated to
   `docs/FOLLOWUPS.md`.

Honest pushback documented:
 - I disagree with the M8-AGRADE-GAP claim of Operator UX 3.7→4.0.
   The drift findings above are exactly the class of bugs that
   rubric criterion was supposed to prevent — the alerts-vs-RUNBOOK
   parity test existed but didn't check kind values against
   emitted call sites. The new TestRUNBOOK_KindsMatchEmitted
   closes that gap; future operator-UX claims should pin to a test
   like this.
 - Deferred: split receiver.go (475 LOC) into 3 files, hoist
   dcgmtest.BaseClient, SECURITY.md for receiver, dcgm_info
   join-target, libdcgm setup in CONTRIBUTING — all logged with
   triggers. Reviewer 1's "construct receiver via M9-style
   primary-Option" inconsistency goes in the queue for M9-close
   review.

`make ci` passes; dcgm coverage 86.0% (essentially flat — new
tests offset by widened emit signature in test paths).

Assisted-by: Claude Opus 4.7
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant