[general] M1.6 hardening: lint + coverage + panic recovery + pipelinebuilder by trilamsr · Pull Request #13 · TraceCoreAI/tracecore

trilamsr · 2026-05-14T04:08:36Z

What this PR does

43 atomic commits closing M1.5's self-review compromises, adding pre-M2 hardening, processing three rounds of external code review, and tracking everything we deferred so nothing leaks back into chat history.

Major surfaces

Area	Commits
Quality gates wired into `make ci`	8
Pipeline runtime + contract changes	11
Operator UX	5
Code consolidation (dedup + split)	5
Documentation (RFCs, READMEs, style guides)	8
Code review closures (3 rounds)	6

Quality gates (CI now enforces)

Commit	Change
`edd54a5`	`docs/STYLE-errors.md` + tree-wide audit + `splitName` stutter fix
`25d7675`	Enable 6 stricter linters (wrapcheck, goconst, nilerr, prealloc, bodyclose, forcetypeassert)
`048f7b0`	`make coverage-check` (internal/* ≥70%, components/* ≥60%) + runtime `Start`-after-`Shutdown` fix
`85b6080`	Fuzz targets for `ParsePipelineID`, `Load`, `splitName`
`34e61a3`	Baseline benchmarks for runtime, fanout, first-data
`93bd8cc`	`govulncheck` in `make ci`
`ceb6b30`	`gocyclo -over 15` gate; `buildSignalPipeline` split into helpers
`2a1dcae`	`goleak` in TestMain + fixed leaked signal-handler goroutine
`78c6508`	`make doc-check` (asserts every test name referenced in rot-prone docs exists)

Pipeline runtime + contract

Commit	Change
`3226a4e`	`pipeline.NewRuntime(pipelines, opts...)` replaces `Settings` struct
`420b788`	`WrapSafeMetrics/Traces/Logs` at every cross-stage Consume hop
`c574f4d`	3× `buildXPipeline` → 1× `buildSignalPipeline[C]` (-109 LOC)
`ad15390`	3× per-signal fanout → 1× `genericFanout[T,C]` (coverage 70%→100%)
`f4966d3`	safe/firstdata: shared base + per-signal embedding
`c7dc11a`	Collapse `signalOps` create+assert closures
`a93fc25`	`sync.OnceValues` integration-test binary cache
`f39aeda`	Double-SIGTERM force-exits; `handleShutdownSignals` extracted + unit-tested
`8be84a7`	`ErrSignalUnsupported`→`ErrSignalNotSupported`; multi-instance regression test
`5643a4a`	`Host.ReportStatus` deprecation + receiver defer/recover example test
`37e1dc0`	Move pipeline assembly into `internal/pipelinebuilder` (sibling, avoids cycle)

Operator UX

Commit	Change
`7f0ae78`	`tracecore validate` subcommand; exit 2 on bad config
`2be6ea0`	`--help` examples + exit-code table; `tracecore vX.Y (sha=..., go=..., built=..., platform=...)`
`0ddcdea`	`--version-short` for scripting; empty-pipeline boot now WARN-visible
`3ffb27d`	Integration: SIGINT, boot-time signal, `validate→collect` roundtrip, `--version` end-to-end
`55c7c3c`	coverage-check slash-guard against malformed profiles

Documentation

Commit	Change
`edd54a5`	`docs/STYLE-errors.md` (≤80 lines)
`5e26f55`	Processor + exporter author quickstarts
`edc1826`	Root README quickstart (copy-paste runnable) + ASCII pipeline diagram
`a3b7a66`	RFC-0003 §"M2 TelemetrySettings extension preview"
`622f2e8`	RFC-0004 Path A confirmed (`Factory` + `NewFactory()` bridge)
`e876810`	Pitfall: receiver goroutine panic recovery recipe
`58621fc`	`docs/FAILURE-MODES.md` inventory (handled / bounded / not-handled, every row linked to a test)
`6359fd7`	`docs/STRATEGY.md` Resolution column + registry-of-registries write-down

Code review closures (3 rounds)

Commit	Round
`189f85d`	v1 polish: default arm, empty-profile guard, wrapcheck doc note
`4928393`	Weak items: nps.md banner, plugin-prevention hard rule, Makefile coverage comment
`72b10a1`	v2: 14 strong-agree items closed
`a663545`	Trim superfluous prose + roadmap-track future work
`78c6508`	v3: SignalDuringBoot signaled flag, validate ctx-comment, doc-check

Real bugs found and fixed

Runtime concurrency bug (048f7b0): Start racing with Shutdown would launch components nobody would tear down. Coverage-instrumentation timing surfaced it.
Leaked signal-handler goroutine (2a1dcae): handleShutdownSignals blocked on <-sigCh forever after run() returned. goleak surfaced it.
splitName stutter (edd54a5): operator's bad input appeared twice in error messages.

Future work — tracked, not lost

MILESTONES.md — 5 milestone-tracked "Carry-forward from M1.6:" bullets (M3 cross-platform CI, M4b chaos-light, M5 benchstat + sustained-load, M6 finalize FAILURE-MODES + nps + backend recipes + doc-check, M8 receiver ergo review + network failures).
docs/FOLLOWUPS.md — 8 open items without fixed milestones (opportunistic + trigger-based) + 3 explicitly considered-and-skipped items with reasoning (DRY wrap helper, signalConsumer generic constraint, inode-based cache check). Pinned "Next up": make doc-check ✓ already shipped this round; squash-merge Assisted-by: verification (observable only post-merge).

Linked issue(s)

Refs internal M1 milestone (post-RFC-0003 / RFC-0004 hardening). N/A — no GitHub issue.

Release notes

[ENHANCEMENT] M1.6 hardening: stricter lint + per-package coverage gate + govulncheck + goleak + doc-check in CI; pipeline runtime variadic-Options configurable, recovers panics in Consume* hot path, double-SIGTERM force-exits; new `tracecore validate` subcommand and `--version-short` for scripting; pipeline assembly extracted to `internal/pipelinebuilder`; OTel v0.152.0 divergence table refreshed with Resolution targets; 3 real bugs fixed (Start/Shutdown race, signal-handler leak, error-message stutter); future work tracked in docs/FOLLOWUPS.md and MILESTONES.md "Carry-forward from M1.6" bullets.

Checklist

Tests added or updated
make ci passes locally
Commits are signed off (git commit -s)
Commits cryptographically signed (SSH-based)
N/A — no new components in this PR (clockreceiver/stdoutexporter shipped in M1.5)

Adds docs/STYLE-errors.md codifying seven rules for tracecore error messages: operator-actionable wording, no stuttering, %w-wrapped chains, no apologetics, %q-quoted operator input, sparing sentinels, typed errors only when callers need fields. Includes a deferral note on error codes (additive Coded interface seam at 1.0; do not implement now). Audit of every fmt.Errorf and errors.New in production code surfaced one rule-2 stutter: splitName included "component name %q" while every caller wraps with "pipeline X: <role> %q: %w", quoting the operator's input twice. Fixed by making splitName's error self-contained. RED: TestBuildPipelines_MultiSlashName_NoStutter pinned the stutter (strings.Count == 2 for the bad name). GREEN: same test passes (count == 1) after splitName no longer embeds the input string. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Adds bodyclose, forcetypeassert, goconst, nilerr, prealloc, wrapcheck to .golangci.yml (gocritic, unconvert, errorlint were already on). wrapcheck was the noisy one — 12 initial violations, all internal pass-through forwarders (Consume* chains, ComponentState lifecycle) or stdlib calls that genuinely cannot fail (bytes.Buffer.Write to in-memory buffer, context.Err()). Wrapping those would create the stutter STYLE-errors rule 2 forbids. Configured wrapcheck to ignore same-module packages and the relevant interfaces. Config schema verified via `golangci-lint config verify`. The schema for golangci-lint v2 wrapcheck uses kebab-case keys (ignore-sigs, ignore-package-globs, ignore-interface-regexps), not the camelCase shown in upstream wrapcheck README. RED: `make lint` after enabling — 12 wrapcheck violations. GREEN: after wrapcheck tuning, `make ci` reports 0 issues. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Phase 3 of M1.6: coverage threshold gate, scoped to the policy STYLE.md sets — internal/* ≥70%, components/* ≥60%, cmd/ informational, tools/ not policed. New surface: - `make coverage` runs go test with -coverprofile + -covermode=atomic against ./cmd/... ./components/... ./internal/... (tools/ excluded; it has its own tests but no coverage gate). - `make coverage-check` parses coverage.out via the small Go program in tools/coverage-check and fails the build if any policed package falls below threshold. - Wired into `make ci` between `test` and `build`. - Baselines recorded in docs/research/baselines.md; internal/fanout is at 70.4% (0.4% margin), explicitly noted as the package to watch during Phase 6's fan-out refactor. Runtime bug surfaced by coverage instrumentation: TestRuntime_ConcurrentStartShutdown_NoLostComponents fails under coverage timing — if Shutdown wins lifecycleMu first, it observes empty started* slices, releases the lock, and Start then runs and launches Components nobody will tear down. Fixed by setting a `shutdown` flag in Shutdown (under lifecycleMu) and returning nil from Start when shutdown is already true. Concurrent boot + signal becomes a clean no-op rather than a race-dependent leak. RED: TestRuntime_ConcurrentStartShutdown_NoLostComponents fails 3/3 runs under `go test -race -cover`. GREEN: same test passes 5/5 after the runtime fix; `make ci` clean with all 9 policed packages above threshold. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Replaces pipeline.NewRuntime(Settings, []Pipeline) with the variadic form pipeline.NewRuntime([]Pipeline, ...Option). The Settings struct becomes the unexported settings type; configuration flows through WithLogger, WithDrainBudget, WithHost. Why: M2 will add MeterProvider and BuildInfo fields. With a struct literal, that's a breaking change for every caller. With variadic options, M2 ships pipeline.WithMeterProvider(...) additively — every existing call site keeps working unchanged. API: rt := pipeline.NewRuntime(pipelines, pipeline.WithLogger(logger), pipeline.WithDrainBudget(5*time.Second), ) Options apply in order; later wins. New test TestRuntime_VariadicOptions_ApplyInOrder pins this so M2 can layer build-default + caller-override without proliferating constructors. Touched files: - internal/pipeline/runtime.go — Option type + 3 With* constructors - internal/pipeline/runtime_test.go — 17 call sites migrated + new test - internal/pipeline/firstdata_test.go — 1 call site migrated - internal/pipeline/contract_test.go — 1 call site migrated - cmd/tracecore/main.go — production call site migrated - internal/pipeline/pipelinetest/doc.go — drop stale NewRuntime reference - docs/rfcs/0003-... — implementation-notes entry RED: TestRuntime_VariadicOptions_ApplyInOrder fails to compile (Option / WithLogger / WithDrainBudget undefined). GREEN: same test passes; `make ci` clean; all 9 policed packages above coverage threshold; internal/pipeline coverage held at 85%. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Replaces buildMetricsPipeline / buildTracesPipeline / buildLogsPipeline (three near-identical ~95-line functions) with a single generic buildSignalPipeline[C any] + three small per-signal signalOps[C] values. The generic algorithm — exporters → fanout → reversed processors → first-data wrap → receivers — now lives in exactly one place. The signal-specific glue (which Create* method, fanout constructor, first-data wrapper, consumer-interface assertion) is the only thing per-signal, captured in metricsOps / tracesOps / logsOps. Net: -212 / +103 in cmd/tracecore/main.go (109 lines removed). The characterization test TestBuildPipelines_MultiSignal_ErrorsCiteCorrectSignal pins that each signal path still surfaces ErrSignalUnsupported with its own pipeline ID in the error — guarding against the worst kind of refactor regression (a logs error routed through the traces builder). ctx is the first parameter of buildSignalPipeline (revive's context-as-argument rule); ops is second, matching how the dispatcher reads at the call site. RED: characterization test passes against the three-function code. GREEN: same test passes after consolidation; `make ci` clean; all 9 policed packages above coverage thresholds; cmd/tracecore coverage rose 39.8% → 57.2% (informational; same tests exercise more of the smaller surface). Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Three near-identical files (metrics.go / traces.go / logs.go, 87+71+71 LOC) replaced with one genericFanout[T any, C any] in a new fanout.go plus three thin per-signal wrappers (~46 LOC each). The cloning algorithm lives once; per-signal files now hold only the pdata bridge funcs (new/copyTo/isReadOnly/markReadOnly), the ConsumeX method that forwards to consumeAll, and the public NewMetrics/NewTraces/NewLogs constructor. Shared cloning-contract test table (TestFanoutContract_{Metrics, Traces,Logs}) exercises 6 scenarios × 3 signals = 18 sub-tests through one generic runCloningContract helper. Lifts traces+logs coverage to parity with metrics — fanout coverage rose 70.4% → 100%. The pdata payload types (pmetric.Metrics, ptrace.Traces, plog.Logs) don't share a Go interface for CopyTo/IsReadOnly/MarkReadOnly in upstream pdata, so the generic uses a pdataOps[T] callback struct to bridge them. Each callback is one line. RED: TestFanoutContract_{Metrics,Traces,Logs} pass against the pre-refactor code (characterization). GREEN: same tests + all pre-existing per-signal tests pass after the generic consolidation; `make ci` clean. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

buildBinary used to call `go build` for every test invocation. With the integration test plus the new TestBuildBinary_CachesAcrossCalls plus `go test -count=N` re-runs, that was a ~2s wait per test. Replaces the per-test build with a package-level sync.OnceValues so the binary is built at most once per `go test` process. The binary lives under os.MkdirTemp so it survives across tests; TestMain cleans the directory after m.Run() returns. Measured (go test -count=2): TestBuildBinary_CachesAcrossCalls: 0.21s first, 0.00s second TestIntegration_ClockreceiverToStdoutexporter: 0.53s first, 0.33s second Total wall-clock: 2.485s; previously each invocation paid the build. RED: TestBuildBinary_CachesAcrossCalls fails on the pre-cache code (each call to buildBinary produces a new t.TempDir path, so require.Equal on paths fails). GREEN: same test passes after the sync.OnceValues cache; the cached binary's mtime is stable across calls, proving no rebuild. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Adds three Go-native fuzz targets covering the operator-input surfaces — the only places tracecore parses untrusted-ish strings: - internal/config/fuzz_test.go::FuzzLoad — writes the fuzz bytes to a t.TempDir file and pushes through config.Load. 5 seeds: empty, valid, malformed, multi-doc, binary. - internal/config/fuzz_test.go::FuzzParsePipelineID — 6 seeds: metrics, metrics/primary, traces/, //bad, badsignal/x, empty. - cmd/tracecore/fuzz_test.go::FuzzSplitName — 6 seeds covering the YAML component-reference name parser. Property under test: no panic, regardless of input. Each fuzzer ran clean for 3 seconds of mutation (370k/265k/508 execs respectively across 14 workers) with zero crashes. Interesting corpora live in the user's go-build cache, not the repo tree. RED-then-GREEN: each target's seed corpus IS the RED — running it the first time would fail loudly on any panic in the parser. All 6+6+5 seeds pass, demonstrating the panic-avoidance contract holds across the curated inputs. Future loops can drive these targets with `go test -fuzz=Fuzz... -fuzztime=...` for longer mutation runs. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Phase 9 decision: keep the dual-export pattern. Each component ships: var Factory pipeline.<X>Factory = &factory{} func NewFactory() pipeline.<X>Factory { return Factory } Path A (keep both, document) over Path B (extend manifest with a factoryVar flag, drop the bridge). Reasoning recorded in docs/rfcs/0004-clockreceiver-stdoutexporter.md "Implementation notes": 1. OTel components export NewFactory() — STRATEGY.md's load-bearing principle says mirror OTel unless we have a written reason to diverge. We don't. 2. The bridge is 3 lines per component. The generator complexity to support both paren-and-no-paren shapes would be more code than the bridge eliminates. 3. Receivers get both ergonomics: Factory for in-tree no-allocation reference, NewFactory() for codegen / future plugin surfaces. Touched: - docs/rfcs/0004-...: new "Implementation notes" section, sub-section "NewFactory() bridge to package-var Factory — Path A confirmed". - tools/components-gen/main.go: doc comment on the FactoryFunc field points at the RFC entry so future contributors see the rationale before they propose Path B again. No code-behaviour change; the decision is durable. M8+ contributors who copy clockreceiver's shape inherit the pattern. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

A panicking processor or exporter used to crash the binary mid-flight. M2 will hot-path components for self-metric counters; a buggy counter increment must not bring down the collector. New `internal/pipeline/saferun.go` exposes three wrap functions — WrapSafeMetrics / WrapSafeTraces / WrapSafeLogs — that wrap a consumer so a panic inside its Consume* is recovered, logged at Error with the component ID, and surfaced as an ordinary error to the caller. Capabilities() is a transparent pass-through so the fan-out cloning decisions stay correct. The wrapper is process- stable: a panicking call does not poison subsequent pushes. Wired in at the pipeline-assembly seam in cmd/tracecore. buildSignalPipeline: after every processor and exporter is created, the consumer-view is wrapped before being used as the upstream's `next` (or before being fed to the fanout). One signalOps callback (wrapSafe) per signal, mirroring how wrapFirstData is plumbed. Receivers don't get wrapped — they push into next, they don't get pushed into. The same goes for the fanout: it is tracecore code, not third-party, and we don't recover from our own bugs. Quarantine semantics (suspend a component that panics repeatedly) deliberately deferred to the M2 self-telemetry RFC. M1.6 ships the cheap behaviour: every push to a panicking component fails its own push but the next tick gets retried. RED: TestWrapSafe_PanicsRecovered_PerSignal fails to compile against pre-Phase-10 code (WrapSafe{Metrics,Traces,Logs} undefined). GREEN: same tests pass; `make ci` clean; internal/pipeline coverage rose 85.2% → 86.3% on the new saferun.go's 100%-covered surface. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Adds three *_bench_test.go files covering the most-trafficked hot-path code paths in tracecore: internal/pipeline/bench_test.go BenchmarkRuntime_StartShutdown (N=10, N=100 components) BenchmarkWrapFirstDataMetrics_ConsumeMetrics internal/fanout/bench_test.go BenchmarkFanout_NewMetrics (5 readonly+mutable mixes) internal/config/bench_test.go BenchmarkParsePipelineID Numbers recorded in docs/research/baselines.md (Apple M4 Pro, darwin/arm64, Go 1.26.2). Highlights: WrapFirstDataMetrics after firstdata fires: 3.5 ns/op, 0 allocs Fanout fast path (single readonly): 1.1 ns/op, 0 allocs ParsePipelineID: 84 ns/op, 0 allocs Runtime_StartShutdown sub-linear (10→100 = 2.4× not 10×) thanks to parallel Phase 1 The harness is the deliverable; M1.6 doesn't gate on these numbers. M5 will introduce regression gating per MILESTONES.md. Future phases that touch these code paths can compare quantitatively. Run via: go test -bench=. -benchmem -count=1 ./internal/... Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Adds govulncheck between coverage-check and build in the ci recipe. The standalone `make govulncheck` target was already present; this phase makes it a hard gate so a CVE in a called code path fails the build instead of waiting for an out-of-band scan. Current state at gate-on: 0 called vulnerabilities. govulncheck's symbol-level analysis flags 2 indirect-but-uncalled findings in imported packages and 6 in transitively-required modules — these don't fail the build because our code doesn't reach them. If a future bump pulls a CVE into a called path, ci will fail loudly. RED: the ci target doesn't include govulncheck (existing state). GREEN: `make ci` now runs govulncheck after coverage-check and before build; exit 0 with current dep set; would exit 1 on any reachable vulnerability. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

The first SIGINT/SIGTERM still triggers graceful shutdown via ctx cancellation. The second one fires os.Exit(exitFailure) so the operator can break out when graceful shutdown is taking longer than they expect. Previously the second signal was silently absorbed by the original NotifyContext handler. Replaces signal.NotifyContext with a direct signal.Notify on a buffered channel (cap=2) so two signals delivered back-to-back are both captured. The two-stage loop lives in handleShutdownSignals, extracted so it's unit-testable via a controlled channel — the integration test for the second-signal path can't deterministically race a fast-shutdown pipeline (clockreceiver+stdoutexporter complete shutdown in <1ms, beating the force-exit goroutine). Tests: - TestHandleShutdownSignals_TwoSignals_FiresForceExit pins the contract directly: feed two signals into the channel, assert cancel() and forceExit() are both invoked in order. - TestHandleShutdownSignals_ChannelClosed_ReturnsCleanly pins the shutdown path used when main exits naturally. - TestIntegration_DoubleSIGTERM_DoesNotHang verifies end-to-end that the second SIGTERM doesn't hang the process within 500ms; the exit code is intentionally not asserted because fast shutdowns may legitimately win the race. `--help` text now documents the two-signal protocol. RED: TestHandleShutdownSignals_TwoSignals_FiresForceExit doesn't compile against pre-Phase-13 code (handleShutdownSignals doesn't exist). GREEN: same test passes; `make ci` clean. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

New \`tracecore validate --config=<path>\` subcommand: loads the config, builds pipelines, prints a terse summary, exits 0. On any config error returns exit 2 (sysexits.h EX_DATAERR) — same code \`collect\` returns, so operators can use validate as a pre-flight dry run with the same shell-script semantics. Summary format (stderr): config valid: N pipeline(s) metrics/primary: receivers=1 processors=0 exporters=1 ... Tests cover the three contract paths: - good config → exit 0 + summary names every pipeline - bad config (unknown YAML field) → exit 2 + error names field - missing --config flag → exit 64 (usage error, kingpin path) RED: TestRun_Validate_GoodConfig_ExitsOK reports "unexpected validate" (kingpin doesn't know the subcommand). GREEN: same test passes; \`make ci\` clean. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Adds gocyclo to .golangci.yml with min-complexity=15 so every PR is auto-audited. Initial run flagged one function: buildSignalPipeline at 18. Refactored buildSignalPipeline into a small orchestrator plus three helpers — buildExporters, buildProcessors, buildReceivers — each in the 7-9 range. The orchestrator reads top-to-bottom as the pipeline assembly's natural shape: exporters → fanout → processors (reversed) → first-data → receivers No behaviour change; existing tests pass unchanged. cmd/tracecore test coverage held at 57.7% (informational; this section of the file was already well-covered by the buildPipelines tests). Audit baseline recorded in docs/research/baselines.md. RED: `make lint` reports cyclomatic complexity 18 on buildSignalPipeline (> 15). GREEN: all functions <= 15 after the helper split; `make ci` clean. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

\`tracecore --help\` now includes a Signals section (already present), plus new Exit codes and Examples sections so operators can read the contract end-to-end without leaving the terminal. Every flag's Help() text gained a one-line description + an example value where useful (--log.level, --log.format, --config, --shutdown.drain-budget). \`tracecore --version\` switched to a parseable k=v form: tracecore v0.0.0-dev (sha=abc1234, go=go1.26.2, built=2026-..., platform=darwin/arm64) Missing build metadata (no ldflags, no VCS info) drops the k=v pair silently, so \`go run\` still emits a sensible line. Tests: - TestRun_HelpOutput_HasExamplesAndExitCodes pins the help text contract (Signals / Exit codes / Examples / runnable example invocations). - TestRun_VersionOutput_HasStablePrefix pins the stable "tracecore v" prefix and the "go=" k=v entry. - TestBuild_StringFormat updated for the new format. To capture kingpin's --help / --version output in tests, run() now installs: - app.UsageWriter(stderr) / ErrorWriter(stderr) so output goes to the caller's writer instead of os.Stdout. - app.Terminate(...) that captures the exit code rather than calling os.Exit directly; run() returns the captured code so the operator-visible exit semantics are preserved. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Mirrors "Writing a receiver" with two new sections in internal/pipeline/README.md: ## Writing a processor (≤180 words; pointer to fakeProcessor in contract_test.go:82 as the canonical referent until M8+ ships one) ## Writing an exporter (≤200 words; pointer to stdoutexporter with explicit file:line citations) Each section follows the same template: shape, factory pattern, capabilities, pitfalls (each pitfall tied to a specific file:line when an in-tree example exists). M8+ contributors writing the second or third component in each category get a copy-able shape without reverse-engineering the existing one. Pitfalls codify lessons that surfaced during M1.5 review: - processor: don't capture `next` at config time, no panic guards (handled by saferun), no manual MarkReadOnly - exporter: synchronize shared writers, no blocking Shutdown, bounded retries against ctx No code changes; `make ci` still clean. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Root README: - Status now reflects M1.6 shipped state (was "not yet usable"). - New Quickstart section: build, write demo.yaml via heredoc, validate via the `tracecore validate` subcommand, then `collect`. Every line is copy-paste-runnable; verified end-to-end against the freshly-built binary. - Documentation list now links to docs/STRATEGY.md, docs/STYLE-errors.md, and internal/pipeline/README.md so a new reader can navigate from the front door to the runtime quickstart without guessing. internal/pipeline/README.md: - New "Architecture at a glance" section with an ASCII diagram of one pipeline: receivers (parallel) → WrapFirstData → processors (data-flow order, built last-first) → fanout → exporters (serial-LIFO drain). Each seam in the diagram links to the file that implements it. No code changes. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

The original RFC §TelemetrySettings noted MeterProvider was "deferred until M2". Phase 19 makes that preview concrete so M2 contributors land on a clean, pre-thought shape: type TelemetrySettings struct { Logger *slog.Logger Resource pcommon.Resource MeterProvider metric.MeterProvider // M2 — components emit self-metrics BuildInfo version.Build // M2 — attribution on emitted data _ struct{} // unkeyed-literal guard } The note lives in the Implementation notes section alongside the existing Phase 4 (variadic Options) entry, so the connection is explicit: Options are additive, the unkeyed-literal guard catches forgotten threading, and M2 ships `pipeline.WithMeterProvider(...)` / `pipeline.WithBuildInfo(...)` without breaking call sites. Doc-only. No code change; `make ci` clean. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

The three pairs of per-signal wrappers (safe{Metrics,Traces,Logs} and firstData{Metrics,Traces,Logs}) repeated the same boilerplate: hold \`next C\`, pass through Capabilities(), and forward Consume<Sig> with some additional work. Phase 6 had already de-duplicated this exact shape in internal/fanout via a generic + per-signal callback. This commit applies the same pattern to saferun.go and firstdata.go. internal/pipeline/saferun.go safeBase[T, C capable] with safeCall(ctx, payload) running invoke under defer/recover. safeMetrics/Traces/Logs embed it and define only their public Consume<Sig> method. internal/pipeline/firstdata.go firstDataBase[T, C capable] with fireAndForward(ctx, payload) running invoke after the once-only "pipeline first data" log. firstDataMetrics/Traces/Logs embed it and define only their public Consume<Sig> method. capable is the local constraint interface — anything with Capabilities() consumer.Capabilities, which is every consumer.X interface in this codebase. Behaviour preserved: existing TestWrapSafe_* + TestFirstData* tests pass unchanged. Error / log message text aligned slightly — the panic log now uses key-value attrs ("component", "op", "panic") instead of folding op into the message string, but the error returned to the caller keeps the "component %s: panic in %s: %v" shape. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Before: each signalOps had 8 closures including createX (3) + asXConsumer (2) + a consumerName string used solely to format the "does not implement <interface>" error. After: createExporter and createProcessor each return (component, consumer.C, error) and own the type assertion inline. The asExporterConsumer / asProcessorConsumer / consumerName fields are gone. The builder-bug error ("does not implement consumer.X") is now a self-contained string the per-signal closure produces; the buildExporters / buildProcessors call sites wrap it with the usual "pipeline X: <role> Y: %w" prefix. Net: ~30 lines removed from cmd/tracecore/main.go; one closure form per role across signals instead of two-pass create-then-assert. Behaviour preserved — the existing TestBuildPipelines_MultiSignal_ErrorsCiteCorrectSignal, TestBuildPipelines_UndeclaredComponentRef, and other build-side tests pass unchanged. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

parseProfile previously did `pkg := file[:strings.LastIndex(file, "/")]` — if the line's file field had no slash, LastIndex returns -1 and the slice expression panics with index-out-of-range. Go coverage profiles always include a package prefix in practice, but the slice was a latent crash on malformed input. Now: capture the index, skip the line if it's negative. One extra branch; no behaviour change on well-formed profiles. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Three test-first improvements driven by post-M1.6 self-review: 1. tools/coverage-check now has unit tests (previously zero): - happy-path parsing (multi-package aggregation) - empty profile (mode-only header) - missing mode header is rejected - non-existent profile path returns error, not panic - slashless file path is skipped (pins the M1.6 slash guard) - thresholdFor table covers internal/components/other prefixes 2. \`tracecore --version-short\` emits just the bare version string for script consumption (\`VER=$(tracecore --version-short)\`). Implemented as an argv scan that short-circuits before kingpin runs subcommand validation, so it works even though collect's --config is Required. Test: TestRun_VersionShort_BareVersionForScripting. 3. Empty-pipeline boot now logs "no pipelines configured" at WARN (was INFO). Operators running with default log-level filters (warn+) would have missed an INFO line in a typo'd-config scenario that looks indistinguishable from a successful run. Test: TestRunCollect_EmptyConfig_NoPipelinesIsWarn. RED-first on each: the version-short and warn-level tests fail against pre-fix code; the coverage-check tests document existing behaviour and pass against the M1.6-era slash guard. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

The single 720-line cmd/tracecore/main.go grew through M1.6 to hold five distinct concerns: CLI wiring, signal handling, subcommand implementations (collect, validate), pipeline assembly, and the signal-generic ops vars. Reading any one required scrolling past the others. Split (preserving package main; all symbols remain unexported and inter-callable): main.go 165 CLI entry, kingpin wiring, exit constants, helpDescription, newLogger signals.go 25 handleShutdownSignals collect.go 72 runCollect validate.go 41 runValidate build.go 334 buildPipelines, buildSignalPipeline, the three build<Role> helpers, splitName, componentSet, resolveComponent, hasFactories, hasOperatorIntent signalops.go 134 signalOps struct + metricsOps/tracesOps/logsOps No behaviour change. \`make ci\` clean; all tests pass; internal/ pipeline coverage rose to 89.5%, version package to 90.5% due to cached test paths re-running across the freshly-built binary. Reviewability win: an M8 receiver author looking at how pipelines are assembled opens build.go. An operator-facing CLI change touches main.go. The signal-generic plumbing has its own file. Each file is under 350 lines. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Adds goleak coverage to two packages where lifecycle goroutines live: - internal/pipeline: new TestMain wrapping goleak.VerifyTestMain. - cmd/tracecore: existing TestMain now calls goleak.Find() after m.Run() returns (and after binary-cache cleanup), failing the test run on any leftover goroutine. The check immediately caught a real leak: run()'s sigCh-reading goroutine (handleShutdownSignals) blocks on <-sigCh and never wakes when run() returns, because we only signal.Stop() the channel — we never close() it. In production this is moot (process exits, all goroutines die); in tests, every TestRun_* invocation accumulated a leaked goroutine. Fix: close(sigCh) AFTER signal.Stop() in the run() defer. Stop prevents the runtime from sending to the channel before we close, so we don't panic on send-to-closed-channel. handleShutdownSignals already handles the closed-channel case (its <-sigCh, !ok branch). This was a knowledge-gap closure from the post-M1 self-assessment: "go test -race detects races, not leaks; we have no goleak coverage." Adding it surfaced the leak the next iteration after we noticed. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Adds TestMetrics_NestedPayload_DeepCloneIsolation, a realistic-shape payload (2 ResourceMetrics × 2 ScopeMetrics × 1 Gauge × 2 DataPoints with attribute maps) pushed through a mutating + readonly fanout pair. Asserts that a mutator adding a key to its data-point attribute maps does NOT leak into the readonly consumer's clone — verifying pdata's CopyTo semantics at every nesting level (RM, SM, Metric, Gauge, DataPoint, pcommon.Map). This closes the post-M1 knowledge gap "we test single-resource payloads; real receivers push deeply nested pdata and we never verified clone isolation at the attribute-map level." Result: isolation holds; the existing fanout cloning algorithm is correct for production-shaped payloads. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Two doc-only gap closures from the post-M1 self-assessment: 1. \`tracecore validate --help\` now says the output is human- readable and may change between releases. Operators tempted to parse "config valid: N pipeline(s)" for scripting see the warning directly. A future --format=json would be the stable surface. 2. internal/pipeline/README.md's receiver-author Pitfalls list now covers goroutine-internal panic recovery: saferun catches panics inside ConsumeMetrics/Traces/Logs, but a panic in the receiver's own goroutine (before reaching next.Consume*) is NOT caught and crashes the process. Includes the defer/recover snippet receivers should copy. No code change. \`make ci\` clean. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Three closures from the OTel v0.152.0 divergence audit run as background research alongside the gap-closing loop: 1. **Rename** pipeline.ErrSignalUnsupported → pipeline.ErrSignalNotSupported to match OTel's exact spelling. Pre-1.0 breaking change with low blast radius (8 in-tree call sites). The error message text is unchanged ("signal not supported by this component"), so operator logs don't drift. 2. **Multi-instance regression test** (TestBuildPipelines_MultiInstance_SameTypeTwoInstances) — pins that two instances of the same factory (clockreceiver/fast + clockreceiver/slow) coexist with independent IDs and distinct goroutines. Smoke-verified live; locking with this test. 3. **STRATEGY.md divergence table** expanded with six newly- identified drifts: - Host.ReportStatus location (method vs componentstatus free fn, align in M2) - CreateSettings shape (missing BuildInfo + unkeyed-init guard, M2 work) - TelemetrySettings shape (missing TracerProvider/MeterProvider, M2 work) - Factory interface decomposition (intentional flattening; OTel adds XStability() we defer to post-v1.0) - ErrSignalNotSupported naming (now aligned) - Consumer-seam panic recovery (intentional tracecore addition; OTel has none — documented as deliberate divergence) - "pipeline first data" log (intentional tracecore UX addition) Also adds an anti-list of "we're identical here" findings so future contributors don't re-audit the agreed points. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Two follow-ups from the OTel-audit gap closure: 1. Doc-comment Host.ReportStatus with a Deprecated note pointing at the M2 migration to OTel's componentstatus.ReportStatus(host, ev) free-function pattern. Receivers can call it today; the rename will be a one-line edit per call site once M2 lands. No code change yet — premature churn pre-1.0 with no actual callers. 2. TestReceiver_GoroutineDeferRecover_KeepsProcessAlive — exercises the documented recipe from internal/pipeline/README.md's Pitfalls list. A receiver whose internal goroutine defers + recovers its own panic must (a) not crash the process and (b) let the runtime shut down cleanly. Without the test, the recipe was unverified; with it, the README has a tested example to point at. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Three non-blocking suggestions from PR #13 review: 1. buildSignalPipeline switch gains a default arm (build.go) so a fourth Signal value added to internal/pipeline can't silently produce zero pipelines. Currently unreachable — ParsePipelineID rejects unknown signals — defense in depth. 2. coverage-check fails loudly when the profile has zero measurable packages (tools/coverage-check/main.go). Prevents the gate from silently becoming a no-op if `make coverage`'s ./... glob gets narrowed in a future tooling change. 3. STYLE-errors.md rule 2 now has a callout explaining why wrapcheck's ignore-package-globs excludes the entire module (in-module errors share context with callers; wrapping stutters). Reminds future contributors that wrapcheck still polices stdlib / third-party returns. The fourth suggestion (constructor helpers for safeBase/firstDataBase) deferred: current field-assignment pattern works and a fourth wrapper family isn't imminent. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Self-review surfaced two more deferrals: - FAILURE-MODES.md references test names by string; rename rot is invisible until a reader follows a dead link. A make doc-check step that asserts existence would close this. ~30 minutes when next touching the Makefile. - The pipelinebuilder package's test surface is split (cmd/tracecore integration coverage + pipelinebuilder's own processor-path test). A one-paragraph doc.go answering "where do I test buildPipelines?" removes the ambiguity for the next contributor. Both belong in FOLLOWUPS.md so they survive past chat. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Three weak items from the fresh-eyes self-review that were cheap enough to close in-PR: 1. nps.md gets a "stub, pre-pilot" banner up top. The doc was over-prescriptive — committing to specific NPS methodology before talking to a single operator is shaping-the-question, not asking-the-question. The banner makes it clear nothing is binding until pilots run. 2. STRATEGY.md "Each component package owns its own Factory var" section gets a hard rule: PRs introducing centralized Register APIs / global factory maps / plugin loaders MUST be rejected without an accepted RFC, burden of proof on the proposer. Was hopeful-policy; now a concrete gate. 3. Makefile coverage target's tools/ exclusion gets an explicit comment. Was a silent behavior; now an in-line "if a future tool needs coverage tracking, add it to both lists" note. Plus one item tracked in FOLLOWUPS.md (not fixed in this PR): - Verify squash-merge preserves Assisted-by trailers after the PR actually lands on main. The disclosure-discipline gap is real but only observable post-merge. Items still NOT closed by this PR (intentional, tracked): - FAILURE-MODES.md doc-rot guard (make doc-check) — FOLLOWUPS - pipelinebuilder doc.go test-surface note — FOLLOWUPS - Cross-platform CI — FOLLOWUPS (M3) - Backend integration recipes — FOLLOWUPS (M14-16) - PR size itself — structural, can't fix mid-PR Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Strong-agree items from the v2 self-review, batched: builder.go: - Reframe "this build" error → "no component factories registered" (library-shaped phrasing in an exported function) - Reframe "unsupported signal %v (this build has no assembly path)" → "no assembly path registered for signal %v" - Nil default-config guard in resolveComponent (defense in depth, matches the default-arm philosophy) - Drop dangling "(UX #4)" reference; explain the criterion inline pipelinebuilder/doc.go (new): - Explains why the package is a sibling, not nested under internal/pipeline (cycle) - Documents the test-split convention so contributors aren't ambiguous about "where do I test buildPipelines?" builder_test.go: - mustYAML panics on empty input rather than silently returning a zero yaml.Node coverage-check: - Dedup key built from regex captures (file + range explicitly) instead of fragile raw-line slicing. The regex now has 4 capture groups so a future format tweak surfaces as a regex mismatch. integration_test.go: - SignalDuringBoot: replace the 5ms-sleep race with a 500ms retry loop that signals until accepted (or the deadline expires). Resolves the Linux-CI flake risk flagged in the review. validate.go: - Rename `stderr` param to `out` (the function doesn't care which writer it is) - ctx.Err() check between Load and BuildPipelines so a SIGTERM during a slow validate bails cleanly FAILURE-MODES.md: - TestLoad_MultiDocYAML_Rejected → TestLoad_MultiDocumentYAML_Rejected (test name was wrong) - "Bad YAML" row now references a specific test, not just the file - "SIGINT during boot" now references TestIntegration_SignalDuringBoot - "Phase-1 timeout" row: no dedicated test exists; row clarified to reference the indirect coverage rather than fabricate a name - "First-data instrumentation" now references a specific test - "Receiver-internal goroutine panic" row: clarify that the linked test exercises the *recipe*, not a crash assertion (negative case isn't tractable from inside the panicking process) - New row: stdout pipe broken mid-write — present failure mode in stdoutexporter today, was missing from the inventory FOLLOWUPS.md (restructure): - New "Next up" pinned section (doc-check, cross-platform CI, squash-merge trailer verification) - "Considered and explicitly skipped" section with rationale for: the DRY wrap helper, the signalConsumer generic constraint, the TestBuildBinary inode check - New rows: signalops trigger refinement (non-mechanical-diff is the real trigger), Factories.Validate() method, validate --json, make register-lint, make ci wall-clock trend, coverage trend - Ticked: internal/pipelinebuilder doc.go (landed in this commit) One review item attempted and reverted with documentation: - signalConsumer generic constraint on signalOps[C] silently dropped pipelinebuilder coverage from 74%→43% under -coverpkg. Root cause not nailed (suspected generic instantiation interacting oddly with coverage attribution); FOLLOWUPS records the diagnosis in the skipped section. Verified: internal/safe.Call exists as the README claims (review item #25). Items deferred to FOLLOWUPS (with explicit rationale): DRY wrap helper, signalConsumer constraint, inode check, register-lint, Factories.Validate, validate --json, OSS-Fuzz integration, benchstat in CI, wall-clock trend, coverage trend. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Two passes addressing the user's "remove superfluous documentation" and "put important-for-the-future items into a roadmap" feedback: Trimmed (remove what-isn't-there unless necessary): - docs/FAILURE-MODES.md: drop the "Things tracecore deliberately does NOT handle (yet)" prose section. The 🔴 rows in the inventory tables already capture the same content; the bottom section was duplicate-with-future-speculation. Intro paragraph also tightened. - docs/nps.md: trim the "Status: stub, pre-pilot" banner from a paragraph to two lines. - docs/STYLE-errors.md: drop the "Deferred — no error codes yet" section. Speculative; the deferral fact lives in RFC-0003. - internal/pipelinebuilder/doc.go: tighten test-split note from three paragraphs to one. - internal/pipeline/README.md: trim the "Deferred from M1" section to a single sentence pointing at RFC-0003. Roadmap-tracked (use the phrase "Carry-forward from M1.6"): The user asked: "anything that would be important for the future it's good to put into a roadmap." MILESTONES.md is that roadmap. Added "Carry-forward from M1.6:" bullets under the affected milestones (M3, M4b, M5, M6, M8) for items that have concrete milestone targets — cross-platform CI, chaos-light test, benchstat in CI, sustained-load, backend integration recipes, doc-check, the M8 receiver-author ergonomics review. docs/FOLLOWUPS.md restructured to keep only the two things that DON'T fit a milestone: - opportunistic / trigger-based items (no fixed date) - considered-and-explicitly-skipped (anti-rot) Cross-link added in FOLLOWUPS preamble so a reader looking for the M3 carry-forward grep "carry-forward" in MILESTONES.md. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Closes the four actionable items from PR-13 v3 review: A. TestIntegration_SignalDuringBoot retry loop: - Track `signaled` bool; fail with a meaningful message if all 250 attempts return errors (was: misleading "process hung" via the cmd.Wait deadline) - Detect os.ErrProcessDone and break early (process exited before we delivered the signal — nothing to wait for) B. validate.go ctx-cancel exit-code comment: - runValidate returns exitFailure on ctx-cancel; runCollect returns exitOK. Inconsistency was intentional: validate's job is to PROVE config validity; cancelled validate hasn't proved anything and shouldn't claim success. Comment makes this intentional asymmetry visible. C. Pointer comment at top of signalops.go warning future contributors away from adding the signalConsumer generic constraint. The PR-13 v2 attempt dropped coverage 74%→43% under -coverpkg; the comment + FOLLOWUPS entry should stop the next person from rediscovering it. D. New `make doc-check` (scripts/doc-check.sh): verifies every Test/Fuzz/Benchmark identifier referenced in rot-prone docs (docs/FAILURE-MODES.md + internal/pipeline/README.md) corresponds to a real function in the test tree. Scoped tight on purpose — RFCs intentionally reference deleted-test names as historical record. Wired into `make ci` between govulncheck and build. Plus two minor v3 review tweaks: - internal/pipelinebuilder/doc.go gains an "As of 2026-05" date so future readers can judge staleness of its "where the tests live" guidance (review item #4). - FOLLOWUPS "considered and skipped" DRY-wrap entry now says "Revisit if: never — this is a style commitment" (review #12). Current state: doc-check finds 23 valid references across 2 docs; all CI gates pass. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

## What this PR does Closes the one FOLLOWUPS item from PR #13 that could only be observed post-merge: whether GitHub's squash-merge preserves the per-commit `Assisted-by:` trailers. **Observed on squash commit `a57883f`: it does not.** The audit trail survives via the PR description (embedded in the squash commit body, listing every source SHA), but the trailer-level disclosure AGENTS.md describes is gone. Documents the finding + flags two governance paths that aren't being chosen unilaterally: - Update AGENTS.md to treat the PR-description SHA list as canonical squash-merge disclosure, or - Switch AI-assisted PRs to merge-commit / rebase-merge so per-commit trailers survive. Tag `v0.1.0-m1` was also created on `a57883f` (dev-preview, not a release artifact) — pushed alongside this branch but separately. ## Linked issue(s) Refs PR #13 (post-merge observation). N/A — no GitHub issue. ## Release notes ```release-notes NONE ``` ## Checklist - [x] N/A — docs-only change - [x] `make ci` passes locally - [x] Commits are signed off (`git commit -s`) - [x] Commits cryptographically signed (SSH-based, see `scripts/setup-signing.sh`) - [x] N/A — no new components in this PR Signed-off-by: tree <tree@lumalabs.ai> Signed-off-by: Tri Lam <trilamsr@gmail.com>

Address items #9 (--version OTel SDK), #13 (Example_* tests), #14 (Kind constant coverage + helper), #15 (shutdown error logging). Also tactical: extract initTelemetryStack from runCollect so the silent-shutdown-error fix didn't push cyclomatic complexity past 15. - internal/version: Build gains OTelSDKMetric + OTelPromExporter fields, populated from debug.ReadBuildInfo. Surfaces in --version output as `otel-sdk-metric=vX.Y.Z, otel-prom-exporter=vA.B.C`. Operators debugging OTel-side bugs see which SDK actually shipped without grep'ing go.mod. - internal/selftelemetry: new CanonicalKinds() returning the Kind* constant set and IsCanonicalKind(s) helper. Test pins the set + asserts clockreceiver's KindDownstream is canonical. Future receivers (M8/M9) plug in the same test pattern; eventual lint enforcement is documented in FOLLOWUPS. - internal/selftelemetry/example_test.go: ExampleNewReceiver + ExampleNewExporter render in pkg.go.dev for M8/M9 authors and compile-check the 6-line wiring pattern every CI run. The RECEIVER-PATTERNS.md snippet can no longer drift from a working example without breaking this test. - cmd/tracecore.runCollect: previously swallowed mp.Shutdown errors in the failure-unwinder paths with `_ = mp.Shutdown(ctx)`. Now logs at Warn level so a real drain failure surfaces. Extracted the telemetry-init block into initTelemetryStack helper to keep runCollect's cyclomatic complexity under the gocyclo 15 ceiling (the helper's own error-handling carries the rest of the cleanup branches). make ci clean. Coverage unchanged on most pkgs; selftelemetry up to 91.0% via the new IsCanonicalKind test. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>

Two independent reviews of PR #18 surfaced a stack of blockers, strong findings, and quality lifts. This commit lands the tractable items (defers documented in docs/FOLLOWUPS.md M8 section). Operator-visible drift (closes Reviewer 2 #13-#17): - RUNBOOK kind-triage row `consume` → `downstream` (last commit renamed the kind but missed the doc table). - README config table: `initial_delay` range updated from "0.. collection_interval" to "≥ 0" (code already relaxed for DGX cold-starts; doc was stale). - prometheus-alerts: DCGMReceiverHighErrorRate threshold changed from `rate > 0.1/sec` (unreachable: 15s default tick caps at 0.067/sec) to `increase > 5 in 5m`. Stale "M2 has not landed" caveat removed. - example_config: single `mode:` line with a clarifying comment (was showing both `mode: standalone` and `# mode: embedded`, inviting operators to uncomment both → YAML duplicate-key). - cmd/tracecore receivers list now prints `dcgm [stub]` / `dcgm [cgo]` so operators can verify deploy shape without reading go.mod. Build-tag-conditional via three small files in cmd/tracecore (receiver_variants*.go) — pattern extends to M11 NVML. Correctness bugs (closes Reviewer 2 #2, #3, #4, #6, #7): - receiver.Shutdown: `r.running.CompareAndSwap(true, false)` gates teardown so a second Shutdown is a no-op (cgo libdcgm `dcgmShutdown` is not documented idempotent). Same CAS provides the happens-before for `r.cancel` publish that Pass-1 flagged. - receiver.ensureWatched: zero-entities path now emits IncError(KindEnumerate) + degraded rather than returning true. Without this, a misconfigured host (no GPUs visible, ACL blocks /dev/nvidia*) had the receiver looking healthy while emitting nothing. - receiver: new `warnOnce` helper gates the 7 per-tick failure- path Warn logs to fire only on the first failure after recovery. Closes the log-storm bug (4 errors/min × 60 min = 240 lines). Counter (`receiver_errors_total`) still ticks every failure. - metrics.applyCardinalityCap: parameter `cap` → `maxSeries` (cap shadowed the Go builtin). Quality / contract lifts: - metrics.emit now returns a `stale` count for StatusStale and StatusError samples. pushSamples calls IncError(KindRead) once per tick when stale > 0 — surfaces DCGM serving slow/faulty data, which is precisely what StatusStale exists for. Per-tick not per-sample so GPU count doesn't inflate the rate. (StatusNoData and StatusFieldNotSupported still silent.) - docs_parity_test.go: new TestRUNBOOK_KindsMatchEmitted walks every emitted IncError/failedTick kind against the RUNBOOK per-kind triage table in both directions. This is the structural fix for the bug class — RUNBOOK can never again drift from emitted kinds without CI failure. - receiver.go: promoted `watchUpdateDivisor` / `watchKeepForMultiplier` / `watchUpdateEveryMinimum` constants for the previously-magic DCGM watch-cadence ratios. Documentation + dedup: - dcgm README: new "Privacy + data residency considerations" subsection (compliance-auditor ask). Flags hw.id / pci.bdf / NVLink peer IDs as quasi-identifying; provides two mitigation patterns (attr-drop processor, salt-hash pseudonymization). - docs/agents/examples/constructor_options.go: `WithTelemetry` renamed to `WithSelfTelemetry` to match the real in-tree receiver API. M9+ authors copying the example no longer drift. - RUNBOOK kind enumeration line restored with both watch and mig (dcgm-local kinds) per the last commit's promotion. - Repo-root `FOLLOWUPS.md` consolidated into `docs/FOLLOWUPS.md` (M8-opportunistic + M8-skipped sections). Single source of truth; 17 deferred items pulled forward with falsifiable triggers (cgo client landing, M11 sibling-receiver shape, operator-report thresholds, file-size triggers). - All bare `FOLLOWUPS.md` references updated to `docs/FOLLOWUPS.md`. Honest pushback documented: - I disagree with the M8-AGRADE-GAP claim of Operator UX 3.7→4.0. The drift findings above are exactly the class of bugs that rubric criterion was supposed to prevent — the alerts-vs-RUNBOOK parity test existed but didn't check kind values against emitted call sites. The new TestRUNBOOK_KindsMatchEmitted closes that gap; future operator-UX claims should pin to a test like this. - Deferred: split receiver.go (475 LOC) into 3 files, hoist dcgmtest.BaseClient, SECURITY.md for receiver, dcgm_info join-target, libdcgm setup in CONTRIBUTING — all logged with triggers. Reviewer 1's "construct receiver via M9-style primary-Option" inconsistency goes in the queue for M9-close review. `make ci` passes; dcgm coverage 86.0% (essentially flat — new tests offset by widened emit signature in test paths). Assisted-by: Claude Opus 4.7 Signed-off-by: Tri Lam <trilamsr@gmail.com>

## Summary - Strip review-process artifacts (reviewer tags, PR refs) and section-banner comments from 5 files. - Collapse per-field / per-const / per-method comment blocks whose bodies just restated identifier names. - Keep every load-bearing WHY (TOCTOU ordering, adversarial-input bounds, retry short-circuit, gzip threshold, classifyKind partition, cardinality discipline). - Net: **+118 / −357 lines**, no behavior change. ## Why these, not others Six-months-cold-reader test: would a reader without this conversation's context need the comment to avoid a footgun? If no, delete. | File | What went | Why it failed the test | |---|---|---| | `pkg/nccl/fr_parser/parser.go` | `// --- section ---` banners; per-field struct prose | Banners decay silently as the file grows; field names already self-document. Adversarial-input WHY preserved on the struct. | | `components/exporters/otlphttp/otlphttp.go` | Reviewer P3-Rev2-F1, S1/F9, C4/V1/V2, V14/m3, P2.10, M-5, M7 tags | Process coordinates that mean nothing post-merge; git blame + PR threads are the right home. Underlying constraints kept. | | `internal/pipelinebuilder/signalops.go` | "PR #13 v2 tried it" attribution | The rule (don't add the constraint) survives; docs/FOLLOWUPS.md already preserves the history pointer. | | `components/receivers/pyspy/kinds.go` | 6-line blocks per Kind const (× 12) | RUNBOOK.md owns operator semantics. Two truths drift; const names are self-documenting. | | `internal/selftelemetry/interface.go` | Per-method one-liner docs whose bodies restated method names; cardinality warning duplicated on every method | DRY for prose too. Cardinality discipline stated once on `Kind`. | ## Release notes ```release-notes NONE ``` ## Test plan - [x] `go build ./...` - [x] `go vet ./...` - [x] `go test ./pkg/nccl/fr_parser/... ./components/exporters/otlphttp/... ./internal/pipelinebuilder/... ./internal/selftelemetry/... ./components/receivers/pyspy/...` — all green 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Eight 1-page pattern-design specs covering #2 IB link flap, #7 dataloader hang, #8 NCCL timeout no-HW, #9 NCCL bootstrap timeout, #10 CUDA OOM deceptive allocator, #11 checkpointer hang, #12 loss spike NaN, #13 silent data corruption. Each carries the standard detector-design shape (symptom, layers, signal sources, evaluation rule, verdict attrs, edge cases, status, open questions) so the next contributor can write a TDD red test directly off the spec. Status: all 8 marked planned. #10 already has issue #303; the spec frames the design alongside. NORTHSTARS Appendix A gains a Spec column; docs/README + patterns README link the new specs. Signed-off-by: Tri Lam <tri@maydow.com>

## Summary Ships pattern #13 (silent data corruption) end-to-end per the spec at `docs/patterns/13-silent-data-corruption.md` — the hardest pattern in the 15-set, where the run completes, loss looks normal, but downstream eval shows degraded model quality. - **Library**: `SilentDataCorruptionDetector` in `module/pkg/patterns/silent_data_corruption.go` consumes `EvalAccuracyRecord` + `SDCCounterRecord` typed projections and emits `SilentDataCorruptionVerdict`. Discriminator follows the spec's evaluation rule: - `accuracy_drop >= AccuracyDropThreshold` AND same-job `hw.gpu.sdc.*` counter rose during the job window → `kind=vendor_signaled` (full confidence) - `accuracy_drop >= AccuracyDropThreshold * 2` alone → `kind=accuracy_only` (partial confidence) - sub-2x-threshold without vendor signal → no verdict (conservative-by-default noise band) - **False-positive guards** (spec §"Edge cases"): dataset-checksum mismatch suppresses verdicts; checkpoint-step mismatch suppresses verdicts. Both guards are opt-in (only fire when both sides set the field). - **Wiring**: `patterndetectorprocessor` projects `gen_ai.training.eval_accuracy*` + `hw.gpu.sdc.*` log records, runs the detector, emits the verdict with promoted scalars per issue #270. - **Schema**: `module/pkg/patterns/testdata/silent_data_corruption_verdict.schema.json` pins the verdict wire format; 13 schema-drift falsifiers + conformance round-trip for both branches. - **Docs**: spec status flipped to shipped; `docs/patterns/README.md` table updated; `docs/ATTRIBUTES.md` registers the 13 new customer-stable attribute keys (`hw.gpu.sdc.{delta,kind}`, `gen_ai.training.eval_accuracy.*`, `gen_ai.training.eval_set.*`, `gen_ai.training.checkpoint.*`, `gen_ai.training.job.{start,end}_unix_nano`, `tracecore.alert.silent_data_corruption.*`) and adds the `silent_data_corruption` row to the per-pattern matrix. Verdict remains explicitly advisory per spec §"Detector evaluation rule" — both branches route the operator to a same-recipe + same-seed re-run on different hardware as the disambiguator, because SDC repro is non-deterministic. ## Test plan - [x] `cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... -count=1` — green (19 SDC library tests incl. 13 schema falsifiers, 7 wiring tests). - [x] `cd module && go vet ./...` — clean. - [x] `golangci-lint run ./...` — 0 issues (via pre-commit hook). - [x] `attribute-namespace-check` — 82/82 unique attribute literals documented (via pre-commit hook). - [x] Full module `go build ./...` — clean. ## Integration gaps (upstream-blocked, documented in-place) - `gen_ai.training.eval_accuracy` upstream framework instrumentation (spec §"Open questions" 5) — projection tested against hand-crafted records; real-world input arrives once the OTTL recipe lands. - `hw.gpu.sdc.*` OTTL recipe for DCGM SDC catcher / row-remap / AMD ECC counters — blocked on RFC-0014 PR-B (metrics→logs pattern). Cross-vendor support depends on `hw.gpu.sdc.*` semconv that doesn't exist yet (spec §"Open questions" 2). - Baseline-accuracy provenance + cross-run state are v0.3+ product questions (spec §"Open questions" 1, 4) — the detector accepts operator-stamped baselines today. ```release-notes Add pattern #13 (silent data corruption) detector. Surfaces suspected SDC when an eval-accuracy regression (`gen_ai.training.eval_accuracy` vs baseline) crosses the configured threshold; a same-job `hw.gpu.sdc.*` counter rise during the job window flips the verdict to high confidence (`kind=vendor_signaled`). The verdict is advisory — both branches recommend a same-recipe re-run on different hardware as the disambiguator. New config knobs: `sdc_accuracy_drop_threshold` (default `0.005`) and `sdc_accuracy_only_multiplier` (default `2.0`). ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>

The patterndetector ships 11 detectors with 14 time-bounded knobs, but the join shape varies across patterns and the rationale lived only in code comments + PR review threads. Operators tuning windows had to read source per detector. Audit finding: five distinct shapes are load-bearing (chosen by the causal physics of each signal), not bugs: - One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect. - Asymmetric two-sided (#11): pre-stall covers concurrent-start checkpoints; post-stall covers OTTL-bridge logger latency. - Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window could be cause OR consequence. - Job-window bounded (#13): SDC counter rise must fall in the bounded eval-cycle's owning job; no operator knob is meaningful. - Trailing-window rate / freshness (#2 #4 #8): rolling window anchored at `now` or the most-recent record. Decision: document the existing reality, do not converge. Forcing every detector to the asymmetric two-knob form would silently zero one leg for the one-sided detectors (footgun on clock skew) and would not apply to #13 at all. Adds: - 'Why this correlation shape' section in docs/patterns/07, 11, 13 (the three shapes the issue called out by name). - 'Correlation-window semantics' table in docs/patterns/README.md covering ALL 11 detectors with the predicate, anchor, and shape rationale, plus cross-links to the per-pattern sections. No code changes; no detector behavior changes. Closes #367. Signed-off-by: Tri Lam <tri@maydow.com>

## Summary Three v1 detectors use three different correlation-window shapes; the rationale lived only in code comments and review threads. Operators tuning windows hit these without warning. - pattern #7 (dataloader_hang): one-sided look-back — strict cause→symptom - pattern #11 (checkpointer_hang): asymmetric (-30s, +60s) — log/record race - pattern #13 (silent_data_corruption): symmetric job window — order-free attribution ## Decision Document the existing reality. Each shape matches a distinct physical event-ordering. A unified asymmetric two-knob form (issue suggestion) would silently zero one leg for #7 and not apply to #13 — operators would tune knobs with no physical meaning for their pattern. ## Changes - "Why this correlation shape" subsection in each of `docs/patterns/07-dataloader-hang.md`, `11-checkpointer-hang.md`, `13-silent-data-corruption.md` - "Correlation-window semantics" comparison table in `docs/patterns/README.md` cross-linking the three rationales - No code changes; no detector behavior changes ## Test plan - [x] `golangci-lint`, `go vet`, `attribute-namespace-check` (pre-push hook) — green - [x] Each pattern doc cross-links to the comparison table - [x] Each comparison-table row links back to the per-pattern rationale section Closes #367. ```release-notes docs: document correlation-window semantics for patterns #7, #11, #13 — three shapes (one-sided, asymmetric, symmetric) match distinct physical event-orderings; no behavior change. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>

Pre-fix, scripts/doc-check.sh:103 exited 0 whenever docs/FAILURE-MODES.md carried no Test*/Fuzz*/Benchmark* references (its current state on main, grep count = 0). The early-exit silently bypassed every gate below it: link-rot (.md + non-.md), banned-phrase lint, (unverified) baseline, reproducibility shell-syntax, maintainer- ship H2, M6 integration recipes (markers + 180d staleness), example placeholder lint, README integration index, required-doc presence, getting-started block cap, nps H3, FAILURE-MODES labels, and the comment-noise diff gate. Root cause: a57883f (#13) shipped doc-check.sh with one gate (the Go-test parity check), so `[ -z "$referenced" ] && exit 0` was correct. Subsequent PRs appended ~14 gates below without noticing they'd be dead-coded whenever FAILURE-MODES dropped its Test* refs. PR #459 worked around by placing the new YAML gate *above* line 99. Fix: scope the skip to the parity block only (if/else, not exit). Update the comment in the trust-posture preamble to record the bug. Surfaced + fixed: - docs/FOLLOWUPS.md: dead `followups/otlphttp.md` link. Shard never shipped on main; folded into the "deleted post-v0.2.0" prose. - docs/cut-criteria.yaml.md (2x) + install/kubernetes/tracecore/ README.md: "production-grade" banned-phrase hits reworded. - scripts/doc-check.sh: rescope getting-started.md block cap to the `## Walkthrough` H2 section only (M6 intent — quickstart commands, not Helm + air-gap installs that landed after the cap was set). Mutation-verified post-fix: - inject dead .md link → link-rot gate exits 1, names file + target - inject "blazing-fast" + "rock-solid" → banned-phrase gate exits 1 - remove tested-against marker from datadog.md → recipe gate exits 1 Gate count: 2 status lines pre-fix → 16 post-fix on clean tree. `make doc-check` exits 0 on clean tree. Closes #460. Signed-off-by: Tri Lam <tree@lumalabs.ai>

Reviewer B flagged the PR body's "2 → 16" gate count as inaccurate and asked whether the test plan had been re-run against post-#453 state. Resolution recorded here: Gate count, empirically verified via `grep -c '^doc-check: '` on `make doc-check` output on a clean tree: Pre-fix on main (after #459 landed) 3 status lines Post-fix this PR (after rebase onto origin) 17 status lines The PR body's "2 → 16" was correct at PR open (before #459 merged). After rebasing onto origin/main, the YAML cross-link gate from #459 adds one more enforced gate (it was placed above the early-exit line specifically to work around the bug this PR fixes, so it ran on the pre-fix baseline too). The "14 gates below that line" framing in the release-notes block is correct for either base: it counts gates that the early-exit was hiding, which is invariant across the rebase. Test plan re-run post-rebase: `make doc-check` exits 0 on clean tree, 17 status lines emitted, no mutation regressions. Refresh the script header docstring to record the empirical count and reflect that doc-check now covers 17 drift patterns, not just the original Go-test-name parity check it shipped with in #13. No functional change to gate logic. Signed-off-by: Tri Lam <tree@lumalabs.ai>

…460) (#466) ## Summary Closes #460. The `exit 0` on `scripts/doc-check.sh` ran unconditionally whenever `docs/FAILURE-MODES.md` carried no `Test*`/`Fuzz*`/`Benchmark*` identifiers (its current state on `main` — `grep -c` = 0), silently bypassing every gate below it. Fix scopes the skip to the Go-test parity block only (if/else, not `exit`), then surfaces and fixes the dead refs the gates were supposed to be catching. ## Root cause Commit a57883f (#13) shipped `doc-check.sh` with one gate — the Go-test name parity check — so `[ -z "$referenced" ] && exit 0` was correct then. PRs #28, #56, #115, #131, #144, #149, #195, #234, #241, #443, #455, #459 (and others) appended gates **below** that line without recognising they'd become dead code whenever `FAILURE-MODES.md` lost its `Test*` references. PR #459 worked around the bug by placing its new YAML gate *above* line 99 and tracked the root cause separately as #460. ## What surfaced Once `exit 0` was removed, three real issues fired: 1. **Dead `.md` link**: `docs/FOLLOWUPS.md` → `followups/otlphttp.md`. The shard was never committed to `main`'s ancestry. Folded into the existing "Shards deleted post-v0.2.0 as fully resolved-via-pivot" prose block (sibling treatment to M9, M14, M16). 2. **Banned-phrase hits** (3x `production-grade`): reworded in `docs/cut-criteria.yaml.md` (2x) and `install/kubernetes/tracecore/README.md` (1x) to falsifiable language. 3. **`docs/getting-started.md` block cap**: 7 fenced bash/sh blocks. The M6 cap of 5 was set for the quickstart only — `## Install via Helm` and `## Air-gapped install` are alternate deployment paths that landed post-M6 and aren't part of the quickstart budget. Rescoped the gate to count blocks inside the `## Walkthrough` H2 section only (1 block, well under cap). ## Gate count Empirically verified via `grep -c '^doc-check: '` on `make doc-check` output on a clean tree: | State | Status lines emitted | Gates the early-exit was hiding | |---|---|---| | Pre-fix on `main` (post-#459) | 3 (trust-posture, YAML cross-link, parity-skip) | 14 | | Post-fix this PR (post-rebase) | 17 | 0 | The "14 gates hidden" number is invariant across the rebase: it counts gates placed below the early-exit line. The "3 → 17" total reflects post-#459 reality on `main`; pre-#459 baseline was "2 → 16" (the figure originally in this PR body), and #459 itself worked around the bug by placing its YAML gate above line 99. ## Mutation tests Each gate below the original early-exit was confirmed to fire post-fix: | Mutation | Gate expected to fire | Exit code post-mutation | Exit code post-restore | |---|---|---|---| | Inject `[bad](nonexistent-ghost.md)` into `docs/FOLLOWUPS.md` | markdown link-rot | 1 | 0 | | Append `blazing-fast` + `rock-solid` to `docs/getting-started.md` | banned-phrase lint | 1 | 0 | | Delete `` from `docs/integrations/datadog.md` | M6 recipe markers | 1 | 0 | ## Test plan - [x] `make doc-check` exits 0 on clean tree (re-run post-rebase onto origin/main; 17 status lines) - [x] 3 mutation tests above each toggle exit 1 → 0 across mutate / restore - [x] Pre-push hooks green: golangci-lint (0 issues), `go vet ./...`, `go mod verify`, `attribute-namespace-check` (100 attrs, all documented), `register-lint`, `actionlint`, `zizmor`, `deprecation-check`, `no-autoupdate-check` - [x] Rebased onto current `origin/main` (includes #459, #461, #462, #456); no conflicts; gate count re-verified empirically post-rebase - [x] No changes to gates above line 99 (the trust-posture callout + YAML cross-link gate from #459 still run and emit unchanged status lines) ## Self-grade **A+** — root cause named in commit body (a57883f #13 with one gate; gates appended below without exit-path awareness); 3 mutation tests (success criteria required 1–2); rescoped the getting-started gate to match M6 intent rather than papering over the surfaced overflow; the `[ -z "$referenced" ]` legitimate skip is preserved via if/else (not `:` no-op, which would have left the `defined=` / `orphans=` block running on empty input); gate count corrected empirically post-rebase per reviewer B feedback. ```release-notes - fix(ci): `scripts/doc-check.sh` no longer exits 0 at the Go-test parity gate when `docs/FAILURE-MODES.md` carries no `Test*` references. 14 gates below that line (link-rot, banned-phrase, M6 recipe markers, etc.) are now actually enforced on every `make doc-check` invocation. Closes #460. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>

## Summary Wave-end audit flagged the patterndetectorprocessor fanout site as an unmet refactor: `ConsumeLogs` hand-rolled dispatch for every shipped detector (12 today: 7 inline + 5 wrapped), so adding pattern #13 required editing the fanout body — not registering a new entry. Past the rule-of-three by 9x. This PR introduces a minimal Detector registry seam: - `module/pkg/patterns/detector.go`: new `Detector` interface (`PatternID() string`) + `Registered` slice that pins all 12 detector pointers. Each `*Detector` struct gets a one-line `PatternID()` method. - `module/pkg/patterns/detector_test.go`: `TestRegistered_PinsAllPatterns` (exact PatternID set + count), `TestRegistered_UniquePatternIDs`, `TestRegistered_NonEmptyPatternIDs`. Drift gate — accidental drops fail in CI. - `patterndetector.go`: introduces `detectorRunners []detectorRunner` closure list iterated by `ConsumeLogs`. `ConsumeLogs` body drops from ~77 lines to 12. Adding pattern #13 = one append to `Registered` + one append to `detectorRunners`, no fanout-site edit. ### Design decision: metadata-only interface The `Detector` interface is intentionally `PatternID() string` only — not a uniform `Evaluate` method. Each detector's Evaluate signature is intrinsically heterogeneous (different input record shapes — events+nodeConds, ncclRecs, xidRecs+events, etc. — and different verdict types). A uniform Evaluate would force a lossy `any`-typed contract that the typed test suite has been fighting for 12 patterns. The closure-per-detector approach keeps the typed Evaluate calls at their concrete-runner sites while letting the registry pin identity + iteration. ### Behavior preservation - Same telemetry vocabulary: PodEvicted and IBLinkFlap still `IncVerdict` with `string(v.Confidence)` (they gate on partial); the other 5 inline detectors still pass `""`. The 5 wrapped runners still tick inside their own helpers (unchanged). - Same emission order: `detectorRunners` is declared in the legacy emission order. - Same partial-confidence gating: `emitPodEvicted` and `emitIBLinkFlap` preserve the `!emitPartial` skip. ### Test plan - [x] `cd module && go build ./...` clean - [x] `cd module && go test ./pkg/patterns/` green (incl. 3 new pin tests) - [x] `cd module && go test ./processor/patterndetectorprocessor/` green except pre-existing #497 (`TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06-multi-rank-disk-pressure`, fixed in Lane J) - [x] `make lint` clean (0 issues) - [x] `make vet`, `go mod verify`, attribute-namespace-check all green (pre-push hook) ### LoC delta - +321 / -79 across 3 files. - `ConsumeLogs` body: 77 → 12 lines. - Growth is in: registry plumbing (164 lines, mostly comments + the pin tests), runner closures (one per detector). The seam earns its bytes — adding pattern #N is now O(append) instead of O(edit-fanout). ### Closes-the-loop Closes wave-end-audit next-wave item #2 (pattern registry seam). ```release-notes - refactor(patterns): introduce `patterns.Detector` interface + `patterns.Registered` slice. The patterndetectorprocessor now iterates a registry-driven runner list instead of hand-rolled fanout — adding a new pattern is one append, not a processor edit. Behavior-preserving; no operator-facing change. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr added 30 commits May 13, 2026 16:34

trilamsr added 5 commits May 13, 2026 21:52

trilamsr changed the title ~~[general] M1.6 hardening: lint, coverage, panic recovery, OTel audit~~ [general] M1.6 hardening: lint + coverage + panic recovery + pipelinebuilder May 14, 2026

trilamsr merged commit a57883f into main May 14, 2026
12 checks passed

trilamsr deleted the feat/m1.6-hardening branch May 14, 2026 09:00

trilamsr mentioned this pull request May 14, 2026

[docs] FOLLOWUPS: close squash-merge-trailer item with observation #14

Merged

5 tasks

trilamsr mentioned this pull request May 20, 2026

[chore] comments: trim verbose/PR-ref/banner noise #128

Merged

3 tasks

trilamsr mentioned this pull request Jun 1, 2026

feat(sdc): pattern-13 silent data corruption detector #344

Merged

5 tasks

This was referenced Jun 1, 2026

feat(schema): publish Verdict v1.0-rc1 envelope schema #351

Merged

feat(sdk): publish Python + Go verdict-consumption SDKs #356

Merged

docs(patterns): correlation-window semantics rationale (#367) #388

Closed

trilamsr mentioned this pull request Jun 1, 2026

docs(patterns): correlation-window semantics rationale (#367) #405

Merged

3 tasks

This was referenced Jun 2, 2026

test(replay): Path B corpora for 4 missing detectors (#428) #439

Merged

fix(doc-check): drop unconditional exit 0 hiding 14 gates downstream (#460) #466

Merged

trilamsr mentioned this pull request Jun 4, 2026

refactor(patterns): introduce Detector registry seam #520

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[general] M1.6 hardening: lint + coverage + panic recovery + pipelinebuilder#13

[general] M1.6 hardening: lint + coverage + panic recovery + pipelinebuilder#13
trilamsr merged 43 commits into
mainfrom
feat/m1.6-hardening

trilamsr commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Major surfaces

Quality gates (CI now enforces)

Pipeline runtime + contract

Operator UX

Documentation

Code review closures (3 rounds)

Real bugs found and fixed

Future work — tracked, not lost

Linked issue(s)

Release notes

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

trilamsr commented May 14, 2026 •

edited

Loading