Skip to content

[general] M1.6 hardening: lint + coverage + panic recovery + pipelinebuilder#13

Merged
trilamsr merged 43 commits into
mainfrom
feat/m1.6-hardening
May 14, 2026
Merged

[general] M1.6 hardening: lint + coverage + panic recovery + pipelinebuilder#13
trilamsr merged 43 commits into
mainfrom
feat/m1.6-hardening

Conversation

@trilamsr

@trilamsr trilamsr commented May 14, 2026

Copy link
Copy Markdown
Contributor

What this PR does

43 atomic commits closing M1.5's self-review compromises, adding pre-M2 hardening, processing three rounds of external code review, and tracking everything we deferred so nothing leaks back into chat history.

Major surfaces

Area Commits
Quality gates wired into make ci 8
Pipeline runtime + contract changes 11
Operator UX 5
Code consolidation (dedup + split) 5
Documentation (RFCs, READMEs, style guides) 8
Code review closures (3 rounds) 6

Quality gates (CI now enforces)

Commit Change
edd54a5 docs/STYLE-errors.md + tree-wide audit + splitName stutter fix
25d7675 Enable 6 stricter linters (wrapcheck, goconst, nilerr, prealloc, bodyclose, forcetypeassert)
048f7b0 make coverage-check (internal/* ≥70%, components/* ≥60%) + runtime Start-after-Shutdown fix
85b6080 Fuzz targets for ParsePipelineID, Load, splitName
34e61a3 Baseline benchmarks for runtime, fanout, first-data
93bd8cc govulncheck in make ci
ceb6b30 gocyclo -over 15 gate; buildSignalPipeline split into helpers
2a1dcae goleak in TestMain + fixed leaked signal-handler goroutine
78c6508 make doc-check (asserts every test name referenced in rot-prone docs exists)

Pipeline runtime + contract

Commit Change
3226a4e pipeline.NewRuntime(pipelines, opts...) replaces Settings struct
420b788 WrapSafeMetrics/Traces/Logs at every cross-stage Consume hop
c574f4d buildXPipeline → 1× buildSignalPipeline[C] (-109 LOC)
ad15390 3× per-signal fanout → 1× genericFanout[T,C] (coverage 70%→100%)
f4966d3 safe/firstdata: shared base + per-signal embedding
c7dc11a Collapse signalOps create+assert closures
a93fc25 sync.OnceValues integration-test binary cache
f39aeda Double-SIGTERM force-exits; handleShutdownSignals extracted + unit-tested
8be84a7 ErrSignalUnsupportedErrSignalNotSupported; multi-instance regression test
5643a4a Host.ReportStatus deprecation + receiver defer/recover example test
37e1dc0 Move pipeline assembly into internal/pipelinebuilder (sibling, avoids cycle)

Operator UX

Commit Change
7f0ae78 tracecore validate subcommand; exit 2 on bad config
2be6ea0 --help examples + exit-code table; tracecore vX.Y (sha=..., go=..., built=..., platform=...)
0ddcdea --version-short for scripting; empty-pipeline boot now WARN-visible
3ffb27d Integration: SIGINT, boot-time signal, validate→collect roundtrip, --version end-to-end
55c7c3c coverage-check slash-guard against malformed profiles

Documentation

Commit Change
edd54a5 docs/STYLE-errors.md (≤80 lines)
5e26f55 Processor + exporter author quickstarts
edc1826 Root README quickstart (copy-paste runnable) + ASCII pipeline diagram
a3b7a66 RFC-0003 §"M2 TelemetrySettings extension preview"
622f2e8 RFC-0004 Path A confirmed (Factory + NewFactory() bridge)
e876810 Pitfall: receiver goroutine panic recovery recipe
58621fc docs/FAILURE-MODES.md inventory (handled / bounded / not-handled, every row linked to a test)
6359fd7 docs/STRATEGY.md Resolution column + registry-of-registries write-down

Code review closures (3 rounds)

Commit Round
189f85d v1 polish: default arm, empty-profile guard, wrapcheck doc note
4928393 Weak items: nps.md banner, plugin-prevention hard rule, Makefile coverage comment
72b10a1 v2: 14 strong-agree items closed
a663545 Trim superfluous prose + roadmap-track future work
78c6508 v3: SignalDuringBoot signaled flag, validate ctx-comment, doc-check

Real bugs found and fixed

  • Runtime concurrency bug (048f7b0): Start racing with Shutdown would launch components nobody would tear down. Coverage-instrumentation timing surfaced it.
  • Leaked signal-handler goroutine (2a1dcae): handleShutdownSignals blocked on <-sigCh forever after run() returned. goleak surfaced it.
  • splitName stutter (edd54a5): operator's bad input appeared twice in error messages.

Future work — tracked, not lost

  • MILESTONES.md — 5 milestone-tracked "Carry-forward from M1.6:" bullets (M3 cross-platform CI, M4b chaos-light, M5 benchstat + sustained-load, M6 finalize FAILURE-MODES + nps + backend recipes + doc-check, M8 receiver ergo review + network failures).
  • docs/FOLLOWUPS.md — 8 open items without fixed milestones (opportunistic + trigger-based) + 3 explicitly considered-and-skipped items with reasoning (DRY wrap helper, signalConsumer generic constraint, inode-based cache check). Pinned "Next up": make doc-check ✓ already shipped this round; squash-merge Assisted-by: verification (observable only post-merge).

Linked issue(s)

Refs internal M1 milestone (post-RFC-0003 / RFC-0004 hardening). N/A — no GitHub issue.

Release notes

[ENHANCEMENT] M1.6 hardening: stricter lint + per-package coverage gate + govulncheck + goleak + doc-check in CI; pipeline runtime variadic-Options configurable, recovers panics in Consume* hot path, double-SIGTERM force-exits; new `tracecore validate` subcommand and `--version-short` for scripting; pipeline assembly extracted to `internal/pipelinebuilder`; OTel v0.152.0 divergence table refreshed with Resolution targets; 3 real bugs fixed (Start/Shutdown race, signal-handler leak, error-message stutter); future work tracked in docs/FOLLOWUPS.md and MILESTONES.md "Carry-forward from M1.6" bullets.

Checklist

  • Tests added or updated
  • make ci passes locally
  • Commits are signed off (git commit -s)
  • Commits cryptographically signed (SSH-based)
  • N/A — no new components in this PR (clockreceiver/stdoutexporter shipped in M1.5)

trilamsr added 30 commits May 13, 2026 16:34
Adds docs/STYLE-errors.md codifying seven rules for tracecore error
messages: operator-actionable wording, no stuttering, %w-wrapped
chains, no apologetics, %q-quoted operator input, sparing sentinels,
typed errors only when callers need fields. Includes a deferral note
on error codes (additive Coded interface seam at 1.0; do not implement
now).

Audit of every fmt.Errorf and errors.New in production code surfaced
one rule-2 stutter: splitName included "component name %q" while every
caller wraps with "pipeline X: <role> %q: %w", quoting the operator's
input twice. Fixed by making splitName's error self-contained.

RED: TestBuildPipelines_MultiSlashName_NoStutter pinned the stutter
(strings.Count == 2 for the bad name).
GREEN: same test passes (count == 1) after splitName no longer
embeds the input string.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds bodyclose, forcetypeassert, goconst, nilerr, prealloc, wrapcheck
to .golangci.yml (gocritic, unconvert, errorlint were already on).

wrapcheck was the noisy one — 12 initial violations, all internal
pass-through forwarders (Consume* chains, ComponentState lifecycle)
or stdlib calls that genuinely cannot fail (bytes.Buffer.Write to
in-memory buffer, context.Err()). Wrapping those would create the
stutter STYLE-errors rule 2 forbids. Configured wrapcheck to ignore
same-module packages and the relevant interfaces. Config schema
verified via `golangci-lint config verify`.

The schema for golangci-lint v2 wrapcheck uses kebab-case keys
(ignore-sigs, ignore-package-globs, ignore-interface-regexps), not
the camelCase shown in upstream wrapcheck README.

RED: `make lint` after enabling — 12 wrapcheck violations.
GREEN: after wrapcheck tuning, `make ci` reports 0 issues.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Phase 3 of M1.6: coverage threshold gate, scoped to the policy STYLE.md
sets — internal/* ≥70%, components/* ≥60%, cmd/ informational, tools/
not policed.

New surface:
- `make coverage` runs go test with -coverprofile + -covermode=atomic
  against ./cmd/... ./components/... ./internal/... (tools/ excluded;
  it has its own tests but no coverage gate).
- `make coverage-check` parses coverage.out via the small Go program
  in tools/coverage-check and fails the build if any policed package
  falls below threshold.
- Wired into `make ci` between `test` and `build`.
- Baselines recorded in docs/research/baselines.md; internal/fanout
  is at 70.4% (0.4% margin), explicitly noted as the package to watch
  during Phase 6's fan-out refactor.

Runtime bug surfaced by coverage instrumentation:
TestRuntime_ConcurrentStartShutdown_NoLostComponents fails under
coverage timing — if Shutdown wins lifecycleMu first, it observes
empty started* slices, releases the lock, and Start then runs and
launches Components nobody will tear down. Fixed by setting a
`shutdown` flag in Shutdown (under lifecycleMu) and returning nil
from Start when shutdown is already true. Concurrent boot + signal
becomes a clean no-op rather than a race-dependent leak.

RED: TestRuntime_ConcurrentStartShutdown_NoLostComponents fails
  3/3 runs under `go test -race -cover`.
GREEN: same test passes 5/5 after the runtime fix; `make ci` clean
  with all 9 policed packages above threshold.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Replaces pipeline.NewRuntime(Settings, []Pipeline) with the variadic
form pipeline.NewRuntime([]Pipeline, ...Option). The Settings struct
becomes the unexported settings type; configuration flows through
WithLogger, WithDrainBudget, WithHost.

Why: M2 will add MeterProvider and BuildInfo fields. With a struct
literal, that's a breaking change for every caller. With variadic
options, M2 ships pipeline.WithMeterProvider(...) additively — every
existing call site keeps working unchanged.

API:
  rt := pipeline.NewRuntime(pipelines,
      pipeline.WithLogger(logger),
      pipeline.WithDrainBudget(5*time.Second),
  )

Options apply in order; later wins. New test
TestRuntime_VariadicOptions_ApplyInOrder pins this so M2 can layer
build-default + caller-override without proliferating constructors.

Touched files:
- internal/pipeline/runtime.go         — Option type + 3 With* constructors
- internal/pipeline/runtime_test.go    — 17 call sites migrated + new test
- internal/pipeline/firstdata_test.go  — 1 call site migrated
- internal/pipeline/contract_test.go   — 1 call site migrated
- cmd/tracecore/main.go                — production call site migrated
- internal/pipeline/pipelinetest/doc.go — drop stale NewRuntime reference
- docs/rfcs/0003-...                   — implementation-notes entry

RED: TestRuntime_VariadicOptions_ApplyInOrder fails to compile
  (Option / WithLogger / WithDrainBudget undefined).
GREEN: same test passes; `make ci` clean; all 9 policed packages
  above coverage threshold; internal/pipeline coverage held at 85%.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Replaces buildMetricsPipeline / buildTracesPipeline / buildLogsPipeline
(three near-identical ~95-line functions) with a single generic
buildSignalPipeline[C any] + three small per-signal signalOps[C] values.

The generic algorithm — exporters → fanout → reversed processors →
first-data wrap → receivers — now lives in exactly one place. The
signal-specific glue (which Create* method, fanout constructor,
first-data wrapper, consumer-interface assertion) is the only thing
per-signal, captured in metricsOps / tracesOps / logsOps.

Net: -212 / +103 in cmd/tracecore/main.go (109 lines removed). The
characterization test TestBuildPipelines_MultiSignal_ErrorsCiteCorrectSignal
pins that each signal path still surfaces ErrSignalUnsupported with
its own pipeline ID in the error — guarding against the worst kind
of refactor regression (a logs error routed through the traces
builder).

ctx is the first parameter of buildSignalPipeline (revive's
context-as-argument rule); ops is second, matching how the dispatcher
reads at the call site.

RED: characterization test passes against the three-function code.
GREEN: same test passes after consolidation; `make ci` clean; all 9
  policed packages above coverage thresholds; cmd/tracecore coverage
  rose 39.8% → 57.2% (informational; same tests exercise more of the
  smaller surface).

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three near-identical files (metrics.go / traces.go / logs.go,
87+71+71 LOC) replaced with one genericFanout[T any, C any] in a new
fanout.go plus three thin per-signal wrappers (~46 LOC each). The
cloning algorithm lives once; per-signal files now hold only the
pdata bridge funcs (new/copyTo/isReadOnly/markReadOnly), the
ConsumeX method that forwards to consumeAll, and the public
NewMetrics/NewTraces/NewLogs constructor.

Shared cloning-contract test table (TestFanoutContract_{Metrics,
Traces,Logs}) exercises 6 scenarios × 3 signals = 18 sub-tests
through one generic runCloningContract helper. Lifts traces+logs
coverage to parity with metrics — fanout coverage rose 70.4% → 100%.

The pdata payload types (pmetric.Metrics, ptrace.Traces, plog.Logs)
don't share a Go interface for CopyTo/IsReadOnly/MarkReadOnly in
upstream pdata, so the generic uses a pdataOps[T] callback struct
to bridge them. Each callback is one line.

RED: TestFanoutContract_{Metrics,Traces,Logs} pass against the
  pre-refactor code (characterization).
GREEN: same tests + all pre-existing per-signal tests pass after
  the generic consolidation; `make ci` clean.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
buildBinary used to call `go build` for every test invocation. With
the integration test plus the new TestBuildBinary_CachesAcrossCalls
plus `go test -count=N` re-runs, that was a ~2s wait per test.

Replaces the per-test build with a package-level sync.OnceValues so
the binary is built at most once per `go test` process. The binary
lives under os.MkdirTemp so it survives across tests; TestMain cleans
the directory after m.Run() returns.

Measured (go test -count=2):
  TestBuildBinary_CachesAcrossCalls: 0.21s first, 0.00s second
  TestIntegration_ClockreceiverToStdoutexporter: 0.53s first, 0.33s second
Total wall-clock: 2.485s; previously each invocation paid the build.

RED: TestBuildBinary_CachesAcrossCalls fails on the pre-cache code
  (each call to buildBinary produces a new t.TempDir path, so
  require.Equal on paths fails).
GREEN: same test passes after the sync.OnceValues cache; the cached
  binary's mtime is stable across calls, proving no rebuild.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds three Go-native fuzz targets covering the operator-input
surfaces — the only places tracecore parses untrusted-ish strings:

- internal/config/fuzz_test.go::FuzzLoad — writes the fuzz bytes to
  a t.TempDir file and pushes through config.Load. 5 seeds: empty,
  valid, malformed, multi-doc, binary.
- internal/config/fuzz_test.go::FuzzParsePipelineID — 6 seeds:
  metrics, metrics/primary, traces/, //bad, badsignal/x, empty.
- cmd/tracecore/fuzz_test.go::FuzzSplitName — 6 seeds covering the
  YAML component-reference name parser.

Property under test: no panic, regardless of input. Each fuzzer ran
clean for 3 seconds of mutation (370k/265k/508 execs respectively
across 14 workers) with zero crashes. Interesting corpora live in
the user's go-build cache, not the repo tree.

RED-then-GREEN: each target's seed corpus IS the RED — running it
the first time would fail loudly on any panic in the parser. All
6+6+5 seeds pass, demonstrating the panic-avoidance contract
holds across the curated inputs.

Future loops can drive these targets with `go test -fuzz=Fuzz...
-fuzztime=...` for longer mutation runs.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Phase 9 decision: keep the dual-export pattern. Each component ships:
  var Factory pipeline.<X>Factory = &factory{}
  func NewFactory() pipeline.<X>Factory { return Factory }

Path A (keep both, document) over Path B (extend manifest with a
factoryVar flag, drop the bridge). Reasoning recorded in
docs/rfcs/0004-clockreceiver-stdoutexporter.md "Implementation notes":

1. OTel components export NewFactory() — STRATEGY.md's load-bearing
   principle says mirror OTel unless we have a written reason to
   diverge. We don't.
2. The bridge is 3 lines per component. The generator complexity to
   support both paren-and-no-paren shapes would be more code than the
   bridge eliminates.
3. Receivers get both ergonomics: Factory for in-tree no-allocation
   reference, NewFactory() for codegen / future plugin surfaces.

Touched:
- docs/rfcs/0004-...: new "Implementation notes" section, sub-section
  "NewFactory() bridge to package-var Factory — Path A confirmed".
- tools/components-gen/main.go: doc comment on the FactoryFunc field
  points at the RFC entry so future contributors see the rationale
  before they propose Path B again.

No code-behaviour change; the decision is durable. M8+ contributors
who copy clockreceiver's shape inherit the pattern.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
A panicking processor or exporter used to crash the binary mid-flight.
M2 will hot-path components for self-metric counters; a buggy counter
increment must not bring down the collector.

New `internal/pipeline/saferun.go` exposes three wrap functions —
WrapSafeMetrics / WrapSafeTraces / WrapSafeLogs — that wrap a
consumer so a panic inside its Consume* is recovered, logged at
Error with the component ID, and surfaced as an ordinary error to
the caller. Capabilities() is a transparent pass-through so the
fan-out cloning decisions stay correct. The wrapper is process-
stable: a panicking call does not poison subsequent pushes.

Wired in at the pipeline-assembly seam in cmd/tracecore.
buildSignalPipeline: after every processor and exporter is created,
the consumer-view is wrapped before being used as the upstream's
`next` (or before being fed to the fanout). One signalOps callback
(wrapSafe) per signal, mirroring how wrapFirstData is plumbed.

Receivers don't get wrapped — they push into next, they don't get
pushed into. The same goes for the fanout: it is tracecore code,
not third-party, and we don't recover from our own bugs.

Quarantine semantics (suspend a component that panics repeatedly)
deliberately deferred to the M2 self-telemetry RFC. M1.6 ships the
cheap behaviour: every push to a panicking component fails its
own push but the next tick gets retried.

RED: TestWrapSafe_PanicsRecovered_PerSignal fails to compile against
  pre-Phase-10 code (WrapSafe{Metrics,Traces,Logs} undefined).
GREEN: same tests pass; `make ci` clean; internal/pipeline coverage
  rose 85.2% → 86.3% on the new saferun.go's 100%-covered surface.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds three *_bench_test.go files covering the most-trafficked
hot-path code paths in tracecore:

  internal/pipeline/bench_test.go
    BenchmarkRuntime_StartShutdown          (N=10, N=100 components)
    BenchmarkWrapFirstDataMetrics_ConsumeMetrics
  internal/fanout/bench_test.go
    BenchmarkFanout_NewMetrics              (5 readonly+mutable mixes)
  internal/config/bench_test.go
    BenchmarkParsePipelineID

Numbers recorded in docs/research/baselines.md (Apple M4 Pro,
darwin/arm64, Go 1.26.2). Highlights:

  WrapFirstDataMetrics after firstdata fires: 3.5 ns/op, 0 allocs
  Fanout fast path (single readonly): 1.1 ns/op, 0 allocs
  ParsePipelineID: 84 ns/op, 0 allocs
  Runtime_StartShutdown sub-linear (10→100 = 2.4× not 10×) thanks
    to parallel Phase 1

The harness is the deliverable; M1.6 doesn't gate on these numbers.
M5 will introduce regression gating per MILESTONES.md. Future
phases that touch these code paths can compare quantitatively.

Run via:
  go test -bench=. -benchmem -count=1 ./internal/...

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds govulncheck between coverage-check and build in the ci recipe.
The standalone `make govulncheck` target was already present; this
phase makes it a hard gate so a CVE in a called code path fails the
build instead of waiting for an out-of-band scan.

Current state at gate-on: 0 called vulnerabilities. govulncheck's
symbol-level analysis flags 2 indirect-but-uncalled findings in
imported packages and 6 in transitively-required modules — these
don't fail the build because our code doesn't reach them. If a
future bump pulls a CVE into a called path, ci will fail loudly.

RED: the ci target doesn't include govulncheck (existing state).
GREEN: `make ci` now runs govulncheck after coverage-check and
  before build; exit 0 with current dep set; would exit 1 on any
  reachable vulnerability.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
The first SIGINT/SIGTERM still triggers graceful shutdown via ctx
cancellation. The second one fires os.Exit(exitFailure) so the
operator can break out when graceful shutdown is taking longer than
they expect. Previously the second signal was silently absorbed by
the original NotifyContext handler.

Replaces signal.NotifyContext with a direct signal.Notify on a
buffered channel (cap=2) so two signals delivered back-to-back are
both captured. The two-stage loop lives in handleShutdownSignals,
extracted so it's unit-testable via a controlled channel — the
integration test for the second-signal path can't deterministically
race a fast-shutdown pipeline (clockreceiver+stdoutexporter complete
shutdown in <1ms, beating the force-exit goroutine).

Tests:
- TestHandleShutdownSignals_TwoSignals_FiresForceExit pins the
  contract directly: feed two signals into the channel, assert
  cancel() and forceExit() are both invoked in order.
- TestHandleShutdownSignals_ChannelClosed_ReturnsCleanly pins the
  shutdown path used when main exits naturally.
- TestIntegration_DoubleSIGTERM_DoesNotHang verifies end-to-end
  that the second SIGTERM doesn't hang the process within 500ms;
  the exit code is intentionally not asserted because fast
  shutdowns may legitimately win the race.

`--help` text now documents the two-signal protocol.

RED: TestHandleShutdownSignals_TwoSignals_FiresForceExit doesn't
  compile against pre-Phase-13 code (handleShutdownSignals
  doesn't exist).
GREEN: same test passes; `make ci` clean.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
New \`tracecore validate --config=<path>\` subcommand: loads the
config, builds pipelines, prints a terse summary, exits 0. On any
config error returns exit 2 (sysexits.h EX_DATAERR) — same code
\`collect\` returns, so operators can use validate as a pre-flight
dry run with the same shell-script semantics.

Summary format (stderr):

  config valid: N pipeline(s)
    metrics/primary: receivers=1 processors=0 exporters=1
    ...

Tests cover the three contract paths:
- good config → exit 0 + summary names every pipeline
- bad config (unknown YAML field) → exit 2 + error names field
- missing --config flag → exit 64 (usage error, kingpin path)

RED: TestRun_Validate_GoodConfig_ExitsOK reports
  "unexpected validate" (kingpin doesn't know the subcommand).
GREEN: same test passes; \`make ci\` clean.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds gocyclo to .golangci.yml with min-complexity=15 so every PR is
auto-audited. Initial run flagged one function:
buildSignalPipeline at 18.

Refactored buildSignalPipeline into a small orchestrator plus three
helpers — buildExporters, buildProcessors, buildReceivers — each in
the 7-9 range. The orchestrator reads top-to-bottom as the pipeline
assembly's natural shape:

  exporters → fanout → processors (reversed) → first-data → receivers

No behaviour change; existing tests pass unchanged. cmd/tracecore
test coverage held at 57.7% (informational; this section of the
file was already well-covered by the buildPipelines tests).

Audit baseline recorded in docs/research/baselines.md.

RED: `make lint` reports cyclomatic complexity 18 on
  buildSignalPipeline (> 15).
GREEN: all functions <= 15 after the helper split; `make ci` clean.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
\`tracecore --help\` now includes a Signals section (already present),
plus new Exit codes and Examples sections so operators can read the
contract end-to-end without leaving the terminal. Every flag's Help()
text gained a one-line description + an example value where useful
(--log.level, --log.format, --config, --shutdown.drain-budget).

\`tracecore --version\` switched to a parseable k=v form:

  tracecore v0.0.0-dev (sha=abc1234, go=go1.26.2, built=2026-..., platform=darwin/arm64)

Missing build metadata (no ldflags, no VCS info) drops the k=v pair
silently, so \`go run\` still emits a sensible line.

Tests:
- TestRun_HelpOutput_HasExamplesAndExitCodes pins the help text
  contract (Signals / Exit codes / Examples / runnable example
  invocations).
- TestRun_VersionOutput_HasStablePrefix pins the stable
  "tracecore v" prefix and the "go=" k=v entry.
- TestBuild_StringFormat updated for the new format.

To capture kingpin's --help / --version output in tests, run()
now installs:
- app.UsageWriter(stderr) / ErrorWriter(stderr) so output goes to
  the caller's writer instead of os.Stdout.
- app.Terminate(...) that captures the exit code rather than
  calling os.Exit directly; run() returns the captured code so the
  operator-visible exit semantics are preserved.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Mirrors "Writing a receiver" with two new sections in
internal/pipeline/README.md:

  ## Writing a processor   (≤180 words; pointer to fakeProcessor in
                            contract_test.go:82 as the canonical
                            referent until M8+ ships one)
  ## Writing an exporter   (≤200 words; pointer to stdoutexporter
                            with explicit file:line citations)

Each section follows the same template: shape, factory pattern,
capabilities, pitfalls (each pitfall tied to a specific file:line
when an in-tree example exists). M8+ contributors writing the second
or third component in each category get a copy-able shape without
reverse-engineering the existing one.

Pitfalls codify lessons that surfaced during M1.5 review:
- processor: don't capture `next` at config time, no panic guards
  (handled by saferun), no manual MarkReadOnly
- exporter: synchronize shared writers, no blocking Shutdown, bounded
  retries against ctx

No code changes; `make ci` still clean.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Root README:
- Status now reflects M1.6 shipped state (was "not yet usable").
- New Quickstart section: build, write demo.yaml via heredoc,
  validate via the `tracecore validate` subcommand, then `collect`.
  Every line is copy-paste-runnable; verified end-to-end against
  the freshly-built binary.
- Documentation list now links to docs/STRATEGY.md,
  docs/STYLE-errors.md, and internal/pipeline/README.md so a new
  reader can navigate from the front door to the runtime quickstart
  without guessing.

internal/pipeline/README.md:
- New "Architecture at a glance" section with an ASCII diagram of
  one pipeline: receivers (parallel) → WrapFirstData → processors
  (data-flow order, built last-first) → fanout → exporters
  (serial-LIFO drain). Each seam in the diagram links to the file
  that implements it.

No code changes.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
The original RFC §TelemetrySettings noted MeterProvider was "deferred
until M2". Phase 19 makes that preview concrete so M2 contributors
land on a clean, pre-thought shape:

  type TelemetrySettings struct {
      Logger        *slog.Logger
      Resource      pcommon.Resource
      MeterProvider metric.MeterProvider  // M2 — components emit self-metrics
      BuildInfo     version.Build          // M2 — attribution on emitted data
      _             struct{}               // unkeyed-literal guard
  }

The note lives in the Implementation notes section alongside the
existing Phase 4 (variadic Options) entry, so the connection is
explicit: Options are additive, the unkeyed-literal guard catches
forgotten threading, and M2 ships `pipeline.WithMeterProvider(...)` /
`pipeline.WithBuildInfo(...)` without breaking call sites.

Doc-only. No code change; `make ci` clean.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
The three pairs of per-signal wrappers (safe{Metrics,Traces,Logs} and
firstData{Metrics,Traces,Logs}) repeated the same boilerplate: hold
\`next C\`, pass through Capabilities(), and forward Consume<Sig> with
some additional work. Phase 6 had already de-duplicated this exact
shape in internal/fanout via a generic + per-signal callback. This
commit applies the same pattern to saferun.go and firstdata.go.

  internal/pipeline/saferun.go
    safeBase[T, C capable] with safeCall(ctx, payload) running invoke
    under defer/recover. safeMetrics/Traces/Logs embed it and define
    only their public Consume<Sig> method.

  internal/pipeline/firstdata.go
    firstDataBase[T, C capable] with fireAndForward(ctx, payload)
    running invoke after the once-only "pipeline first data" log.
    firstDataMetrics/Traces/Logs embed it and define only their
    public Consume<Sig> method.

  capable is the local constraint interface — anything with
  Capabilities() consumer.Capabilities, which is every consumer.X
  interface in this codebase.

Behaviour preserved: existing TestWrapSafe_* + TestFirstData* tests
pass unchanged. Error / log message text aligned slightly — the panic
log now uses key-value attrs ("component", "op", "panic") instead of
folding op into the message string, but the error returned to the
caller keeps the "component %s: panic in %s: %v" shape.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Before: each signalOps had 8 closures including createX (3) +
asXConsumer (2) + a consumerName string used solely to format the
"does not implement <interface>" error.

After: createExporter and createProcessor each return
(component, consumer.C, error) and own the type assertion inline.
The asExporterConsumer / asProcessorConsumer / consumerName fields
are gone. The builder-bug error ("does not implement consumer.X")
is now a self-contained string the per-signal closure produces; the
buildExporters / buildProcessors call sites wrap it with the usual
"pipeline X: <role> Y: %w" prefix.

Net: ~30 lines removed from cmd/tracecore/main.go; one closure form
per role across signals instead of two-pass create-then-assert.
Behaviour preserved — the existing
TestBuildPipelines_MultiSignal_ErrorsCiteCorrectSignal,
TestBuildPipelines_UndeclaredComponentRef, and other build-side
tests pass unchanged.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
parseProfile previously did `pkg := file[:strings.LastIndex(file,
"/")]` — if the line's file field had no slash, LastIndex returns
-1 and the slice expression panics with index-out-of-range. Go
coverage profiles always include a package prefix in practice, but
the slice was a latent crash on malformed input.

Now: capture the index, skip the line if it's negative. One extra
branch; no behaviour change on well-formed profiles.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three test-first improvements driven by post-M1.6 self-review:

1. tools/coverage-check now has unit tests (previously zero):
   - happy-path parsing (multi-package aggregation)
   - empty profile (mode-only header)
   - missing mode header is rejected
   - non-existent profile path returns error, not panic
   - slashless file path is skipped (pins the M1.6 slash guard)
   - thresholdFor table covers internal/components/other prefixes

2. \`tracecore --version-short\` emits just the bare version string for
   script consumption (\`VER=$(tracecore --version-short)\`). Implemented
   as an argv scan that short-circuits before kingpin runs subcommand
   validation, so it works even though collect's --config is Required.
   Test: TestRun_VersionShort_BareVersionForScripting.

3. Empty-pipeline boot now logs "no pipelines configured" at WARN
   (was INFO). Operators running with default log-level filters
   (warn+) would have missed an INFO line in a typo'd-config scenario
   that looks indistinguishable from a successful run. Test:
   TestRunCollect_EmptyConfig_NoPipelinesIsWarn.

RED-first on each: the version-short and warn-level tests fail
against pre-fix code; the coverage-check tests document existing
behaviour and pass against the M1.6-era slash guard.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
The single 720-line cmd/tracecore/main.go grew through M1.6 to hold
five distinct concerns: CLI wiring, signal handling, subcommand
implementations (collect, validate), pipeline assembly, and the
signal-generic ops vars. Reading any one required scrolling past
the others.

Split (preserving package main; all symbols remain unexported and
inter-callable):

  main.go        165  CLI entry, kingpin wiring, exit constants,
                      helpDescription, newLogger
  signals.go      25  handleShutdownSignals
  collect.go      72  runCollect
  validate.go     41  runValidate
  build.go       334  buildPipelines, buildSignalPipeline, the three
                      build<Role> helpers, splitName, componentSet,
                      resolveComponent, hasFactories, hasOperatorIntent
  signalops.go   134  signalOps struct + metricsOps/tracesOps/logsOps

No behaviour change. \`make ci\` clean; all tests pass; internal/
pipeline coverage rose to 89.5%, version package to 90.5% due to
cached test paths re-running across the freshly-built binary.

Reviewability win: an M8 receiver author looking at how pipelines
are assembled opens build.go. An operator-facing CLI change touches
main.go. The signal-generic plumbing has its own file. Each file is
under 350 lines.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds goleak coverage to two packages where lifecycle goroutines live:
- internal/pipeline: new TestMain wrapping goleak.VerifyTestMain.
- cmd/tracecore: existing TestMain now calls goleak.Find() after
  m.Run() returns (and after binary-cache cleanup), failing the
  test run on any leftover goroutine.

The check immediately caught a real leak: run()'s sigCh-reading
goroutine (handleShutdownSignals) blocks on <-sigCh and never wakes
when run() returns, because we only signal.Stop() the channel — we
never close() it. In production this is moot (process exits, all
goroutines die); in tests, every TestRun_* invocation accumulated a
leaked goroutine.

Fix: close(sigCh) AFTER signal.Stop() in the run() defer. Stop
prevents the runtime from sending to the channel before we close,
so we don't panic on send-to-closed-channel. handleShutdownSignals
already handles the closed-channel case (its <-sigCh, !ok branch).

This was a knowledge-gap closure from the post-M1 self-assessment:
"go test -race detects races, not leaks; we have no goleak coverage."
Adding it surfaced the leak the next iteration after we noticed.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds TestMetrics_NestedPayload_DeepCloneIsolation, a realistic-shape
payload (2 ResourceMetrics × 2 ScopeMetrics × 1 Gauge × 2 DataPoints
with attribute maps) pushed through a mutating + readonly fanout
pair. Asserts that a mutator adding a key to its data-point attribute
maps does NOT leak into the readonly consumer's clone — verifying
pdata's CopyTo semantics at every nesting level (RM, SM, Metric,
Gauge, DataPoint, pcommon.Map).

This closes the post-M1 knowledge gap "we test single-resource
payloads; real receivers push deeply nested pdata and we never
verified clone isolation at the attribute-map level." Result:
isolation holds; the existing fanout cloning algorithm is correct
for production-shaped payloads.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two doc-only gap closures from the post-M1 self-assessment:

1. \`tracecore validate --help\` now says the output is human-
   readable and may change between releases. Operators tempted to
   parse "config valid: N pipeline(s)" for scripting see the warning
   directly. A future --format=json would be the stable surface.

2. internal/pipeline/README.md's receiver-author Pitfalls list now
   covers goroutine-internal panic recovery: saferun catches panics
   inside ConsumeMetrics/Traces/Logs, but a panic in the receiver's
   own goroutine (before reaching next.Consume*) is NOT caught and
   crashes the process. Includes the defer/recover snippet receivers
   should copy.

No code change. \`make ci\` clean.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three closures from the OTel v0.152.0 divergence audit run as
background research alongside the gap-closing loop:

1. **Rename** pipeline.ErrSignalUnsupported → pipeline.ErrSignalNotSupported
   to match OTel's exact spelling. Pre-1.0 breaking change with low
   blast radius (8 in-tree call sites). The error message text is
   unchanged ("signal not supported by this component"), so operator
   logs don't drift.

2. **Multi-instance regression test**
   (TestBuildPipelines_MultiInstance_SameTypeTwoInstances) — pins
   that two instances of the same factory (clockreceiver/fast +
   clockreceiver/slow) coexist with independent IDs and distinct
   goroutines. Smoke-verified live; locking with this test.

3. **STRATEGY.md divergence table** expanded with six newly-
   identified drifts:
   - Host.ReportStatus location (method vs componentstatus free fn,
     align in M2)
   - CreateSettings shape (missing BuildInfo + unkeyed-init guard,
     M2 work)
   - TelemetrySettings shape (missing TracerProvider/MeterProvider,
     M2 work)
   - Factory interface decomposition (intentional flattening; OTel
     adds XStability() we defer to post-v1.0)
   - ErrSignalNotSupported naming (now aligned)
   - Consumer-seam panic recovery (intentional tracecore addition;
     OTel has none — documented as deliberate divergence)
   - "pipeline first data" log (intentional tracecore UX addition)

   Also adds an anti-list of "we're identical here" findings so
   future contributors don't re-audit the agreed points.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two follow-ups from the OTel-audit gap closure:

1. Doc-comment Host.ReportStatus with a Deprecated note pointing at
   the M2 migration to OTel's componentstatus.ReportStatus(host, ev)
   free-function pattern. Receivers can call it today; the rename
   will be a one-line edit per call site once M2 lands. No code
   change yet — premature churn pre-1.0 with no actual callers.

2. TestReceiver_GoroutineDeferRecover_KeepsProcessAlive — exercises
   the documented recipe from internal/pipeline/README.md's Pitfalls
   list. A receiver whose internal goroutine defers + recovers its
   own panic must (a) not crash the process and (b) let the runtime
   shut down cleanly. Without the test, the recipe was unverified;
   with it, the README has a tested example to point at.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three non-blocking suggestions from PR #13 review:

1. buildSignalPipeline switch gains a default arm (build.go) so a
   fourth Signal value added to internal/pipeline can't silently
   produce zero pipelines. Currently unreachable — ParsePipelineID
   rejects unknown signals — defense in depth.

2. coverage-check fails loudly when the profile has zero measurable
   packages (tools/coverage-check/main.go). Prevents the gate from
   silently becoming a no-op if `make coverage`'s ./... glob gets
   narrowed in a future tooling change.

3. STYLE-errors.md rule 2 now has a callout explaining why
   wrapcheck's ignore-package-globs excludes the entire module
   (in-module errors share context with callers; wrapping stutters).
   Reminds future contributors that wrapcheck still polices stdlib
   / third-party returns.

The fourth suggestion (constructor helpers for safeBase/firstDataBase)
deferred: current field-assignment pattern works and a fourth
wrapper family isn't imminent.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added 5 commits May 13, 2026 21:52
Self-review surfaced two more deferrals:

- FAILURE-MODES.md references test names by string; rename rot is
  invisible until a reader follows a dead link. A make doc-check
  step that asserts existence would close this. ~30 minutes when
  next touching the Makefile.

- The pipelinebuilder package's test surface is split (cmd/tracecore
  integration coverage + pipelinebuilder's own processor-path test).
  A one-paragraph doc.go answering "where do I test buildPipelines?"
  removes the ambiguity for the next contributor.

Both belong in FOLLOWUPS.md so they survive past chat.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three weak items from the fresh-eyes self-review that were cheap
enough to close in-PR:

1. nps.md gets a "stub, pre-pilot" banner up top. The doc was
   over-prescriptive — committing to specific NPS methodology before
   talking to a single operator is shaping-the-question, not
   asking-the-question. The banner makes it clear nothing is binding
   until pilots run.

2. STRATEGY.md "Each component package owns its own Factory var"
   section gets a hard rule: PRs introducing centralized Register
   APIs / global factory maps / plugin loaders MUST be rejected
   without an accepted RFC, burden of proof on the proposer. Was
   hopeful-policy; now a concrete gate.

3. Makefile coverage target's tools/ exclusion gets an explicit
   comment. Was a silent behavior; now an in-line "if a future tool
   needs coverage tracking, add it to both lists" note.

Plus one item tracked in FOLLOWUPS.md (not fixed in this PR):
- Verify squash-merge preserves Assisted-by trailers after the PR
  actually lands on main. The disclosure-discipline gap is real but
  only observable post-merge.

Items still NOT closed by this PR (intentional, tracked):
- FAILURE-MODES.md doc-rot guard (make doc-check) — FOLLOWUPS
- pipelinebuilder doc.go test-surface note — FOLLOWUPS
- Cross-platform CI — FOLLOWUPS (M3)
- Backend integration recipes — FOLLOWUPS (M14-16)
- PR size itself — structural, can't fix mid-PR

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Strong-agree items from the v2 self-review, batched:

builder.go:
- Reframe "this build" error → "no component factories registered"
  (library-shaped phrasing in an exported function)
- Reframe "unsupported signal %v (this build has no assembly path)"
  → "no assembly path registered for signal %v"
- Nil default-config guard in resolveComponent (defense in depth,
  matches the default-arm philosophy)
- Drop dangling "(UX #4)" reference; explain the criterion inline

pipelinebuilder/doc.go (new):
- Explains why the package is a sibling, not nested under
  internal/pipeline (cycle)
- Documents the test-split convention so contributors aren't
  ambiguous about "where do I test buildPipelines?"

builder_test.go:
- mustYAML panics on empty input rather than silently returning a
  zero yaml.Node

coverage-check:
- Dedup key built from regex captures (file + range explicitly)
  instead of fragile raw-line slicing. The regex now has 4 capture
  groups so a future format tweak surfaces as a regex mismatch.

integration_test.go:
- SignalDuringBoot: replace the 5ms-sleep race with a 500ms retry
  loop that signals until accepted (or the deadline expires).
  Resolves the Linux-CI flake risk flagged in the review.

validate.go:
- Rename `stderr` param to `out` (the function doesn't care which
  writer it is)
- ctx.Err() check between Load and BuildPipelines so a SIGTERM
  during a slow validate bails cleanly

FAILURE-MODES.md:
- TestLoad_MultiDocYAML_Rejected → TestLoad_MultiDocumentYAML_Rejected
  (test name was wrong)
- "Bad YAML" row now references a specific test, not just the file
- "SIGINT during boot" now references TestIntegration_SignalDuringBoot
- "Phase-1 timeout" row: no dedicated test exists; row clarified to
  reference the indirect coverage rather than fabricate a name
- "First-data instrumentation" now references a specific test
- "Receiver-internal goroutine panic" row: clarify that the linked
  test exercises the *recipe*, not a crash assertion (negative case
  isn't tractable from inside the panicking process)
- New row: stdout pipe broken mid-write — present failure mode in
  stdoutexporter today, was missing from the inventory

FOLLOWUPS.md (restructure):
- New "Next up" pinned section (doc-check, cross-platform CI,
  squash-merge trailer verification)
- "Considered and explicitly skipped" section with rationale for:
  the DRY wrap helper, the signalConsumer generic constraint, the
  TestBuildBinary inode check
- New rows: signalops trigger refinement (non-mechanical-diff is
  the real trigger), Factories.Validate() method, validate --json,
  make register-lint, make ci wall-clock trend, coverage trend
- Ticked: internal/pipelinebuilder doc.go (landed in this commit)

One review item attempted and reverted with documentation:
- signalConsumer generic constraint on signalOps[C] silently
  dropped pipelinebuilder coverage from 74%→43% under -coverpkg.
  Root cause not nailed (suspected generic instantiation
  interacting oddly with coverage attribution); FOLLOWUPS records
  the diagnosis in the skipped section.

Verified: internal/safe.Call exists as the README claims (review
item #25).

Items deferred to FOLLOWUPS (with explicit rationale): DRY wrap
helper, signalConsumer constraint, inode check, register-lint,
Factories.Validate, validate --json, OSS-Fuzz integration,
benchstat in CI, wall-clock trend, coverage trend.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two passes addressing the user's "remove superfluous documentation"
and "put important-for-the-future items into a roadmap" feedback:

Trimmed (remove what-isn't-there unless necessary):

- docs/FAILURE-MODES.md: drop the "Things tracecore deliberately
  does NOT handle (yet)" prose section. The 🔴 rows in the inventory
  tables already capture the same content; the bottom section was
  duplicate-with-future-speculation. Intro paragraph also tightened.
- docs/nps.md: trim the "Status: stub, pre-pilot" banner from a
  paragraph to two lines.
- docs/STYLE-errors.md: drop the "Deferred — no error codes yet"
  section. Speculative; the deferral fact lives in RFC-0003.
- internal/pipelinebuilder/doc.go: tighten test-split note from
  three paragraphs to one.
- internal/pipeline/README.md: trim the "Deferred from M1" section
  to a single sentence pointing at RFC-0003.

Roadmap-tracked (use the phrase "Carry-forward from M1.6"):

The user asked: "anything that would be important for the future
it's good to put into a roadmap." MILESTONES.md is that roadmap.
Added "Carry-forward from M1.6:" bullets under the affected
milestones (M3, M4b, M5, M6, M8) for items that have concrete
milestone targets — cross-platform CI, chaos-light test, benchstat
in CI, sustained-load, backend integration recipes, doc-check, the
M8 receiver-author ergonomics review.

docs/FOLLOWUPS.md restructured to keep only the two things that
DON'T fit a milestone:
- opportunistic / trigger-based items (no fixed date)
- considered-and-explicitly-skipped (anti-rot)

Cross-link added in FOLLOWUPS preamble so a reader looking for
the M3 carry-forward grep "carry-forward" in MILESTONES.md.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Closes the four actionable items from PR-13 v3 review:

A. TestIntegration_SignalDuringBoot retry loop:
   - Track `signaled` bool; fail with a meaningful message if all
     250 attempts return errors (was: misleading "process hung" via
     the cmd.Wait deadline)
   - Detect os.ErrProcessDone and break early (process exited
     before we delivered the signal — nothing to wait for)

B. validate.go ctx-cancel exit-code comment:
   - runValidate returns exitFailure on ctx-cancel; runCollect
     returns exitOK. Inconsistency was intentional: validate's job
     is to PROVE config validity; cancelled validate hasn't proved
     anything and shouldn't claim success. Comment makes this
     intentional asymmetry visible.

C. Pointer comment at top of signalops.go warning future
   contributors away from adding the signalConsumer generic
   constraint. The PR-13 v2 attempt dropped coverage 74%→43%
   under -coverpkg; the comment + FOLLOWUPS entry should stop the
   next person from rediscovering it.

D. New `make doc-check` (scripts/doc-check.sh): verifies every
   Test/Fuzz/Benchmark identifier referenced in rot-prone docs
   (docs/FAILURE-MODES.md + internal/pipeline/README.md)
   corresponds to a real function in the test tree. Scoped tight
   on purpose — RFCs intentionally reference deleted-test names
   as historical record. Wired into `make ci` between govulncheck
   and build.

Plus two minor v3 review tweaks:
- internal/pipelinebuilder/doc.go gains an "As of 2026-05" date so
  future readers can judge staleness of its "where the tests live"
  guidance (review item #4).
- FOLLOWUPS "considered and skipped" DRY-wrap entry now says
  "Revisit if: never — this is a style commitment" (review #12).

Current state: doc-check finds 23 valid references across 2 docs;
all CI gates pass.

Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr changed the title [general] M1.6 hardening: lint, coverage, panic recovery, OTel audit [general] M1.6 hardening: lint + coverage + panic recovery + pipelinebuilder May 14, 2026
@trilamsr trilamsr merged commit a57883f into main May 14, 2026
12 checks passed
@trilamsr trilamsr deleted the feat/m1.6-hardening branch May 14, 2026 09:00
trilamsr added a commit that referenced this pull request May 14, 2026
## What this PR does

Closes the one FOLLOWUPS item from PR #13 that could only be observed
post-merge: whether GitHub's squash-merge preserves the per-commit
`Assisted-by:` trailers. **Observed on squash commit `a57883f`: it
does not.** The audit trail survives via the PR description (embedded
in the squash commit body, listing every source SHA), but the
trailer-level disclosure AGENTS.md describes is gone.

Documents the finding + flags two governance paths that aren't being
chosen unilaterally:
- Update AGENTS.md to treat the PR-description SHA list as canonical
  squash-merge disclosure, or
- Switch AI-assisted PRs to merge-commit / rebase-merge so per-commit
  trailers survive.

Tag `v0.1.0-m1` was also created on `a57883f` (dev-preview, not a
release artifact) — pushed alongside this branch but separately.

## Linked issue(s)

Refs PR #13 (post-merge observation). N/A — no GitHub issue.

## Release notes

```release-notes
NONE
```

## Checklist

- [x] N/A — docs-only change
- [x] `make ci` passes locally
- [x] Commits are signed off (`git commit -s`)
- [x] Commits cryptographically signed (SSH-based, see
`scripts/setup-signing.sh`)
- [x] N/A — no new components in this PR

Signed-off-by: tree <tree@lumalabs.ai>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Address items #9 (--version OTel SDK), #13 (Example_* tests),
#14 (Kind constant coverage + helper), #15 (shutdown error logging).
Also tactical: extract initTelemetryStack from runCollect so the
silent-shutdown-error fix didn't push cyclomatic complexity past 15.

- internal/version: Build gains OTelSDKMetric + OTelPromExporter
  fields, populated from debug.ReadBuildInfo. Surfaces in --version
  output as `otel-sdk-metric=vX.Y.Z, otel-prom-exporter=vA.B.C`.
  Operators debugging OTel-side bugs see which SDK actually shipped
  without grep'ing go.mod.

- internal/selftelemetry: new CanonicalKinds() returning the
  Kind* constant set and IsCanonicalKind(s) helper. Test pins the
  set + asserts clockreceiver's KindDownstream is canonical. Future
  receivers (M8/M9) plug in the same test pattern; eventual lint
  enforcement is documented in FOLLOWUPS.

- internal/selftelemetry/example_test.go: ExampleNewReceiver +
  ExampleNewExporter render in pkg.go.dev for M8/M9 authors and
  compile-check the 6-line wiring pattern every CI run. The
  RECEIVER-PATTERNS.md snippet can no longer drift from a working
  example without breaking this test.

- cmd/tracecore.runCollect: previously swallowed mp.Shutdown errors
  in the failure-unwinder paths with `_ = mp.Shutdown(ctx)`. Now
  logs at Warn level so a real drain failure surfaces. Extracted
  the telemetry-init block into initTelemetryStack helper to keep
  runCollect's cyclomatic complexity under the gocyclo 15 ceiling
  (the helper's own error-handling carries the rest of the
  cleanup branches).

make ci clean. Coverage unchanged on most pkgs; selftelemetry up
to 91.0% via the new IsCanonicalKind test.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Two independent reviews of PR #18 surfaced a stack of blockers,
strong findings, and quality lifts. This commit lands the tractable
items (defers documented in docs/FOLLOWUPS.md M8 section).

Operator-visible drift (closes Reviewer 2 #13-#17):
 - RUNBOOK kind-triage row `consume` → `downstream` (last commit
   renamed the kind but missed the doc table).
 - README config table: `initial_delay` range updated from "0..
   collection_interval" to "≥ 0" (code already relaxed for DGX
   cold-starts; doc was stale).
 - prometheus-alerts: DCGMReceiverHighErrorRate threshold changed
   from `rate > 0.1/sec` (unreachable: 15s default tick caps at
   0.067/sec) to `increase > 5 in 5m`. Stale "M2 has not landed"
   caveat removed.
 - example_config: single `mode:` line with a clarifying comment
   (was showing both `mode: standalone` and `# mode: embedded`,
   inviting operators to uncomment both → YAML duplicate-key).
 - cmd/tracecore receivers list now prints `dcgm [stub]` / `dcgm
   [cgo]` so operators can verify deploy shape without reading
   go.mod. Build-tag-conditional via three small files in
   cmd/tracecore (receiver_variants*.go) — pattern extends to M11
   NVML.

Correctness bugs (closes Reviewer 2 #2, #3, #4, #6, #7):
 - receiver.Shutdown: `r.running.CompareAndSwap(true, false)`
   gates teardown so a second Shutdown is a no-op (cgo libdcgm
   `dcgmShutdown` is not documented idempotent). Same CAS provides
   the happens-before for `r.cancel` publish that Pass-1 flagged.
 - receiver.ensureWatched: zero-entities path now emits
   IncError(KindEnumerate) + degraded rather than returning true.
   Without this, a misconfigured host (no GPUs visible, ACL blocks
   /dev/nvidia*) had the receiver looking healthy while emitting
   nothing.
 - receiver: new `warnOnce` helper gates the 7 per-tick failure-
   path Warn logs to fire only on the first failure after recovery.
   Closes the log-storm bug (4 errors/min × 60 min = 240 lines).
   Counter (`receiver_errors_total`) still ticks every failure.
 - metrics.applyCardinalityCap: parameter `cap` → `maxSeries`
   (cap shadowed the Go builtin).

Quality / contract lifts:
 - metrics.emit now returns a `stale` count for StatusStale and
   StatusError samples. pushSamples calls IncError(KindRead) once
   per tick when stale > 0 — surfaces DCGM serving slow/faulty
   data, which is precisely what StatusStale exists for. Per-tick
   not per-sample so GPU count doesn't inflate the rate.
   (StatusNoData and StatusFieldNotSupported still silent.)
 - docs_parity_test.go: new TestRUNBOOK_KindsMatchEmitted walks
   every emitted IncError/failedTick kind against the RUNBOOK
   per-kind triage table in both directions. This is the
   structural fix for the bug class — RUNBOOK can never again
   drift from emitted kinds without CI failure.
 - receiver.go: promoted `watchUpdateDivisor` /
   `watchKeepForMultiplier` / `watchUpdateEveryMinimum` constants
   for the previously-magic DCGM watch-cadence ratios.

Documentation + dedup:
 - dcgm README: new "Privacy + data residency considerations"
   subsection (compliance-auditor ask). Flags hw.id / pci.bdf /
   NVLink peer IDs as quasi-identifying; provides two mitigation
   patterns (attr-drop processor, salt-hash pseudonymization).
 - docs/agents/examples/constructor_options.go: `WithTelemetry`
   renamed to `WithSelfTelemetry` to match the real in-tree
   receiver API. M9+ authors copying the example no longer drift.
 - RUNBOOK kind enumeration line restored with both watch and mig
   (dcgm-local kinds) per the last commit's promotion.
 - Repo-root `FOLLOWUPS.md` consolidated into `docs/FOLLOWUPS.md`
   (M8-opportunistic + M8-skipped sections). Single source of
   truth; 17 deferred items pulled forward with falsifiable
   triggers (cgo client landing, M11 sibling-receiver shape,
   operator-report thresholds, file-size triggers).
 - All bare `FOLLOWUPS.md` references updated to
   `docs/FOLLOWUPS.md`.

Honest pushback documented:
 - I disagree with the M8-AGRADE-GAP claim of Operator UX 3.7→4.0.
   The drift findings above are exactly the class of bugs that
   rubric criterion was supposed to prevent — the alerts-vs-RUNBOOK
   parity test existed but didn't check kind values against
   emitted call sites. The new TestRUNBOOK_KindsMatchEmitted
   closes that gap; future operator-UX claims should pin to a test
   like this.
 - Deferred: split receiver.go (475 LOC) into 3 files, hoist
   dcgmtest.BaseClient, SECURITY.md for receiver, dcgm_info
   join-target, libdcgm setup in CONTRIBUTING — all logged with
   triggers. Reviewer 1's "construct receiver via M9-style
   primary-Option" inconsistency goes in the queue for M9-close
   review.

`make ci` passes; dcgm coverage 86.0% (essentially flat — new
tests offset by widened emit signature in test paths).

Assisted-by: Claude Opus 4.7
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 20, 2026
## Summary

- Strip review-process artifacts (reviewer tags, PR refs) and
section-banner comments from 5 files.
- Collapse per-field / per-const / per-method comment blocks whose
bodies just restated identifier names.
- Keep every load-bearing WHY (TOCTOU ordering, adversarial-input
bounds, retry short-circuit, gzip threshold, classifyKind partition,
cardinality discipline).
- Net: **+118 / −357 lines**, no behavior change.

## Why these, not others

Six-months-cold-reader test: would a reader without this conversation's
context need the comment to avoid a footgun? If no, delete.

| File | What went | Why it failed the test |
|---|---|---|
| `pkg/nccl/fr_parser/parser.go` | `// --- section ---` banners;
per-field struct prose | Banners decay silently as the file grows; field
names already self-document. Adversarial-input WHY preserved on the
struct. |
| `components/exporters/otlphttp/otlphttp.go` | Reviewer P3-Rev2-F1,
S1/F9, C4/V1/V2, V14/m3, P2.10, M-5, M7 tags | Process coordinates that
mean nothing post-merge; git blame + PR threads are the right home.
Underlying constraints kept. |
| `internal/pipelinebuilder/signalops.go` | "PR #13 v2 tried it"
attribution | The rule (don't add the constraint) survives;
docs/FOLLOWUPS.md already preserves the history pointer. |
| `components/receivers/pyspy/kinds.go` | 6-line blocks per Kind const
(× 12) | RUNBOOK.md owns operator semantics. Two truths drift; const
names are self-documenting. |
| `internal/selftelemetry/interface.go` | Per-method one-liner docs
whose bodies restated method names; cardinality warning duplicated on
every method | DRY for prose too. Cardinality discipline stated once on
`Kind`. |

## Release notes

```release-notes
NONE
```

## Test plan

- [x] `go build ./...`
- [x] `go vet ./...`
- [x] `go test ./pkg/nccl/fr_parser/...
./components/exporters/otlphttp/... ./internal/pipelinebuilder/...
./internal/selftelemetry/... ./components/receivers/pyspy/...` — all
green

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Eight 1-page pattern-design specs covering #2 IB link flap, #7
dataloader hang, #8 NCCL timeout no-HW, #9 NCCL bootstrap timeout,
#10 CUDA OOM deceptive allocator, #11 checkpointer hang, #12 loss
spike NaN, #13 silent data corruption. Each carries the standard
detector-design shape (symptom, layers, signal sources, evaluation
rule, verdict attrs, edge cases, status, open questions) so the next
contributor can write a TDD red test directly off the spec.

Status: all 8 marked planned. #10 already has issue #303; the spec
frames the design alongside.

NORTHSTARS Appendix A gains a Spec column; docs/README + patterns
README link the new specs.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Ships pattern #13 (silent data corruption) end-to-end per the spec at
`docs/patterns/13-silent-data-corruption.md` — the hardest pattern in
the 15-set, where the run completes, loss looks normal, but downstream
eval shows degraded model quality.

- **Library**: `SilentDataCorruptionDetector` in
`module/pkg/patterns/silent_data_corruption.go` consumes
`EvalAccuracyRecord` + `SDCCounterRecord` typed projections and emits
`SilentDataCorruptionVerdict`. Discriminator follows the spec's
evaluation rule:
- `accuracy_drop >= AccuracyDropThreshold` AND same-job `hw.gpu.sdc.*`
counter rose during the job window → `kind=vendor_signaled` (full
confidence)
- `accuracy_drop >= AccuracyDropThreshold * 2` alone →
`kind=accuracy_only` (partial confidence)
- sub-2x-threshold without vendor signal → no verdict
(conservative-by-default noise band)
- **False-positive guards** (spec §"Edge cases"): dataset-checksum
mismatch suppresses verdicts; checkpoint-step mismatch suppresses
verdicts. Both guards are opt-in (only fire when both sides set the
field).
- **Wiring**: `patterndetectorprocessor` projects
`gen_ai.training.eval_accuracy*` + `hw.gpu.sdc.*` log records, runs the
detector, emits the verdict with promoted scalars per issue #270.
- **Schema**:
`module/pkg/patterns/testdata/silent_data_corruption_verdict.schema.json`
pins the verdict wire format; 13 schema-drift falsifiers + conformance
round-trip for both branches.
- **Docs**: spec status flipped to shipped; `docs/patterns/README.md`
table updated; `docs/ATTRIBUTES.md` registers the 13 new customer-stable
attribute keys (`hw.gpu.sdc.{delta,kind}`,
`gen_ai.training.eval_accuracy.*`, `gen_ai.training.eval_set.*`,
`gen_ai.training.checkpoint.*`,
`gen_ai.training.job.{start,end}_unix_nano`,
`tracecore.alert.silent_data_corruption.*`) and adds the
`silent_data_corruption` row to the per-pattern matrix.

Verdict remains explicitly advisory per spec §"Detector evaluation rule"
— both branches route the operator to a same-recipe + same-seed re-run
on different hardware as the disambiguator, because SDC repro is
non-deterministic.

## Test plan

- [x] `cd module && go test ./pkg/patterns/...
./processor/patterndetectorprocessor/... -count=1` — green (19 SDC
library tests incl. 13 schema falsifiers, 7 wiring tests).
- [x] `cd module && go vet ./...` — clean.
- [x] `golangci-lint run ./...` — 0 issues (via pre-commit hook).
- [x] `attribute-namespace-check` — 82/82 unique attribute literals
documented (via pre-commit hook).
- [x] Full module `go build ./...` — clean.

## Integration gaps (upstream-blocked, documented in-place)

- `gen_ai.training.eval_accuracy` upstream framework instrumentation
(spec §"Open questions" 5) — projection tested against hand-crafted
records; real-world input arrives once the OTTL recipe lands.
- `hw.gpu.sdc.*` OTTL recipe for DCGM SDC catcher / row-remap / AMD ECC
counters — blocked on RFC-0014 PR-B (metrics→logs pattern). Cross-vendor
support depends on `hw.gpu.sdc.*` semconv that doesn't exist yet (spec
§"Open questions" 2).
- Baseline-accuracy provenance + cross-run state are v0.3+ product
questions (spec §"Open questions" 1, 4) — the detector accepts
operator-stamped baselines today.

```release-notes
Add pattern #13 (silent data corruption) detector. Surfaces suspected SDC when an eval-accuracy regression (`gen_ai.training.eval_accuracy` vs baseline) crosses the configured threshold; a same-job `hw.gpu.sdc.*` counter rise during the job window flips the verdict to high confidence (`kind=vendor_signaled`). The verdict is advisory — both branches recommend a same-recipe re-run on different hardware as the disambiguator. New config knobs: `sdc_accuracy_drop_threshold` (default `0.005`) and `sdc_accuracy_only_multiplier` (default `2.0`).
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but
the join shape varies across patterns and the rationale lived only
in code comments + PR review threads. Operators tuning windows had to
read source per detector.

Audit finding: five distinct shapes are load-bearing (chosen by the
causal physics of each signal), not bugs:

- One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect.
- Asymmetric two-sided (#11): pre-stall covers concurrent-start
  checkpoints; post-stall covers OTTL-bridge logger latency.
- Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window
  could be cause OR consequence.
- Job-window bounded (#13): SDC counter rise must fall in the
  bounded eval-cycle's owning job; no operator knob is meaningful.
- Trailing-window rate / freshness (#2 #4 #8): rolling window
  anchored at `now` or the most-recent record.

Decision: document the existing reality, do not converge. Forcing
every detector to the asymmetric two-knob form would silently
zero one leg for the one-sided detectors (footgun on clock skew)
and would not apply to #13 at all.

Adds:
- 'Why this correlation shape' section in docs/patterns/07, 11, 13
  (the three shapes the issue called out by name).
- 'Correlation-window semantics' table in docs/patterns/README.md
  covering ALL 11 detectors with the predicate, anchor, and shape
  rationale, plus cross-links to the per-pattern sections.

No code changes; no detector behavior changes.

Closes #367.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Three v1 detectors use three different correlation-window shapes; the
rationale lived only in code comments and review threads. Operators
tuning windows hit these without warning.

- pattern #7 (dataloader_hang): one-sided look-back — strict
cause→symptom
- pattern #11 (checkpointer_hang): asymmetric (-30s, +60s) — log/record
race
- pattern #13 (silent_data_corruption): symmetric job window —
order-free attribution

## Decision

Document the existing reality. Each shape matches a distinct physical
event-ordering. A unified asymmetric two-knob form (issue suggestion)
would silently zero one leg for #7 and not apply to #13 — operators
would tune knobs with no physical meaning for their pattern.

## Changes

- "Why this correlation shape" subsection in each of
`docs/patterns/07-dataloader-hang.md`, `11-checkpointer-hang.md`,
`13-silent-data-corruption.md`
- "Correlation-window semantics" comparison table in
`docs/patterns/README.md` cross-linking the three rationales
- No code changes; no detector behavior changes

## Test plan

- [x] `golangci-lint`, `go vet`, `attribute-namespace-check` (pre-push
hook) — green
- [x] Each pattern doc cross-links to the comparison table
- [x] Each comparison-table row links back to the per-pattern rationale
section

Closes #367.

```release-notes
docs: document correlation-window semantics for patterns #7, #11, #13 — three shapes (one-sided, asymmetric, symmetric) match distinct physical event-orderings; no behavior change.
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 2, 2026
Pre-fix, scripts/doc-check.sh:103 exited 0 whenever
docs/FAILURE-MODES.md carried no Test*/Fuzz*/Benchmark* references
(its current state on main, grep count = 0). The early-exit silently
bypassed every gate below it: link-rot (.md + non-.md), banned-phrase
lint, (unverified) baseline, reproducibility shell-syntax, maintainer-
ship H2, M6 integration recipes (markers + 180d staleness), example
placeholder lint, README integration index, required-doc presence,
getting-started block cap, nps H3, FAILURE-MODES labels, and the
comment-noise diff gate.

Root cause: a57883f (#13) shipped doc-check.sh with one gate (the
Go-test parity check), so `[ -z "$referenced" ] && exit 0` was
correct. Subsequent PRs appended ~14 gates below without noticing
they'd be dead-coded whenever FAILURE-MODES dropped its Test* refs.
PR #459 worked around by placing the new YAML gate *above* line 99.

Fix: scope the skip to the parity block only (if/else, not exit).
Update the comment in the trust-posture preamble to record the bug.

Surfaced + fixed:
- docs/FOLLOWUPS.md: dead `followups/otlphttp.md` link. Shard never
  shipped on main; folded into the "deleted post-v0.2.0" prose.
- docs/cut-criteria.yaml.md (2x) + install/kubernetes/tracecore/
  README.md: "production-grade" banned-phrase hits reworded.
- scripts/doc-check.sh: rescope getting-started.md block cap to the
  `## Walkthrough` H2 section only (M6 intent — quickstart commands,
  not Helm + air-gap installs that landed after the cap was set).

Mutation-verified post-fix:
- inject dead .md link → link-rot gate exits 1, names file + target
- inject "blazing-fast" + "rock-solid" → banned-phrase gate exits 1
- remove tested-against marker from datadog.md → recipe gate exits 1

Gate count: 2 status lines pre-fix → 16 post-fix on clean tree.
`make doc-check` exits 0 on clean tree.

Closes #460.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
Reviewer B flagged the PR body's "2 → 16" gate count as inaccurate
and asked whether the test plan had been re-run against post-#453
state. Resolution recorded here:

Gate count, empirically verified via `grep -c '^doc-check: '` on
`make doc-check` output on a clean tree:

  Pre-fix on main (after #459 landed)           3 status lines
  Post-fix this PR (after rebase onto origin)  17 status lines

The PR body's "2 → 16" was correct at PR open (before #459 merged).
After rebasing onto origin/main, the YAML cross-link gate from #459
adds one more enforced gate (it was placed above the early-exit line
specifically to work around the bug this PR fixes, so it ran on the
pre-fix baseline too). The "14 gates below that line" framing in
the release-notes block is correct for either base: it counts gates
that the early-exit was hiding, which is invariant across the rebase.

Test plan re-run post-rebase: `make doc-check` exits 0 on clean tree,
17 status lines emitted, no mutation regressions.

Refresh the script header docstring to record the empirical count
and reflect that doc-check now covers 17 drift patterns, not just
the original Go-test-name parity check it shipped with in #13.

No functional change to gate logic.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…460) (#466)

## Summary

Closes #460. The `exit 0` on `scripts/doc-check.sh` ran unconditionally
whenever `docs/FAILURE-MODES.md` carried no `Test*`/`Fuzz*`/`Benchmark*`
identifiers (its current state on `main` — `grep -c` = 0), silently
bypassing every gate below it. Fix scopes the skip to the Go-test parity
block only (if/else, not `exit`), then surfaces and fixes the dead refs
the gates were supposed to be catching.

## Root cause

Commit a57883f (#13) shipped `doc-check.sh` with one gate — the Go-test
name parity check — so `[ -z "$referenced" ] && exit 0` was correct
then. PRs #28, #56, #115, #131, #144, #149, #195, #234, #241, #443,
#455, #459 (and others) appended gates **below** that line without
recognising they'd become dead code whenever `FAILURE-MODES.md` lost its
`Test*` references. PR #459 worked around the bug by placing its new
YAML gate *above* line 99 and tracked the root cause separately as #460.

## What surfaced

Once `exit 0` was removed, three real issues fired:

1. **Dead `.md` link**: `docs/FOLLOWUPS.md` → `followups/otlphttp.md`.
The shard was never committed to `main`'s ancestry. Folded into the
existing "Shards deleted post-v0.2.0 as fully resolved-via-pivot" prose
block (sibling treatment to M9, M14, M16).
2. **Banned-phrase hits** (3x `production-grade`): reworded in
`docs/cut-criteria.yaml.md` (2x) and
`install/kubernetes/tracecore/README.md` (1x) to falsifiable language.
3. **`docs/getting-started.md` block cap**: 7 fenced bash/sh blocks. The
M6 cap of 5 was set for the quickstart only — `## Install via Helm` and
`## Air-gapped install` are alternate deployment paths that landed
post-M6 and aren't part of the quickstart budget. Rescoped the gate to
count blocks inside the `## Walkthrough` H2 section only (1 block, well
under cap).

## Gate count

Empirically verified via `grep -c '^doc-check: '` on `make doc-check`
output on a clean tree:

| State | Status lines emitted | Gates the early-exit was hiding |
|---|---|---|
| Pre-fix on `main` (post-#459) | 3 (trust-posture, YAML cross-link,
parity-skip) | 14 |
| Post-fix this PR (post-rebase) | 17 | 0 |

The "14 gates hidden" number is invariant across the rebase: it counts
gates placed below the early-exit line. The "3 → 17" total reflects
post-#459 reality on `main`; pre-#459 baseline was "2 → 16" (the figure
originally in this PR body), and #459 itself worked around the bug by
placing its YAML gate above line 99.

## Mutation tests

Each gate below the original early-exit was confirmed to fire post-fix:

| Mutation | Gate expected to fire | Exit code post-mutation | Exit code
post-restore |
|---|---|---|---|
| Inject `[bad](nonexistent-ghost.md)` into `docs/FOLLOWUPS.md` |
markdown link-rot | 1 | 0 |
| Append `blazing-fast` + `rock-solid` to `docs/getting-started.md` |
banned-phrase lint | 1 | 0 |
| Delete `<!-- tested-against: ... -->` from
`docs/integrations/datadog.md` | M6 recipe markers | 1 | 0 |

## Test plan

- [x] `make doc-check` exits 0 on clean tree (re-run post-rebase onto
origin/main; 17 status lines)
- [x] 3 mutation tests above each toggle exit 1 → 0 across mutate /
restore
- [x] Pre-push hooks green: golangci-lint (0 issues), `go vet ./...`,
`go mod verify`, `attribute-namespace-check` (100 attrs, all
documented), `register-lint`, `actionlint`, `zizmor`,
`deprecation-check`, `no-autoupdate-check`
- [x] Rebased onto current `origin/main` (includes #459, #461, #462,
#456); no conflicts; gate count re-verified empirically post-rebase
- [x] No changes to gates above line 99 (the trust-posture callout +
YAML cross-link gate from #459 still run and emit unchanged status
lines)

## Self-grade

**A+** — root cause named in commit body (a57883f #13 with one gate;
gates appended below without exit-path awareness); 3 mutation tests
(success criteria required 1–2); rescoped the getting-started gate to
match M6 intent rather than papering over the surfaced overflow; the `[
-z "$referenced" ]` legitimate skip is preserved via if/else (not `:`
no-op, which would have left the `defined=` / `orphans=` block running
on empty input); gate count corrected empirically post-rebase per
reviewer B feedback.

```release-notes
- fix(ci): `scripts/doc-check.sh` no longer exits 0 at the Go-test parity gate when `docs/FAILURE-MODES.md` carries no `Test*` references. 14 gates below that line (link-rot, banned-phrase, M6 recipe markers, etc.) are now actually enforced on every `make doc-check` invocation. Closes #460.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 4, 2026
## Summary

Wave-end audit flagged the patterndetectorprocessor fanout site as an
unmet refactor: `ConsumeLogs` hand-rolled dispatch for every shipped
detector (12 today: 7 inline + 5 wrapped), so adding pattern #13
required editing the fanout body — not registering a new entry. Past the
rule-of-three by 9x.

This PR introduces a minimal Detector registry seam:

- `module/pkg/patterns/detector.go`: new `Detector` interface
(`PatternID() string`) + `Registered` slice that pins all 12 detector
pointers. Each `*Detector` struct gets a one-line `PatternID()` method.
- `module/pkg/patterns/detector_test.go`:
`TestRegistered_PinsAllPatterns` (exact PatternID set + count),
`TestRegistered_UniquePatternIDs`, `TestRegistered_NonEmptyPatternIDs`.
Drift gate — accidental drops fail in CI.
- `patterndetector.go`: introduces `detectorRunners []detectorRunner`
closure list iterated by `ConsumeLogs`. `ConsumeLogs` body drops from
~77 lines to 12. Adding pattern #13 = one append to `Registered` + one
append to `detectorRunners`, no fanout-site edit.

### Design decision: metadata-only interface

The `Detector` interface is intentionally `PatternID() string` only —
not a uniform `Evaluate` method. Each detector's Evaluate signature is
intrinsically heterogeneous (different input record shapes —
events+nodeConds, ncclRecs, xidRecs+events, etc. — and different verdict
types). A uniform Evaluate would force a lossy `any`-typed contract that
the typed test suite has been fighting for 12 patterns. The
closure-per-detector approach keeps the typed Evaluate calls at their
concrete-runner sites while letting the registry pin identity +
iteration.

### Behavior preservation

- Same telemetry vocabulary: PodEvicted and IBLinkFlap still
`IncVerdict` with `string(v.Confidence)` (they gate on partial); the
other 5 inline detectors still pass `""`. The 5 wrapped runners still
tick inside their own helpers (unchanged).
- Same emission order: `detectorRunners` is declared in the legacy
emission order.
- Same partial-confidence gating: `emitPodEvicted` and `emitIBLinkFlap`
preserve the `!emitPartial` skip.

### Test plan

- [x] `cd module && go build ./...` clean
- [x] `cd module && go test ./pkg/patterns/` green (incl. 3 new pin
tests)
- [x] `cd module && go test ./processor/patterndetectorprocessor/` green
except pre-existing #497
(`TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06-multi-rank-disk-pressure`,
fixed in Lane J)
- [x] `make lint` clean (0 issues)
- [x] `make vet`, `go mod verify`, attribute-namespace-check all green
(pre-push hook)

### LoC delta

- +321 / -79 across 3 files.
- `ConsumeLogs` body: 77 → 12 lines.
- Growth is in: registry plumbing (164 lines, mostly comments + the pin
tests), runner closures (one per detector). The seam earns its bytes —
adding pattern #N is now O(append) instead of O(edit-fanout).

### Closes-the-loop

Closes wave-end-audit next-wave item #2 (pattern registry seam).

```release-notes
- refactor(patterns): introduce `patterns.Detector` interface + `patterns.Registered` slice. The patterndetectorprocessor now iterates a registry-driven runner list instead of hand-rolled fanout — adding a new pattern is one append, not a processor edit. Behavior-preserving; no operator-facing change.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant