[general] M1.6 hardening: lint + coverage + panic recovery + pipelinebuilder#13
Merged
Conversation
Adds docs/STYLE-errors.md codifying seven rules for tracecore error messages: operator-actionable wording, no stuttering, %w-wrapped chains, no apologetics, %q-quoted operator input, sparing sentinels, typed errors only when callers need fields. Includes a deferral note on error codes (additive Coded interface seam at 1.0; do not implement now). Audit of every fmt.Errorf and errors.New in production code surfaced one rule-2 stutter: splitName included "component name %q" while every caller wraps with "pipeline X: <role> %q: %w", quoting the operator's input twice. Fixed by making splitName's error self-contained. RED: TestBuildPipelines_MultiSlashName_NoStutter pinned the stutter (strings.Count == 2 for the bad name). GREEN: same test passes (count == 1) after splitName no longer embeds the input string. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds bodyclose, forcetypeassert, goconst, nilerr, prealloc, wrapcheck to .golangci.yml (gocritic, unconvert, errorlint were already on). wrapcheck was the noisy one — 12 initial violations, all internal pass-through forwarders (Consume* chains, ComponentState lifecycle) or stdlib calls that genuinely cannot fail (bytes.Buffer.Write to in-memory buffer, context.Err()). Wrapping those would create the stutter STYLE-errors rule 2 forbids. Configured wrapcheck to ignore same-module packages and the relevant interfaces. Config schema verified via `golangci-lint config verify`. The schema for golangci-lint v2 wrapcheck uses kebab-case keys (ignore-sigs, ignore-package-globs, ignore-interface-regexps), not the camelCase shown in upstream wrapcheck README. RED: `make lint` after enabling — 12 wrapcheck violations. GREEN: after wrapcheck tuning, `make ci` reports 0 issues. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Phase 3 of M1.6: coverage threshold gate, scoped to the policy STYLE.md sets — internal/* ≥70%, components/* ≥60%, cmd/ informational, tools/ not policed. New surface: - `make coverage` runs go test with -coverprofile + -covermode=atomic against ./cmd/... ./components/... ./internal/... (tools/ excluded; it has its own tests but no coverage gate). - `make coverage-check` parses coverage.out via the small Go program in tools/coverage-check and fails the build if any policed package falls below threshold. - Wired into `make ci` between `test` and `build`. - Baselines recorded in docs/research/baselines.md; internal/fanout is at 70.4% (0.4% margin), explicitly noted as the package to watch during Phase 6's fan-out refactor. Runtime bug surfaced by coverage instrumentation: TestRuntime_ConcurrentStartShutdown_NoLostComponents fails under coverage timing — if Shutdown wins lifecycleMu first, it observes empty started* slices, releases the lock, and Start then runs and launches Components nobody will tear down. Fixed by setting a `shutdown` flag in Shutdown (under lifecycleMu) and returning nil from Start when shutdown is already true. Concurrent boot + signal becomes a clean no-op rather than a race-dependent leak. RED: TestRuntime_ConcurrentStartShutdown_NoLostComponents fails 3/3 runs under `go test -race -cover`. GREEN: same test passes 5/5 after the runtime fix; `make ci` clean with all 9 policed packages above threshold. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Replaces pipeline.NewRuntime(Settings, []Pipeline) with the variadic
form pipeline.NewRuntime([]Pipeline, ...Option). The Settings struct
becomes the unexported settings type; configuration flows through
WithLogger, WithDrainBudget, WithHost.
Why: M2 will add MeterProvider and BuildInfo fields. With a struct
literal, that's a breaking change for every caller. With variadic
options, M2 ships pipeline.WithMeterProvider(...) additively — every
existing call site keeps working unchanged.
API:
rt := pipeline.NewRuntime(pipelines,
pipeline.WithLogger(logger),
pipeline.WithDrainBudget(5*time.Second),
)
Options apply in order; later wins. New test
TestRuntime_VariadicOptions_ApplyInOrder pins this so M2 can layer
build-default + caller-override without proliferating constructors.
Touched files:
- internal/pipeline/runtime.go — Option type + 3 With* constructors
- internal/pipeline/runtime_test.go — 17 call sites migrated + new test
- internal/pipeline/firstdata_test.go — 1 call site migrated
- internal/pipeline/contract_test.go — 1 call site migrated
- cmd/tracecore/main.go — production call site migrated
- internal/pipeline/pipelinetest/doc.go — drop stale NewRuntime reference
- docs/rfcs/0003-... — implementation-notes entry
RED: TestRuntime_VariadicOptions_ApplyInOrder fails to compile
(Option / WithLogger / WithDrainBudget undefined).
GREEN: same test passes; `make ci` clean; all 9 policed packages
above coverage threshold; internal/pipeline coverage held at 85%.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Replaces buildMetricsPipeline / buildTracesPipeline / buildLogsPipeline (three near-identical ~95-line functions) with a single generic buildSignalPipeline[C any] + three small per-signal signalOps[C] values. The generic algorithm — exporters → fanout → reversed processors → first-data wrap → receivers — now lives in exactly one place. The signal-specific glue (which Create* method, fanout constructor, first-data wrapper, consumer-interface assertion) is the only thing per-signal, captured in metricsOps / tracesOps / logsOps. Net: -212 / +103 in cmd/tracecore/main.go (109 lines removed). The characterization test TestBuildPipelines_MultiSignal_ErrorsCiteCorrectSignal pins that each signal path still surfaces ErrSignalUnsupported with its own pipeline ID in the error — guarding against the worst kind of refactor regression (a logs error routed through the traces builder). ctx is the first parameter of buildSignalPipeline (revive's context-as-argument rule); ops is second, matching how the dispatcher reads at the call site. RED: characterization test passes against the three-function code. GREEN: same test passes after consolidation; `make ci` clean; all 9 policed packages above coverage thresholds; cmd/tracecore coverage rose 39.8% → 57.2% (informational; same tests exercise more of the smaller surface). Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three near-identical files (metrics.go / traces.go / logs.go,
87+71+71 LOC) replaced with one genericFanout[T any, C any] in a new
fanout.go plus three thin per-signal wrappers (~46 LOC each). The
cloning algorithm lives once; per-signal files now hold only the
pdata bridge funcs (new/copyTo/isReadOnly/markReadOnly), the
ConsumeX method that forwards to consumeAll, and the public
NewMetrics/NewTraces/NewLogs constructor.
Shared cloning-contract test table (TestFanoutContract_{Metrics,
Traces,Logs}) exercises 6 scenarios × 3 signals = 18 sub-tests
through one generic runCloningContract helper. Lifts traces+logs
coverage to parity with metrics — fanout coverage rose 70.4% → 100%.
The pdata payload types (pmetric.Metrics, ptrace.Traces, plog.Logs)
don't share a Go interface for CopyTo/IsReadOnly/MarkReadOnly in
upstream pdata, so the generic uses a pdataOps[T] callback struct
to bridge them. Each callback is one line.
RED: TestFanoutContract_{Metrics,Traces,Logs} pass against the
pre-refactor code (characterization).
GREEN: same tests + all pre-existing per-signal tests pass after
the generic consolidation; `make ci` clean.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
buildBinary used to call `go build` for every test invocation. With the integration test plus the new TestBuildBinary_CachesAcrossCalls plus `go test -count=N` re-runs, that was a ~2s wait per test. Replaces the per-test build with a package-level sync.OnceValues so the binary is built at most once per `go test` process. The binary lives under os.MkdirTemp so it survives across tests; TestMain cleans the directory after m.Run() returns. Measured (go test -count=2): TestBuildBinary_CachesAcrossCalls: 0.21s first, 0.00s second TestIntegration_ClockreceiverToStdoutexporter: 0.53s first, 0.33s second Total wall-clock: 2.485s; previously each invocation paid the build. RED: TestBuildBinary_CachesAcrossCalls fails on the pre-cache code (each call to buildBinary produces a new t.TempDir path, so require.Equal on paths fails). GREEN: same test passes after the sync.OnceValues cache; the cached binary's mtime is stable across calls, proving no rebuild. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds three Go-native fuzz targets covering the operator-input surfaces — the only places tracecore parses untrusted-ish strings: - internal/config/fuzz_test.go::FuzzLoad — writes the fuzz bytes to a t.TempDir file and pushes through config.Load. 5 seeds: empty, valid, malformed, multi-doc, binary. - internal/config/fuzz_test.go::FuzzParsePipelineID — 6 seeds: metrics, metrics/primary, traces/, //bad, badsignal/x, empty. - cmd/tracecore/fuzz_test.go::FuzzSplitName — 6 seeds covering the YAML component-reference name parser. Property under test: no panic, regardless of input. Each fuzzer ran clean for 3 seconds of mutation (370k/265k/508 execs respectively across 14 workers) with zero crashes. Interesting corpora live in the user's go-build cache, not the repo tree. RED-then-GREEN: each target's seed corpus IS the RED — running it the first time would fail loudly on any panic in the parser. All 6+6+5 seeds pass, demonstrating the panic-avoidance contract holds across the curated inputs. Future loops can drive these targets with `go test -fuzz=Fuzz... -fuzztime=...` for longer mutation runs. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Phase 9 decision: keep the dual-export pattern. Each component ships:
var Factory pipeline.<X>Factory = &factory{}
func NewFactory() pipeline.<X>Factory { return Factory }
Path A (keep both, document) over Path B (extend manifest with a
factoryVar flag, drop the bridge). Reasoning recorded in
docs/rfcs/0004-clockreceiver-stdoutexporter.md "Implementation notes":
1. OTel components export NewFactory() — STRATEGY.md's load-bearing
principle says mirror OTel unless we have a written reason to
diverge. We don't.
2. The bridge is 3 lines per component. The generator complexity to
support both paren-and-no-paren shapes would be more code than the
bridge eliminates.
3. Receivers get both ergonomics: Factory for in-tree no-allocation
reference, NewFactory() for codegen / future plugin surfaces.
Touched:
- docs/rfcs/0004-...: new "Implementation notes" section, sub-section
"NewFactory() bridge to package-var Factory — Path A confirmed".
- tools/components-gen/main.go: doc comment on the FactoryFunc field
points at the RFC entry so future contributors see the rationale
before they propose Path B again.
No code-behaviour change; the decision is durable. M8+ contributors
who copy clockreceiver's shape inherit the pattern.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
A panicking processor or exporter used to crash the binary mid-flight.
M2 will hot-path components for self-metric counters; a buggy counter
increment must not bring down the collector.
New `internal/pipeline/saferun.go` exposes three wrap functions —
WrapSafeMetrics / WrapSafeTraces / WrapSafeLogs — that wrap a
consumer so a panic inside its Consume* is recovered, logged at
Error with the component ID, and surfaced as an ordinary error to
the caller. Capabilities() is a transparent pass-through so the
fan-out cloning decisions stay correct. The wrapper is process-
stable: a panicking call does not poison subsequent pushes.
Wired in at the pipeline-assembly seam in cmd/tracecore.
buildSignalPipeline: after every processor and exporter is created,
the consumer-view is wrapped before being used as the upstream's
`next` (or before being fed to the fanout). One signalOps callback
(wrapSafe) per signal, mirroring how wrapFirstData is plumbed.
Receivers don't get wrapped — they push into next, they don't get
pushed into. The same goes for the fanout: it is tracecore code,
not third-party, and we don't recover from our own bugs.
Quarantine semantics (suspend a component that panics repeatedly)
deliberately deferred to the M2 self-telemetry RFC. M1.6 ships the
cheap behaviour: every push to a panicking component fails its
own push but the next tick gets retried.
RED: TestWrapSafe_PanicsRecovered_PerSignal fails to compile against
pre-Phase-10 code (WrapSafe{Metrics,Traces,Logs} undefined).
GREEN: same tests pass; `make ci` clean; internal/pipeline coverage
rose 85.2% → 86.3% on the new saferun.go's 100%-covered surface.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds three *_bench_test.go files covering the most-trafficked
hot-path code paths in tracecore:
internal/pipeline/bench_test.go
BenchmarkRuntime_StartShutdown (N=10, N=100 components)
BenchmarkWrapFirstDataMetrics_ConsumeMetrics
internal/fanout/bench_test.go
BenchmarkFanout_NewMetrics (5 readonly+mutable mixes)
internal/config/bench_test.go
BenchmarkParsePipelineID
Numbers recorded in docs/research/baselines.md (Apple M4 Pro,
darwin/arm64, Go 1.26.2). Highlights:
WrapFirstDataMetrics after firstdata fires: 3.5 ns/op, 0 allocs
Fanout fast path (single readonly): 1.1 ns/op, 0 allocs
ParsePipelineID: 84 ns/op, 0 allocs
Runtime_StartShutdown sub-linear (10→100 = 2.4× not 10×) thanks
to parallel Phase 1
The harness is the deliverable; M1.6 doesn't gate on these numbers.
M5 will introduce regression gating per MILESTONES.md. Future
phases that touch these code paths can compare quantitatively.
Run via:
go test -bench=. -benchmem -count=1 ./internal/...
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds govulncheck between coverage-check and build in the ci recipe. The standalone `make govulncheck` target was already present; this phase makes it a hard gate so a CVE in a called code path fails the build instead of waiting for an out-of-band scan. Current state at gate-on: 0 called vulnerabilities. govulncheck's symbol-level analysis flags 2 indirect-but-uncalled findings in imported packages and 6 in transitively-required modules — these don't fail the build because our code doesn't reach them. If a future bump pulls a CVE into a called path, ci will fail loudly. RED: the ci target doesn't include govulncheck (existing state). GREEN: `make ci` now runs govulncheck after coverage-check and before build; exit 0 with current dep set; would exit 1 on any reachable vulnerability. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
The first SIGINT/SIGTERM still triggers graceful shutdown via ctx cancellation. The second one fires os.Exit(exitFailure) so the operator can break out when graceful shutdown is taking longer than they expect. Previously the second signal was silently absorbed by the original NotifyContext handler. Replaces signal.NotifyContext with a direct signal.Notify on a buffered channel (cap=2) so two signals delivered back-to-back are both captured. The two-stage loop lives in handleShutdownSignals, extracted so it's unit-testable via a controlled channel — the integration test for the second-signal path can't deterministically race a fast-shutdown pipeline (clockreceiver+stdoutexporter complete shutdown in <1ms, beating the force-exit goroutine). Tests: - TestHandleShutdownSignals_TwoSignals_FiresForceExit pins the contract directly: feed two signals into the channel, assert cancel() and forceExit() are both invoked in order. - TestHandleShutdownSignals_ChannelClosed_ReturnsCleanly pins the shutdown path used when main exits naturally. - TestIntegration_DoubleSIGTERM_DoesNotHang verifies end-to-end that the second SIGTERM doesn't hang the process within 500ms; the exit code is intentionally not asserted because fast shutdowns may legitimately win the race. `--help` text now documents the two-signal protocol. RED: TestHandleShutdownSignals_TwoSignals_FiresForceExit doesn't compile against pre-Phase-13 code (handleShutdownSignals doesn't exist). GREEN: same test passes; `make ci` clean. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
New \`tracecore validate --config=<path>\` subcommand: loads the
config, builds pipelines, prints a terse summary, exits 0. On any
config error returns exit 2 (sysexits.h EX_DATAERR) — same code
\`collect\` returns, so operators can use validate as a pre-flight
dry run with the same shell-script semantics.
Summary format (stderr):
config valid: N pipeline(s)
metrics/primary: receivers=1 processors=0 exporters=1
...
Tests cover the three contract paths:
- good config → exit 0 + summary names every pipeline
- bad config (unknown YAML field) → exit 2 + error names field
- missing --config flag → exit 64 (usage error, kingpin path)
RED: TestRun_Validate_GoodConfig_ExitsOK reports
"unexpected validate" (kingpin doesn't know the subcommand).
GREEN: same test passes; \`make ci\` clean.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds gocyclo to .golangci.yml with min-complexity=15 so every PR is auto-audited. Initial run flagged one function: buildSignalPipeline at 18. Refactored buildSignalPipeline into a small orchestrator plus three helpers — buildExporters, buildProcessors, buildReceivers — each in the 7-9 range. The orchestrator reads top-to-bottom as the pipeline assembly's natural shape: exporters → fanout → processors (reversed) → first-data → receivers No behaviour change; existing tests pass unchanged. cmd/tracecore test coverage held at 57.7% (informational; this section of the file was already well-covered by the buildPipelines tests). Audit baseline recorded in docs/research/baselines.md. RED: `make lint` reports cyclomatic complexity 18 on buildSignalPipeline (> 15). GREEN: all functions <= 15 after the helper split; `make ci` clean. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
\`tracecore --help\` now includes a Signals section (already present), plus new Exit codes and Examples sections so operators can read the contract end-to-end without leaving the terminal. Every flag's Help() text gained a one-line description + an example value where useful (--log.level, --log.format, --config, --shutdown.drain-budget). \`tracecore --version\` switched to a parseable k=v form: tracecore v0.0.0-dev (sha=abc1234, go=go1.26.2, built=2026-..., platform=darwin/arm64) Missing build metadata (no ldflags, no VCS info) drops the k=v pair silently, so \`go run\` still emits a sensible line. Tests: - TestRun_HelpOutput_HasExamplesAndExitCodes pins the help text contract (Signals / Exit codes / Examples / runnable example invocations). - TestRun_VersionOutput_HasStablePrefix pins the stable "tracecore v" prefix and the "go=" k=v entry. - TestBuild_StringFormat updated for the new format. To capture kingpin's --help / --version output in tests, run() now installs: - app.UsageWriter(stderr) / ErrorWriter(stderr) so output goes to the caller's writer instead of os.Stdout. - app.Terminate(...) that captures the exit code rather than calling os.Exit directly; run() returns the captured code so the operator-visible exit semantics are preserved. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Mirrors "Writing a receiver" with two new sections in
internal/pipeline/README.md:
## Writing a processor (≤180 words; pointer to fakeProcessor in
contract_test.go:82 as the canonical
referent until M8+ ships one)
## Writing an exporter (≤200 words; pointer to stdoutexporter
with explicit file:line citations)
Each section follows the same template: shape, factory pattern,
capabilities, pitfalls (each pitfall tied to a specific file:line
when an in-tree example exists). M8+ contributors writing the second
or third component in each category get a copy-able shape without
reverse-engineering the existing one.
Pitfalls codify lessons that surfaced during M1.5 review:
- processor: don't capture `next` at config time, no panic guards
(handled by saferun), no manual MarkReadOnly
- exporter: synchronize shared writers, no blocking Shutdown, bounded
retries against ctx
No code changes; `make ci` still clean.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Root README: - Status now reflects M1.6 shipped state (was "not yet usable"). - New Quickstart section: build, write demo.yaml via heredoc, validate via the `tracecore validate` subcommand, then `collect`. Every line is copy-paste-runnable; verified end-to-end against the freshly-built binary. - Documentation list now links to docs/STRATEGY.md, docs/STYLE-errors.md, and internal/pipeline/README.md so a new reader can navigate from the front door to the runtime quickstart without guessing. internal/pipeline/README.md: - New "Architecture at a glance" section with an ASCII diagram of one pipeline: receivers (parallel) → WrapFirstData → processors (data-flow order, built last-first) → fanout → exporters (serial-LIFO drain). Each seam in the diagram links to the file that implements it. No code changes. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
The original RFC §TelemetrySettings noted MeterProvider was "deferred
until M2". Phase 19 makes that preview concrete so M2 contributors
land on a clean, pre-thought shape:
type TelemetrySettings struct {
Logger *slog.Logger
Resource pcommon.Resource
MeterProvider metric.MeterProvider // M2 — components emit self-metrics
BuildInfo version.Build // M2 — attribution on emitted data
_ struct{} // unkeyed-literal guard
}
The note lives in the Implementation notes section alongside the
existing Phase 4 (variadic Options) entry, so the connection is
explicit: Options are additive, the unkeyed-literal guard catches
forgotten threading, and M2 ships `pipeline.WithMeterProvider(...)` /
`pipeline.WithBuildInfo(...)` without breaking call sites.
Doc-only. No code change; `make ci` clean.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
The three pairs of per-signal wrappers (safe{Metrics,Traces,Logs} and
firstData{Metrics,Traces,Logs}) repeated the same boilerplate: hold
\`next C\`, pass through Capabilities(), and forward Consume<Sig> with
some additional work. Phase 6 had already de-duplicated this exact
shape in internal/fanout via a generic + per-signal callback. This
commit applies the same pattern to saferun.go and firstdata.go.
internal/pipeline/saferun.go
safeBase[T, C capable] with safeCall(ctx, payload) running invoke
under defer/recover. safeMetrics/Traces/Logs embed it and define
only their public Consume<Sig> method.
internal/pipeline/firstdata.go
firstDataBase[T, C capable] with fireAndForward(ctx, payload)
running invoke after the once-only "pipeline first data" log.
firstDataMetrics/Traces/Logs embed it and define only their
public Consume<Sig> method.
capable is the local constraint interface — anything with
Capabilities() consumer.Capabilities, which is every consumer.X
interface in this codebase.
Behaviour preserved: existing TestWrapSafe_* + TestFirstData* tests
pass unchanged. Error / log message text aligned slightly — the panic
log now uses key-value attrs ("component", "op", "panic") instead of
folding op into the message string, but the error returned to the
caller keeps the "component %s: panic in %s: %v" shape.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Before: each signalOps had 8 closures including createX (3) +
asXConsumer (2) + a consumerName string used solely to format the
"does not implement <interface>" error.
After: createExporter and createProcessor each return
(component, consumer.C, error) and own the type assertion inline.
The asExporterConsumer / asProcessorConsumer / consumerName fields
are gone. The builder-bug error ("does not implement consumer.X")
is now a self-contained string the per-signal closure produces; the
buildExporters / buildProcessors call sites wrap it with the usual
"pipeline X: <role> Y: %w" prefix.
Net: ~30 lines removed from cmd/tracecore/main.go; one closure form
per role across signals instead of two-pass create-then-assert.
Behaviour preserved — the existing
TestBuildPipelines_MultiSignal_ErrorsCiteCorrectSignal,
TestBuildPipelines_UndeclaredComponentRef, and other build-side
tests pass unchanged.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
parseProfile previously did `pkg := file[:strings.LastIndex(file, "/")]` — if the line's file field had no slash, LastIndex returns -1 and the slice expression panics with index-out-of-range. Go coverage profiles always include a package prefix in practice, but the slice was a latent crash on malformed input. Now: capture the index, skip the line if it's negative. One extra branch; no behaviour change on well-formed profiles. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three test-first improvements driven by post-M1.6 self-review: 1. tools/coverage-check now has unit tests (previously zero): - happy-path parsing (multi-package aggregation) - empty profile (mode-only header) - missing mode header is rejected - non-existent profile path returns error, not panic - slashless file path is skipped (pins the M1.6 slash guard) - thresholdFor table covers internal/components/other prefixes 2. \`tracecore --version-short\` emits just the bare version string for script consumption (\`VER=$(tracecore --version-short)\`). Implemented as an argv scan that short-circuits before kingpin runs subcommand validation, so it works even though collect's --config is Required. Test: TestRun_VersionShort_BareVersionForScripting. 3. Empty-pipeline boot now logs "no pipelines configured" at WARN (was INFO). Operators running with default log-level filters (warn+) would have missed an INFO line in a typo'd-config scenario that looks indistinguishable from a successful run. Test: TestRunCollect_EmptyConfig_NoPipelinesIsWarn. RED-first on each: the version-short and warn-level tests fail against pre-fix code; the coverage-check tests document existing behaviour and pass against the M1.6-era slash guard. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
The single 720-line cmd/tracecore/main.go grew through M1.6 to hold
five distinct concerns: CLI wiring, signal handling, subcommand
implementations (collect, validate), pipeline assembly, and the
signal-generic ops vars. Reading any one required scrolling past
the others.
Split (preserving package main; all symbols remain unexported and
inter-callable):
main.go 165 CLI entry, kingpin wiring, exit constants,
helpDescription, newLogger
signals.go 25 handleShutdownSignals
collect.go 72 runCollect
validate.go 41 runValidate
build.go 334 buildPipelines, buildSignalPipeline, the three
build<Role> helpers, splitName, componentSet,
resolveComponent, hasFactories, hasOperatorIntent
signalops.go 134 signalOps struct + metricsOps/tracesOps/logsOps
No behaviour change. \`make ci\` clean; all tests pass; internal/
pipeline coverage rose to 89.5%, version package to 90.5% due to
cached test paths re-running across the freshly-built binary.
Reviewability win: an M8 receiver author looking at how pipelines
are assembled opens build.go. An operator-facing CLI change touches
main.go. The signal-generic plumbing has its own file. Each file is
under 350 lines.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds goleak coverage to two packages where lifecycle goroutines live: - internal/pipeline: new TestMain wrapping goleak.VerifyTestMain. - cmd/tracecore: existing TestMain now calls goleak.Find() after m.Run() returns (and after binary-cache cleanup), failing the test run on any leftover goroutine. The check immediately caught a real leak: run()'s sigCh-reading goroutine (handleShutdownSignals) blocks on <-sigCh and never wakes when run() returns, because we only signal.Stop() the channel — we never close() it. In production this is moot (process exits, all goroutines die); in tests, every TestRun_* invocation accumulated a leaked goroutine. Fix: close(sigCh) AFTER signal.Stop() in the run() defer. Stop prevents the runtime from sending to the channel before we close, so we don't panic on send-to-closed-channel. handleShutdownSignals already handles the closed-channel case (its <-sigCh, !ok branch). This was a knowledge-gap closure from the post-M1 self-assessment: "go test -race detects races, not leaks; we have no goleak coverage." Adding it surfaced the leak the next iteration after we noticed. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Adds TestMetrics_NestedPayload_DeepCloneIsolation, a realistic-shape payload (2 ResourceMetrics × 2 ScopeMetrics × 1 Gauge × 2 DataPoints with attribute maps) pushed through a mutating + readonly fanout pair. Asserts that a mutator adding a key to its data-point attribute maps does NOT leak into the readonly consumer's clone — verifying pdata's CopyTo semantics at every nesting level (RM, SM, Metric, Gauge, DataPoint, pcommon.Map). This closes the post-M1 knowledge gap "we test single-resource payloads; real receivers push deeply nested pdata and we never verified clone isolation at the attribute-map level." Result: isolation holds; the existing fanout cloning algorithm is correct for production-shaped payloads. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two doc-only gap closures from the post-M1 self-assessment: 1. \`tracecore validate --help\` now says the output is human- readable and may change between releases. Operators tempted to parse "config valid: N pipeline(s)" for scripting see the warning directly. A future --format=json would be the stable surface. 2. internal/pipeline/README.md's receiver-author Pitfalls list now covers goroutine-internal panic recovery: saferun catches panics inside ConsumeMetrics/Traces/Logs, but a panic in the receiver's own goroutine (before reaching next.Consume*) is NOT caught and crashes the process. Includes the defer/recover snippet receivers should copy. No code change. \`make ci\` clean. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three closures from the OTel v0.152.0 divergence audit run as
background research alongside the gap-closing loop:
1. **Rename** pipeline.ErrSignalUnsupported → pipeline.ErrSignalNotSupported
to match OTel's exact spelling. Pre-1.0 breaking change with low
blast radius (8 in-tree call sites). The error message text is
unchanged ("signal not supported by this component"), so operator
logs don't drift.
2. **Multi-instance regression test**
(TestBuildPipelines_MultiInstance_SameTypeTwoInstances) — pins
that two instances of the same factory (clockreceiver/fast +
clockreceiver/slow) coexist with independent IDs and distinct
goroutines. Smoke-verified live; locking with this test.
3. **STRATEGY.md divergence table** expanded with six newly-
identified drifts:
- Host.ReportStatus location (method vs componentstatus free fn,
align in M2)
- CreateSettings shape (missing BuildInfo + unkeyed-init guard,
M2 work)
- TelemetrySettings shape (missing TracerProvider/MeterProvider,
M2 work)
- Factory interface decomposition (intentional flattening; OTel
adds XStability() we defer to post-v1.0)
- ErrSignalNotSupported naming (now aligned)
- Consumer-seam panic recovery (intentional tracecore addition;
OTel has none — documented as deliberate divergence)
- "pipeline first data" log (intentional tracecore UX addition)
Also adds an anti-list of "we're identical here" findings so
future contributors don't re-audit the agreed points.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two follow-ups from the OTel-audit gap closure: 1. Doc-comment Host.ReportStatus with a Deprecated note pointing at the M2 migration to OTel's componentstatus.ReportStatus(host, ev) free-function pattern. Receivers can call it today; the rename will be a one-line edit per call site once M2 lands. No code change yet — premature churn pre-1.0 with no actual callers. 2. TestReceiver_GoroutineDeferRecover_KeepsProcessAlive — exercises the documented recipe from internal/pipeline/README.md's Pitfalls list. A receiver whose internal goroutine defers + recovers its own panic must (a) not crash the process and (b) let the runtime shut down cleanly. Without the test, the recipe was unverified; with it, the README has a tested example to point at. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three non-blocking suggestions from PR #13 review: 1. buildSignalPipeline switch gains a default arm (build.go) so a fourth Signal value added to internal/pipeline can't silently produce zero pipelines. Currently unreachable — ParsePipelineID rejects unknown signals — defense in depth. 2. coverage-check fails loudly when the profile has zero measurable packages (tools/coverage-check/main.go). Prevents the gate from silently becoming a no-op if `make coverage`'s ./... glob gets narrowed in a future tooling change. 3. STYLE-errors.md rule 2 now has a callout explaining why wrapcheck's ignore-package-globs excludes the entire module (in-module errors share context with callers; wrapping stutters). Reminds future contributors that wrapcheck still polices stdlib / third-party returns. The fourth suggestion (constructor helpers for safeBase/firstDataBase) deferred: current field-assignment pattern works and a fourth wrapper family isn't imminent. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Self-review surfaced two more deferrals: - FAILURE-MODES.md references test names by string; rename rot is invisible until a reader follows a dead link. A make doc-check step that asserts existence would close this. ~30 minutes when next touching the Makefile. - The pipelinebuilder package's test surface is split (cmd/tracecore integration coverage + pipelinebuilder's own processor-path test). A one-paragraph doc.go answering "where do I test buildPipelines?" removes the ambiguity for the next contributor. Both belong in FOLLOWUPS.md so they survive past chat. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three weak items from the fresh-eyes self-review that were cheap enough to close in-PR: 1. nps.md gets a "stub, pre-pilot" banner up top. The doc was over-prescriptive — committing to specific NPS methodology before talking to a single operator is shaping-the-question, not asking-the-question. The banner makes it clear nothing is binding until pilots run. 2. STRATEGY.md "Each component package owns its own Factory var" section gets a hard rule: PRs introducing centralized Register APIs / global factory maps / plugin loaders MUST be rejected without an accepted RFC, burden of proof on the proposer. Was hopeful-policy; now a concrete gate. 3. Makefile coverage target's tools/ exclusion gets an explicit comment. Was a silent behavior; now an in-line "if a future tool needs coverage tracking, add it to both lists" note. Plus one item tracked in FOLLOWUPS.md (not fixed in this PR): - Verify squash-merge preserves Assisted-by trailers after the PR actually lands on main. The disclosure-discipline gap is real but only observable post-merge. Items still NOT closed by this PR (intentional, tracked): - FAILURE-MODES.md doc-rot guard (make doc-check) — FOLLOWUPS - pipelinebuilder doc.go test-surface note — FOLLOWUPS - Cross-platform CI — FOLLOWUPS (M3) - Backend integration recipes — FOLLOWUPS (M14-16) - PR size itself — structural, can't fix mid-PR Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Strong-agree items from the v2 self-review, batched: builder.go: - Reframe "this build" error → "no component factories registered" (library-shaped phrasing in an exported function) - Reframe "unsupported signal %v (this build has no assembly path)" → "no assembly path registered for signal %v" - Nil default-config guard in resolveComponent (defense in depth, matches the default-arm philosophy) - Drop dangling "(UX #4)" reference; explain the criterion inline pipelinebuilder/doc.go (new): - Explains why the package is a sibling, not nested under internal/pipeline (cycle) - Documents the test-split convention so contributors aren't ambiguous about "where do I test buildPipelines?" builder_test.go: - mustYAML panics on empty input rather than silently returning a zero yaml.Node coverage-check: - Dedup key built from regex captures (file + range explicitly) instead of fragile raw-line slicing. The regex now has 4 capture groups so a future format tweak surfaces as a regex mismatch. integration_test.go: - SignalDuringBoot: replace the 5ms-sleep race with a 500ms retry loop that signals until accepted (or the deadline expires). Resolves the Linux-CI flake risk flagged in the review. validate.go: - Rename `stderr` param to `out` (the function doesn't care which writer it is) - ctx.Err() check between Load and BuildPipelines so a SIGTERM during a slow validate bails cleanly FAILURE-MODES.md: - TestLoad_MultiDocYAML_Rejected → TestLoad_MultiDocumentYAML_Rejected (test name was wrong) - "Bad YAML" row now references a specific test, not just the file - "SIGINT during boot" now references TestIntegration_SignalDuringBoot - "Phase-1 timeout" row: no dedicated test exists; row clarified to reference the indirect coverage rather than fabricate a name - "First-data instrumentation" now references a specific test - "Receiver-internal goroutine panic" row: clarify that the linked test exercises the *recipe*, not a crash assertion (negative case isn't tractable from inside the panicking process) - New row: stdout pipe broken mid-write — present failure mode in stdoutexporter today, was missing from the inventory FOLLOWUPS.md (restructure): - New "Next up" pinned section (doc-check, cross-platform CI, squash-merge trailer verification) - "Considered and explicitly skipped" section with rationale for: the DRY wrap helper, the signalConsumer generic constraint, the TestBuildBinary inode check - New rows: signalops trigger refinement (non-mechanical-diff is the real trigger), Factories.Validate() method, validate --json, make register-lint, make ci wall-clock trend, coverage trend - Ticked: internal/pipelinebuilder doc.go (landed in this commit) One review item attempted and reverted with documentation: - signalConsumer generic constraint on signalOps[C] silently dropped pipelinebuilder coverage from 74%→43% under -coverpkg. Root cause not nailed (suspected generic instantiation interacting oddly with coverage attribution); FOLLOWUPS records the diagnosis in the skipped section. Verified: internal/safe.Call exists as the README claims (review item #25). Items deferred to FOLLOWUPS (with explicit rationale): DRY wrap helper, signalConsumer constraint, inode check, register-lint, Factories.Validate, validate --json, OSS-Fuzz integration, benchstat in CI, wall-clock trend, coverage trend. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two passes addressing the user's "remove superfluous documentation" and "put important-for-the-future items into a roadmap" feedback: Trimmed (remove what-isn't-there unless necessary): - docs/FAILURE-MODES.md: drop the "Things tracecore deliberately does NOT handle (yet)" prose section. The 🔴 rows in the inventory tables already capture the same content; the bottom section was duplicate-with-future-speculation. Intro paragraph also tightened. - docs/nps.md: trim the "Status: stub, pre-pilot" banner from a paragraph to two lines. - docs/STYLE-errors.md: drop the "Deferred — no error codes yet" section. Speculative; the deferral fact lives in RFC-0003. - internal/pipelinebuilder/doc.go: tighten test-split note from three paragraphs to one. - internal/pipeline/README.md: trim the "Deferred from M1" section to a single sentence pointing at RFC-0003. Roadmap-tracked (use the phrase "Carry-forward from M1.6"): The user asked: "anything that would be important for the future it's good to put into a roadmap." MILESTONES.md is that roadmap. Added "Carry-forward from M1.6:" bullets under the affected milestones (M3, M4b, M5, M6, M8) for items that have concrete milestone targets — cross-platform CI, chaos-light test, benchstat in CI, sustained-load, backend integration recipes, doc-check, the M8 receiver-author ergonomics review. docs/FOLLOWUPS.md restructured to keep only the two things that DON'T fit a milestone: - opportunistic / trigger-based items (no fixed date) - considered-and-explicitly-skipped (anti-rot) Cross-link added in FOLLOWUPS preamble so a reader looking for the M3 carry-forward grep "carry-forward" in MILESTONES.md. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Closes the four actionable items from PR-13 v3 review:
A. TestIntegration_SignalDuringBoot retry loop:
- Track `signaled` bool; fail with a meaningful message if all
250 attempts return errors (was: misleading "process hung" via
the cmd.Wait deadline)
- Detect os.ErrProcessDone and break early (process exited
before we delivered the signal — nothing to wait for)
B. validate.go ctx-cancel exit-code comment:
- runValidate returns exitFailure on ctx-cancel; runCollect
returns exitOK. Inconsistency was intentional: validate's job
is to PROVE config validity; cancelled validate hasn't proved
anything and shouldn't claim success. Comment makes this
intentional asymmetry visible.
C. Pointer comment at top of signalops.go warning future
contributors away from adding the signalConsumer generic
constraint. The PR-13 v2 attempt dropped coverage 74%→43%
under -coverpkg; the comment + FOLLOWUPS entry should stop the
next person from rediscovering it.
D. New `make doc-check` (scripts/doc-check.sh): verifies every
Test/Fuzz/Benchmark identifier referenced in rot-prone docs
(docs/FAILURE-MODES.md + internal/pipeline/README.md)
corresponds to a real function in the test tree. Scoped tight
on purpose — RFCs intentionally reference deleted-test names
as historical record. Wired into `make ci` between govulncheck
and build.
Plus two minor v3 review tweaks:
- internal/pipelinebuilder/doc.go gains an "As of 2026-05" date so
future readers can judge staleness of its "where the tests live"
guidance (review item #4).
- FOLLOWUPS "considered and skipped" DRY-wrap entry now says
"Revisit if: never — this is a style commitment" (review #12).
Current state: doc-check finds 23 valid references across 2 docs;
all CI gates pass.
Signed-off-by: tree <tree@lumalabs.ai>
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
5 tasks
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
## What this PR does Closes the one FOLLOWUPS item from PR #13 that could only be observed post-merge: whether GitHub's squash-merge preserves the per-commit `Assisted-by:` trailers. **Observed on squash commit `a57883f`: it does not.** The audit trail survives via the PR description (embedded in the squash commit body, listing every source SHA), but the trailer-level disclosure AGENTS.md describes is gone. Documents the finding + flags two governance paths that aren't being chosen unilaterally: - Update AGENTS.md to treat the PR-description SHA list as canonical squash-merge disclosure, or - Switch AI-assisted PRs to merge-commit / rebase-merge so per-commit trailers survive. Tag `v0.1.0-m1` was also created on `a57883f` (dev-preview, not a release artifact) — pushed alongside this branch but separately. ## Linked issue(s) Refs PR #13 (post-merge observation). N/A — no GitHub issue. ## Release notes ```release-notes NONE ``` ## Checklist - [x] N/A — docs-only change - [x] `make ci` passes locally - [x] Commits are signed off (`git commit -s`) - [x] Commits cryptographically signed (SSH-based, see `scripts/setup-signing.sh`) - [x] N/A — no new components in this PR Signed-off-by: tree <tree@lumalabs.ai> Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Address items #9 (--version OTel SDK), #13 (Example_* tests), #14 (Kind constant coverage + helper), #15 (shutdown error logging). Also tactical: extract initTelemetryStack from runCollect so the silent-shutdown-error fix didn't push cyclomatic complexity past 15. - internal/version: Build gains OTelSDKMetric + OTelPromExporter fields, populated from debug.ReadBuildInfo. Surfaces in --version output as `otel-sdk-metric=vX.Y.Z, otel-prom-exporter=vA.B.C`. Operators debugging OTel-side bugs see which SDK actually shipped without grep'ing go.mod. - internal/selftelemetry: new CanonicalKinds() returning the Kind* constant set and IsCanonicalKind(s) helper. Test pins the set + asserts clockreceiver's KindDownstream is canonical. Future receivers (M8/M9) plug in the same test pattern; eventual lint enforcement is documented in FOLLOWUPS. - internal/selftelemetry/example_test.go: ExampleNewReceiver + ExampleNewExporter render in pkg.go.dev for M8/M9 authors and compile-check the 6-line wiring pattern every CI run. The RECEIVER-PATTERNS.md snippet can no longer drift from a working example without breaking this test. - cmd/tracecore.runCollect: previously swallowed mp.Shutdown errors in the failure-unwinder paths with `_ = mp.Shutdown(ctx)`. Now logs at Warn level so a real drain failure surfaces. Extracted the telemetry-init block into initTelemetryStack helper to keep runCollect's cyclomatic complexity under the gocyclo 15 ceiling (the helper's own error-handling carries the rest of the cleanup branches). make ci clean. Coverage unchanged on most pkgs; selftelemetry up to 91.0% via the new IsCanonicalKind test. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Two independent reviews of PR #18 surfaced a stack of blockers, strong findings, and quality lifts. This commit lands the tractable items (defers documented in docs/FOLLOWUPS.md M8 section). Operator-visible drift (closes Reviewer 2 #13-#17): - RUNBOOK kind-triage row `consume` → `downstream` (last commit renamed the kind but missed the doc table). - README config table: `initial_delay` range updated from "0.. collection_interval" to "≥ 0" (code already relaxed for DGX cold-starts; doc was stale). - prometheus-alerts: DCGMReceiverHighErrorRate threshold changed from `rate > 0.1/sec` (unreachable: 15s default tick caps at 0.067/sec) to `increase > 5 in 5m`. Stale "M2 has not landed" caveat removed. - example_config: single `mode:` line with a clarifying comment (was showing both `mode: standalone` and `# mode: embedded`, inviting operators to uncomment both → YAML duplicate-key). - cmd/tracecore receivers list now prints `dcgm [stub]` / `dcgm [cgo]` so operators can verify deploy shape without reading go.mod. Build-tag-conditional via three small files in cmd/tracecore (receiver_variants*.go) — pattern extends to M11 NVML. Correctness bugs (closes Reviewer 2 #2, #3, #4, #6, #7): - receiver.Shutdown: `r.running.CompareAndSwap(true, false)` gates teardown so a second Shutdown is a no-op (cgo libdcgm `dcgmShutdown` is not documented idempotent). Same CAS provides the happens-before for `r.cancel` publish that Pass-1 flagged. - receiver.ensureWatched: zero-entities path now emits IncError(KindEnumerate) + degraded rather than returning true. Without this, a misconfigured host (no GPUs visible, ACL blocks /dev/nvidia*) had the receiver looking healthy while emitting nothing. - receiver: new `warnOnce` helper gates the 7 per-tick failure- path Warn logs to fire only on the first failure after recovery. Closes the log-storm bug (4 errors/min × 60 min = 240 lines). Counter (`receiver_errors_total`) still ticks every failure. - metrics.applyCardinalityCap: parameter `cap` → `maxSeries` (cap shadowed the Go builtin). Quality / contract lifts: - metrics.emit now returns a `stale` count for StatusStale and StatusError samples. pushSamples calls IncError(KindRead) once per tick when stale > 0 — surfaces DCGM serving slow/faulty data, which is precisely what StatusStale exists for. Per-tick not per-sample so GPU count doesn't inflate the rate. (StatusNoData and StatusFieldNotSupported still silent.) - docs_parity_test.go: new TestRUNBOOK_KindsMatchEmitted walks every emitted IncError/failedTick kind against the RUNBOOK per-kind triage table in both directions. This is the structural fix for the bug class — RUNBOOK can never again drift from emitted kinds without CI failure. - receiver.go: promoted `watchUpdateDivisor` / `watchKeepForMultiplier` / `watchUpdateEveryMinimum` constants for the previously-magic DCGM watch-cadence ratios. Documentation + dedup: - dcgm README: new "Privacy + data residency considerations" subsection (compliance-auditor ask). Flags hw.id / pci.bdf / NVLink peer IDs as quasi-identifying; provides two mitigation patterns (attr-drop processor, salt-hash pseudonymization). - docs/agents/examples/constructor_options.go: `WithTelemetry` renamed to `WithSelfTelemetry` to match the real in-tree receiver API. M9+ authors copying the example no longer drift. - RUNBOOK kind enumeration line restored with both watch and mig (dcgm-local kinds) per the last commit's promotion. - Repo-root `FOLLOWUPS.md` consolidated into `docs/FOLLOWUPS.md` (M8-opportunistic + M8-skipped sections). Single source of truth; 17 deferred items pulled forward with falsifiable triggers (cgo client landing, M11 sibling-receiver shape, operator-report thresholds, file-size triggers). - All bare `FOLLOWUPS.md` references updated to `docs/FOLLOWUPS.md`. Honest pushback documented: - I disagree with the M8-AGRADE-GAP claim of Operator UX 3.7→4.0. The drift findings above are exactly the class of bugs that rubric criterion was supposed to prevent — the alerts-vs-RUNBOOK parity test existed but didn't check kind values against emitted call sites. The new TestRUNBOOK_KindsMatchEmitted closes that gap; future operator-UX claims should pin to a test like this. - Deferred: split receiver.go (475 LOC) into 3 files, hoist dcgmtest.BaseClient, SECURITY.md for receiver, dcgm_info join-target, libdcgm setup in CONTRIBUTING — all logged with triggers. Reviewer 1's "construct receiver via M9-style primary-Option" inconsistency goes in the queue for M9-close review. `make ci` passes; dcgm coverage 86.0% (essentially flat — new tests offset by widened emit signature in test paths). Assisted-by: Claude Opus 4.7 Signed-off-by: Tri Lam <trilamsr@gmail.com>
3 tasks
trilamsr
added a commit
that referenced
this pull request
May 20, 2026
## Summary - Strip review-process artifacts (reviewer tags, PR refs) and section-banner comments from 5 files. - Collapse per-field / per-const / per-method comment blocks whose bodies just restated identifier names. - Keep every load-bearing WHY (TOCTOU ordering, adversarial-input bounds, retry short-circuit, gzip threshold, classifyKind partition, cardinality discipline). - Net: **+118 / −357 lines**, no behavior change. ## Why these, not others Six-months-cold-reader test: would a reader without this conversation's context need the comment to avoid a footgun? If no, delete. | File | What went | Why it failed the test | |---|---|---| | `pkg/nccl/fr_parser/parser.go` | `// --- section ---` banners; per-field struct prose | Banners decay silently as the file grows; field names already self-document. Adversarial-input WHY preserved on the struct. | | `components/exporters/otlphttp/otlphttp.go` | Reviewer P3-Rev2-F1, S1/F9, C4/V1/V2, V14/m3, P2.10, M-5, M7 tags | Process coordinates that mean nothing post-merge; git blame + PR threads are the right home. Underlying constraints kept. | | `internal/pipelinebuilder/signalops.go` | "PR #13 v2 tried it" attribution | The rule (don't add the constraint) survives; docs/FOLLOWUPS.md already preserves the history pointer. | | `components/receivers/pyspy/kinds.go` | 6-line blocks per Kind const (× 12) | RUNBOOK.md owns operator semantics. Two truths drift; const names are self-documenting. | | `internal/selftelemetry/interface.go` | Per-method one-liner docs whose bodies restated method names; cardinality warning duplicated on every method | DRY for prose too. Cardinality discipline stated once on `Kind`. | ## Release notes ```release-notes NONE ``` ## Test plan - [x] `go build ./...` - [x] `go vet ./...` - [x] `go test ./pkg/nccl/fr_parser/... ./components/exporters/otlphttp/... ./internal/pipelinebuilder/... ./internal/selftelemetry/... ./components/receivers/pyspy/...` — all green 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Eight 1-page pattern-design specs covering #2 IB link flap, #7 dataloader hang, #8 NCCL timeout no-HW, #9 NCCL bootstrap timeout, #10 CUDA OOM deceptive allocator, #11 checkpointer hang, #12 loss spike NaN, #13 silent data corruption. Each carries the standard detector-design shape (symptom, layers, signal sources, evaluation rule, verdict attrs, edge cases, status, open questions) so the next contributor can write a TDD red test directly off the spec. Status: all 8 marked planned. #10 already has issue #303; the spec frames the design alongside. NORTHSTARS Appendix A gains a Spec column; docs/README + patterns README link the new specs. Signed-off-by: Tri Lam <tri@maydow.com>
5 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Ships pattern #13 (silent data corruption) end-to-end per the spec at `docs/patterns/13-silent-data-corruption.md` — the hardest pattern in the 15-set, where the run completes, loss looks normal, but downstream eval shows degraded model quality. - **Library**: `SilentDataCorruptionDetector` in `module/pkg/patterns/silent_data_corruption.go` consumes `EvalAccuracyRecord` + `SDCCounterRecord` typed projections and emits `SilentDataCorruptionVerdict`. Discriminator follows the spec's evaluation rule: - `accuracy_drop >= AccuracyDropThreshold` AND same-job `hw.gpu.sdc.*` counter rose during the job window → `kind=vendor_signaled` (full confidence) - `accuracy_drop >= AccuracyDropThreshold * 2` alone → `kind=accuracy_only` (partial confidence) - sub-2x-threshold without vendor signal → no verdict (conservative-by-default noise band) - **False-positive guards** (spec §"Edge cases"): dataset-checksum mismatch suppresses verdicts; checkpoint-step mismatch suppresses verdicts. Both guards are opt-in (only fire when both sides set the field). - **Wiring**: `patterndetectorprocessor` projects `gen_ai.training.eval_accuracy*` + `hw.gpu.sdc.*` log records, runs the detector, emits the verdict with promoted scalars per issue #270. - **Schema**: `module/pkg/patterns/testdata/silent_data_corruption_verdict.schema.json` pins the verdict wire format; 13 schema-drift falsifiers + conformance round-trip for both branches. - **Docs**: spec status flipped to shipped; `docs/patterns/README.md` table updated; `docs/ATTRIBUTES.md` registers the 13 new customer-stable attribute keys (`hw.gpu.sdc.{delta,kind}`, `gen_ai.training.eval_accuracy.*`, `gen_ai.training.eval_set.*`, `gen_ai.training.checkpoint.*`, `gen_ai.training.job.{start,end}_unix_nano`, `tracecore.alert.silent_data_corruption.*`) and adds the `silent_data_corruption` row to the per-pattern matrix. Verdict remains explicitly advisory per spec §"Detector evaluation rule" — both branches route the operator to a same-recipe + same-seed re-run on different hardware as the disambiguator, because SDC repro is non-deterministic. ## Test plan - [x] `cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... -count=1` — green (19 SDC library tests incl. 13 schema falsifiers, 7 wiring tests). - [x] `cd module && go vet ./...` — clean. - [x] `golangci-lint run ./...` — 0 issues (via pre-commit hook). - [x] `attribute-namespace-check` — 82/82 unique attribute literals documented (via pre-commit hook). - [x] Full module `go build ./...` — clean. ## Integration gaps (upstream-blocked, documented in-place) - `gen_ai.training.eval_accuracy` upstream framework instrumentation (spec §"Open questions" 5) — projection tested against hand-crafted records; real-world input arrives once the OTTL recipe lands. - `hw.gpu.sdc.*` OTTL recipe for DCGM SDC catcher / row-remap / AMD ECC counters — blocked on RFC-0014 PR-B (metrics→logs pattern). Cross-vendor support depends on `hw.gpu.sdc.*` semconv that doesn't exist yet (spec §"Open questions" 2). - Baseline-accuracy provenance + cross-run state are v0.3+ product questions (spec §"Open questions" 1, 4) — the detector accepts operator-stamped baselines today. ```release-notes Add pattern #13 (silent data corruption) detector. Surfaces suspected SDC when an eval-accuracy regression (`gen_ai.training.eval_accuracy` vs baseline) crosses the configured threshold; a same-job `hw.gpu.sdc.*` counter rise during the job window flips the verdict to high confidence (`kind=vendor_signaled`). The verdict is advisory — both branches recommend a same-recipe re-run on different hardware as the disambiguator. New config knobs: `sdc_accuracy_drop_threshold` (default `0.005`) and `sdc_accuracy_only_multiplier` (default `2.0`). ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but the join shape varies across patterns and the rationale lived only in code comments + PR review threads. Operators tuning windows had to read source per detector. Audit finding: five distinct shapes are load-bearing (chosen by the causal physics of each signal), not bugs: - One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect. - Asymmetric two-sided (#11): pre-stall covers concurrent-start checkpoints; post-stall covers OTTL-bridge logger latency. - Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window could be cause OR consequence. - Job-window bounded (#13): SDC counter rise must fall in the bounded eval-cycle's owning job; no operator knob is meaningful. - Trailing-window rate / freshness (#2 #4 #8): rolling window anchored at `now` or the most-recent record. Decision: document the existing reality, do not converge. Forcing every detector to the asymmetric two-knob form would silently zero one leg for the one-sided detectors (footgun on clock skew) and would not apply to #13 at all. Adds: - 'Why this correlation shape' section in docs/patterns/07, 11, 13 (the three shapes the issue called out by name). - 'Correlation-window semantics' table in docs/patterns/README.md covering ALL 11 detectors with the predicate, anchor, and shape rationale, plus cross-links to the per-pattern sections. No code changes; no detector behavior changes. Closes #367. Signed-off-by: Tri Lam <tri@maydow.com>
3 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Three v1 detectors use three different correlation-window shapes; the rationale lived only in code comments and review threads. Operators tuning windows hit these without warning. - pattern #7 (dataloader_hang): one-sided look-back — strict cause→symptom - pattern #11 (checkpointer_hang): asymmetric (-30s, +60s) — log/record race - pattern #13 (silent_data_corruption): symmetric job window — order-free attribution ## Decision Document the existing reality. Each shape matches a distinct physical event-ordering. A unified asymmetric two-knob form (issue suggestion) would silently zero one leg for #7 and not apply to #13 — operators would tune knobs with no physical meaning for their pattern. ## Changes - "Why this correlation shape" subsection in each of `docs/patterns/07-dataloader-hang.md`, `11-checkpointer-hang.md`, `13-silent-data-corruption.md` - "Correlation-window semantics" comparison table in `docs/patterns/README.md` cross-linking the three rationales - No code changes; no detector behavior changes ## Test plan - [x] `golangci-lint`, `go vet`, `attribute-namespace-check` (pre-push hook) — green - [x] Each pattern doc cross-links to the comparison table - [x] Each comparison-table row links back to the per-pattern rationale section Closes #367. ```release-notes docs: document correlation-window semantics for patterns #7, #11, #13 — three shapes (one-sided, asymmetric, symmetric) match distinct physical event-orderings; no behavior change. ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 2, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
Pre-fix, scripts/doc-check.sh:103 exited 0 whenever docs/FAILURE-MODES.md carried no Test*/Fuzz*/Benchmark* references (its current state on main, grep count = 0). The early-exit silently bypassed every gate below it: link-rot (.md + non-.md), banned-phrase lint, (unverified) baseline, reproducibility shell-syntax, maintainer- ship H2, M6 integration recipes (markers + 180d staleness), example placeholder lint, README integration index, required-doc presence, getting-started block cap, nps H3, FAILURE-MODES labels, and the comment-noise diff gate. Root cause: a57883f (#13) shipped doc-check.sh with one gate (the Go-test parity check), so `[ -z "$referenced" ] && exit 0` was correct. Subsequent PRs appended ~14 gates below without noticing they'd be dead-coded whenever FAILURE-MODES dropped its Test* refs. PR #459 worked around by placing the new YAML gate *above* line 99. Fix: scope the skip to the parity block only (if/else, not exit). Update the comment in the trust-posture preamble to record the bug. Surfaced + fixed: - docs/FOLLOWUPS.md: dead `followups/otlphttp.md` link. Shard never shipped on main; folded into the "deleted post-v0.2.0" prose. - docs/cut-criteria.yaml.md (2x) + install/kubernetes/tracecore/ README.md: "production-grade" banned-phrase hits reworded. - scripts/doc-check.sh: rescope getting-started.md block cap to the `## Walkthrough` H2 section only (M6 intent — quickstart commands, not Helm + air-gap installs that landed after the cap was set). Mutation-verified post-fix: - inject dead .md link → link-rot gate exits 1, names file + target - inject "blazing-fast" + "rock-solid" → banned-phrase gate exits 1 - remove tested-against marker from datadog.md → recipe gate exits 1 Gate count: 2 status lines pre-fix → 16 post-fix on clean tree. `make doc-check` exits 0 on clean tree. Closes #460. Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
Reviewer B flagged the PR body's "2 → 16" gate count as inaccurate and asked whether the test plan had been re-run against post-#453 state. Resolution recorded here: Gate count, empirically verified via `grep -c '^doc-check: '` on `make doc-check` output on a clean tree: Pre-fix on main (after #459 landed) 3 status lines Post-fix this PR (after rebase onto origin) 17 status lines The PR body's "2 → 16" was correct at PR open (before #459 merged). After rebasing onto origin/main, the YAML cross-link gate from #459 adds one more enforced gate (it was placed above the early-exit line specifically to work around the bug this PR fixes, so it ran on the pre-fix baseline too). The "14 gates below that line" framing in the release-notes block is correct for either base: it counts gates that the early-exit was hiding, which is invariant across the rebase. Test plan re-run post-rebase: `make doc-check` exits 0 on clean tree, 17 status lines emitted, no mutation regressions. Refresh the script header docstring to record the empirical count and reflect that doc-check now covers 17 drift patterns, not just the original Go-test-name parity check it shipped with in #13. No functional change to gate logic. Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
…460) (#466) ## Summary Closes #460. The `exit 0` on `scripts/doc-check.sh` ran unconditionally whenever `docs/FAILURE-MODES.md` carried no `Test*`/`Fuzz*`/`Benchmark*` identifiers (its current state on `main` — `grep -c` = 0), silently bypassing every gate below it. Fix scopes the skip to the Go-test parity block only (if/else, not `exit`), then surfaces and fixes the dead refs the gates were supposed to be catching. ## Root cause Commit a57883f (#13) shipped `doc-check.sh` with one gate — the Go-test name parity check — so `[ -z "$referenced" ] && exit 0` was correct then. PRs #28, #56, #115, #131, #144, #149, #195, #234, #241, #443, #455, #459 (and others) appended gates **below** that line without recognising they'd become dead code whenever `FAILURE-MODES.md` lost its `Test*` references. PR #459 worked around the bug by placing its new YAML gate *above* line 99 and tracked the root cause separately as #460. ## What surfaced Once `exit 0` was removed, three real issues fired: 1. **Dead `.md` link**: `docs/FOLLOWUPS.md` → `followups/otlphttp.md`. The shard was never committed to `main`'s ancestry. Folded into the existing "Shards deleted post-v0.2.0 as fully resolved-via-pivot" prose block (sibling treatment to M9, M14, M16). 2. **Banned-phrase hits** (3x `production-grade`): reworded in `docs/cut-criteria.yaml.md` (2x) and `install/kubernetes/tracecore/README.md` (1x) to falsifiable language. 3. **`docs/getting-started.md` block cap**: 7 fenced bash/sh blocks. The M6 cap of 5 was set for the quickstart only — `## Install via Helm` and `## Air-gapped install` are alternate deployment paths that landed post-M6 and aren't part of the quickstart budget. Rescoped the gate to count blocks inside the `## Walkthrough` H2 section only (1 block, well under cap). ## Gate count Empirically verified via `grep -c '^doc-check: '` on `make doc-check` output on a clean tree: | State | Status lines emitted | Gates the early-exit was hiding | |---|---|---| | Pre-fix on `main` (post-#459) | 3 (trust-posture, YAML cross-link, parity-skip) | 14 | | Post-fix this PR (post-rebase) | 17 | 0 | The "14 gates hidden" number is invariant across the rebase: it counts gates placed below the early-exit line. The "3 → 17" total reflects post-#459 reality on `main`; pre-#459 baseline was "2 → 16" (the figure originally in this PR body), and #459 itself worked around the bug by placing its YAML gate above line 99. ## Mutation tests Each gate below the original early-exit was confirmed to fire post-fix: | Mutation | Gate expected to fire | Exit code post-mutation | Exit code post-restore | |---|---|---|---| | Inject `[bad](nonexistent-ghost.md)` into `docs/FOLLOWUPS.md` | markdown link-rot | 1 | 0 | | Append `blazing-fast` + `rock-solid` to `docs/getting-started.md` | banned-phrase lint | 1 | 0 | | Delete `<!-- tested-against: ... -->` from `docs/integrations/datadog.md` | M6 recipe markers | 1 | 0 | ## Test plan - [x] `make doc-check` exits 0 on clean tree (re-run post-rebase onto origin/main; 17 status lines) - [x] 3 mutation tests above each toggle exit 1 → 0 across mutate / restore - [x] Pre-push hooks green: golangci-lint (0 issues), `go vet ./...`, `go mod verify`, `attribute-namespace-check` (100 attrs, all documented), `register-lint`, `actionlint`, `zizmor`, `deprecation-check`, `no-autoupdate-check` - [x] Rebased onto current `origin/main` (includes #459, #461, #462, #456); no conflicts; gate count re-verified empirically post-rebase - [x] No changes to gates above line 99 (the trust-posture callout + YAML cross-link gate from #459 still run and emit unchanged status lines) ## Self-grade **A+** — root cause named in commit body (a57883f #13 with one gate; gates appended below without exit-path awareness); 3 mutation tests (success criteria required 1–2); rescoped the getting-started gate to match M6 intent rather than papering over the surfaced overflow; the `[ -z "$referenced" ]` legitimate skip is preserved via if/else (not `:` no-op, which would have left the `defined=` / `orphans=` block running on empty input); gate count corrected empirically post-rebase per reviewer B feedback. ```release-notes - fix(ci): `scripts/doc-check.sh` no longer exits 0 at the Go-test parity gate when `docs/FAILURE-MODES.md` carries no `Test*` references. 14 gates below that line (link-rot, banned-phrase, M6 recipe markers, etc.) are now actually enforced on every `make doc-check` invocation. Closes #460. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
5 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 4, 2026
## Summary Wave-end audit flagged the patterndetectorprocessor fanout site as an unmet refactor: `ConsumeLogs` hand-rolled dispatch for every shipped detector (12 today: 7 inline + 5 wrapped), so adding pattern #13 required editing the fanout body — not registering a new entry. Past the rule-of-three by 9x. This PR introduces a minimal Detector registry seam: - `module/pkg/patterns/detector.go`: new `Detector` interface (`PatternID() string`) + `Registered` slice that pins all 12 detector pointers. Each `*Detector` struct gets a one-line `PatternID()` method. - `module/pkg/patterns/detector_test.go`: `TestRegistered_PinsAllPatterns` (exact PatternID set + count), `TestRegistered_UniquePatternIDs`, `TestRegistered_NonEmptyPatternIDs`. Drift gate — accidental drops fail in CI. - `patterndetector.go`: introduces `detectorRunners []detectorRunner` closure list iterated by `ConsumeLogs`. `ConsumeLogs` body drops from ~77 lines to 12. Adding pattern #13 = one append to `Registered` + one append to `detectorRunners`, no fanout-site edit. ### Design decision: metadata-only interface The `Detector` interface is intentionally `PatternID() string` only — not a uniform `Evaluate` method. Each detector's Evaluate signature is intrinsically heterogeneous (different input record shapes — events+nodeConds, ncclRecs, xidRecs+events, etc. — and different verdict types). A uniform Evaluate would force a lossy `any`-typed contract that the typed test suite has been fighting for 12 patterns. The closure-per-detector approach keeps the typed Evaluate calls at their concrete-runner sites while letting the registry pin identity + iteration. ### Behavior preservation - Same telemetry vocabulary: PodEvicted and IBLinkFlap still `IncVerdict` with `string(v.Confidence)` (they gate on partial); the other 5 inline detectors still pass `""`. The 5 wrapped runners still tick inside their own helpers (unchanged). - Same emission order: `detectorRunners` is declared in the legacy emission order. - Same partial-confidence gating: `emitPodEvicted` and `emitIBLinkFlap` preserve the `!emitPartial` skip. ### Test plan - [x] `cd module && go build ./...` clean - [x] `cd module && go test ./pkg/patterns/` green (incl. 3 new pin tests) - [x] `cd module && go test ./processor/patterndetectorprocessor/` green except pre-existing #497 (`TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06-multi-rank-disk-pressure`, fixed in Lane J) - [x] `make lint` clean (0 issues) - [x] `make vet`, `go mod verify`, attribute-namespace-check all green (pre-push hook) ### LoC delta - +321 / -79 across 3 files. - `ConsumeLogs` body: 77 → 12 lines. - Growth is in: registry plumbing (164 lines, mostly comments + the pin tests), runner closures (one per detector). The seam earns its bytes — adding pattern #N is now O(append) instead of O(edit-fanout). ### Closes-the-loop Closes wave-end-audit next-wave item #2 (pattern registry seam). ```release-notes - refactor(patterns): introduce `patterns.Detector` interface + `patterns.Registered` slice. The patterndetectorprocessor now iterates a registry-driven runner list instead of hand-rolled fanout — adding a new pattern is one append, not a processor edit. Behavior-preserving; no operator-facing change. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
43 atomic commits closing M1.5's self-review compromises, adding pre-M2 hardening, processing three rounds of external code review, and tracking everything we deferred so nothing leaks back into chat history.
Major surfaces
make ciQuality gates (CI now enforces)
edd54a5docs/STYLE-errors.md+ tree-wide audit +splitNamestutter fix25d7675048f7b0make coverage-check(internal/* ≥70%, components/* ≥60%) + runtimeStart-after-Shutdownfix85b6080ParsePipelineID,Load,splitName34e61a393bd8ccgovulncheckinmake ciceb6b30gocyclo -over 15gate;buildSignalPipelinesplit into helpers2a1dcaegoleakin TestMain + fixed leaked signal-handler goroutine78c6508make doc-check(asserts every test name referenced in rot-prone docs exists)Pipeline runtime + contract
3226a4epipeline.NewRuntime(pipelines, opts...)replacesSettingsstruct420b788WrapSafeMetrics/Traces/Logsat every cross-stage Consume hopc574f4dbuildXPipeline→ 1×buildSignalPipeline[C](-109 LOC)ad15390genericFanout[T,C](coverage 70%→100%)f4966d3c7dc11asignalOpscreate+assert closuresa93fc25sync.OnceValuesintegration-test binary cachef39aedahandleShutdownSignalsextracted + unit-tested8be84a7ErrSignalUnsupported→ErrSignalNotSupported; multi-instance regression test5643a4aHost.ReportStatusdeprecation + receiver defer/recover example test37e1dc0internal/pipelinebuilder(sibling, avoids cycle)Operator UX
7f0ae78tracecore validatesubcommand; exit 2 on bad config2be6ea0--helpexamples + exit-code table;tracecore vX.Y (sha=..., go=..., built=..., platform=...)0ddcdea--version-shortfor scripting; empty-pipeline boot now WARN-visible3ffb27dvalidate→collectroundtrip,--versionend-to-end55c7c3cDocumentation
edd54a5docs/STYLE-errors.md(≤80 lines)5e26f55edc1826a3b7a66622f2e8Factory+NewFactory()bridge)e87681058621fcdocs/FAILURE-MODES.mdinventory (handled / bounded / not-handled, every row linked to a test)6359fd7docs/STRATEGY.mdResolution column + registry-of-registries write-downCode review closures (3 rounds)
189f85d492839372b10a1a66354578c6508Real bugs found and fixed
048f7b0):Startracing withShutdownwould launch components nobody would tear down. Coverage-instrumentation timing surfaced it.2a1dcae):handleShutdownSignalsblocked on<-sigChforever afterrun()returned. goleak surfaced it.splitNamestutter (edd54a5): operator's bad input appeared twice in error messages.Future work — tracked, not lost
MILESTONES.md— 5 milestone-tracked "Carry-forward from M1.6:" bullets (M3 cross-platform CI, M4b chaos-light, M5 benchstat + sustained-load, M6 finalize FAILURE-MODES + nps + backend recipes + doc-check, M8 receiver ergo review + network failures).docs/FOLLOWUPS.md— 8 open items without fixed milestones (opportunistic + trigger-based) + 3 explicitly considered-and-skipped items with reasoning (DRY wrap helper,signalConsumergeneric constraint, inode-based cache check). Pinned "Next up":make doc-check✓ already shipped this round; squash-mergeAssisted-by:verification (observable only post-merge).Linked issue(s)
Refs internal M1 milestone (post-RFC-0003 / RFC-0004 hardening). N/A — no GitHub issue.
Release notes
Checklist
make cipasses locallygit commit -s)