Skip to content

Add NORTHSTARS, RFC-0002, and Q1 MILESTONES#3

Merged
trilamsr merged 3 commits into
mainfrom
docs/q1-planning
May 13, 2026
Merged

Add NORTHSTARS, RFC-0002, and Q1 MILESTONES#3
trilamsr merged 3 commits into
mainfrom
docs/q1-planning

Conversation

@trilamsr

@trilamsr trilamsr commented May 13, 2026

Copy link
Copy Markdown
Contributor

What this PR does

Adds three project-planning artifacts that together set the goals tracecore optimizes for and what we commit to ship this quarter:

  • RFC-0002 closes RFC 0001's open question on collector binary architecture: own binary, with receivers structured to be upstreamable to opentelemetry-collector-contrib later.
  • NORTHSTARS.md names seven strategic objectives, hero KPIs, operating rules, inter-objective coupling, what we explicitly do not chase, and the 15 named root-cause patterns (Appendix A) the collector is built to root-cause end-to-end.
  • MILESTONES.md is the Q1 commitment list — 23 milestones across foundation, receivers, and outcomes; ≥80% target hit-rate; in-repo scope only. Standards-body and external community work are NORTHSTARS cadences, not Q1 milestones.

Reviewer test plan:

  • Read RFC-0002; confirm the architectural call matches direction.
  • Read NORTHSTARS.md; verify the seven objectives, KPIs, and operating rules.
  • Read MILESTONES.md; verify acceptance criteria are falsifiable.
  • Confirm cross-references resolve (NORTHSTARS → RFC-0002; MILESTONES → NORTHSTARS).

Linked issue(s)

N/A — initial planning artifacts; no linked issue.

Release notes

NONE

Checklist

  • Tests added or updated — N/A (docs-only PR)
  • make ci passes locally — N/A (docs-only PR, no Go changes)
  • Commits are signed off (git commit -s)
  • For new components, follows the layout required by STYLE.md — N/A (no new components)

trilamsr added 3 commits May 13, 2026 02:44
Closes RFC 0001's open question on whether to import
go.opentelemetry.io/collector/component directly or ship a
standalone collector binary.

Commits to: tracecore is its own Go binary. Receivers live
in this repo and are structured so they could be upstreamed
to opentelemetry-collector-contrib later, but the primary
distribution is our own build.

Why own-binary over upstream-extension:
- Trust narrative requires we own the build chain end-to-end
  (PRINCIPLES.md §1).
- Release cadence is faster than upstream's; tracecore needs
  point-release cadence aligned with NCCL/PyTorch upstream
  changes, not OTel collector cadence.
- Convenience-first defaults for training observability can't
  live in a general-purpose collector.
- Vendor SDK isolation policy (STYLE.md §Vendor SDK isolation)
  needs per-receiver build-tags that upstream's contribution
  model doesn't accommodate asymmetrically.

Reversibility preserved by mirroring upstream's component
interface shape in internal/pipeline; migration to upstream's
contract is mechanical if we ever reverse.

Three open follow-on questions deferred to later RFCs:
when individual receivers get upstreamed, our pdata-import
exposure, and whether general (non-training) receivers ever
belong here.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
A goals doc that complements PRINCIPLES.md (why) and STYLE.md
(what conventions). Names the northstar, seven strategic
objectives with hero KPIs, supporting KPIs, operating rules,
and named caveats.

The northstar: make tracecore the default OSS collector running
on every distributed AI training cluster — the most convenient,
lowest-friction way to get cross-vendor training observability.

Seven objectives:
- O1 Coverage — root-cause-pattern coverage (15 named patterns
  enumerated in Appendix A)
- O2 Convenience & quality — install-to-first-data on a named
  reference cluster
- O3 Trust & supply chain — SLSA Build Level achieved and
  maintained
- O4 Standards — external implementations of gen_ai.training.*
  OpenTelemetry semantic conventions
- O5 Distribution & community — unique organizations in
  production
- O6 Velocity — quarterly ship-commitment hit-rate
- O7 Project governance — non-employee maintainers active

Each objective has one accountable owner role, hero KPI,
supporting KPIs, operating rules, and named caveats. The doc
also enumerates inter-objective coupling, what we explicitly
do not chase (closed-source synthesis, inference observability,
GPU cost dashboards, experiment tracking, auto-phone-home), and
the open questions tracked as RFCs.

Goals don't grow with the codebase; KPIs do.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
The quarterly ship-commitment artifact per NORTHSTARS.md §O6.
23 milestones across foundation, receivers, and outcomes;
target hit-rate ≥80%; in-repo scope only.

Foundation (Weeks 1-4): pipeline runtime, self-telemetry,
reproducible-build CI, lint/test harness, failure-injection
harness, benchmark harness, Helm chart skeleton, docs scaffold.

Receivers (Weeks 3-10): DCGM, dmesg/journald, k8s events, NCCL
FlightRecorder (with a safe pickle parser that refuses code
execution), NCCL Inspector, py-spy continuous, PyTorch Kineto,
container stdout, Kueue scheduler.

Outcomes (Weeks 5-12): three root-cause patterns covered
end-to-end (pod-evicted → stragglers → NVLink degradation;
staggered by data-injection difficulty), staged reference-
cluster install benchmark (Helm install → first DCGM datapoint
→ ≤5 min median), first v0.1.0 release with a public
failure-trace issue template so operators can contribute
regression-test fixtures.

Scope is in-repo work only. Standards-body engagement (O4) and
external community outreach are recurring cadences in
NORTHSTARS, not Q1 milestones — they don't produce commits to
this repo.

Misses get documented with reason and re-planned in the
end-of-quarter retro, not silently dropped. Hits and misses
both remain in MILESTONES.md as the public record.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
@github-advanced-security

Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

@trilamsr trilamsr merged commit 5df282c into main May 13, 2026
9 of 11 checks passed
@trilamsr trilamsr deleted the docs/q1-planning branch May 13, 2026 10:09
trilamsr added a commit that referenced this pull request May 14, 2026
Pushes toward the prompt's self-eval gate by closing every gap
that does NOT genuinely require Linux + libdcgm at build time.

- metrics.go::fieldEmitters grows from 6 to all 13 metric
  families: + hw.gpu.io (PCIe Tx/Rx), hw.energy, hw.gpu.nvlink.io
  (per-link Tx/Rx), hw.gpu.clock.frequency (sm/memory/video
  domains), hw.gpu.xid.errors. ECC aggregate counters keep their
  dedicated drop tier. The receiver-side pipeline is now complete
  for every metric in the README's design table; only the SOURCE
  of samples remains gated on the cgo client.

- pkg/dcgm/types.go: new well-known FieldID constants for SM /
  memory / video clock (100/101/102), NVLink L0 Tx/Rx
  (1040/1041), throttle reasons bitmask (112). RHS will switch to
  go-dcgm constants when client_cgo.go lands.

- components/receivers/dcgm/integration_hardware_test.go:
  //go:build dcgm,hardware skeleton. Skips with a clear reason
  when DCGM is unreachable; runs end-to-end against a real GPU on
  a Linux host where both build tags are active. Hardware
  reviewers have the test to fill in; macOS CI doesn't run it.

- emit_bench_test.go: BenchmarkEmit_TypicalScrape pins the
  per-scrape cost at 37 microseconds for 8 GPUs x 12 fields.
  At 15s collection_interval that's 0.00025 percent CPU --
  three orders of magnitude under the 0.05% O2 budget.

- resetSession() helper extracted from ensureConnected + scrape
  so the connection-loss state-reset doesn't drift between two
  call sites. Closes Loop-4 P3 nit on duplicated reset logic.

- docs/agents/RECEIVER-PATTERNS.md: new "Pattern selection" table
  -- five source-type rows with constructor / lifecycle / pattern
  reference per row -- so M9 (streaming/subprocess), M10 (failure-
  triggered), M11 (vendor-SDK like dcgm) authors know which shape
  fits their work. Closes Loop-4 P3 question on the doc gap.

- FOLLOWUPS.md created at repo root (was referenced repeatedly,
  never written): 4 opportunistic items, 4 "considered and
  explicitly skipped" items with Revisit-if predicates.

- README.md: metric table no longer split into "emitted vs
  deferred" -- the table is the truth, and a single paragraph
  notes that the data SOURCE waits on the cgo client.

- Smoke-tested the binary: `tracecore collect --config=
  example_config.yaml` boots, logs "dcgm receiver started",
  attempts Connect, fails with "dcgm: SDK unavailable", enters
  degraded mode (reason=init), shuts down within the 1s budget.
  End-to-end happy path verified on this host.

Self-eval criterion #3 (metric set) lifts from 3 to 4. Criterion
#2 (cgo wrapper) stays at 3 because client_cgo.go itself is the
only remaining gate -- a Linux GPU host is required to compile
the cgo bindings. The MILESTONES Carry-forward bullet commits to
that work.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Address item #3: `syncBuffer` was duplicated in cmd/tracecore +
internal/telemetry, and `failingSink` lived only in clockreceiver
even though M8/M9 will need the same downstream-error stub.
Hoisted into `internal/pipeline/pipelinetest`:

  pipelinetest.SyncBuffer            (was syncBuffer ×2)
  pipelinetest.RecordingMetricsSink  (new — for capturing pushes)
  pipelinetest.FailingMetricsSink    (was failingSink ×1)

Migrations:
- cmd/tracecore/integration_test.go: drops 19-line syncBuffer def,
  type alias to pipelinetest.SyncBuffer (call sites unchanged).
- internal/telemetry/server_test.go: same alias pattern.
- components/receivers/clockreceiver/errors_integration_test.go:
  drops failingSink, uses pipelinetest.FailingMetricsSink{Err: ...}.

clockreceiver_test.go's metricsSink intentionally not migrated —
it's package-internal, tightly coupled to a few test assertions,
and migration cost exceeds value for this PR. M8/M9 + future
receivers reach for pipelinetest.RecordingMetricsSink instead.

Coverage: internal/pipeline/pipelinetest 77.3% → 86.8% via new
helper tests. Other packages unchanged. make ci clean.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Closes 4 of 7 new A+ criteria from the recursive self-review:

#1 — e2e-otelcontrib now verifies the collector PARSED the record,
not just that it accepted bytes. Workflow rewritten to docker-run
otelcol-contrib with a custom config (file + debug exporters,
detailed verbosity). After the e2e POST, the bash step greps
/tmp/otelout/logs.json for the canonical body, the kernelevents.xid
attribute, and the gpu.id attribute. Empty file or missing
attributes → workflow fails.

#2 — TestIntegration_KmsgWriteReadBehavioral (//go:build linux)
writes a synthetic <6>NVRM Xid 79 line to /dev/kmsg, uses a marker
string in a regex_filter to isolate from ring-buffer noise, then
asserts the receiver emits a plog.LogRecord with kernelevents.xid=79
+ gpu.id=0000:65:00.0 within 3s. A regression in parse/build/emit
fails this on Linux CI.

#3 — prometheus_alerts_test.go validates the alert YAML structure
(every group has interval, every rule has expr/severity/summary/
description) AND cross-references the metric + label-filter names
against the receiver's actual SelfTelemetry surface. A typo in
the alert would silently never fire; this catches it before merge.

#5 — runbook_test.go executes the RUNBOOK's "First 15 minutes"
step 1 (`tracecore validate --config=...`) and step 2
(`tracecore debug dump`) as real commands. Documentation rot
becomes a test failure, not a silent SRE-time discovery.

#4 — sustained_test.go (`//go:build sustained`) feeds 1000
events/sec for 5 minutes (300k records), samples heap every 30s,
asserts ≤10 MiB growth and p99 emit latency tail bounded. New
`sustained-load` workflow job runs it on push-to-main + schedule
(not PR — 5 minutes is too slow for the inner loop).

The seventh criterion (two-week soak + external operator) requires
elapsed time + a human; nothing in-session can close it.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Two independent reviews of PR #18 surfaced a stack of blockers,
strong findings, and quality lifts. This commit lands the tractable
items (defers documented in docs/FOLLOWUPS.md M8 section).

Operator-visible drift (closes Reviewer 2 #13-#17):
 - RUNBOOK kind-triage row `consume` → `downstream` (last commit
   renamed the kind but missed the doc table).
 - README config table: `initial_delay` range updated from "0..
   collection_interval" to "≥ 0" (code already relaxed for DGX
   cold-starts; doc was stale).
 - prometheus-alerts: DCGMReceiverHighErrorRate threshold changed
   from `rate > 0.1/sec` (unreachable: 15s default tick caps at
   0.067/sec) to `increase > 5 in 5m`. Stale "M2 has not landed"
   caveat removed.
 - example_config: single `mode:` line with a clarifying comment
   (was showing both `mode: standalone` and `# mode: embedded`,
   inviting operators to uncomment both → YAML duplicate-key).
 - cmd/tracecore receivers list now prints `dcgm [stub]` / `dcgm
   [cgo]` so operators can verify deploy shape without reading
   go.mod. Build-tag-conditional via three small files in
   cmd/tracecore (receiver_variants*.go) — pattern extends to M11
   NVML.

Correctness bugs (closes Reviewer 2 #2, #3, #4, #6, #7):
 - receiver.Shutdown: `r.running.CompareAndSwap(true, false)`
   gates teardown so a second Shutdown is a no-op (cgo libdcgm
   `dcgmShutdown` is not documented idempotent). Same CAS provides
   the happens-before for `r.cancel` publish that Pass-1 flagged.
 - receiver.ensureWatched: zero-entities path now emits
   IncError(KindEnumerate) + degraded rather than returning true.
   Without this, a misconfigured host (no GPUs visible, ACL blocks
   /dev/nvidia*) had the receiver looking healthy while emitting
   nothing.
 - receiver: new `warnOnce` helper gates the 7 per-tick failure-
   path Warn logs to fire only on the first failure after recovery.
   Closes the log-storm bug (4 errors/min × 60 min = 240 lines).
   Counter (`receiver_errors_total`) still ticks every failure.
 - metrics.applyCardinalityCap: parameter `cap` → `maxSeries`
   (cap shadowed the Go builtin).

Quality / contract lifts:
 - metrics.emit now returns a `stale` count for StatusStale and
   StatusError samples. pushSamples calls IncError(KindRead) once
   per tick when stale > 0 — surfaces DCGM serving slow/faulty
   data, which is precisely what StatusStale exists for. Per-tick
   not per-sample so GPU count doesn't inflate the rate.
   (StatusNoData and StatusFieldNotSupported still silent.)
 - docs_parity_test.go: new TestRUNBOOK_KindsMatchEmitted walks
   every emitted IncError/failedTick kind against the RUNBOOK
   per-kind triage table in both directions. This is the
   structural fix for the bug class — RUNBOOK can never again
   drift from emitted kinds without CI failure.
 - receiver.go: promoted `watchUpdateDivisor` /
   `watchKeepForMultiplier` / `watchUpdateEveryMinimum` constants
   for the previously-magic DCGM watch-cadence ratios.

Documentation + dedup:
 - dcgm README: new "Privacy + data residency considerations"
   subsection (compliance-auditor ask). Flags hw.id / pci.bdf /
   NVLink peer IDs as quasi-identifying; provides two mitigation
   patterns (attr-drop processor, salt-hash pseudonymization).
 - docs/agents/examples/constructor_options.go: `WithTelemetry`
   renamed to `WithSelfTelemetry` to match the real in-tree
   receiver API. M9+ authors copying the example no longer drift.
 - RUNBOOK kind enumeration line restored with both watch and mig
   (dcgm-local kinds) per the last commit's promotion.
 - Repo-root `FOLLOWUPS.md` consolidated into `docs/FOLLOWUPS.md`
   (M8-opportunistic + M8-skipped sections). Single source of
   truth; 17 deferred items pulled forward with falsifiable
   triggers (cgo client landing, M11 sibling-receiver shape,
   operator-report thresholds, file-size triggers).
 - All bare `FOLLOWUPS.md` references updated to
   `docs/FOLLOWUPS.md`.

Honest pushback documented:
 - I disagree with the M8-AGRADE-GAP claim of Operator UX 3.7→4.0.
   The drift findings above are exactly the class of bugs that
   rubric criterion was supposed to prevent — the alerts-vs-RUNBOOK
   parity test existed but didn't check kind values against
   emitted call sites. The new TestRUNBOOK_KindsMatchEmitted
   closes that gap; future operator-UX claims should pin to a test
   like this.
 - Deferred: split receiver.go (475 LOC) into 3 files, hoist
   dcgmtest.BaseClient, SECURITY.md for receiver, dcgm_info
   join-target, libdcgm setup in CONTRIBUTING — all logged with
   triggers. Reviewer 1's "construct receiver via M9-style
   primary-Option" inconsistency goes in the queue for M9-close
   review.

`make ci` passes; dcgm coverage 86.0% (essentially flat — new
tests offset by widened emit signature in test paths).

Assisted-by: Claude Opus 4.7
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Closes the R1 #3 / S4 gap: README + RUNBOOK reference deploying
the receiver as a DaemonSet multiple times, but no manifest
shipped. Operators landing on the docs had to derive the
hostPath + CAP_SYSLOG + securityContext shape from prose.

components/receivers/kernelevents/example-daemonset.yaml:
- ConfigMap with a minimal `receivers: kernelevents` config and
  `service.pipelines.logs/kernel.receivers: [kernelevents]`
  (operator-empty `exporters: []` so the file is paste-and-run
  without forcing an exporter dependency).
- DaemonSet with the exact securityContext shape the README's
  Container realities section prescribes: runAsNonRoot,
  drop ALL caps, add SYSLOG, readOnlyRootFilesystem, seccomp
  RuntimeDefault.
- hostPath mount of /dev/kmsg with `type: CharDevice` (validates
  on apply; without the type, a `/dev/kmsg` that's missing on
  the node would create silently as a regular file).
- updateStrategy rollingUpdate maxUnavailable: 1 — one pod at a
  time keeps kmsg coverage continuous across the fleet.
- hostPID: false is required for the kernel namespaced-kmsg
  feature; we mount /dev/kmsg explicitly so the receiver still
  sees host-wide events.
- tolerations: operator-Exists so the DaemonSet runs everywhere
  including the control plane; commented out priorityClass and
  the journald-host-mount block so they're discoverable but not
  forced.
- automountServiceAccountToken: false because the receiver
  doesn't talk to the k8s API.

Validated locally with `kubectl apply --dry-run=client -f`
against kubernetes 1.28+. NOT yet validated on a live cluster —
operators should also run `kubectl apply --dry-run=server`
first to catch policy-engine (Kyverno, Gatekeeper, PSA)
objections their fleet may have.

Cross-references:
- README "Container realities" section points readers here.
- CHANGELOG row remains accurate (no change needed).

Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Test tag #3 confirmed the reproducibility fix lands: both build digests
matched (f744dd584b5e…). The step still tripped because
diffoscope-minimal in ubuntu-latest's apt repo doesn't recognise
--exit-code; the assertion was returning 2 (argparse error) instead of
0. The default exit status (0 = no diff, 1 = diff) carries the same
signal, so just drop the flag.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Applying phase-4 reviewer findings on the skill at
.claude/skills/pr-review-loop/SKILL.md. This commit is a test
invocation of the skill's own protocol on the PR shipping the skill.

Findings addressed:

- BLOCKER P1.1: branch precondition `feat/m<X>-*` rejected this very
  branch (worktree-docs+loops-pr-review). Skill self-failed on its
  own PR. Widened to any non-main branch with optional milestone
  parsing; added `(no-milestone)` fallback when the branch doesn't
  match the feature-branch convention. SKILL.md lines 26-32, 56-63.

- BLOCKER P1.2: PR #33 body described the deleted doc artifact, not
  the shipped skill. Per feedback_update_pr_body_when_premise_breaks.
  Updated via gh pr edit (separate from this commit).

- CONCERN P1.3: skill cited `MILESTONES.md §M<X>` but the actual
  heading form is `### M<X>.` (zero §M references in MILESTONES.md
  body). Reviewer subagents told to read §M8 would fail to anchor.
  6 occurrences updated to `MILESTONES.md "### M<X>."`.

- CONCERN P1.4: single-quoted heredoc makes placeholder resolution
  the agent's responsibility, but skill didn't say so explicitly.
  Added clarifier in auto-detect block (SKILL.md:55-63).

- REJECTED: reviewer-1 claim that `/docs/loops/` is still in
  .gitignore. False positive — verified absent via
  `grep 'docs/loops' .gitignore` returning no match. Was removed in
  commit a7fcc82. Per feedback_audit_finding_triage.

- DEFERRED to FOLLOWUPS:
  - AI-vocab grep gate exclusion for .claude/skills/ vs actual CI
    (no enforced CI gate found; convention-only)
  - 5-phase rigidity vs feedback_narrow_pr_scope (2-cycle cap) and
    feedback_anti_bureaucracy — needs adaptive-phase design
  - TDD escape-hatch for doc/comment-only changes — design tradeoff
  - .claude/ralph-loop.local.md scaffolding / gitignore policy
  - --max-iterations 10 citation against ralph-loop plugin docs
  - Skill description vocab leak into Claude Code skill picker UI

Edge cases enumerated (phase-1 protocol requires >=5):
  1. Branch with no open PR yet — skill's precondition #3 catches
     this; aborts with "no open PR." Acceptable behavior.
  2. Branch with multiple PRs — gh pr view defaults to first.
     Documented behavior. No action.
  3. Detached HEAD — `git branch --show-current` returns empty;
     precondition catches via empty-branch check.
  4. First-time invocation, no .claude/ralph-loop.local.md — agent
     creates on first append. Acceptable.
  5. Re-invocation after partial completion (some [review-pass-N-*]
     commits exist) — agent counts existing commits and picks up at
     next phase. Acceptable; relies on agent counting.

Hard proof for each applied change: git diff shows the line-level
changes; before/after grep verifies §M, branch-regex, placeholder
clarifier.

This is phase 1 of 5. Subsequent phases (stakeholder lenses,
adversarial, A+ aspiration, simplification) operate on the now-
updated skill.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…+ threat-root trace on go-mod-verify

Phase-4 A+ aspiration review (2 fresh subagents; both graded A,
diverged on which gates to apply). Validation cycle:

Findings table:

| ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action |
|----|------|-------------|----------|---------|-------|------------|------------|---------|--------|
| P4.1 (aplus-1 #2, also P2.6) | aplus | operator | CONCERN | A workflow_dispatch run with `inputs.tag` set but `github.ref` ≠ refs/tags/$INPUT_TAG passes Build and fails the OIDC smoke check 15-30 minutes later. Operator wastes runner time and sees the misuse late. | New "Verify dispatch ref matches tag (pre-flight)" step exit-1s within seconds with the documented workaround. | Reviewer noted the smoke check already enforces this — but at job-end, not at job-start. Fail-fast IS the load-bearing property. | n/a — workflow YAML, actionlint clean | yes — P4-aplus-1 | applied this commit; closes P2.6 deferral. |
| P4.4 (aplus-2 #2) | aplus | repo-long-term | NIT | go-mod-verify comment says "defense in depth against a compromised GOPROXY mirror" but doesn't name the trust root or the orthogonal threat (a poisoned go.sum itself). | Comment now states "Trust root: the go.sum at this tag commit" and cross-references the tag-protection FOLLOWUPS entry. | A future maintainer might over-attribute the protection. | n/a | no | applied this commit |

Rejected/deferred:

- P4.2 (aplus-1 #4) — Structured diff lint for release.yml ↔
  docs/reproducibility.md. DEFER to FOLLOWUPS.md (real value, but
  manual review caught both drift directions in Phases 2 + 3;
  automate when next edit happens).
- P4.3 (aplus-1 #6) — Release artifact manifest validation before
  upload. REJECT. Per anti-bureaucracy: reviewer concedes `needs:`
  dependency already gates malformed artifacts from reaching the
  release job. Adding defensive validation against a CI-bug
  scenario is bloat.
- P4.5 (aplus-1 #3) — docs/SUPPLY-CHAIN-IDENTITY.md consolidated
  reference. DEFER to FOLLOWUPS.md; ~30-min write, scope creep
  beyond release.yml. M21 release-checklist is the natural trigger.
- P4.6 (aplus-1 #5, aplus-2 #3) — Formal threat-model document +
  M21 alignment narrative. DEFER to M21.
- P4.7 (aplus-2 #5) — Cross-link health lint. Duplicate of P4.2;
  same deferral.

Reproducibility:
  $ make actionlint zizmor   # exit 0
  $ grep -A1 "workflow_dispatch with inputs.tag" .github/workflows/release.yml
    # pre-flight gate present

Letter-grade outcome:
  Reviewer #1 starting: A → A+ via criteria 2, 4, 6 (we applied 2 + threat-model comment)
  Reviewer #2 starting: A → APPROVED-AS-IS (already strong)
  After this commit: A on the falsifiable axis (one operator-UX gate
  + one comment clarification), with the broader doc/lint work
  scoped to follow-ups.

Beneficiary: operator. The pre-flight gate cites a specific
operator-facing surface (15-30 minute waste on workflow_dispatch
misuse) and turns it into a seconds-fast named error.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…IT applied; 13 deferred

8 parallel stakeholder-lens reviews against the design-locked draft.
6/8 lenses returned CONCERNS-REQUIRE-FIX; 1 returned BLOCKER-REQUIRES-RESOLUTION (researcher).

Pushback (severity-prefixed, applied inline unless noted):

BLOCKER:
- P2.R-1 (researcher) — §Attribute namespace's blanket "gen_ai-only, no dual-emit" contradicts MILESTONES.md M15 rubric L360 + M18 rubric L481, both of which compile-time-pin tracecore.training.data_time_s / iter_time_s (M18 via ErrStragglerInputMissing in internal/synthesis/patterns/straggler.go). APPLIED: wire-attribute table carves the two timing attributes as a documented exception until upstream proposal lands, then sibling cross-receiver PR.

CONCERN (multi-lens — escalated):
- P2.multi-2 (sre,adopter,security) — /var/log/containers mount + fsGroup + runAsNonRoot missing per M15 rubric L373. DEFERRED to FOLLOWUPS (§Pod manifest subsection in Phase-2 PR).
- P2.multi-3 (operator,sre,adopter) — KindCardinality covers both fingerprint and attribution LRUs; on-call can't distinguish. APPLIED: split into KindFingerprintCardinality + KindAttributionCardinality; renamed config to `attribution.lru_cap` vs `egress_rate_limit.lru_cap`.
- P2.multi-4 (contributor,maintainer,adopter,researcher) — "makes funded" ungrammatical. APPLIED: replaced with "the drafted upstream proposal already absorbs".
- P2.multi-6 (contributor,adopter) — dataloader_regex shipped as default without marking placeholder per rubric L360 "operators MUST override". APPLIED: inline comment block.
- P2.multi-7 (operator,sre) — RUNBOOK.md + prometheus-alerts.example.yaml + FAILURE-MODES.md rows not named as Phase-4 deliverables. DEFERRED.

CONCERN (single-lens, applied):
- P2.R-2 (researcher) — HintEvicted does not compile; actual constant is k8sevents.HintPodEvicted at hint.go:24. APPLIED: typo fixed + evictionMatchWindow defined as eviction_match_window default 5s in Configuration surface.
- P2.R-4 (researcher) — JobID source-priority chain missing. APPLIED: JOB_ID → TORCHELASTIC_RUN_ID → SLURM_JOB_ID → k8s label fallback (per rubric L358 + research §13.13).
- P2.sre-1 — ClusterRole named tracecore-containerstdout-clusterrole breaks k8sevents convention. APPLIED: renamed to tracecore-containerstdout (no -clusterrole suffix).
- P2.contrib-1 — SchemaURLv0/SchemaURL collapse inverts k8sevents anti-pattern warning. APPLIED: rewrote snippet to match k8sevents/record.go:90-107 line-for-line with the warning comment preserved.
- P2.maint-3 — "§Functional rubric #3" cross-doc reference points outside the RFC. APPLIED: fully-qualified to "MILESTONES.md M15 functional rubric bullet 3".
- P2.sec-4 — cursor file mode unspecified. APPLIED: cursor.file_mode: 0600 + dir_mode: 0700 in Configuration.
- P2.sec-5 — FieldSelector defense-in-depth missing. APPLIED: Validate rejects empty FieldSelector; runtime drops mismatched-nodeName events with KindWatch.
- P2.R-7 — named-capture (?P<name>…) allowance not stated in validate-at-load list. APPLIED.

CONCERN (single-lens, deferred to FOLLOWUPS):
- P2.perf-1/2/3/4 — §Performance subsection (alloc budget, channel sizing rationale, regex literal-prefilter, attribution-lookup concurrency). DEFERRED as consolidated FOLLOWUPS entry.
- P2.sec-2 — log-body secret redaction posture. DEFERRED (alpha-stage posture acceptable per sibling-receiver pattern).
- P2.sec-3 — vendor-tree CVE tracking. DEFERRED.
- P2.sre-2 — RBAC opt-out path when rank_source: downward_api. DEFERRED.
- P2.sre-5 — Rollback subsection in §Migration. DEFERRED.
- P2.maint-1/2 overlap with P2.R-2/R-4 (already applied).

NIT (deferred or skipped):
- P2.perf-5 — cursor flush cadence. DEFERRED.
- P2.op-3 — covered by P2.multi-3 (applied).
- P2.op-4 — chart smoke test asserts NODE_NAME. DEFERRED.
- P2.op-5 — M16 reversal recheck. DEFERRED.
- P2.contrib-3 — FOLLOWUPS governance-row falsifying-check. DEFERRED.
- P2.contrib-4 — covered by P2.multi-6.
- P2.maint-5 — forward-pointer to FOLLOWUPS. DEFERRED.
- P2.maint-6 — vendor/ path collision with go mod vendor. DEFERRED.
- P2.R-8 — SHA-pin manifest URL. DEFERRED.

EXPLICITLY-SKIPPED:
- P2.multi-5 / P2.adopt-3 / P2.maint-5 / P2.R-5 (cross-RFC label tracecore.io/rank vs gen_ai.training.io/rank). Phase 1 deferred to FOLLOWUPS; Phase 2 escalation contradicted by: design-locked DRAFT is the appropriate place to defer cross-receiver reconciliation pre-v1.0 (PRINCIPLES §11 backward-compat opt-in pre-v1.0); label fallback only fires when env discovery already failed, in which case rank is "unknown" regardless of label name. Phase 1 defer stands.

Validation-cycle stats:
- Findings rejected during contradict: 1 (P2.multi-5 namespace-rename; defended by pre-v1.0 + fallback-path-only argument)
- Findings whose hard-proof did not reproduce: 0

TDD discipline: N/A — RFC is prose. Reproducibility: make doc-check clean.

Rubric additions accepted (binding):
- Alpha-receiver RFCs MUST contain §Operator-surfaces section (RUNBOOK + alerts + FAILURE-MODES rows).
- RFCs with typed-record schemas MUST include wire-attribute name table cross-checked against MILESTONES.
- RFCs naming sibling-package symbols MUST verify compile-resolve at HEAD.

Discovered constraints:
- Pattern: alpha receivers consistently drop one of {RUNBOOK row, alerts row, FAILURE-MODES row}. Promote rubric.
- Pattern: cross-RFC join-key inconsistencies tend to defer to "whichever PR lands first"; design-locked status should exclude this class.

Per-lens final verdicts:
- Performance:    CONCERNS-REQUIRE-FIX (deferred)
- SRE/Infra:      CONCERNS-REQUIRE-FIX (deferred/applied mix)
- Maintainer:     CONCERNS-REQUIRE-FIX (mostly applied)
- Contributor:    CONCERNS-REQUIRE-FIX (applied)
- Operator/User:  CONCERNS-REQUIRE-FIX (deferred)
- Adopter:        CONCERNS-REQUIRE-FIX (overlapped with SRE/Security)
- Security:       CONCERNS-REQUIRE-FIX (mostly deferred)
- Researcher:     BLOCKER-REQUIRES-RESOLUTION (resolved inline)

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 31, 2026
)

## Summary

Four amendments to `docs/rfcs/0013-distro-first-pivot.md` (plus a
one-line sweep across 6 companion docs) per the scope-review findings
staged before PR-I.1 / PR-K / PR-M code work begins. Pre-stages each
decision in the RFC so the autonomous code PRs don't escalate
mid-flight.

## Root cause

#181 (RFC-0013 PR-I in-repo submodule rescope) was incomplete:

1. Sweep missed 6 companion docs still pointing at the original-design
external `tracecoreai/tracecore-components` repo.
2. §7 listed 3 GitHub workflows for deletion that were already removed
pre-RFC and 1 issue template (`component-bug-dcgm.yml`) that was already
removed pre-RFC.
3. PR-K was a single 4-receiver-delete-plus-chart-migration mega-PR with
no decoupling of the `internal/synthesis/patterns/` k8sevents dep break,
which is on PR-I.2's critical path.
4. PR-I.1 conflated the `module/go.mod` scaffolding with the `git mv` +
package rename, blocking PR-I.1a from landing without PR-B2 even though
the scaffolding step has no nccl_fr dep.

Mid-flight discovery during merge cycle: merged commit #188
(`feat(pivot): PR-B2 — port dcgm off internal selftel + lifecycle`)
reused the `PR-B2` slug for a PR-B1-shape dcgm port (which is moot since
dcgm is deleted entirely in PR-F), creating a naming collision against
the canonical PR-B2 defined in the RFC — the nccl_fr
`internal/{pipeline,consumer,runtime/lifecycle}` → upstream port that
hard-gates the PR-I.1b `git mv`.

## Amendments

1. **§6/§7 sweep miss (Amendment 1)**: Remove surviving
`tracecoreai/tracecore-components` external-repo references across
`docs/getting-started.md`, `docs/followups/M11.md`,
`docs/followups/M19.md`, `docs/FOLLOWUPS.md`,
`docs/rfcs/0003-pipeline-runtime-and-component-contract.md`,
`AGENTS.md`. All re-pointed at `github.com/tracecoreai/tracecore/module`
per RFC-0013 §6. Verified zero surviving stale refs.
2. **§7 nonexistent workflow entries (Amendment 2)**: Collapse
`pyspy-integration.yml`, `python-publish.yml`,
`kernelevents-integration.yml` deletion rows into one row marked
"already removed pre-RFC". `component-bug-dcgm.yml` also already
removed. Only `component-bug-kernelevents.yml` survives for PR-K. §4
v0.3.0 row + PR-M slug cleaned for consistency.
3. **§migration PR-K sub-slice (Amendment 3)**:
- **PR-K.1** — sever `internal/synthesis/patterns/` from
`components/receivers/k8sevents` via local model types in
`internal/synthesis/patterns/model.go`. No deletions. **Unblocks
PR-I.2.**
- **PR-K.2** — delete
`components/receivers/{clockreceiver,kernelevents,k8sevents,containerstdout}`
+ migrate ~86 test fixtures + delete `tools/failure-inject/xidgen/` +
keep `tools/failure-inject/ncclhang/`.
- **PR-K.3** — chart cleanup: flip `containerstdout-on-values.yaml` to
filelog+container-stanza, delete `containerstdout-rbac.yaml`, delete
`.github/ISSUE_TEMPLATE/component-bug-kernelevents.yml`, ship
`NOTES.txt` deprecation + values-key removal.
4. **§migration PR-I sub-slice + PR-B2 promotion (Amendment 4)**:
- **PR-B2** reframed as hard gate for PR-I.1b: port
`components/receivers/nccl_fr` off
`internal/{pipeline,consumer,runtime/lifecycle}` to upstream
`go.opentelemetry.io/collector/{component,receiver,consumer,pipeline}`.
Slug-collision note added re: merged #188.
- **PR-I.1a** — `module/go.mod` + root `go.work` + `builder-config.yaml`
`replaces:` skeleton. No file movement. Tag `module/v0.0.1` (genesis
tag, validates the tagging contract).
- **PR-I.1b** — `git mv components/receivers/nccl_fr →
module/receiver/ncclfrreceiver` + `git mv pkg/nccl/fr_parser →
module/pkg/nccl/fr_parser` + rename Go package `nccl_fr` →
`ncclfrreceiver` + update all importers. Hard-gated on PR-B2. No new
tag; next bump is `module/v0.1.0` at PR-I.2.
- **PR-I.2** — `rankjoinprocessor` + `patterndetectorprocessor` net-new.
Hard-gated on PR-K.1. Tag `module/v0.1.0` (first version pinned in
`builder-config.yaml` for v0.2.0).

Also: PR-J marked `(landed, #195)` with note that recipe docs landed but
chart-values compat map follows in PR-K.3.

## Adversarial review (5 lenses, inline)

- **(a) PR slug internal consistency**: PR-I.1b ↔ PR-B2 ↔ PR-I.2 ↔
PR-K.1 bidirectional gates all match. PR-J landed marker consistent with
#195. §4 v0.2.0 row has pre-existing drift (mentions dcgm+kueue in
v0.2.0 when PR-F+#168 already deleted them in v0.1.0) — out of these 4
amendments' scope; flag for follow-up.
- **(b) PR-B2 hard-gate naming**: tightened from "Hard gate for PR-I.1"
to "Hard gate for PR-I.1b" — accurate because PR-I.1a is
scaffolding-only with no file movement.
- **(c) Sub-PR numbering collision**: #188 explicitly addressed in
slug-collision note. #185/#186/#187/#193/#194/#196 are PR-B1-shape ports
for non-nccl receivers (no PR-slug label in their commits), no
collision.
- **(d) Stale external-repo refs**: `grep -rn
"tracecoreai/tracecore-components" docs/ AGENTS.md README.md` returns
zero hits post-amendment.
- **(e) Cross-reference link integrity**:
`docs/migration/v0.1-to-v0.2.md` references `#migration--rollout` and
`#3-customer-stable-telemetry-contracts`; both anchors preserved (§
headers unchanged). `make doc-check` confirms 526 markdown links
resolve.

## Test plan

- [x] `make doc-check` — 526 markdown links resolve, 0 stale refs,
banned-phrase lint clean, alert-check + chart-appversion gates green.
- [x] Pre-push hooks: golangci-lint clean, go vet clean, go mod verify
clean.
- [ ] CI doc-check + actionlint + zizmor gates pass on PR.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
…ORTHSTAR pattern #3) (#255)

## Summary

Third real detector in `module/pkg/patterns/` — closes v0.3.0 NORTHSTAR
O1 ("3 patterns at v0.1.0") alongside `pod_evicted` (#14) and
`nccl_hang` (#15).

**Pattern**: a GPU Xid kernel event (carrying the customer-stable
`kernelevents.xid` attribute) followed within a configurable window
(default 60s) by a pod eviction on the same node → one verdict per (Xid,
evicted_pod) tuple. Operator-actionable: "Xid 79 on gpu-node-0001 →
training/job-rank-3 evicted 10s later" with remediation pinning the node
to drain and the pod to reschedule.

**Detector** (`module/pkg/patterns/xid_correlation.go`):
- `XidCorrelationDetector` zero-value usable; `CorrelationWindow`
overrideable (defaults to `DefaultXidCorrelationWindow = 60s`).
- `Evaluate(xids, events)` joins on `(node, time-within-window)` with
the most-recent Xid winning the proximate-cause attribution.
- Causal-direction guard: eviction BEFORE Xid does not correlate
(mirrors pod_evicted's future-transition exclusion).
- Deterministic output sorted by `(eviction_time, EventUID)`.

**Verdict shape** — single vs multi-layer decision: structurally two
evidence surfaces (kernel_event + pod_event in causal order in the
evidence trail), but emission rule is "both layers joined or no
verdict". Xid-without-eviction is already operator-visible via the raw
`kernelevents.xid` telemetry; an Xid-only verdict would duplicate that
signal. So `Confidence` / `MissingLayers` are omitted (dead-fields
discipline from PR #250 review). Mirrors the `nccl_hang` shape.

**Multi-pod decision**: one verdict PER evicted pod (not one verdict
listing all). Per-pod is the operator-actionable shape — each verdict's
`Remediation` pins a specific pod to drain/recreate so alert routing
fans out by pod owner; collapsing would force operators to parse a list
and lose per-pod routing.

**Schema** (`testdata/xid_correlation_verdict.schema.json`):
`additionalProperties:false`, `pattern.id` const="16",
`evidence_trail.kind` enum=`["kernel_event","pod_event"]`, `minItems:2`
(both layers required). 7-row drift-rejection battery covers
re-introduced `confidence`/`missing_layers`, evidence-minimum violation,
kind-enum, and `pattern.id` numeric vs string drift.

**Wiring** (`patterndetectorprocessor`):
- New `Config.XidCorrelationWindow` YAML field (default 60s, floor 1s),
validated alongside the existing window fields.
- `collectInputs` returns a fourth typed slice (`[]patterns.XidRecord`)
built by `projectXidRecord`, which gates on `kernelevents.xid` and reads
the host node from the resource attribute `k8s.node.name` (the standard
k8sattributes / resourcedetection stamp on a DaemonSet — same pattern
`projectNodeCondition` uses). Per-record `k8s.node.name` fallback for
non-DaemonSet emitters.
- `appendXidCorrelationVerdict` mirrors `appendNCCLHangVerdict`'s wire
format — broken-out scalar attrs plus full `pattern.verdict_json`.

**Recipe alignment**: `docs/integrations/journald-kernel.md`
§"Customer-stable attribute mapping" already documents
`kernelevents.xid` (int) as the OTTL-stamped surface. No recipe update
needed; the existing pipeline emits exactly what the detector consumes.
`gpu.id` (PCI BDF) is also stamped but intentionally not projected — the
detector doesn't use it yet, so it stays available on the raw record
without entering the pattern library as a dead field.

## Test plan

- [x] `cd module && go test ./pkg/patterns/... ./processor/... -race
-count=1` — green
- [x] `make build` — collector binary compiles
- [x] `make check` (golangci-lint + go vet + go mod verify) — 0 issues
- [x] TDD red-test-first: both `xid_correlation_test.go` files failed to
compile/assert before their impl landed
- [x] Detector test cases: positive Xid 79 → eviction, no-eviction
negative, cross-node negative, out-of-window edge (61s default),
multi-pod per Xid (3 verdicts), window-configurable, pre-Xid eviction
excluded, non-evicted-hint ignored, deterministic order,
most-recent-Xid-wins
- [x] Schema-conformance + 7-row drift-rejection battery
- [x] Wiring tests: positive verdict emission, healthy non-emission,
window override
- [x] `TestConfig_Validate` extended with sub-1s xid_correlation_window
floor case
- [x] `TestFactory_Surface` pins `DefaultXidCorrelationWindow` (and the
previously-unpinned `DefaultNCCLHangThreshold`) on the factory's default
config

```release-notes
feat(patterns): add xid_correlation detector (v0.3.0 NORTHSTAR pattern #3) — correlates GPU Xid kernel events (`kernelevents.xid` attribute per RFC-0013 §3) with downstream pod evictions on the same node within a configurable window (default 60s). Emits one verdict per evicted pod so alert routing can fan out per pod owner. Wired into patterndetectorprocessor; configurable via `xid_correlation_window` YAML field.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
…ctor wiring (PR-A for #260) (#267)

## Summary

PR-A for [issue
#260](#260) — the
recipe-side half. Two commits:

1. `docs(adr): metrics-to-logs converter for pattern input (Option A)` —
records the architectural decision and the upstream-blocker analysis
under a new `docs/adrs/` directory.
2. `docs(recipe): extend prometheus-scrape with DCGM to hw.* OTTL
transform` — adds `transform/dcgm_to_hw_semconv` (a second OTTL pass
after `transform/gpu_vendor`) and documents every per-pattern projection
in `docs/integrations/prometheus-scrape.md`.

### Architectural decision (issue #260 §Proposed plan step 2)

**Option A** — extend `patterndetectorprocessor` with
`processor.WithMetrics` alongside the existing `WithLogs` (deferred to
PR-B; this PR ships only the wire-format contract PR-B will consume).

Per `[[adopt-over-build]]` Option B (OTTL metrics-to-logs converter) was
the preferred starting point. It's blocked upstream: at OTel contrib
v0.130 (the release pinned in `builder-config.yaml`) the
`transformprocessor` signal contexts are sealed per-signal —
`metric_statements` can only reference `resource | scope | metric |
datapoint` paths, never `log` ([README @ v0.130.0 §
Config](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config)).
Surveying every contrib connector at v0.130 also turns up zero
metrics→logs primitives (`countconnector`, `signaltometricsconnector`,
`spanmetrics`, `exceptions`, `servicegraph` all output metrics;
`routing`/`failover`/`roundrobin`/`otlpjson` are signal-preserving;
`datadog` and `grafanacloud` are vendor-specific). The full survey +
reasoning is in
[ADR-0001](docs/adrs/0001-metrics-to-logs-pattern-input.md).

Option C (routing connector + sibling) is strictly heavier than A with
the same end state — routing preserves signal type, so it cannot bridge
metrics→logs either.

Trade-off accepted: doubled processor surface inside
`patterndetectorprocessor` (one `ConsumeLogs`, one `ConsumeMetrics`).
Mitigation: the metrics path is a thin projection on top of the same
`module/pkg/patterns/` library; verdict emission stays log-based via a
connector. Upstream-contribution slot opened — a
`metricthresholdconnector` (inverse of `signaltometricsconnector`) would
let us collapse Option A back to Option B without operator-visible
churn. Tracked under RFC-0013 §5.

### Per-pattern OTTL projection (this PR)

| Pattern | Raw DCGM input | OTel output | Trigger-statement shape (PR-B
consumes this) |
|---|---|---|---|
| #1 NVLink | `DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES` |
`hw.gpu.nvlink.io` (Counter, `By`) + `hw.gpu.nvlink.link={N}` +
`network.io.direction={transmit\|receive}` | `rate(hw.gpu.nvlink.io[5m])
< 0.5 × per-GPU median over the per-`hw.id` link set, for 10m` |
| #3 HBM ECC | `DCGM_FI_DEV_ECC_{SBE,DBE}_{VOL,AGG}_TOTAL` | `hw.errors`
(Counter, `{error}`) + `error.{type,subtype,persistence}` |
`increase(hw.errors{error.type="uncorrected",
error.persistence="volatile"}[5m]) > 0` |
| #4 thermal |
`DCGM_FI_DEV_{THERMAL,POWER,SYNC_BOOST,BOARD_LIMIT,LOW_UTIL}_VIOLATION`
| `hw.gpu.throttle.duration` (Counter, `s`) + `hw.gpu.throttle.reason` |
`increase(hw.gpu.throttle.duration{reason="thermal"}[5m]) > 30s, for
10m` |
| #5 PCIe | `DCGM_FI_PROF_PCIE_{TX,RX}_BYTES` | `hw.gpu.io` (Counter,
`By`) + `network.io.direction` + `hw.gpu.pci.bdf` resource attr |
`rate(hw.gpu.io[5m]) < 0.3 × host median, for 15m` |

Resource attribution: `UUID` or `gpu_uuid` → `hw.id`; `gpu` or `GPU` →
`hw.gpu.index`; `hw.type="gpu"` stamped on every DCGM series. Dual-label
fallback handles both legacy and modern dcgm-exporter releases.

### What's NOT in this PR

The third commit originally planned by issue #260 — `feat(recipe):
metrics→logs OTTL alert emission` — is dropped because OTTL cannot emit
log records from metric streams at v0.130 (root cause cited above, full
analysis in ADR-0001). The fallback is Option A, which requires
modifying `patterndetectorprocessor` source — out of scope for a
recipe-only PR. The alert-emission half lands in PR-B alongside the
NVLink detector code.

## Test plan

- [x] `./_build/tracecore validate
--config=docs/integrations/examples/prometheus-scrape.yaml` exits 0.
- [x] `bash scripts/doc-check.sh` clean (test-name parity, markdown link
integrity, banned-phrase lint, integration-recipe rubric all pass).
- [x] `bash scripts/validator-recipe.sh` validates 6 recipes (skips 2
requires-linux / requires-k8s-cluster — CI ubuntu runner exercises those
paths).
- [x] `make check` (golangci-lint, go vet, mod-verify) clean.
- [ ] Adversarial reviewer check that the OTTL `set(metric.name, ...)`
identity-conflict caveat (recipe markdown § "Identity-conflict caveat")
matches the operator's downstream backend's deduplication semantics for
the four projected metric families.
- [ ] CI runner exercises the requires-linux / requires-k8s recipes that
this PR's host skipped.

## Followup

- **PR-B (issue #260, blocked by this PR's merge)** — extend
`patterndetectorprocessor` with `processor.WithMetrics`; land
`module/pkg/patterns/nvlink_degradation.go`; wire verdict emission
through a sibling logs pipeline via a connector. Decision rationale
baked into ADR-0001 so PR-B doesn't relitigate.
- **Upstream contribution (v0.3 cycle, RFC-0013 §5)** — propose a
`metricthresholdconnector` (or `signaltologsconnector`) to OTel-contrib.
If/when it lands, the recipe can collapse back to Option B without
operator-visible churn — the `hw.gpu.*` wire-format contract this PR
ships is the load-bearing customer surface, not the upstream component
selection.
- **dcgm-exporter throttle-reason mapping verification** — the pattern
#4 mapping table assumes modern dcgm-exporter `DCGM_FI_DEV_*_VIOLATION`
series names. If your dcgm-exporter release exposes the throttle bitmask
as `DCGM_FI_DEV_CLOCKS_EVENT_REASONS` instead (per the semconv proposal
Open Question #1), extend the OTTL block to expand the bitmask into
discrete `hw.gpu.throttle.duration` datapoints — out of this recipe's
scope; tracked as a follow-up under the same milestone.

```release-notes
docs(recipe): The `prometheus-scrape` recipe gains a second OTTL transform pass that projects raw `dcgm-exporter` `DCGM_FI_*` series into the customer-stable `hw.gpu.*` / `hw.errors` namespace (semconv proposal). Four pattern families covered — NVLink, HBM ECC, thermal throttle, PCIe — using upstream OTTL functions only (`set`, `IsMatch`, `ExtractPatterns`, `Int`). The wire-format contract this transform produces is the input PR-B's NVLink detector will consume; the in-tree component change for the detector itself lands in a follow-up PR per ADR-0001.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
…#3) (#274)

## Summary

Ships pattern #3 (uncorrectable HBM ECC) per
`docs/patterns/pattern-3-hbm-ecc.md`. Two layers required for a verdict
— a single layer alone emits nothing.

- **Layer 1 (DCGM):** uncorrected double-bit volatile ECC counter delta
on a GPU. Discriminator: `error.type=uncorrected` AND
`error.subtype=double_bit` AND `error.persistence=volatile` AND `Delta
>= threshold`.
- **Layer 2 (kernel):** Xid 48 / 63 / 64 on the SAME `gpu.id` (PCI BDF)
within 5min default window. Causality flows forward — Xid must be at or
after the ECC delta.
- Join key: `gpu.id`. Same-node is implied but not required for the join
(the GPU identifier is the proximate hardware-fault key).

### Why two layers required

- ECC delta alone: operators already see the rising counter on
dashboards; pattern's value is the kernel-level confirmation that the
GPU is failing.
- Xid alone: may fire `xid_correlation` if a pod evicts on the same node
— that's a node-scoped pattern, distinct from this GPU-scoped one.
Operators get raw `kernelevents.xid` telemetry already.

### What's new

- `module/pkg/patterns/hbm_ecc.go` — `HBMECCDetector` library type,
`HBMECCRecord` projection, `HBMECCVerdict` output. `PatternIDHBMECC =
"3"`, `EvidenceKindHBM = "hw_error"`.
- `module/pkg/patterns/testdata/hbm_ecc_verdict.schema.json` — JSON
schema with 10 drift falsifiers.
- `XidRecord` extended with optional `GPUID` (PCI BDF from the
journald-kernel OTTL recipe per RFC-0013 §3). `xid_correlation` ignores
it; `hbm_ecc` requires it.
- Processor wiring: `projectHBMECCRecord` + `appendHBMECCVerdict` mirror
the xid_correlation shape. Config grows `hbm_ecc_window` (default 5min)
+ `hbm_ecc_delta_threshold` (default 1).

### Follow-up gap (not a workaround)

The pattern doc names `hw.errors` as a Counter on the metrics pipeline;
the patterndetectorprocessor consumes log records. The detector wires up
against `hw.errors.delta`-shaped log records but no OTTL recipe today
derives that log shape from the upstream metric. Tracked in #273 —
needed for end-to-end firing in a real deployment; the detector library
is the v0.3 moat and ships independently per RFC-0013.

## Test plan

- [x] `cd module && go test ./... -race -count=1` — all green
- [x] `make build` — exit 0
- [x] `make check` — golangci-lint 0 issues, go vet clean, go mod
verified
- [x] Window-edge fenced both sides: `EdgeOutOfWindow` (5min+1s, no
fire) + `EdgeAtWindowBoundary` (exactly 5min, must fire)
- [x] Negative paths: ECC alone, Xid alone, wrong Xid code, corrected
ECC, cross-GPU, pre-ECC Xid (causal direction)
- [x] Schema drift battery: 10 falsifiers (extra field, confidence
re-added, pattern.id numeric, pattern.id wrong value, xid_code string,
ecc_delta negative, ecc_delta zero, evidence_kind outside enum,
evidence_trail under min, gpu_id empty)
- [x] Wiring: emits-verdict, xid-alone, ecc-alone, out-of-window,
cross-gpu, window+threshold configurable

```release-notes
feat(patterns): hbm_ecc detector (NORTHSTAR pattern #3) — emits a verdict when an uncorrected double-bit volatile DCGM ECC counter rise joins an Xid 48/63/64 on the same gpu.id within the configurable window (default 5min). Two new YAML config keys: hbm_ecc_window, hbm_ecc_delta_threshold.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Sweep of broken doc anchors and stale `tracecore_*` self-telemetry
metric names that drifted after the v0.2.0 OCB pivot.

Three independent root causes:

1. **Anchor typos** — 8 cross-links to RFC-0013 used
`#migration--rollout` (double dash) and one used
`#3-customer-stable-contracts` (truncated). Replaced with the canonical
slugs `#migration-rollout` and `#3-customer-stable-telemetry-contracts`.
2. **Stale RUNBOOK metric vocabulary** —
`module/receiver/ncclfrreceiver/RUNBOOK.md` still referenced v0.1.x
names (`tracecore_receiver_errors_total`,
`tracecore_exporter_calls_total`,
`tracecore_receiver_degraded_seconds_total`). The v0.2.0 OCB binary
emits `otelcol_receiver_ncclfr_*` and `otelcol_exporter_<name>_*` per
the in-tree namespace alignment table in
`docs/migration/v0.1-to-v0.2.md`. Also flipped `result="fail"` →
`result="error"` to match the migration table's worked example.
3. **Orphan attribute row** — `tracecore.container.lines_per_s` in
`docs/ATTRIBUTES.md` carried a "recipe TBD" emitter, but the
`containerstdout` receiver was deleted at v0.2.0 (RFC-0013 PR-K.2 / §7)
and no replacement emitter exists. Dropped the row.

## Files touched

- `docs/getting-started.md` — `#3-customer-stable-contracts` →
`#3-customer-stable-telemetry-contracts`
- `docs/ATTRIBUTES.md` — drop `tracecore.container.lines_per_s` row
-
`docs/integrations/{k8sobjects-events,filelog-container,journald-kernel}.md`
— anchor fix
- `docs/migration/v0.1-to-v0.2.md` — anchor fix (2 occurrences)
- `docs/migration/v0.2-to-v0.3.md` — anchor fix
- `docs/notes/ci.md` — anchor fix
- `docs/reproducibility.md` — anchor fix
- `module/receiver/ncclfrreceiver/RUNBOOK.md` — 4 metric-name rewrites +
migration-doc citation

## Why

Anchor fragments silently 404 in GitHub-rendered docs (no broken-link CI
gate for fragment-only mismatches). Stale RUNBOOK metric names point
operators at series that no longer exist on the `:8888/metrics`
endpoint, blocking incident response. The orphan ATTRIBUTES row implied
a recipe was in flight when in fact the upstream emitter was deleted two
minors ago.

## Test plan

- [x] `git grep '#migration--rollout' docs/` returns 0 in-scope hits
(the one remaining occurrence is an intra-doc link in
`docs/rfcs/0009-pyspy-receiver-scope.md`, outside this PR's scope).
- [x] `grep -n 'Customer-stable telemetry contracts'
docs/rfcs/0013-distro-first-pivot.md` confirms heading exists at L97.
- [x] RUNBOOK metric names cross-checked against
`docs/migration/v0.1-to-v0.2.md` lines 113-121 (in-tree namespace
alignment table) and lines 137-138 (worked PromQL migration recipe —
`result="error"`).
- [x] Pre-push hooks: `golangci-lint`, `go vet`, `go mod verify`,
`doc-check` (554 markdown links resolved on-disk), `alert-check`,
`chart-appversion-check`, `no-autoupdate-check` — all green.

```release-notes
docs: fix 9 broken cross-doc anchors (RFC-0013 §migration and §3 slug), rewrite stale v0.1.x `tracecore_*` self-telemetry metric names in `module/receiver/ncclfrreceiver/RUNBOOK.md` to the v0.2.0 `otelcol_receiver_ncclfr_*` / `otelcol_exporter_<name>_*` vocabulary, and drop the orphan `tracecore.container.lines_per_s` row from `docs/ATTRIBUTES.md` (no emitter — `containerstdout` deleted at v0.2.0 per RFC-0013 §7).
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

The `docs/adrs/` directory held one file
(`0001-metrics-to-logs-pattern-input.md`). A single-file decision-record
directory sitting parallel to `docs/rfcs/` (13 RFCs + README + template)
is taxonomy drift — operators and contributors had to learn two
near-identical conventions for "load-bearing architectural decision in
tree."

This PR collapses the split.

### What changed

- ADR-0001 → **RFC-0014**
(`docs/rfcs/0014-metrics-to-logs-pattern-input.md`). Content reformatted
to match the RFC template's section headings (Summary / Motivation /
Proposal / Alternatives / Open questions / Migration / References). The
substance (Option A vs Option B vs Option C analysis, the v0.130 contrib
survey, PR-A landed + PR-B pending sequencing) is preserved verbatim —
this is active design for issue #260 PR-B, not archaeology.
- `docs/adrs/` directory removed.
- 6 cross-references repointed at the new RFC path:
  - `docs/README.md` (subdirectories table — `adrs/` row removed)
- `docs/ATTRIBUTES.md` (2 spots: `tracecore.alert.pcie_rate_collapse.*`
row + "See also" link)
  - `docs/integrations/prometheus-scrape.md` (2 spots)
  - `docs/patterns/pattern-4-thermal-throttle.md`
  - `docs/patterns/pattern-5-pcie-aer.md`
- 5 source-code comments repointed
(`module/processor/patterndetectorprocessor/{patterndetector.go,thermal_throttle_test.go}`,
`module/pkg/patterns/{pcie_aer.go,thermal_throttle.go}`).
- `docs/rfcs/README.md` status-index gains an RFC-0014 row (`accepted`,
2026-05-31).

### Why convert (not delete)

ADR-0001 was evaluated against the "delete if RFC-0013 already covers
it" bias. It is not covered:

- RFC-0013 §5 mentions a `metricthresholdconnector` as a contribution
slot in one bullet. It does **not** evaluate Option A vs Option B vs
Option C, cite the contrib v0.130 evidence, or sequence the PR-A/PR-B
split for the metric-sourced detectors.
- ADR-0001 is the binding design contract for patterns #1 / #3 / #4 / #5
(4 of the next NORTHSTAR detectors). Source code (`pcie_aer.go`,
`thermal_throttle.go`, `patterndetector.go`) cites it as the reason for
the staged-but-quiet wire-up.

Delete would orphan 5 source comments and break the audit trail for an
active design decision. Convert keeps the record load-bearing without
preserving the parallel taxonomy.

### Verification

- `grep -rn "docs/adrs\|adrs/0001"` returns 0 hits.
- `grep -rn "ADR-0001\|ADR 0001"` returns 0 hits.
- `ls docs/ | grep -i adr` returns empty.
- pre-commit golangci-lint + go vet + go mod verify clean.

```release-notes
docs: collapse single-file `docs/adrs/` into `docs/rfcs/`. ADR-0001 (metrics-sourced pattern inputs) is promoted to RFC-0014 verbatim; cross-references across docs and module source repointed.
```

## Test plan

- [x] `grep -rn "docs/adrs"` returns 0 hits
- [x] `grep -rn "ADR-0001"` returns 0 hits
- [x] `docs/adrs/` directory removed
- [x] golangci-lint + go vet clean (pre-commit hook)
- [ ] CI green

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Extends prometheus-scrape.md with the bridge attribute contract for the
four metrics-derived patterns:

- pattern #1 NVLink (#260) — the `hw.gpu.nvlink.io` OTTL transform
  already lands in commit 0baa557; this PR closes #260's recipe-half.
- pattern #3 HBM ECC (#273) — `hw.errors.delta` + error.{type,
  subtype,persistence} + gpu.id contract.
- pattern #4 thermal throttle (#282) — `hw.gpu.throttle.duration.delta`
  in integer seconds + reason=thermal + gpu.id contract.
- pattern #5 PCIe AER Layer 2 (#284) — the `tracecore.alert.
  pcie_rate_collapse.*` namespace contract.

OTTL metrics->logs emission stays upstream-blocked at OTel-contrib
v0.130 (RFC-0014): no contrib processor or connector emits log records
from a metrics pipeline. The bridge contract documented here is the
load-bearing wire format any future emitter (an upstream
metricthresholdconnector OR the WithMetrics extension to
patterndetectorprocessor per RFC-0014 PR-B) MUST honor; the detector
projections at module/processor/patterndetectorprocessor/
patterndetector.go gate on this contract today.

last-verified marker bumped to 2026-06-01.

Closes #260. Closes #273. Closes #282. Refs #284 (Layer 1 closed
under #285 in a prior commit; Layer 2 contract documented here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #331. Replaces the single `pattern-pod-evicted` chaos job with a
matrix-strategy `pattern-detectors` job covering every detector named
in v1-rc1 cut-criterion §1: `hbm_ecc`, `thermal_throttle`, `pcie_aer`,
`pod_evicted`, `nccl_hang`, `xid_correlation`.

Each row runs the detector's positive + negative + schema `*_test.go`
prefix under `-race`:

- **Positive / Edge asserts** — verdict emitted with correct
  `pattern.id` within bounded time.
- **Negative / SchemaRejectsDrift asserts** — zero verdicts on
  baseline / confounder input (the false-positive gate the chaos
  matrix exists to enforce).

The `pod_evicted` row additionally runs the on-disk replay corpus under
`module/pkg/replay/pod_evicted/`; sibling rows skip that step until
their `replay/<pattern>/` fixtures land — tracked as Path A in
[`docs/v1-rc1-test-audit.md`](../blob/main/docs/v1-rc1-test-audit.md)
§4.

## Why this matrix shape, not new `failure-inject` modes

Audit §4 explicitly recommends Path A (matrix over existing
hermetic detector tests) for rc1 and defers Path B (per-pattern
`failure-inject` CLI modes) to v1.x. Path A reuses the
race-detector + golden-SHA shape already validated by
`pattern-pod-evicted` and costs ~60 LoC of workflow YAML instead of
five new CLI subcommands.

## Wall-time budget

Observed per-row runtime under `-race` (measured locally on this
worktree):

| Row | runtime |
|---|---|
| hbm_ecc | 1.35s |
| thermal_throttle | 1.33s |
| pcie_aer | 1.30s |
| pod_evicted (incl. replay corpus) | 1.35s + 1.34s |
| nccl_hang | 1.31s |
| xid_correlation | 1.32s |

Per-row `timeout-minutes: 5` is ~3× headroom; matrix parallelism
keeps total wall-time well inside the chaos-gate budget.

## Test plan

- [x] `actionlint .github/workflows/chaos.yml` — clean.
- [x] `python3 -c "import yaml;
yaml.safe_load(open('.github/workflows/chaos.yml'))"` — parses, 3 jobs
(`harness-determinism`, `cpu-steal-mpstat`, `pattern-detectors`), 6-row
matrix.
- [x] For every matrix row, `cd module && go test -race -count=1 -run
'<regex>' ./pkg/patterns/...` — all green (counts: HBMECC=16,
ThermalThrottle=17, PCIeAER=15, PodEvicted+PressureFromNote=12,
NCCLHang=10, XidCorrelation=14).
- [x] `cd module && go test -race -count=1 ./pkg/replay/...`
(pod_evicted corpus step) — green.
- [x] Pre-commit hook: golangci-lint + go vet + go mod verify +
attribute-namespace-check — all green.
- [ ] Workflow runs green on `main` after merge (verifiable only
post-merge — nightly cron + push-to-main).

```release-notes
ci: chaos.yml now matrix-tests all six shipped detectors (#14
pod_evicted, #8 nccl_hang, #3 hbm_ecc, #4 thermal_throttle, #5
pcie_aer, xid_correlation) instead of just pod_evicted. Closes the
chaos-coverage gap called out in v1-rc1-test-audit §4. No
operator-visible behavior change.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #337.

The CUDA OOM detector (`projectFBMemoryRecord` at
`module/processor/patterndetectorprocessor/cuda_oom.go:114`) gates on
`hw.gpu.memory.{free,total}` log-record attributes, but nothing in
the recipe layer produced them: `dcgm-exporter` emits
`DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` as Prometheus gauges and
no OTTL transform projected them onto the customer-stable namespace.
Detector compiled, never fired on a real install — sibling gap to
#273 (pattern #3), #282 (#4), #284 (#5).

This PR closes the gap on the metric-side projection and pins the
load-bearing log-shape contract the bridge layer MUST honor.

- `docs/integrations/examples/prometheus-scrape.yaml`:
  - `DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used` (Gauge, unit `By`)
  - `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free` (Gauge, unit `By`)
  - Identity-preserving rename only; `hw.gpu.memory.total = used+free`
    deferred to the bridge layer per the named upstream limit (see
    below).
- `docs/integrations/prometheus-scrape.md`:
  - New `### Pattern #10 — CUDA OOM (framebuffer)` metric-side
    projection section with raw-series → semconv table.
  - New `#### Pattern #10 — hw.gpu.memory.{free,total}` bridge-
    contract subsection with full log-record schema (yaml-shaped)
    matching what `projectFBMemoryRecord` reads, plus MIG caveat
    and unit-test cross-link.
  - Intro + bridge-contract header bumped to include pattern #10.
- `docs/patterns/10-cuda-oom-deceptive.md`:
  - Signal-source line links to the recipe sections.
  - Open Question #1 (`DCGM_FI_DEV_FB_*` OTTL extension) struck
    through; resolution recorded.
- `docs/ATTRIBUTES.md`:
  - `hw.gpu.memory.free` / `.total` rows updated to distinguish
    metric vs log shape and to cross-link to the recipe section.
  - New `hw.gpu.memory.used` row (now projected on the metrics
    pipeline by this PR — dashboard evidence context).

## Root cause + named upstream limit

**Root cause (fixed in this PR):** the prometheus-scrape OTTL
transform had no stanza projecting the DCGM FB series onto
`hw.gpu.memory.*`. The detector's projection gate could not be
satisfied on a real install. Fixed by adding the rename stanza in
`transform/dcgm_to_hw_semconv` (same processor the #1/#3/#4/#5
projections already live in — no new processor surface).

**Named upstream limit (NOT worked around — tracked):** OTel-contrib
`transformprocessor` v0.130 `metric_statements` cannot perform
cross-series arithmetic — there is no OTTL path to compute
`hw.gpu.memory.total = hw.gpu.memory.used + hw.gpu.memory.free`
on a metrics pipeline ([upstream
README](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config)).
Per RFC-0014 §Alternatives, the metrics→logs primitive does not
exist in contrib at v0.130 either. The total scalar lives at the
bridge layer (RFC-0014 PR-B `WithMetrics` extension to
`patterndetectorprocessor`, tracked under #260). The recipe pins
the load-bearing wire format the bridge MUST honor so PR-B lands
without a contract change.

## Adopt-over-build posture

Every new OTTL statement uses upstream functions only (`set`,
`==` equality). No new transformprocessor extension. Mirrors the
existing #1/#3/#4/#5 stanzas.

## Test plan

- [x] `make build` — clean
- [x] `./scripts/validator-recipe.sh` — 9 validated, 3 skipped
      (non-linux), 0 fail; prometheus-scrape example passes
      `tracecore validate`.
- [x] `./scripts/doc-check.sh` — 721 markdown links resolve, all
      new cross-links + anchor refs included; banned-phrase lint
      clean; recipe markers (`tested-against`, `last-verified`)
      present.
- [x] `./scripts/attribute-namespace-check.sh` — 67/67 attribute
      literals documented (no new undocumented attrs introduced).
- [x] golangci-lint, go vet, go mod verify via commit hook — clean.
- [ ] CI linux runner exercises journald + k8sobjects skip-paths
      we couldn't run locally (validator-recipe ubuntu job).

```release-notes
recipe(ottl): project DCGM `DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` onto the customer-stable `hw.gpu.memory.{used,free}` namespace and pin the metrics-to-logs bridge log-shape spec the pattern #10 (CUDA OOM) detector consumes via `projectFBMemoryRecord`. `hw.gpu.memory.total = used + free` derivation is deferred to the RFC-0014 PR-B `WithMetrics` bridge layer because OTTL `transformprocessor` v0.130 has no cross-series arithmetic on metrics pipelines.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but
the join shape varies across patterns and the rationale lived only
in code comments + PR review threads. Operators tuning windows had to
read source per detector.

Audit finding: five distinct shapes are load-bearing (chosen by the
causal physics of each signal), not bugs:

- One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect.
- Asymmetric two-sided (#11): pre-stall covers concurrent-start
  checkpoints; post-stall covers OTTL-bridge logger latency.
- Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window
  could be cause OR consequence.
- Job-window bounded (#13): SDC counter rise must fall in the
  bounded eval-cycle's owning job; no operator knob is meaningful.
- Trailing-window rate / freshness (#2 #4 #8): rolling window
  anchored at `now` or the most-recent record.

Decision: document the existing reality, do not converge. Forcing
every detector to the asymmetric two-knob form would silently
zero one leg for the one-sided detectors (footgun on clock skew)
and would not apply to #13 at all.

Adds:
- 'Why this correlation shape' section in docs/patterns/07, 11, 13
  (the three shapes the issue called out by name).
- 'Correlation-window semantics' table in docs/patterns/README.md
  covering ALL 11 detectors with the predicate, anchor, and shape
  rationale, plus cross-links to the per-pattern sections.

No code changes; no detector behavior changes.

Closes #367.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Lands replay corpora for the seven pattern detectors that lacked one,
closing the Path-A test gap named in the v1-rc1 audit (#366). Each
pattern now ships `module/pkg/replay/<pattern>/{canonical,_negative,
_real_world}/` fixtures plus a `*_replay_test.go` runner that JSON-eqs
detector output against the on-disk golden.

Detectors covered: `hbm_ecc` (#3), `thermal_throttle` (#4),
`pcie_aer` (#5), `ib_link_flap` (#2), `nccl_hang` (#15), `cuda_oom`
(#10), `xid_correlation` (#16). `pod_evicted` (#14) corpus is
unchanged.

## Design

- Each detector takes a different input shape (e.g. `HBMECCRecord +
  XidRecord` for `hbm_ecc`, `ThermalThrottleRecord` for
  `thermal_throttle`, ...), so the existing `LoadFixturesUnder` helper
  (typed on `Record + NodeRecord`) cannot be reused. Each detector
  gets its own `*_replay_test.go` that inlines the per-detector JSON
  read; shared fixture-discovery and golden-assert helpers live in
  `helpers_test.go`.
- Two detectors (`nccl_hang`, `ib_link_flap`) take a `Now` reference.
  Tests pin `Now` to a fixed timestamp matching the fixture's
  `started_ns` so hang-age and flap-window inclusion stay
  deterministic across replay runs (otherwise wall-clock drift would
  silently flip the verdicts as the fixture aged).
- Goldens were generated from the live detectors via
  `UPDATE_REPLAY_GOLDEN=1 go test ./module/pkg/replay/...` and pinned;
  future drift in detector output (headline / remediation prose,
  evidence-trail UID shape, scalar-field rename) surfaces as a
  `JSONEq` diff against the fixture. Operators can also eyeball the
  golden to assert what they EXPECT vs what fires.
- Negative fixtures each exercise a distinct discriminator (wrong Xid
  code, single GPU, no AER, no eviction, completed state, single
  transition, no OOM log) so a regression in one false-positive guard
  lights up the corresponding row only.
- Flipped `run_corpus: true` on every row of the chaos.yml
  pattern-detectors matrix now that every detector has a corpus.

## Test plan

- [x] `make check` — clean
- [x] `go test -race -count=1 ./pkg/replay/...` — 35 tests pass
  (28 new + 7 pre-existing pod_evicted)
- [x] `go test -count=1 ./processor/patterndetectorprocessor/...` —
  unchanged, still green
- [x] `make verify` (pre-push) — clean
- [ ] CI: chaos.yml `pattern-detectors` matrix — 8 rows, each runs
  hermetic regex + replay-corpus step

Closes #366.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #393. Ships the metric-side OTTL projection from
`node_exporter --collector.infiniband`'s `node_infiniband_port_state_id`
Gauge onto the customer-stable `hw.network.ib.*` namespace
(`hw.network.ib.port.state` int + `hw.network.ib.device` +
`hw.network.ib.port.num`) so pattern #2's `IBLinkFlapDetector`
consumes the same vendor-neutral wire shape regardless of whether the
underlying source is node_exporter, a Mellanox exporter, or the
`mlx5_core` journald stream.

Detector library + processor wiring already shipped in #391 (closed
#300). Only the metric-side input recipe was missing — pattern #2 was
configured-but-quiet on real deployments. This PR closes that gap.

## Wire contract (node_exporter raw → hw.network.ib.*)

```
node_infiniband_port_state_id{device="mlx5_0", port="1"} = 4
                                                          (IBA phys_state ID)
            ↓ transform/ib_to_hw_semconv
Gauge metric "hw.network.ib.port.state" with datapoint attrs:
  hw.network.ib.device  = "mlx5_0"        (str, from `device` label)
  hw.network.ib.port.num = 1              (int, from Int(`port` label))
  value                  = 4              (int, the phys_state ID)
```

The future RFC-0014 PR-B metrics→logs bridge emitter (shared with
patterns #3/#4/#5/#10) will lift these three attributes onto a log
record at emit time. The bridge log-record schema for pattern #2 is
pinned in `docs/integrations/prometheus-scrape.md §Pattern #2 —
hw.network.ib.port.state (issue #393)` so PR-B has no per-pattern
reconstruction work to do.

The companion series `node_infiniband_state{state="<name>"}` (string
label) is intentionally NOT mapped — the detector
(`module/processor/patterndetectorprocessor/ib_link_flap.go`) compares
`state.Int()` against `patterns.IBPortState*` integer constants, so
the string variant would round-trip wrong.

## No detector code change required

The detector reads three attribute names off a log record:
`hw.network.ib.port.state`, `hw.network.ib.device`,
`hw.network.ib.port.num`. The recipe stanza stamps the exact same
three names on the metric datapoint. The wire format `port.Int()`
expects (the projector at
`module/processor/patterndetectorprocessor/ib_link_flap.go` line 39
calls `int(port.Int())`) is satisfied because the OTTL `Int()` cast
on the Prometheus `port` string label produces a pdata int Value.
Confirmed by the new `TestRecipe_IBLinkFlap_RoundTripFiresVerdict`
test.

## Root cause + scope

- **Root cause of #393**: missing metric-side OTTL stanza. Fixed.
- **Out of scope (separate blocker, tracked under #260 PR-B)**: the
  metrics→logs bridge emitter. Upstream-blocked at OTel-contrib v0.130
  — `transformprocessor`'s `metric_statements` cannot reference
  `log.*` paths and no contrib connector emits log records from a
  metrics pipeline (per

[RFC-0014](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0014-metrics-to-logs-pattern-input.md)).
  The recipe doc explicitly documents this gating relationship; PR-B
  is shared with patterns #3/#4/#5/#10 and lands the bridge once.

## Files changed

- `docs/integrations/examples/prometheus-scrape.yaml` — new
  `transform/ib_to_hw_semconv` processor; wired into the
  `metrics/scrape` pipeline. Validates with `./_build/tracecore
  validate` (exit 0).
- `docs/integrations/prometheus-scrape.md` — new "Pattern #2 —
  InfiniBand link flap" projection section, intro updated from "Two"
  to "Three OTTL transforms", and bridge log-record contract
  subsection added under the "Metrics-to-logs bridge contract"
  section.
- `docs/patterns/pattern-2-ib-link-flap.md` — deleted "Integration
  gap" section, replaced with "Integration recipe" pointing at the
  shipped stanza; updated "Why node_exporter sees it" prose to drop
  the "pending" hedge.
-
`module/processor/patterndetectorprocessor/ib_link_flap_recipe_test.go`
  — new file. `TestRecipe_IBLinkFlap_StanzaPinsWireContract` parses
  the example YAML and asserts every load-bearing token is present
  (source metric name, three `hw.network.ib.*` attrs, the `Int()`
  cast on the port label, the transform name, and the pipeline
  wiring). `TestRecipe_IBLinkFlap_RoundTripFiresVerdict` simulates
  the end-to-end path: builds `plog.Logs` with the exact attribute
  shape the recipe stamps and asserts the processor emits a flap
  verdict.

## Test plan

- [x] `./_build/tracecore validate
--config=docs/integrations/examples/prometheus-scrape.yaml` → exit 0
- [x] `bash scripts/validator-recipe.sh` → 9 validated, 3 skipped
(non-linux host)
- [x] `bash scripts/doc-check.sh` → clean (no orphan test refs)
- [x] `go test ./module/processor/patterndetectorprocessor/... -count=1`
→ PASS (incl. the two new tests + all 5 existing IB tests)
- [x] `go build ./...` and `go vet ./...` → clean
- [x] Pre-commit hooks: golangci-lint 0 issues, go mod verify,
attribute-namespace-check 100/100
- [x] Mutation-verified: dropping `hw.network.ib.port.state` from the
recipe yaml fails `TestRecipe_IBLinkFlap_StanzaPinsWireContract` with
the expected remediation message naming the missing identifier
- [ ] CI on the PR (waiting on push)

```release-notes
feat(recipe): InfiniBand link-flap OTTL stanza projecting node_exporter's `node_infiniband_port_state_id` onto the tracecore-canonical `hw.network.ib.*` namespace (`hw.network.ib.port.state` int + `hw.network.ib.device` + `hw.network.ib.port.num`). Pattern #2's `IBLinkFlapDetector` now has its metric-side input wired; metrics→logs bridge emitter remains gated on RFC-0014 PR-B (#260).
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…ard) (#477)

## Summary

Closes the `docs/MILESTONES.md` §M6 carry-forward: *"every fenced block
in `docs/getting-started.md` is exercised by `scripts/smoke.sh`"*.

The ≤5-count gate shipped with the M6 wave; the binding half was tracked
carry-forward because `smoke.sh` ran a parallel hand-written
hostmetrics→debug config rather than the doc's actual YAML.

## Root cause

Two scripts owned the "first OTLP byte" config — `smoke.sh` rendered one
inline, `docs/getting-started.md` carried another. They happened to
agree, but nothing forced them to. The carry-forward existed because the
binding was *correct by inspection*, not *correct by construction*.

The fix is to make the doc the single source: `smoke.sh` extracts the
YAML from `docs/getting-started.md`'s `## Walkthrough` heredoc at
runtime. If the doc grows a typo, a renamed receiver, or a different
scraper, `smoke.sh` exercises the change automatically. If the heredoc
disappears, the extractor fails loud with a named error.

## Changes

- `scripts/smoke.sh` — extracts the Walkthrough heredoc via a perl
one-liner, writes it to a tempfile, then runs `tracecore validate
--config=` + `tracecore --config=` against it (Walkthrough steps 3 + 4).
Lifecycle-log assertions retained, with `"Shutdown complete"` now
load-bearing against the doc's post-walkthrough prose.
- `scripts/doc-check.sh` — new gate (right after the existing ≤5-count
gate) asserts the smoke↔doc binding with four mutation-verified clauses:
Walkthrough scope, `"$BIN" validate --config=` invocation, `"$BIN"
--config=` run invocation, `docs/getting-started.md` path reference.
- `scripts/smoke_test.sh` — new mutation-verify harness mirroring the
gate at runtime, plus an inline mutant-doc test that proves the
extractor exits 1 and the wrapper emits the named error when the heredoc
is removed.
- `Makefile` — `make smoke` now also runs `smoke_test.sh`; wired into
`ci-full` alongside the existing `smoke-quickstart` target.
- `docs/MILESTONES.md` — §M6 status `⧗ partial` → `☑ delivered`;
getting-started rubric `⧗` → `☑`; carry-forward bullet rewritten
(remaining work is operator-config branch-protection only).

## Runtime

End-to-end `bash scripts/smoke.sh` on darwin/arm64: **~2.2s** (extract +
validate + 1.5s run window + lifecycle-log assertions). Well under the
120s ci-fast budget. No hardware required — uses the `hostmetrics` load
scraper, portable across linux/darwin/windows.

## Test plan

```release-notes
ci(smoke): scripts/smoke.sh now extracts its YAML config from docs/getting-started.md '## Walkthrough' instead of carrying a parallel hand-written config; doc-check.sh gates the doc↔smoke binding with four mutation-verified clauses. Closes the M6 carry-forward.
```

- [x] `bash scripts/smoke.sh` exits 0 on clean main (verified locally,
~2.2s).
- [x] `bash scripts/smoke_test.sh` all assertions pass.
- [x] `bash scripts/doc-check.sh` reports `scripts/smoke.sh binds to
docs/getting-started.md (M6: every block exercised by smoke.sh)`.
- [x] Mutation test #1: `sed -i 's/"$BIN" validate --config=/"$BIN" XXX
--config=/' scripts/smoke.sh` → doc-check exits 1 naming "validate
--config= invocation (Walkthrough step 3)".
- [x] Mutation test #2: `sed -i 's/"$BIN" --config=/"$BIN" XXX=/'
scripts/smoke.sh` → doc-check exits 1 naming "run invocation
(Walkthrough step 4)".
- [x] Mutation test #3: `sed -i 's/Walkthrough/Section/'
scripts/smoke.sh` → doc-check exits 1 naming "extraction scope lost".
- [x] Mutation test #4: `sed -i
's/docs/getting-started.md/docs/SOMEWHERE-ELSE.md/' scripts/smoke.sh` →
doc-check exits 1 naming "binding source missing".
- [x] Mutation test #5: getting-started.md with no `## Walkthrough`
heredoc → smoke.sh exits 1 with named error message (covered by
`smoke_test.sh`).
- [x] `make lint` 0 issues; `make vet` clean; `make doc-check` clean
(all 18 gates pass).
- [x] `make smoke` end-to-end including `smoke_test.sh` passes.

## Related

- Refs `docs/MILESTONES.md` §M6 (Documentation scaffold).
- Sibling #460 (`fix(doc-check): drop unconditional exit 0`) made this
carry-forward visible — before #460, the new gate would have been
silently skipped by the line-99 short-circuit.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants