Skip to content

Apply Wave-1 follow-ups: tests, automation, doc-truth#5

Merged
trilamsr merged 2 commits into
mainfrom
wave1-followup
May 13, 2026
Merged

Apply Wave-1 follow-ups: tests, automation, doc-truth#5
trilamsr merged 2 commits into
mainfrom
wave1-followup

Conversation

@trilamsr

Copy link
Copy Markdown
Contributor

What this PR does

Four small follow-ups on top of #4 (Wave-1 governance bootstrap). Each is independently small; they're bundled here because they share a theme (tightening the wave we just landed).

  1. dco-check_test.sh is now signing-safe. The test creates a temp git repo and commits to it. Without overriding commit.gpgsign locally, a contributor who's run scripts/setup-signing.sh (and thus has commit.gpgsign=true globally) would see the test commits try to sign with a key not present in the sandbox repo — a false negative in our own test suite. Fix is two git config lines.

  2. scripts/apply-branch-protection.sh applies the rules in .github/branch-protection.yml directly via gh api. Idempotent; safe to re-run. Replaces the manual-UI-clicks-then-pray pattern. After this PR merges, applying branch protection is one command instead of a series of GitHub UI navigations — and drift between the intent doc and live settings becomes a re-run away from being fixed.

  3. AGENTS.md documents the kernel-style Assisted-by: trailer format (<agent>:<model> [<tool>]) as preferred, with the simple form (<tool name>) still accepted. Aligns the doc with the format every commit on this branch and Wave 1: governance bootstrap (CODEOWNERS, DCO, signing) #4 already uses, and matches the Linux kernel's coding-assistants.rst convention.

  4. The "Aim for under 500 lines" guidance is removed from CONTRIBUTING.md and the matching bullet in AGENTS.md. Following a recent design discussion, we explicitly chose not to enforce a PR size cap (the friction it adds outweighs the discipline it provides for a small-team project). Replaced with: "split when the diff outgrows the concern, not when it crosses an arbitrary line count".

Depends on

#4. This branch is wave1-followup cut from wave1-governance; once #4 merges, this PR's diff is just the four follow-up changes.

Release notes

NONE

Checklist

  • Tests still pass (make ci, all *_test.sh suites)
  • Commits are signed off
  • If AI-assisted, Assisted-by: trailer present

trilamsr added 2 commits May 13, 2026 10:11
Four small follow-ups on the contribution-safeguards bundle.

1. dco-check_test.sh sets `commit.gpgsign=false` and
   `tag.gpgsign=false` in the sandbox repo before any commit.
   Without this, contributors who have run scripts/setup-signing.sh
   (and therefore have `commit.gpgsign=true` in their ~/.gitconfig)
   would see the test commits attempt to sign with a key not
   present in the temp repo's context — a false negative in our
   own test suite.

2. scripts/apply-branch-protection.sh applies the rules in
   .github/branch-protection.yml directly via `gh api`. Idempotent;
   safe to re-run. Replaces the manual-UI-clicks-then-pray pattern
   with one command. Drift between the intent doc and live settings
   becomes a re-run away from being fixed.

3. AGENTS.md now documents the kernel coding-assistants.rst trailer
   format (`Assisted-by: <agent>:<model> [<tool>]`) as preferred,
   with the simple form still accepted. Aligns the doc with the
   format every commit on this branch already uses.

4. The "Aim for under 500 lines" guidance is removed from
   CONTRIBUTING.md and the matching bullet in AGENTS.md. We
   explicitly chose not to enforce a PR size cap; the soft guidance
   was misleading. Replaced with "split when the diff outgrows the
   concern, not when it crosses an arbitrary line count".

Refs RFC-0003.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three small wording cuts after a fresh-eyes pass:

- .github/scripts/dco-check_test.sh: 4-line signing-disable comment
  collapsed to one. The "why" is non-obvious enough to keep one
  line; the other three were prose.
- scripts/apply-branch-protection.sh: dropped the "Future improvement:
  parse the YAML directly..." block. Forward-looking comments rot
  per PRINCIPLES §2; the drift hazard will surface when it bites.
  Header trimmed from 16 lines to 7.
- CONTRIBUTING.md: AI-assisted commits bullet now points at AGENTS.md
  for the trailer format instead of duplicating examples inline.

Net: -12 lines, no behavior change, all tests still green.

Refs RFC-0003.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr enabled auto-merge (squash) May 13, 2026 17:28
@trilamsr trilamsr merged commit 0c3ce85 into main May 13, 2026
9 checks passed
@trilamsr trilamsr deleted the wave1-followup branch May 13, 2026 17:29
trilamsr added a commit that referenced this pull request May 14, 2026
Address self-review items #5 (pkg-name confusion), #12 (MILESTONES
glyph), #16+#17 (commit/scope discipline), #18 (telemetry pkg
architecture intent), #19 (Resource auto-pop) — all doc-only.

internal/telemetry/doc.go + internal/selftelemetry/doc.go: explicit
"where do I add code" guidance with the conceptual split (process-
level SURFACE vs per-component PRODUCER CONTRACT). Includes the
architecture intent for future milestones — when tracing lands the
TracerProvider goes in `telemetry`, signal-direction-driven split
(consumer/producer) rather than modality-driven (metrics/traces).

selftelemetry/doc.go also pins the cardinality contract explicitly
(every label value must be low-cardinality; canonical Kind* constants
exist for that purpose) and the Resource auto-population contract.

MILESTONES.md: new ☒ glyph for "policy-declined, not deferred."
Applied to the /readyz SLO-threshold criterion — RFC-0006 explicitly
chose degraded ≠ not-ready, that's a policy, not a missing feature.

FOLLOWUPS.md: new "M2 process lessons" section captures three honest
process gaps from this milestone for the next loop's prompt:
- Scope-creep (Go-bump + CI-fix-deadlines shipped under [telemetry])
- Commit history discipline (25 commits incl. 13 review-fixes)
- Self-assessment optimism caught by gates four separate times

Lessons → next-milestone-prompt input, not corrective action for M2
(retroactive split blocked by no-force-push rule).

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 14, 2026
Closes 4 of 7 new A+ criteria from the recursive self-review:

#1 — e2e-otelcontrib now verifies the collector PARSED the record,
not just that it accepted bytes. Workflow rewritten to docker-run
otelcol-contrib with a custom config (file + debug exporters,
detailed verbosity). After the e2e POST, the bash step greps
/tmp/otelout/logs.json for the canonical body, the kernelevents.xid
attribute, and the gpu.id attribute. Empty file or missing
attributes → workflow fails.

#2 — TestIntegration_KmsgWriteReadBehavioral (//go:build linux)
writes a synthetic <6>NVRM Xid 79 line to /dev/kmsg, uses a marker
string in a regex_filter to isolate from ring-buffer noise, then
asserts the receiver emits a plog.LogRecord with kernelevents.xid=79
+ gpu.id=0000:65:00.0 within 3s. A regression in parse/build/emit
fails this on Linux CI.

#3 — prometheus_alerts_test.go validates the alert YAML structure
(every group has interval, every rule has expr/severity/summary/
description) AND cross-references the metric + label-filter names
against the receiver's actual SelfTelemetry surface. A typo in
the alert would silently never fire; this catches it before merge.

#5 — runbook_test.go executes the RUNBOOK's "First 15 minutes"
step 1 (`tracecore validate --config=...`) and step 2
(`tracecore debug dump`) as real commands. Documentation rot
becomes a test failure, not a silent SRE-time discovery.

#4 — sustained_test.go (`//go:build sustained`) feeds 1000
events/sec for 5 minutes (300k records), samples heap every 30s,
asserts ≤10 MiB growth and p99 emit latency tail bounded. New
`sustained-load` workflow job runs it on push-to-main + schedule
(not PR — 5 minutes is too slow for the inner loop).

The seventh criterion (two-week soak + external operator) requires
elapsed time + a human; nothing in-session can close it.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Brings the M8 DCGM receiver to this branch and reconciles divergent
decisions between the two parallel milestones.

Auto-merged: CHANGELOG.md, MILESTONES.md, FOLLOWUPS.md,
docs/FAILURE-MODES.md, docs/agents/RECEIVER-PATTERNS.md,
components/receivers/clockreceiver/clockreceiver.go.

Conflicts resolved:
- components.yaml + cmd/tracecore/components.go: keep all three
  receivers (clockreceiver, dcgm, kernelevents); regenerated via
  `make generate`.
- cmd/tracecore/main.go: keep both M9 `debug dump` and M8
  `receivers list` subcommands. Both surfaces have tests.
- Makefile .PHONY: unioned M9's test-extras-* targets with M8's
  smoke + build-tags targets.
- docs/agents/REVIEWER-CONTEXT.md: kept both the M9 streaming-
  receiver section and the M8 receiver-author-patterns section.

Divergent decisions identified and unified:

1. selftelemetry.Kind typed-enum (M8 introduced) — M9's
   IncError("string") call sites migrated to canonical Kind*
   constants (KindPanic, KindDownstream, KindParse, KindCardinality)
   and receiver-local typed constants (kindJournalctlCrash,
   kindKmsgOversized, kindKmsgOverflow) in a new
   `components/receivers/kernelevents/kinds.go`. Mirrors M8 dcgm's
   `kindWatch` / `kindMIG` shape.

2. CONTRIBUTING.md "Adding a receiver" — M8 added one section
   centred on dcgm; my earlier A+ pass added a second. Merged into
   a single "Adding a receiver or exporter" section with a 3-row
   shape-selection table (trivial / vendor-SDK / streaming-source)
   plus shared component-local conventions and the kernelevents-
   source sub-section.

3. Lifecycle helper canonical-vs-inline — declared
   `internal/runtime/lifecycle.Lifecycle` as the canonical pattern
   in docs/agents/RECEIVER-PATTERNS.md ("Canonical lifecycle
   helper") and documented the M8 DCGM divergence inline with
   rationale. Tracked the DCGM migration as
   docs/FOLLOWUPS.md row #5; not refactored here to avoid bloating
   this merge's scope.

4. Receiver-author skeletons — M8's `docs/agents/examples/*.go`
   (six general receiver patterns) and M9's
   `components/receivers/kernelevents/source_template.go.example`
   (kernelevents-source-specific) now cross-reference each other.
   The kernelevents template's reading order points authors at the
   general examples for cross-cutting patterns.

5. alert-check.sh wired into make doc-check (from M9) now also
   verifies the dcgm RUNBOOK ↔ prometheus-alerts.example.yaml
   parity automatically. doc-check now verifies 50 test references
   across 2 docs (was 44 across 2 pre-merge); alert-check 2 pairs.

All tests pass under make ci; coverage gates clean
(kernelevents 80.8%, dcgm 86.0%, lifecycle 96.3%).

Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 15, 2026
Three test-correctness findings.

R1.B4 — gateReader Close race. `cascade_test.go::gateReader.Close`
used a `*bool` guard around `close(g.release)`; under -race, two
Close calls (deferred-defer + ctx-cancel watcher) could observe
the bool false and both close the channel → "close of closed
channel" panic in tests, or a race-detector trip. Fixed with
`*atomic.Bool` + CompareAndSwap so only the first Close fires the
release. Test passes under `-race -count=3`.

R1.S2 — TestPrometheusAlerts_ReferenceWiredCounters used
`require.Contains(body, metric)` on the raw YAML file, which
would be satisfied by a metric name appearing in a comment or
description. Refactored to walk the parsed
`doc.Groups[].Rules[].Expr` and assert metric/kind references
appear in actual alert expressions. Same test now also enforces
the R1 #4 follow-up: every rule must carry a `runbook_url`
annotation. Three alerts × runbook_url = three new structural
assertions.

#5 — drainStderr lacked panic recovery. Per PRINCIPLES §1, every
goroutine carries panic recovery; bounded-buffer Scanner +
slog.Warn aren't expected to panic, but a defensive recover
keeps a future refactor (custom formatter, sink swap) from
taking down the receiver via this side path. Defer added at the
top of `journald.go::drainStderr` mirroring the receiver's main
goroutine pattern; routes the recovered panic to
`IncError(KindPanic)` + structured log so it surfaces on the
operator's dashboard.

Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…rd surface

## What this PR does

Lands M19 (pattern #14 pod-evicted) end-to-end against the rubric's
hermetic replay-test gate, plus the M10 NodeRecord surface the
detector consumes.

### Pattern detector (internal/synthesis/patterns/)

- `pod_evicted.go`: `PodEvictedDetector{}.Evaluate(events, nodeConds)`
  joins each Pod-Evicted Record against the most-recent per-node
  pressure transition within a 30s sliding window (configurable via
  `JoinWindow`). Returns one `PodEvictedVerdict` per evicted pod,
  sorted by event time for golden-test determinism.
- Headline matches the rubric's regex contract for the canonical
  disk-pressure fixture (`/Pod .* evicted at .* due to disk pressure/`);
  remediation matches `/relocate.*NVMe/`. Three other pressure roots
  (memory/pid/notready) get distinct operator-facing remediations.
- Missing-layer path: when no NodeRecord joins, verdict falls back to
  `confidence: partial` with `missing_layers: ["node_condition"]` and
  the pressure kind is parsed heuristically from the Event Note.
- `verdict.go`: `PodEvictedVerdict` + `EvidenceRef` + `Confidence`
  with JSON tags pinned by `testdata/verdict.schema.json` (additional-
  Properties:false; drift rejected in `verdict_schema_test.go`).
- Generic `Detector` interface deferred per PRINCIPLES §3 rule-of-three
  — M17 / M18 land as the second and third consumers.

### Replay runner + corpus (internal/synthesis/replay/)

- `runner.go`: `LoadFixture(dir)` / `LoadFixturesUnder(root)` walk a
  directory tree, load `manifest.json` + `events.json` +
  `node_conditions.json` + `golden.json`. Pure stdlib; hermetic by
  design per rubric L414.
- `runner_test.go`: drives every fixture under `pod_evicted/`,
  JSONEq-diffs detector output against `golden.json`. Asserts the
  canonical + three negative fixtures all exist as a structural gate.
- Canonical fixture: `pod_evicted/canonical/` (Evicted Pod on
  `gpu-node-0001` at T=10:00:00Z; DiskPressure transition at T-5s →
  full-confidence verdict).
- Negative fixtures under `_negative/`: `killing/`, `preempted/`,
  `failed_scheduling/` — each produces an empty verdict slice (rubric
  L407 FP-rate contract).

### M10 NodeRecord surface (components/receivers/k8sevents/)

- `node_record.go`: typed `NodeRecord` + `NodePressureKind` constants
  (memory/disk/pid/notready) + distinct `NodeSchemaURL` so downstream
  consumers version-gate the two surfaces independently. `HintForPressure`
  inversion: NotReady → HintNodeUnhealthy; rest → HintNodePressure.
- `node_emit.go`: `buildNodeLogRecord` analogous to `buildLogRecord`.
- `node_informer.go`: pure `convertNodeConditions(*corev1.Node)
  []NodeRecord` (only alarming statuses surface: True for the three
  pressures, False for Ready) + `nodeConditionDedup` that tracks
  (nodeUID, conditionType, transitionNano) so an unrelated Node-object
  update doesn't re-emit.
- `receiver.go`: Node SharedInformer registered when
  `Config.NodeConditions.Enabled=true` (default false — wider RBAC).
  `nodeRecords chan NodeRecord` joins the existing `events chan` in
  `run()`'s select; nil channel blocks when disabled.
- `rbac.yaml`: adds `core/v1.nodes get,list,watch`. `rbac.can-i.golden`
  updated to match (sorted verb-then-resource).
- `Record.ReportingInstance` added (kubelet sets to node name; M19's
  join key); `MaxAttributesFloor` bumped 9→10 to keep it intact.
- `pattern_consumer_test.go`: 4 new compile-gate tests pin the
  NodeRecord field walk, the four-kind exhaustiveness, the pressure→
  hint inversion, and `reporting.instance` wire-format pinning.
- `node_record_test.go`: mirrors `hint_test.go` for the
  conditionType→NodePressureKind taxonomy and the pressure→Hint map.

### CLI

- `tracecore failure-inject pod-evict --reason=... --seed=N --out=path`
  wraps `tools/failure-inject/podevict/podevict.go` for deterministic
  Event YAML generation. Closes FOLLOWUPS L806-815 carry-forward.
  `--allow-cluster-write` still returns `ErrClusterWriteUnsupported`
  (exit 70) — captured in M19's new carry-forward #5.

### Hygiene

- CHANGELOG entry under [Unreleased] / ### Added.
- MILESTONES §M19 flipped ☐ → ⧗ partial; rubric items annotated ☑/⧗
  with carry-forwards listed.
- FOLLOWUPS L806-815 closed; L927-934 (chaos.yml pattern_evicted
  matrix row) trigger-fired annotation added.

## Linked issue(s)

_No linked issue._

## Release notes

```release-notes
NONE
```

## Checklist

- [x] `make ci` exit 0 (verified twice; runtime budget under 60s)
- [x] All new unit + replay tests pass; no regressions across repo suite
- [x] JSON Schema gates verdict shape (additionalProperties:false)
- [x] Negative fixtures produce zero verdicts
- [x] RBAC golden regenerated; pattern_consumer compile-gate updated
- [x] Commits signed off

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…+ threat-root trace on go-mod-verify

Phase-4 A+ aspiration review (2 fresh subagents; both graded A,
diverged on which gates to apply). Validation cycle:

Findings table:

| ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action |
|----|------|-------------|----------|---------|-------|------------|------------|---------|--------|
| P4.1 (aplus-1 #2, also P2.6) | aplus | operator | CONCERN | A workflow_dispatch run with `inputs.tag` set but `github.ref` ≠ refs/tags/$INPUT_TAG passes Build and fails the OIDC smoke check 15-30 minutes later. Operator wastes runner time and sees the misuse late. | New "Verify dispatch ref matches tag (pre-flight)" step exit-1s within seconds with the documented workaround. | Reviewer noted the smoke check already enforces this — but at job-end, not at job-start. Fail-fast IS the load-bearing property. | n/a — workflow YAML, actionlint clean | yes — P4-aplus-1 | applied this commit; closes P2.6 deferral. |
| P4.4 (aplus-2 #2) | aplus | repo-long-term | NIT | go-mod-verify comment says "defense in depth against a compromised GOPROXY mirror" but doesn't name the trust root or the orthogonal threat (a poisoned go.sum itself). | Comment now states "Trust root: the go.sum at this tag commit" and cross-references the tag-protection FOLLOWUPS entry. | A future maintainer might over-attribute the protection. | n/a | no | applied this commit |

Rejected/deferred:

- P4.2 (aplus-1 #4) — Structured diff lint for release.yml ↔
  docs/reproducibility.md. DEFER to FOLLOWUPS.md (real value, but
  manual review caught both drift directions in Phases 2 + 3;
  automate when next edit happens).
- P4.3 (aplus-1 #6) — Release artifact manifest validation before
  upload. REJECT. Per anti-bureaucracy: reviewer concedes `needs:`
  dependency already gates malformed artifacts from reaching the
  release job. Adding defensive validation against a CI-bug
  scenario is bloat.
- P4.5 (aplus-1 #3) — docs/SUPPLY-CHAIN-IDENTITY.md consolidated
  reference. DEFER to FOLLOWUPS.md; ~30-min write, scope creep
  beyond release.yml. M21 release-checklist is the natural trigger.
- P4.6 (aplus-1 #5, aplus-2 #3) — Formal threat-model document +
  M21 alignment narrative. DEFER to M21.
- P4.7 (aplus-2 #5) — Cross-link health lint. Duplicate of P4.2;
  same deferral.

Reproducibility:
  $ make actionlint zizmor   # exit 0
  $ grep -A1 "workflow_dispatch with inputs.tag" .github/workflows/release.yml
    # pre-flight gate present

Letter-grade outcome:
  Reviewer #1 starting: A → A+ via criteria 2, 4, 6 (we applied 2 + threat-model comment)
  Reviewer #2 starting: A → APPROVED-AS-IS (already strong)
  After this commit: A on the falsifiable axis (one operator-UX gate
  + one comment clarification), with the broader doc/lint work
  scoped to follow-ups.

Beneficiary: operator. The pre-flight gate cites a specific
operator-facing surface (15-30 minute waste on workflow_dispatch
misuse) and turns it into a seconds-fast named error.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
…ctor wiring (PR-A for #260) (#267)

## Summary

PR-A for [issue
#260](#260) — the
recipe-side half. Two commits:

1. `docs(adr): metrics-to-logs converter for pattern input (Option A)` —
records the architectural decision and the upstream-blocker analysis
under a new `docs/adrs/` directory.
2. `docs(recipe): extend prometheus-scrape with DCGM to hw.* OTTL
transform` — adds `transform/dcgm_to_hw_semconv` (a second OTTL pass
after `transform/gpu_vendor`) and documents every per-pattern projection
in `docs/integrations/prometheus-scrape.md`.

### Architectural decision (issue #260 §Proposed plan step 2)

**Option A** — extend `patterndetectorprocessor` with
`processor.WithMetrics` alongside the existing `WithLogs` (deferred to
PR-B; this PR ships only the wire-format contract PR-B will consume).

Per `[[adopt-over-build]]` Option B (OTTL metrics-to-logs converter) was
the preferred starting point. It's blocked upstream: at OTel contrib
v0.130 (the release pinned in `builder-config.yaml`) the
`transformprocessor` signal contexts are sealed per-signal —
`metric_statements` can only reference `resource | scope | metric |
datapoint` paths, never `log` ([README @ v0.130.0 §
Config](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config)).
Surveying every contrib connector at v0.130 also turns up zero
metrics→logs primitives (`countconnector`, `signaltometricsconnector`,
`spanmetrics`, `exceptions`, `servicegraph` all output metrics;
`routing`/`failover`/`roundrobin`/`otlpjson` are signal-preserving;
`datadog` and `grafanacloud` are vendor-specific). The full survey +
reasoning is in
[ADR-0001](docs/adrs/0001-metrics-to-logs-pattern-input.md).

Option C (routing connector + sibling) is strictly heavier than A with
the same end state — routing preserves signal type, so it cannot bridge
metrics→logs either.

Trade-off accepted: doubled processor surface inside
`patterndetectorprocessor` (one `ConsumeLogs`, one `ConsumeMetrics`).
Mitigation: the metrics path is a thin projection on top of the same
`module/pkg/patterns/` library; verdict emission stays log-based via a
connector. Upstream-contribution slot opened — a
`metricthresholdconnector` (inverse of `signaltometricsconnector`) would
let us collapse Option A back to Option B without operator-visible
churn. Tracked under RFC-0013 §5.

### Per-pattern OTTL projection (this PR)

| Pattern | Raw DCGM input | OTel output | Trigger-statement shape (PR-B
consumes this) |
|---|---|---|---|
| #1 NVLink | `DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES` |
`hw.gpu.nvlink.io` (Counter, `By`) + `hw.gpu.nvlink.link={N}` +
`network.io.direction={transmit\|receive}` | `rate(hw.gpu.nvlink.io[5m])
< 0.5 × per-GPU median over the per-`hw.id` link set, for 10m` |
| #3 HBM ECC | `DCGM_FI_DEV_ECC_{SBE,DBE}_{VOL,AGG}_TOTAL` | `hw.errors`
(Counter, `{error}`) + `error.{type,subtype,persistence}` |
`increase(hw.errors{error.type="uncorrected",
error.persistence="volatile"}[5m]) > 0` |
| #4 thermal |
`DCGM_FI_DEV_{THERMAL,POWER,SYNC_BOOST,BOARD_LIMIT,LOW_UTIL}_VIOLATION`
| `hw.gpu.throttle.duration` (Counter, `s`) + `hw.gpu.throttle.reason` |
`increase(hw.gpu.throttle.duration{reason="thermal"}[5m]) > 30s, for
10m` |
| #5 PCIe | `DCGM_FI_PROF_PCIE_{TX,RX}_BYTES` | `hw.gpu.io` (Counter,
`By`) + `network.io.direction` + `hw.gpu.pci.bdf` resource attr |
`rate(hw.gpu.io[5m]) < 0.3 × host median, for 15m` |

Resource attribution: `UUID` or `gpu_uuid` → `hw.id`; `gpu` or `GPU` →
`hw.gpu.index`; `hw.type="gpu"` stamped on every DCGM series. Dual-label
fallback handles both legacy and modern dcgm-exporter releases.

### What's NOT in this PR

The third commit originally planned by issue #260 — `feat(recipe):
metrics→logs OTTL alert emission` — is dropped because OTTL cannot emit
log records from metric streams at v0.130 (root cause cited above, full
analysis in ADR-0001). The fallback is Option A, which requires
modifying `patterndetectorprocessor` source — out of scope for a
recipe-only PR. The alert-emission half lands in PR-B alongside the
NVLink detector code.

## Test plan

- [x] `./_build/tracecore validate
--config=docs/integrations/examples/prometheus-scrape.yaml` exits 0.
- [x] `bash scripts/doc-check.sh` clean (test-name parity, markdown link
integrity, banned-phrase lint, integration-recipe rubric all pass).
- [x] `bash scripts/validator-recipe.sh` validates 6 recipes (skips 2
requires-linux / requires-k8s-cluster — CI ubuntu runner exercises those
paths).
- [x] `make check` (golangci-lint, go vet, mod-verify) clean.
- [ ] Adversarial reviewer check that the OTTL `set(metric.name, ...)`
identity-conflict caveat (recipe markdown § "Identity-conflict caveat")
matches the operator's downstream backend's deduplication semantics for
the four projected metric families.
- [ ] CI runner exercises the requires-linux / requires-k8s recipes that
this PR's host skipped.

## Followup

- **PR-B (issue #260, blocked by this PR's merge)** — extend
`patterndetectorprocessor` with `processor.WithMetrics`; land
`module/pkg/patterns/nvlink_degradation.go`; wire verdict emission
through a sibling logs pipeline via a connector. Decision rationale
baked into ADR-0001 so PR-B doesn't relitigate.
- **Upstream contribution (v0.3 cycle, RFC-0013 §5)** — propose a
`metricthresholdconnector` (or `signaltologsconnector`) to OTel-contrib.
If/when it lands, the recipe can collapse back to Option B without
operator-visible churn — the `hw.gpu.*` wire-format contract this PR
ships is the load-bearing customer surface, not the upstream component
selection.
- **dcgm-exporter throttle-reason mapping verification** — the pattern
#4 mapping table assumes modern dcgm-exporter `DCGM_FI_DEV_*_VIOLATION`
series names. If your dcgm-exporter release exposes the throttle bitmask
as `DCGM_FI_DEV_CLOCKS_EVENT_REASONS` instead (per the semconv proposal
Open Question #1), extend the OTTL block to expand the bitmask into
discrete `hw.gpu.throttle.duration` datapoints — out of this recipe's
scope; tracked as a follow-up under the same milestone.

```release-notes
docs(recipe): The `prometheus-scrape` recipe gains a second OTTL transform pass that projects raw `dcgm-exporter` `DCGM_FI_*` series into the customer-stable `hw.gpu.*` / `hw.errors` namespace (semconv proposal). Four pattern families covered — NVLink, HBM ECC, thermal throttle, PCIe — using upstream OTTL functions only (`set`, `IsMatch`, `ExtractPatterns`, `Int`). The wire-format contract this transform produces is the input PR-B's NVLink detector will consume; the in-tree component change for the detector itself lands in a follow-up PR per ADR-0001.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
…AR pattern #4) (#281)

## Summary

Pattern #4 detector + processor wiring: thermal throttle cascade.
Detects when N+ GPUs on the same node show rising
`hw.gpu.throttle.duration{reason=thermal}` delta within a window —
cluster-rack thermal event.

**Layer decision**: single-layer (multi-GPU same-node cascade; no
cross-signal join). Mirrors `nccl_hang`'s no-Confidence shape.

**Defaults**:
- `Window`: 5min
- `ThrottleDeltaThreshold`: 30s (per pattern doc)
- `MinCascadeGPUs`: 4

**Integration gap**: detector ready; end-to-end firing blocked on PR-B
metrics→logs converter per ADR-0001. Mirrors HBM ECC pattern's gap at
#273.

Closes pattern #4 in the NORTHSTAR ladder (3/15 → 4/15 covered after
this lands; #5 in parallel PR).

## Test plan

- [x] `cd module && go test ./pkg/patterns/... ./processor/... -race
-count=1`
- [x] `make build && make check`
- [ ] CI gates (validator-recipe, install-bench, bench-check)

```release-notes
feat(patterns): thermal_throttle pattern detector (pattern #4) — detects multi-GPU same-node thermal cascade
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Ships NORTHSTAR pattern #5 per docs/patterns/pattern-5-pcie-aer.md.
Two layers required for a verdict; a single layer alone emits
nothing.

- Layer 1 (kernel AER): `PCIe Bus Error: severity=…, type=…` line
  on a specific `gpu.id` (PCI BDF) extracted by the journald-kernel
  OTTL recipe.
- Layer 2 (rate-collapse): per-GPU `hw.gpu.io` `BytesPerSecond` <=
  `(1 - RateDropThreshold) * BaselineBytesPerSecond` on the same
  BDF within the correlation window. Default threshold 0.5 (50%
  drop) and window 5min. Causality flows forward — the AER must
  precede the rate-collapse sample.

Why two layers required (no Confidence taxonomy, mirroring
xid_correlation + hbm_ecc):
- AER alone: corrected errors recovered — the link re-trained.
  Operators see raw AER telemetry on journald already.
- Rate-collapse alone: workload-natural Tx/Rx variance (a rank
  finished a comm phase, etc.). The AER is the hardware-fault
  confirmation that makes the join operator-actionable.

Join key is `gpu.id` (PCI BDF) per RFC-0013 §3, shared across the
dmesg AER preamble and the dcgm-exporter `hw.gpu.pci.bdf` resource
attribute on `hw.gpu.io`. Node is carried for the verdict prose
but not used in the join key — the BDF is the proximate hardware
identifier.

Defaults:
- CorrelationWindow: 5min — mirrors the spec's `rate(...[5m])`
  PromQL and typical post-AER link-renegotiation latency.
- RateDropThreshold: 0.5 — a one-generation PCIe downshift (Gen5
  → Gen3 or x16 → x8) lands well past this floor.

What's new:
- module/pkg/patterns/pcie_aer.go — PCIeAERDetector library type,
  PCIeAERRecord (AER kernel message), PCIeIORecord (hw.gpu.io
  rate sample with baseline), PCIeAERVerdict output.
  PatternIDPCIeAER = "5"; EvidenceKindAER = "pcie_aer";
  EvidenceKindPCIeIOCollapse = "hw_io_collapse".
- module/pkg/patterns/testdata/pcie_aer_verdict.schema.json — JSON
  schema with 10 drift falsifiers (extra field, confidence re-add,
  pattern.id numeric, pattern.id wrong value, severity outside
  enum, gpu_id empty, drop_ratio negative, drop_ratio over 1,
  evidence kind outside enum, evidence_trail under min).

Verdict struct carries promoted scalar fields (Severity, AERType,
DropRatio, GPUID, Node) so the processor wiring can stamp them as
top-level OTLP log attributes for dashboard table-aggregation per
PR #275's lesson. No dead fields — every struct field is read by
either the Evaluate path or the verdict-shape contract.

Integration end-to-end firing in a real deployment is blocked on
PR-B (issue #260): no OTTL recipe today derives the per-GPU
`tracecore.alert.pcie_rate_collapse`-shaped log record from
`hw.gpu.io`. The detector library is the v0.3 moat and ships
independently per ADR-0001.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
#5) (#286)

## Summary

Ships NORTHSTAR pattern #5 (PCIe AER + hw.gpu.io rate-collapse) per

[`docs/patterns/pattern-5-pcie-aer.md`](docs/patterns/pattern-5-pcie-aer.md).
Two layers required for a verdict (no Confidence taxonomy — mirroring
xid_correlation + hbm_ecc):

- **Layer 1 (kernel AER):** `PCIe Bus Error: severity=..., type=...`
line
  on a specific `gpu.id` (PCI BDF) extracted by the journald-kernel
  OTTL recipe.
- **Layer 2 (rate-collapse):** per-GPU `hw.gpu.io` `BytesPerSecond` <=
  `(1 - RateDropThreshold) * BaselineBytesPerSecond` on the same BDF
  within the correlation window. Default threshold 0.5 (50% drop) and
  window 5min. Causality is forward — the AER must precede the
  rate-collapse sample.

Join key: `gpu.id` (PCI BDF) per RFC-0013 §3, shared across the dmesg
AER preamble and the dcgm-exporter `hw.gpu.pci.bdf` resource attribute
on `hw.gpu.io`. Node is carried for the verdict prose only, not the
join key — the BDF is the proximate hardware identifier.

The verdict log record promotes operator-facing scalars onto top-level
OTLP attributes (`gpu.id`, `kernelevents.pcie_aer.severity`,
`kernelevents.pcie_aer.type`, `k8s.node.name`,
`tracecore.alert.pcie_rate_collapse.drop_ratio`) per the issue #270
scalar-promotion contract, via the `putStrIfSet` helper so an empty
upstream stamp doesn't silently match empty-filter dashboard queries.

## Integration gap (filed separately, not blocking this PR)

Per ADR-0001's intentional split — library + processor wiring first,
OTTL recipes second — end-to-end firing on a real cluster needs:

- #284 — OTTL recipe to emit `tracecore.alert.pcie_rate_collapse.*` log
  records from raw `hw.gpu.io` Counter samples (the metrics→logs PR-B
  side; same blocker class as #260 for pattern-1).
- #285 — `journald-kernel` recipe extension to extract
  `kernelevents.pcie_aer.severity` / `.type` / `gpu.id` from
  `PCIe Bus Error: ...` kernel lines (the existing NVRM Xid stanza is
  the proximate sibling — same file, same extraction shape).

Until both land, the detector is the v0.3 moat (pattern logic + tests)
and the processor wiring is configured-but-quiet (zero verdicts on
real input — projections find nothing). The detector library and
wiring ship independently per ADR-0001.

## What's new

- `module/pkg/patterns/pcie_aer.go` — `PCIeAERDetector` library type,
  `PCIeAERRecord` (AER kernel message), `PCIeIORecord` (hw.gpu.io rate
sample with baseline), `PCIeAERVerdict` output. `PatternIDPCIeAER =
"5"`;
  `EvidenceKindAER = "pcie_aer"`;
  `EvidenceKindPCIeIOCollapse = "hw_io_collapse"`.
- `module/pkg/patterns/testdata/pcie_aer_verdict.schema.json` — JSON
  schema with 10 drift falsifiers (extra field, confidence re-add,
  pattern.id numeric, pattern.id wrong value, severity outside enum,
  gpu_id empty, drop_ratio negative, drop_ratio over 1, evidence kind
  outside enum, evidence_trail under min).
- Config: `pcie_aer_window` (default 5min),
`pcie_aer_rate_drop_threshold`
  (default 0.5). `Validate` rejects sub-1s window and threshold outside
  [0, 1].
- `collectInputs` grows two new typed projections
(`projectPCIeAERRecord`,
  `projectPCIeIORecord`) ordered after `hbm_ecc` in the
  pod_event > node_condition > nccl_fr > xid > hbm_ecc > pcie_aer >
  pcie_io discriminator priority. PCIe AER is checked before PCIe IO
  because the kernel-line discriminator is a tighter regex match than
  the metric-derived bridge attribute.
- Wiring tests: emits-verdict, AER-alone (no fire), rate-collapse-alone
  (no fire), window-configurable, threshold-configurable,
  promoted-scalar attribute presence, Validate guard.

## Test plan

- [x] `cd module && go test ./pkg/patterns/...
./processor/patterndetectorprocessor/... -race -count=1` — green
- [x] `make build` — green (OCB compile clean against the new processor
wiring)
- [x] `make check` — green (golangci-lint, go vet, go mod verify)
- [x] `make doc-check` — green (comment-noise diff gate clean)
- [x] PCIeAERDetector library tests cover both-layers-emit, AER-alone
(no fire), rate-collapse-alone (no fire), window-respected (collapse
outside window suppressed), threshold-respected (drop under threshold
suppressed), schema-conformance, schema-drift falsifier battery
- [x] Wiring tests confirm the projections gate on the right
discriminators (severity + gpu.id; bridge attribute + gpu.id) and the
verdict log record carries the promoted-scalar attribute taxonomy

```release-notes
patterndetectorprocessor wires the PCIeAERDetector library and ships NORTHSTAR pattern #5 (PCIe AER + hw.gpu.io rate-collapse, join key gpu.id PCI BDF, default 5m correlation window, default 0.5 rate-drop threshold). Verdict log records carry pattern.* attrs plus promoted scalars (gpu.id, kernelevents.pcie_aer.severity, kernelevents.pcie_aer.type, k8s.node.name, tracecore.alert.pcie_rate_collapse.drop_ratio) for dashboard table-aggregation without parsing pattern.verdict_json. End-to-end firing on a real cluster requires the OTTL recipes tracked in issues #284 + #285; detector + wiring ship independently per ADR-0001.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

The `docs/adrs/` directory held one file
(`0001-metrics-to-logs-pattern-input.md`). A single-file decision-record
directory sitting parallel to `docs/rfcs/` (13 RFCs + README + template)
is taxonomy drift — operators and contributors had to learn two
near-identical conventions for "load-bearing architectural decision in
tree."

This PR collapses the split.

### What changed

- ADR-0001 → **RFC-0014**
(`docs/rfcs/0014-metrics-to-logs-pattern-input.md`). Content reformatted
to match the RFC template's section headings (Summary / Motivation /
Proposal / Alternatives / Open questions / Migration / References). The
substance (Option A vs Option B vs Option C analysis, the v0.130 contrib
survey, PR-A landed + PR-B pending sequencing) is preserved verbatim —
this is active design for issue #260 PR-B, not archaeology.
- `docs/adrs/` directory removed.
- 6 cross-references repointed at the new RFC path:
  - `docs/README.md` (subdirectories table — `adrs/` row removed)
- `docs/ATTRIBUTES.md` (2 spots: `tracecore.alert.pcie_rate_collapse.*`
row + "See also" link)
  - `docs/integrations/prometheus-scrape.md` (2 spots)
  - `docs/patterns/pattern-4-thermal-throttle.md`
  - `docs/patterns/pattern-5-pcie-aer.md`
- 5 source-code comments repointed
(`module/processor/patterndetectorprocessor/{patterndetector.go,thermal_throttle_test.go}`,
`module/pkg/patterns/{pcie_aer.go,thermal_throttle.go}`).
- `docs/rfcs/README.md` status-index gains an RFC-0014 row (`accepted`,
2026-05-31).

### Why convert (not delete)

ADR-0001 was evaluated against the "delete if RFC-0013 already covers
it" bias. It is not covered:

- RFC-0013 §5 mentions a `metricthresholdconnector` as a contribution
slot in one bullet. It does **not** evaluate Option A vs Option B vs
Option C, cite the contrib v0.130 evidence, or sequence the PR-A/PR-B
split for the metric-sourced detectors.
- ADR-0001 is the binding design contract for patterns #1 / #3 / #4 / #5
(4 of the next NORTHSTAR detectors). Source code (`pcie_aer.go`,
`thermal_throttle.go`, `patterndetector.go`) cites it as the reason for
the staged-but-quiet wire-up.

Delete would orphan 5 source comments and break the audit trail for an
active design decision. Convert keeps the record load-bearing without
preserving the parallel taxonomy.

### Verification

- `grep -rn "docs/adrs\|adrs/0001"` returns 0 hits.
- `grep -rn "ADR-0001\|ADR 0001"` returns 0 hits.
- `ls docs/ | grep -i adr` returns empty.
- pre-commit golangci-lint + go vet + go mod verify clean.

```release-notes
docs: collapse single-file `docs/adrs/` into `docs/rfcs/`. ADR-0001 (metrics-sourced pattern inputs) is promoted to RFC-0014 verbatim; cross-references across docs and module source repointed.
```

## Test plan

- [x] `grep -rn "docs/adrs"` returns 0 hits
- [x] `grep -rn "ADR-0001"` returns 0 hits
- [x] `docs/adrs/` directory removed
- [x] golangci-lint + go vet clean (pre-commit hook)
- [ ] CI green

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Extends prometheus-scrape.md with the bridge attribute contract for the
four metrics-derived patterns:

- pattern #1 NVLink (#260) — the `hw.gpu.nvlink.io` OTTL transform
  already lands in commit 0baa557; this PR closes #260's recipe-half.
- pattern #3 HBM ECC (#273) — `hw.errors.delta` + error.{type,
  subtype,persistence} + gpu.id contract.
- pattern #4 thermal throttle (#282) — `hw.gpu.throttle.duration.delta`
  in integer seconds + reason=thermal + gpu.id contract.
- pattern #5 PCIe AER Layer 2 (#284) — the `tracecore.alert.
  pcie_rate_collapse.*` namespace contract.

OTTL metrics->logs emission stays upstream-blocked at OTel-contrib
v0.130 (RFC-0014): no contrib processor or connector emits log records
from a metrics pipeline. The bridge contract documented here is the
load-bearing wire format any future emitter (an upstream
metricthresholdconnector OR the WithMetrics extension to
patterndetectorprocessor per RFC-0014 PR-B) MUST honor; the detector
projections at module/processor/patterndetectorprocessor/
patterndetector.go gate on this contract today.

last-verified marker bumped to 2026-06-01.

Closes #260. Closes #273. Closes #282. Refs #284 (Layer 1 closed
under #285 in a prior commit; Layer 2 contract documented here).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Reviewer cleanup pass on yellow-tier findings:

* Schema: tighten gen_ai.training.job_id to minLength:1 and document
  the fallback-grouping semantic explicitly — the field is OMITTED
  (not empty-string) on the namespace-only fallback path. Downstream
  consumers must treat ABSENCE as the explicit fallback signal, not
  silent exclusion. processor already uses putStrIfSet to suppress
  empty-string variants.

* Spec eval rule: clarify Pattern #8 (NCCL hang) vs Pattern #9 (NCCL
  bootstrap) trigger disjoint-ness — hang fires on PRESENCE of
  non-completed FR records mid-run; bootstrap fires on ABSENCE of
  any FR record past deadline. Both can fire on the same cohort
  during a heterogeneous bootstrap by design.

* Spec impl-notes: add note #5 documenting the min(ReadyAt) /
  max(ReadyAt) split with the late-joiner-masks-stuck-rank
  scenario as the rationale.

* Empty-Node skip comment in detector: previous comment claimed the
  skip biases toward false-negatives; in fact it STRENGTHENS the
  absence signal (rank stays "no FR seen" → counted as failed),
  biasing toward false-positives. Correct the directionality and
  call out the fallback-to-(namespace, rank) escape hatch.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #331. Replaces the single `pattern-pod-evicted` chaos job with a
matrix-strategy `pattern-detectors` job covering every detector named
in v1-rc1 cut-criterion §1: `hbm_ecc`, `thermal_throttle`, `pcie_aer`,
`pod_evicted`, `nccl_hang`, `xid_correlation`.

Each row runs the detector's positive + negative + schema `*_test.go`
prefix under `-race`:

- **Positive / Edge asserts** — verdict emitted with correct
  `pattern.id` within bounded time.
- **Negative / SchemaRejectsDrift asserts** — zero verdicts on
  baseline / confounder input (the false-positive gate the chaos
  matrix exists to enforce).

The `pod_evicted` row additionally runs the on-disk replay corpus under
`module/pkg/replay/pod_evicted/`; sibling rows skip that step until
their `replay/<pattern>/` fixtures land — tracked as Path A in
[`docs/v1-rc1-test-audit.md`](../blob/main/docs/v1-rc1-test-audit.md)
§4.

## Why this matrix shape, not new `failure-inject` modes

Audit §4 explicitly recommends Path A (matrix over existing
hermetic detector tests) for rc1 and defers Path B (per-pattern
`failure-inject` CLI modes) to v1.x. Path A reuses the
race-detector + golden-SHA shape already validated by
`pattern-pod-evicted` and costs ~60 LoC of workflow YAML instead of
five new CLI subcommands.

## Wall-time budget

Observed per-row runtime under `-race` (measured locally on this
worktree):

| Row | runtime |
|---|---|
| hbm_ecc | 1.35s |
| thermal_throttle | 1.33s |
| pcie_aer | 1.30s |
| pod_evicted (incl. replay corpus) | 1.35s + 1.34s |
| nccl_hang | 1.31s |
| xid_correlation | 1.32s |

Per-row `timeout-minutes: 5` is ~3× headroom; matrix parallelism
keeps total wall-time well inside the chaos-gate budget.

## Test plan

- [x] `actionlint .github/workflows/chaos.yml` — clean.
- [x] `python3 -c "import yaml;
yaml.safe_load(open('.github/workflows/chaos.yml'))"` — parses, 3 jobs
(`harness-determinism`, `cpu-steal-mpstat`, `pattern-detectors`), 6-row
matrix.
- [x] For every matrix row, `cd module && go test -race -count=1 -run
'<regex>' ./pkg/patterns/...` — all green (counts: HBMECC=16,
ThermalThrottle=17, PCIeAER=15, PodEvicted+PressureFromNote=12,
NCCLHang=10, XidCorrelation=14).
- [x] `cd module && go test -race -count=1 ./pkg/replay/...`
(pod_evicted corpus step) — green.
- [x] Pre-commit hook: golangci-lint + go vet + go mod verify +
attribute-namespace-check — all green.
- [ ] Workflow runs green on `main` after merge (verifiable only
post-merge — nightly cron + push-to-main).

```release-notes
ci: chaos.yml now matrix-tests all six shipped detectors (#14
pod_evicted, #8 nccl_hang, #3 hbm_ecc, #4 thermal_throttle, #5
pcie_aer, xid_correlation) instead of just pod_evicted. Closes the
chaos-coverage gap called out in v1-rc1-test-audit §4. No
operator-visible behavior change.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #337.

The CUDA OOM detector (`projectFBMemoryRecord` at
`module/processor/patterndetectorprocessor/cuda_oom.go:114`) gates on
`hw.gpu.memory.{free,total}` log-record attributes, but nothing in
the recipe layer produced them: `dcgm-exporter` emits
`DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` as Prometheus gauges and
no OTTL transform projected them onto the customer-stable namespace.
Detector compiled, never fired on a real install — sibling gap to
#273 (pattern #3), #282 (#4), #284 (#5).

This PR closes the gap on the metric-side projection and pins the
load-bearing log-shape contract the bridge layer MUST honor.

- `docs/integrations/examples/prometheus-scrape.yaml`:
  - `DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used` (Gauge, unit `By`)
  - `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free` (Gauge, unit `By`)
  - Identity-preserving rename only; `hw.gpu.memory.total = used+free`
    deferred to the bridge layer per the named upstream limit (see
    below).
- `docs/integrations/prometheus-scrape.md`:
  - New `### Pattern #10 — CUDA OOM (framebuffer)` metric-side
    projection section with raw-series → semconv table.
  - New `#### Pattern #10 — hw.gpu.memory.{free,total}` bridge-
    contract subsection with full log-record schema (yaml-shaped)
    matching what `projectFBMemoryRecord` reads, plus MIG caveat
    and unit-test cross-link.
  - Intro + bridge-contract header bumped to include pattern #10.
- `docs/patterns/10-cuda-oom-deceptive.md`:
  - Signal-source line links to the recipe sections.
  - Open Question #1 (`DCGM_FI_DEV_FB_*` OTTL extension) struck
    through; resolution recorded.
- `docs/ATTRIBUTES.md`:
  - `hw.gpu.memory.free` / `.total` rows updated to distinguish
    metric vs log shape and to cross-link to the recipe section.
  - New `hw.gpu.memory.used` row (now projected on the metrics
    pipeline by this PR — dashboard evidence context).

## Root cause + named upstream limit

**Root cause (fixed in this PR):** the prometheus-scrape OTTL
transform had no stanza projecting the DCGM FB series onto
`hw.gpu.memory.*`. The detector's projection gate could not be
satisfied on a real install. Fixed by adding the rename stanza in
`transform/dcgm_to_hw_semconv` (same processor the #1/#3/#4/#5
projections already live in — no new processor surface).

**Named upstream limit (NOT worked around — tracked):** OTel-contrib
`transformprocessor` v0.130 `metric_statements` cannot perform
cross-series arithmetic — there is no OTTL path to compute
`hw.gpu.memory.total = hw.gpu.memory.used + hw.gpu.memory.free`
on a metrics pipeline ([upstream
README](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config)).
Per RFC-0014 §Alternatives, the metrics→logs primitive does not
exist in contrib at v0.130 either. The total scalar lives at the
bridge layer (RFC-0014 PR-B `WithMetrics` extension to
`patterndetectorprocessor`, tracked under #260). The recipe pins
the load-bearing wire format the bridge MUST honor so PR-B lands
without a contract change.

## Adopt-over-build posture

Every new OTTL statement uses upstream functions only (`set`,
`==` equality). No new transformprocessor extension. Mirrors the
existing #1/#3/#4/#5 stanzas.

## Test plan

- [x] `make build` — clean
- [x] `./scripts/validator-recipe.sh` — 9 validated, 3 skipped
      (non-linux), 0 fail; prometheus-scrape example passes
      `tracecore validate`.
- [x] `./scripts/doc-check.sh` — 721 markdown links resolve, all
      new cross-links + anchor refs included; banned-phrase lint
      clean; recipe markers (`tested-against`, `last-verified`)
      present.
- [x] `./scripts/attribute-namespace-check.sh` — 67/67 attribute
      literals documented (no new undocumented attrs introduced).
- [x] golangci-lint, go vet, go mod verify via commit hook — clean.
- [ ] CI linux runner exercises journald + k8sobjects skip-paths
      we couldn't run locally (validator-recipe ubuntu job).

```release-notes
recipe(ottl): project DCGM `DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` onto the customer-stable `hw.gpu.memory.{used,free}` namespace and pin the metrics-to-logs bridge log-shape spec the pattern #10 (CUDA OOM) detector consumes via `projectFBMemoryRecord`. `hw.gpu.memory.total = used + free` derivation is deferred to the RFC-0014 PR-B `WithMetrics` bridge layer because OTTL `transformprocessor` v0.130 has no cross-series arithmetic on metrics pipelines.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Bundle three post-wave-audit docs-cleanup issues
(`docs/v1-rc1-post-wave-audit.md` findings #5, #8, #15). One commit per
issue for clean revert / history. Fourth audit cleanup (#380)
intentionally deferred — precondition unmet.

## Per-issue scope + delta

### `chore(module): trim stale PR-I genesis-tag rationale` (closes #381)
Audit finding #8. `module/doc.go` referenced PR-I.1a/1b/2 as "Contents
land in…" and explained the genesis-tag proxy rationale. All three PRs
landed long ago; the module is alive. Trimmed to package-doc only.
- **Before:** 14 lines (8-line historical-artifact block).
- **After:** 6 lines (`package module` declaration only, retained for
proxy compat).
- `go list -m github.com/tracecoreai/tracecore/module` still resolves.

### `docs(rc1): sweep strikethrough resolved-in-issue blocks` (closes
#383)
Audit finding #15. The two specific examples called out by the issue:
- `docs/patterns/10-cuda-oom-deceptive.md:74` — `~~DCGM_FI_DEV_FB_* OTTL
recipe extension.~~` referenced closed #337 (verified via `gh issue view
337` → CLOSED).
- `docs/v1-rc1-simplification-audit.md:185-186` —
`~~components/exporters/otlphttp/~~` +
`~~components/exporters/stdoutexporter/~~` referenced closed #333 / #334
(both verified CLOSED).

Resolution text retained as cross-link history; only the strike-through
visual noise removed.

**Explicitly NOT touched:** `docs/followups/M*.md`,
`docs/MILESTONES.md`, `docs/rfcs/0003-*.md` strikethroughs — these are
documented convention markers per `MILESTONES.md:65,69` ("A PR that
completes a follow-up strikes it through (`~~…~~`)"). Sweeping them
would delete a load-bearing convention.

### `chore(config): document pattern-prefixed knob naming convention`
(closes #379)
Audit finding #5. `config.go` (540 lines) carries three knob styles. The
post-wave prefixed shape is right; renaming bare→prefixed pre-RC1 would
break every existing `values.yaml`. Take option 1 from the issue:
document the convention at top-of-file so future detectors follow it.
- **Before:** No top-of-file knob-naming guidance.
- **After:** 12-line `// Knob-naming convention` block before `package
patterndetectorprocessor`, naming bare-name knobs explicitly and
pointing forward to v2.0 nesting.

## Why #380 is NOT in this bundle

#380 sweeps 12 "future PR-B" comments referencing the RFC-0014
`WithMetrics` bridge. Its own acceptance criteria say "Closes with
PR-B's merge commit."

Precondition check today:
- `git log --all --grep=WithMetrics` → only wave-3 port commits, no
bridge.
- `docs/rfcs/0014-metrics-to-logs-pattern-input.md:3` → status
`accepted, blocks issue #260 PR-B`.
- `docs/integrations/prometheus-scrape.md:381` → "PR-B has NOT yet
shipped."

All 12 comments are factually accurate today. Sweeping prematurely would
either delete load-bearing forward-looking documentation, or replace it
with stale post-PR-B prose. Issue stays open as tracker per its design;
commented on issue noting the deferral.

## Verification

- `go build ./...` (module) — clean.
- `go vet ./...` — clean (via pre-commit).
- `go list -m github.com/tracecoreai/tracecore/module` — resolves.
- `golangci-lint run ./...` — 0 issues (via pre-commit).
- `attribute-namespace-check` — 100/100 (via pre-commit).
- `gh issue view 337 333 334` — all CLOSED (justifies #383 sweep).

## Test plan

- [x] Module builds + vets.
- [x] Referenced closed-issue states verified.
- [x] Strikethrough convention markers in `docs/followups/*` /
`MILESTONES.md` / RFC-0003 left intact (deliberate scope-limit).
- [x] `#380` deferred with on-issue rationale; not in `Closes` trailer.
- [ ] CI green.

```release-notes
NONE
```

Closes #383
Closes #381
Closes #379

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but
the join shape varies across patterns and the rationale lived only
in code comments + PR review threads. Operators tuning windows had to
read source per detector.

Audit finding: five distinct shapes are load-bearing (chosen by the
causal physics of each signal), not bugs:

- One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect.
- Asymmetric two-sided (#11): pre-stall covers concurrent-start
  checkpoints; post-stall covers OTTL-bridge logger latency.
- Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window
  could be cause OR consequence.
- Job-window bounded (#13): SDC counter rise must fall in the
  bounded eval-cycle's owning job; no operator knob is meaningful.
- Trailing-window rate / freshness (#2 #4 #8): rolling window
  anchored at `now` or the most-recent record.

Decision: document the existing reality, do not converge. Forcing
every detector to the asymmetric two-knob form would silently
zero one leg for the one-sided detectors (footgun on clock skew)
and would not apply to #13 at all.

Adds:
- 'Why this correlation shape' section in docs/patterns/07, 11, 13
  (the three shapes the issue called out by name).
- 'Correlation-window semantics' table in docs/patterns/README.md
  covering ALL 11 detectors with the predicate, anchor, and shape
  rationale, plus cross-links to the per-pattern sections.

No code changes; no detector behavior changes.

Closes #367.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Lands replay corpora for the seven pattern detectors that lacked one,
closing the Path-A test gap named in the v1-rc1 audit (#366). Each
pattern now ships `module/pkg/replay/<pattern>/{canonical,_negative,
_real_world}/` fixtures plus a `*_replay_test.go` runner that JSON-eqs
detector output against the on-disk golden.

Detectors covered: `hbm_ecc` (#3), `thermal_throttle` (#4),
`pcie_aer` (#5), `ib_link_flap` (#2), `nccl_hang` (#15), `cuda_oom`
(#10), `xid_correlation` (#16). `pod_evicted` (#14) corpus is
unchanged.

## Design

- Each detector takes a different input shape (e.g. `HBMECCRecord +
  XidRecord` for `hbm_ecc`, `ThermalThrottleRecord` for
  `thermal_throttle`, ...), so the existing `LoadFixturesUnder` helper
  (typed on `Record + NodeRecord`) cannot be reused. Each detector
  gets its own `*_replay_test.go` that inlines the per-detector JSON
  read; shared fixture-discovery and golden-assert helpers live in
  `helpers_test.go`.
- Two detectors (`nccl_hang`, `ib_link_flap`) take a `Now` reference.
  Tests pin `Now` to a fixed timestamp matching the fixture's
  `started_ns` so hang-age and flap-window inclusion stay
  deterministic across replay runs (otherwise wall-clock drift would
  silently flip the verdicts as the fixture aged).
- Goldens were generated from the live detectors via
  `UPDATE_REPLAY_GOLDEN=1 go test ./module/pkg/replay/...` and pinned;
  future drift in detector output (headline / remediation prose,
  evidence-trail UID shape, scalar-field rename) surfaces as a
  `JSONEq` diff against the fixture. Operators can also eyeball the
  golden to assert what they EXPECT vs what fires.
- Negative fixtures each exercise a distinct discriminator (wrong Xid
  code, single GPU, no AER, no eviction, completed state, single
  transition, no OOM log) so a regression in one false-positive guard
  lights up the corresponding row only.
- Flipped `run_corpus: true` on every row of the chaos.yml
  pattern-detectors matrix now that every detector has a corpus.

## Test plan

- [x] `make check` — clean
- [x] `go test -race -count=1 ./pkg/replay/...` — 35 tests pass
  (28 new + 7 pre-existing pod_evicted)
- [x] `go test -count=1 ./processor/patterndetectorprocessor/...` —
  unchanged, still green
- [x] `make verify` (pre-push) — clean
- [ ] CI: chaos.yml `pattern-detectors` matrix — 8 rows, each runs
  hermetic regex + replay-corpus step

Closes #366.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Closes #393. Ships the metric-side OTTL projection from
`node_exporter --collector.infiniband`'s `node_infiniband_port_state_id`
Gauge onto the customer-stable `hw.network.ib.*` namespace
(`hw.network.ib.port.state` int + `hw.network.ib.device` +
`hw.network.ib.port.num`) so pattern #2's `IBLinkFlapDetector`
consumes the same vendor-neutral wire shape regardless of whether the
underlying source is node_exporter, a Mellanox exporter, or the
`mlx5_core` journald stream.

Detector library + processor wiring already shipped in #391 (closed
#300). Only the metric-side input recipe was missing — pattern #2 was
configured-but-quiet on real deployments. This PR closes that gap.

## Wire contract (node_exporter raw → hw.network.ib.*)

```
node_infiniband_port_state_id{device="mlx5_0", port="1"} = 4
                                                          (IBA phys_state ID)
            ↓ transform/ib_to_hw_semconv
Gauge metric "hw.network.ib.port.state" with datapoint attrs:
  hw.network.ib.device  = "mlx5_0"        (str, from `device` label)
  hw.network.ib.port.num = 1              (int, from Int(`port` label))
  value                  = 4              (int, the phys_state ID)
```

The future RFC-0014 PR-B metrics→logs bridge emitter (shared with
patterns #3/#4/#5/#10) will lift these three attributes onto a log
record at emit time. The bridge log-record schema for pattern #2 is
pinned in `docs/integrations/prometheus-scrape.md §Pattern #2 —
hw.network.ib.port.state (issue #393)` so PR-B has no per-pattern
reconstruction work to do.

The companion series `node_infiniband_state{state="<name>"}` (string
label) is intentionally NOT mapped — the detector
(`module/processor/patterndetectorprocessor/ib_link_flap.go`) compares
`state.Int()` against `patterns.IBPortState*` integer constants, so
the string variant would round-trip wrong.

## No detector code change required

The detector reads three attribute names off a log record:
`hw.network.ib.port.state`, `hw.network.ib.device`,
`hw.network.ib.port.num`. The recipe stanza stamps the exact same
three names on the metric datapoint. The wire format `port.Int()`
expects (the projector at
`module/processor/patterndetectorprocessor/ib_link_flap.go` line 39
calls `int(port.Int())`) is satisfied because the OTTL `Int()` cast
on the Prometheus `port` string label produces a pdata int Value.
Confirmed by the new `TestRecipe_IBLinkFlap_RoundTripFiresVerdict`
test.

## Root cause + scope

- **Root cause of #393**: missing metric-side OTTL stanza. Fixed.
- **Out of scope (separate blocker, tracked under #260 PR-B)**: the
  metrics→logs bridge emitter. Upstream-blocked at OTel-contrib v0.130
  — `transformprocessor`'s `metric_statements` cannot reference
  `log.*` paths and no contrib connector emits log records from a
  metrics pipeline (per

[RFC-0014](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0014-metrics-to-logs-pattern-input.md)).
  The recipe doc explicitly documents this gating relationship; PR-B
  is shared with patterns #3/#4/#5/#10 and lands the bridge once.

## Files changed

- `docs/integrations/examples/prometheus-scrape.yaml` — new
  `transform/ib_to_hw_semconv` processor; wired into the
  `metrics/scrape` pipeline. Validates with `./_build/tracecore
  validate` (exit 0).
- `docs/integrations/prometheus-scrape.md` — new "Pattern #2 —
  InfiniBand link flap" projection section, intro updated from "Two"
  to "Three OTTL transforms", and bridge log-record contract
  subsection added under the "Metrics-to-logs bridge contract"
  section.
- `docs/patterns/pattern-2-ib-link-flap.md` — deleted "Integration
  gap" section, replaced with "Integration recipe" pointing at the
  shipped stanza; updated "Why node_exporter sees it" prose to drop
  the "pending" hedge.
-
`module/processor/patterndetectorprocessor/ib_link_flap_recipe_test.go`
  — new file. `TestRecipe_IBLinkFlap_StanzaPinsWireContract` parses
  the example YAML and asserts every load-bearing token is present
  (source metric name, three `hw.network.ib.*` attrs, the `Int()`
  cast on the port label, the transform name, and the pipeline
  wiring). `TestRecipe_IBLinkFlap_RoundTripFiresVerdict` simulates
  the end-to-end path: builds `plog.Logs` with the exact attribute
  shape the recipe stamps and asserts the processor emits a flap
  verdict.

## Test plan

- [x] `./_build/tracecore validate
--config=docs/integrations/examples/prometheus-scrape.yaml` → exit 0
- [x] `bash scripts/validator-recipe.sh` → 9 validated, 3 skipped
(non-linux host)
- [x] `bash scripts/doc-check.sh` → clean (no orphan test refs)
- [x] `go test ./module/processor/patterndetectorprocessor/... -count=1`
→ PASS (incl. the two new tests + all 5 existing IB tests)
- [x] `go build ./...` and `go vet ./...` → clean
- [x] Pre-commit hooks: golangci-lint 0 issues, go mod verify,
attribute-namespace-check 100/100
- [x] Mutation-verified: dropping `hw.network.ib.port.state` from the
recipe yaml fails `TestRecipe_IBLinkFlap_StanzaPinsWireContract` with
the expected remediation message naming the missing identifier
- [ ] CI on the PR (waiting on push)

```release-notes
feat(recipe): InfiniBand link-flap OTTL stanza projecting node_exporter's `node_infiniband_port_state_id` onto the tracecore-canonical `hw.network.ib.*` namespace (`hw.network.ib.port.state` int + `hw.network.ib.device` + `hw.network.ib.port.num`). Pattern #2's `IBLinkFlapDetector` now has its metric-side input wired; metrics→logs bridge emitter remains gated on RFC-0014 PR-B (#260).
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
…ard) (#477)

## Summary

Closes the `docs/MILESTONES.md` §M6 carry-forward: *"every fenced block
in `docs/getting-started.md` is exercised by `scripts/smoke.sh`"*.

The ≤5-count gate shipped with the M6 wave; the binding half was tracked
carry-forward because `smoke.sh` ran a parallel hand-written
hostmetrics→debug config rather than the doc's actual YAML.

## Root cause

Two scripts owned the "first OTLP byte" config — `smoke.sh` rendered one
inline, `docs/getting-started.md` carried another. They happened to
agree, but nothing forced them to. The carry-forward existed because the
binding was *correct by inspection*, not *correct by construction*.

The fix is to make the doc the single source: `smoke.sh` extracts the
YAML from `docs/getting-started.md`'s `## Walkthrough` heredoc at
runtime. If the doc grows a typo, a renamed receiver, or a different
scraper, `smoke.sh` exercises the change automatically. If the heredoc
disappears, the extractor fails loud with a named error.

## Changes

- `scripts/smoke.sh` — extracts the Walkthrough heredoc via a perl
one-liner, writes it to a tempfile, then runs `tracecore validate
--config=` + `tracecore --config=` against it (Walkthrough steps 3 + 4).
Lifecycle-log assertions retained, with `"Shutdown complete"` now
load-bearing against the doc's post-walkthrough prose.
- `scripts/doc-check.sh` — new gate (right after the existing ≤5-count
gate) asserts the smoke↔doc binding with four mutation-verified clauses:
Walkthrough scope, `"$BIN" validate --config=` invocation, `"$BIN"
--config=` run invocation, `docs/getting-started.md` path reference.
- `scripts/smoke_test.sh` — new mutation-verify harness mirroring the
gate at runtime, plus an inline mutant-doc test that proves the
extractor exits 1 and the wrapper emits the named error when the heredoc
is removed.
- `Makefile` — `make smoke` now also runs `smoke_test.sh`; wired into
`ci-full` alongside the existing `smoke-quickstart` target.
- `docs/MILESTONES.md` — §M6 status `⧗ partial` → `☑ delivered`;
getting-started rubric `⧗` → `☑`; carry-forward bullet rewritten
(remaining work is operator-config branch-protection only).

## Runtime

End-to-end `bash scripts/smoke.sh` on darwin/arm64: **~2.2s** (extract +
validate + 1.5s run window + lifecycle-log assertions). Well under the
120s ci-fast budget. No hardware required — uses the `hostmetrics` load
scraper, portable across linux/darwin/windows.

## Test plan

```release-notes
ci(smoke): scripts/smoke.sh now extracts its YAML config from docs/getting-started.md '## Walkthrough' instead of carrying a parallel hand-written config; doc-check.sh gates the doc↔smoke binding with four mutation-verified clauses. Closes the M6 carry-forward.
```

- [x] `bash scripts/smoke.sh` exits 0 on clean main (verified locally,
~2.2s).
- [x] `bash scripts/smoke_test.sh` all assertions pass.
- [x] `bash scripts/doc-check.sh` reports `scripts/smoke.sh binds to
docs/getting-started.md (M6: every block exercised by smoke.sh)`.
- [x] Mutation test #1: `sed -i 's/"$BIN" validate --config=/"$BIN" XXX
--config=/' scripts/smoke.sh` → doc-check exits 1 naming "validate
--config= invocation (Walkthrough step 3)".
- [x] Mutation test #2: `sed -i 's/"$BIN" --config=/"$BIN" XXX=/'
scripts/smoke.sh` → doc-check exits 1 naming "run invocation
(Walkthrough step 4)".
- [x] Mutation test #3: `sed -i 's/Walkthrough/Section/'
scripts/smoke.sh` → doc-check exits 1 naming "extraction scope lost".
- [x] Mutation test #4: `sed -i
's/docs/getting-started.md/docs/SOMEWHERE-ELSE.md/' scripts/smoke.sh` →
doc-check exits 1 naming "binding source missing".
- [x] Mutation test #5: getting-started.md with no `## Walkthrough`
heredoc → smoke.sh exits 1 with named error message (covered by
`smoke_test.sh`).
- [x] `make lint` 0 issues; `make vet` clean; `make doc-check` clean
(all 18 gates pass).
- [x] `make smoke` end-to-end including `smoke_test.sh` passes.

## Related

- Refs `docs/MILESTONES.md` §M6 (Documentation scaffold).
- Sibling #460 (`fix(doc-check): drop unconditional exit 0`) made this
carry-forward visible — before #460, the new gate would have been
silently skipped by the line-99 short-circuit.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant