Apply Wave-1 follow-ups: tests, automation, doc-truth#5
Merged
Conversation
Four small follow-ups on the contribution-safeguards bundle. 1. dco-check_test.sh sets `commit.gpgsign=false` and `tag.gpgsign=false` in the sandbox repo before any commit. Without this, contributors who have run scripts/setup-signing.sh (and therefore have `commit.gpgsign=true` in their ~/.gitconfig) would see the test commits attempt to sign with a key not present in the temp repo's context — a false negative in our own test suite. 2. scripts/apply-branch-protection.sh applies the rules in .github/branch-protection.yml directly via `gh api`. Idempotent; safe to re-run. Replaces the manual-UI-clicks-then-pray pattern with one command. Drift between the intent doc and live settings becomes a re-run away from being fixed. 3. AGENTS.md now documents the kernel coding-assistants.rst trailer format (`Assisted-by: <agent>:<model> [<tool>]`) as preferred, with the simple form still accepted. Aligns the doc with the format every commit on this branch already uses. 4. The "Aim for under 500 lines" guidance is removed from CONTRIBUTING.md and the matching bullet in AGENTS.md. We explicitly chose not to enforce a PR size cap; the soft guidance was misleading. Replaced with "split when the diff outgrows the concern, not when it crosses an arbitrary line count". Refs RFC-0003. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
Three small wording cuts after a fresh-eyes pass: - .github/scripts/dco-check_test.sh: 4-line signing-disable comment collapsed to one. The "why" is non-obvious enough to keep one line; the other three were prose. - scripts/apply-branch-protection.sh: dropped the "Future improvement: parse the YAML directly..." block. Forward-looking comments rot per PRINCIPLES §2; the drift hazard will surface when it bites. Header trimmed from 16 lines to 7. - CONTRIBUTING.md: AI-assisted commits bullet now points at AGENTS.md for the trailer format instead of duplicating examples inline. Net: -12 lines, no behavior change, all tests still green. Refs RFC-0003. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Address self-review items #5 (pkg-name confusion), #12 (MILESTONES glyph), #16+#17 (commit/scope discipline), #18 (telemetry pkg architecture intent), #19 (Resource auto-pop) — all doc-only. internal/telemetry/doc.go + internal/selftelemetry/doc.go: explicit "where do I add code" guidance with the conceptual split (process- level SURFACE vs per-component PRODUCER CONTRACT). Includes the architecture intent for future milestones — when tracing lands the TracerProvider goes in `telemetry`, signal-direction-driven split (consumer/producer) rather than modality-driven (metrics/traces). selftelemetry/doc.go also pins the cardinality contract explicitly (every label value must be low-cardinality; canonical Kind* constants exist for that purpose) and the Resource auto-population contract. MILESTONES.md: new ☒ glyph for "policy-declined, not deferred." Applied to the /readyz SLO-threshold criterion — RFC-0006 explicitly chose degraded ≠ not-ready, that's a policy, not a missing feature. FOLLOWUPS.md: new "M2 process lessons" section captures three honest process gaps from this milestone for the next loop's prompt: - Scope-creep (Go-bump + CI-fix-deadlines shipped under [telemetry]) - Commit history discipline (25 commits incl. 13 review-fixes) - Self-assessment optimism caught by gates four separate times Lessons → next-milestone-prompt input, not corrective action for M2 (retroactive split blocked by no-force-push rule). Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Closes 4 of 7 new A+ criteria from the recursive self-review: #1 — e2e-otelcontrib now verifies the collector PARSED the record, not just that it accepted bytes. Workflow rewritten to docker-run otelcol-contrib with a custom config (file + debug exporters, detailed verbosity). After the e2e POST, the bash step greps /tmp/otelout/logs.json for the canonical body, the kernelevents.xid attribute, and the gpu.id attribute. Empty file or missing attributes → workflow fails. #2 — TestIntegration_KmsgWriteReadBehavioral (//go:build linux) writes a synthetic <6>NVRM Xid 79 line to /dev/kmsg, uses a marker string in a regex_filter to isolate from ring-buffer noise, then asserts the receiver emits a plog.LogRecord with kernelevents.xid=79 + gpu.id=0000:65:00.0 within 3s. A regression in parse/build/emit fails this on Linux CI. #3 — prometheus_alerts_test.go validates the alert YAML structure (every group has interval, every rule has expr/severity/summary/ description) AND cross-references the metric + label-filter names against the receiver's actual SelfTelemetry surface. A typo in the alert would silently never fire; this catches it before merge. #5 — runbook_test.go executes the RUNBOOK's "First 15 minutes" step 1 (`tracecore validate --config=...`) and step 2 (`tracecore debug dump`) as real commands. Documentation rot becomes a test failure, not a silent SRE-time discovery. #4 — sustained_test.go (`//go:build sustained`) feeds 1000 events/sec for 5 minutes (300k records), samples heap every 30s, asserts ≤10 MiB growth and p99 emit latency tail bounded. New `sustained-load` workflow job runs it on push-to-main + schedule (not PR — 5 minutes is too slow for the inner loop). The seventh criterion (two-week soak + external operator) requires elapsed time + a human; nothing in-session can close it. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Brings the M8 DCGM receiver to this branch and reconciles divergent
decisions between the two parallel milestones.
Auto-merged: CHANGELOG.md, MILESTONES.md, FOLLOWUPS.md,
docs/FAILURE-MODES.md, docs/agents/RECEIVER-PATTERNS.md,
components/receivers/clockreceiver/clockreceiver.go.
Conflicts resolved:
- components.yaml + cmd/tracecore/components.go: keep all three
receivers (clockreceiver, dcgm, kernelevents); regenerated via
`make generate`.
- cmd/tracecore/main.go: keep both M9 `debug dump` and M8
`receivers list` subcommands. Both surfaces have tests.
- Makefile .PHONY: unioned M9's test-extras-* targets with M8's
smoke + build-tags targets.
- docs/agents/REVIEWER-CONTEXT.md: kept both the M9 streaming-
receiver section and the M8 receiver-author-patterns section.
Divergent decisions identified and unified:
1. selftelemetry.Kind typed-enum (M8 introduced) — M9's
IncError("string") call sites migrated to canonical Kind*
constants (KindPanic, KindDownstream, KindParse, KindCardinality)
and receiver-local typed constants (kindJournalctlCrash,
kindKmsgOversized, kindKmsgOverflow) in a new
`components/receivers/kernelevents/kinds.go`. Mirrors M8 dcgm's
`kindWatch` / `kindMIG` shape.
2. CONTRIBUTING.md "Adding a receiver" — M8 added one section
centred on dcgm; my earlier A+ pass added a second. Merged into
a single "Adding a receiver or exporter" section with a 3-row
shape-selection table (trivial / vendor-SDK / streaming-source)
plus shared component-local conventions and the kernelevents-
source sub-section.
3. Lifecycle helper canonical-vs-inline — declared
`internal/runtime/lifecycle.Lifecycle` as the canonical pattern
in docs/agents/RECEIVER-PATTERNS.md ("Canonical lifecycle
helper") and documented the M8 DCGM divergence inline with
rationale. Tracked the DCGM migration as
docs/FOLLOWUPS.md row #5; not refactored here to avoid bloating
this merge's scope.
4. Receiver-author skeletons — M8's `docs/agents/examples/*.go`
(six general receiver patterns) and M9's
`components/receivers/kernelevents/source_template.go.example`
(kernelevents-source-specific) now cross-reference each other.
The kernelevents template's reading order points authors at the
general examples for cross-cutting patterns.
5. alert-check.sh wired into make doc-check (from M9) now also
verifies the dcgm RUNBOOK ↔ prometheus-alerts.example.yaml
parity automatically. doc-check now verifies 50 test references
across 2 docs (was 44 across 2 pre-merge); alert-check 2 pairs.
All tests pass under make ci; coverage gates clean
(kernelevents 80.8%, dcgm 86.0%, lifecycle 96.3%).
Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Three test-correctness findings. R1.B4 — gateReader Close race. `cascade_test.go::gateReader.Close` used a `*bool` guard around `close(g.release)`; under -race, two Close calls (deferred-defer + ctx-cancel watcher) could observe the bool false and both close the channel → "close of closed channel" panic in tests, or a race-detector trip. Fixed with `*atomic.Bool` + CompareAndSwap so only the first Close fires the release. Test passes under `-race -count=3`. R1.S2 — TestPrometheusAlerts_ReferenceWiredCounters used `require.Contains(body, metric)` on the raw YAML file, which would be satisfied by a metric name appearing in a comment or description. Refactored to walk the parsed `doc.Groups[].Rules[].Expr` and assert metric/kind references appear in actual alert expressions. Same test now also enforces the R1 #4 follow-up: every rule must carry a `runbook_url` annotation. Three alerts × runbook_url = three new structural assertions. #5 — drainStderr lacked panic recovery. Per PRINCIPLES §1, every goroutine carries panic recovery; bounded-buffer Scanner + slog.Warn aren't expected to panic, but a defensive recover keeps a future refactor (custom formatter, sink swap) from taking down the receiver via this side path. Defer added at the top of `journald.go::drainStderr` mirroring the receiver's main goroutine pattern; routes the recovered panic to `IncError(KindPanic)` + structured log so it surfaces on the operator's dashboard. Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <trilamsr@gmail.com>
9 tasks
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…rd surface ## What this PR does Lands M19 (pattern #14 pod-evicted) end-to-end against the rubric's hermetic replay-test gate, plus the M10 NodeRecord surface the detector consumes. ### Pattern detector (internal/synthesis/patterns/) - `pod_evicted.go`: `PodEvictedDetector{}.Evaluate(events, nodeConds)` joins each Pod-Evicted Record against the most-recent per-node pressure transition within a 30s sliding window (configurable via `JoinWindow`). Returns one `PodEvictedVerdict` per evicted pod, sorted by event time for golden-test determinism. - Headline matches the rubric's regex contract for the canonical disk-pressure fixture (`/Pod .* evicted at .* due to disk pressure/`); remediation matches `/relocate.*NVMe/`. Three other pressure roots (memory/pid/notready) get distinct operator-facing remediations. - Missing-layer path: when no NodeRecord joins, verdict falls back to `confidence: partial` with `missing_layers: ["node_condition"]` and the pressure kind is parsed heuristically from the Event Note. - `verdict.go`: `PodEvictedVerdict` + `EvidenceRef` + `Confidence` with JSON tags pinned by `testdata/verdict.schema.json` (additional- Properties:false; drift rejected in `verdict_schema_test.go`). - Generic `Detector` interface deferred per PRINCIPLES §3 rule-of-three — M17 / M18 land as the second and third consumers. ### Replay runner + corpus (internal/synthesis/replay/) - `runner.go`: `LoadFixture(dir)` / `LoadFixturesUnder(root)` walk a directory tree, load `manifest.json` + `events.json` + `node_conditions.json` + `golden.json`. Pure stdlib; hermetic by design per rubric L414. - `runner_test.go`: drives every fixture under `pod_evicted/`, JSONEq-diffs detector output against `golden.json`. Asserts the canonical + three negative fixtures all exist as a structural gate. - Canonical fixture: `pod_evicted/canonical/` (Evicted Pod on `gpu-node-0001` at T=10:00:00Z; DiskPressure transition at T-5s → full-confidence verdict). - Negative fixtures under `_negative/`: `killing/`, `preempted/`, `failed_scheduling/` — each produces an empty verdict slice (rubric L407 FP-rate contract). ### M10 NodeRecord surface (components/receivers/k8sevents/) - `node_record.go`: typed `NodeRecord` + `NodePressureKind` constants (memory/disk/pid/notready) + distinct `NodeSchemaURL` so downstream consumers version-gate the two surfaces independently. `HintForPressure` inversion: NotReady → HintNodeUnhealthy; rest → HintNodePressure. - `node_emit.go`: `buildNodeLogRecord` analogous to `buildLogRecord`. - `node_informer.go`: pure `convertNodeConditions(*corev1.Node) []NodeRecord` (only alarming statuses surface: True for the three pressures, False for Ready) + `nodeConditionDedup` that tracks (nodeUID, conditionType, transitionNano) so an unrelated Node-object update doesn't re-emit. - `receiver.go`: Node SharedInformer registered when `Config.NodeConditions.Enabled=true` (default false — wider RBAC). `nodeRecords chan NodeRecord` joins the existing `events chan` in `run()`'s select; nil channel blocks when disabled. - `rbac.yaml`: adds `core/v1.nodes get,list,watch`. `rbac.can-i.golden` updated to match (sorted verb-then-resource). - `Record.ReportingInstance` added (kubelet sets to node name; M19's join key); `MaxAttributesFloor` bumped 9→10 to keep it intact. - `pattern_consumer_test.go`: 4 new compile-gate tests pin the NodeRecord field walk, the four-kind exhaustiveness, the pressure→ hint inversion, and `reporting.instance` wire-format pinning. - `node_record_test.go`: mirrors `hint_test.go` for the conditionType→NodePressureKind taxonomy and the pressure→Hint map. ### CLI - `tracecore failure-inject pod-evict --reason=... --seed=N --out=path` wraps `tools/failure-inject/podevict/podevict.go` for deterministic Event YAML generation. Closes FOLLOWUPS L806-815 carry-forward. `--allow-cluster-write` still returns `ErrClusterWriteUnsupported` (exit 70) — captured in M19's new carry-forward #5. ### Hygiene - CHANGELOG entry under [Unreleased] / ### Added. - MILESTONES §M19 flipped ☐ → ⧗ partial; rubric items annotated ☑/⧗ with carry-forwards listed. - FOLLOWUPS L806-815 closed; L927-934 (chaos.yml pattern_evicted matrix row) trigger-fired annotation added. ## Linked issue(s) _No linked issue._ ## Release notes ```release-notes NONE ``` ## Checklist - [x] `make ci` exit 0 (verified twice; runtime budget under 60s) - [x] All new unit + replay tests pass; no regressions across repo suite - [x] JSON Schema gates verdict shape (additionalProperties:false) - [x] Negative fixtures produce zero verdicts - [x] RBAC golden regenerated; pattern_consumer compile-gate updated - [x] Commits signed off Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…+ threat-root trace on go-mod-verify Phase-4 A+ aspiration review (2 fresh subagents; both graded A, diverged on which gates to apply). Validation cycle: Findings table: | ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action | |----|------|-------------|----------|---------|-------|------------|------------|---------|--------| | P4.1 (aplus-1 #2, also P2.6) | aplus | operator | CONCERN | A workflow_dispatch run with `inputs.tag` set but `github.ref` ≠ refs/tags/$INPUT_TAG passes Build and fails the OIDC smoke check 15-30 minutes later. Operator wastes runner time and sees the misuse late. | New "Verify dispatch ref matches tag (pre-flight)" step exit-1s within seconds with the documented workaround. | Reviewer noted the smoke check already enforces this — but at job-end, not at job-start. Fail-fast IS the load-bearing property. | n/a — workflow YAML, actionlint clean | yes — P4-aplus-1 | applied this commit; closes P2.6 deferral. | | P4.4 (aplus-2 #2) | aplus | repo-long-term | NIT | go-mod-verify comment says "defense in depth against a compromised GOPROXY mirror" but doesn't name the trust root or the orthogonal threat (a poisoned go.sum itself). | Comment now states "Trust root: the go.sum at this tag commit" and cross-references the tag-protection FOLLOWUPS entry. | A future maintainer might over-attribute the protection. | n/a | no | applied this commit | Rejected/deferred: - P4.2 (aplus-1 #4) — Structured diff lint for release.yml ↔ docs/reproducibility.md. DEFER to FOLLOWUPS.md (real value, but manual review caught both drift directions in Phases 2 + 3; automate when next edit happens). - P4.3 (aplus-1 #6) — Release artifact manifest validation before upload. REJECT. Per anti-bureaucracy: reviewer concedes `needs:` dependency already gates malformed artifacts from reaching the release job. Adding defensive validation against a CI-bug scenario is bloat. - P4.5 (aplus-1 #3) — docs/SUPPLY-CHAIN-IDENTITY.md consolidated reference. DEFER to FOLLOWUPS.md; ~30-min write, scope creep beyond release.yml. M21 release-checklist is the natural trigger. - P4.6 (aplus-1 #5, aplus-2 #3) — Formal threat-model document + M21 alignment narrative. DEFER to M21. - P4.7 (aplus-2 #5) — Cross-link health lint. Duplicate of P4.2; same deferral. Reproducibility: $ make actionlint zizmor # exit 0 $ grep -A1 "workflow_dispatch with inputs.tag" .github/workflows/release.yml # pre-flight gate present Letter-grade outcome: Reviewer #1 starting: A → A+ via criteria 2, 4, 6 (we applied 2 + threat-model comment) Reviewer #2 starting: A → APPROVED-AS-IS (already strong) After this commit: A on the falsifiable axis (one operator-UX gate + one comment clarification), with the broader doc/lint work scoped to follow-ups. Beneficiary: operator. The pre-flight gate cites a specific operator-facing surface (15-30 minute waste on workflow_dispatch misuse) and turns it into a seconds-fast named error. Signed-off-by: Tri Lam <tree@lumalabs.ai> Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
…ctor wiring (PR-A for #260) (#267) ## Summary PR-A for [issue #260](#260) — the recipe-side half. Two commits: 1. `docs(adr): metrics-to-logs converter for pattern input (Option A)` — records the architectural decision and the upstream-blocker analysis under a new `docs/adrs/` directory. 2. `docs(recipe): extend prometheus-scrape with DCGM to hw.* OTTL transform` — adds `transform/dcgm_to_hw_semconv` (a second OTTL pass after `transform/gpu_vendor`) and documents every per-pattern projection in `docs/integrations/prometheus-scrape.md`. ### Architectural decision (issue #260 §Proposed plan step 2) **Option A** — extend `patterndetectorprocessor` with `processor.WithMetrics` alongside the existing `WithLogs` (deferred to PR-B; this PR ships only the wire-format contract PR-B will consume). Per `[[adopt-over-build]]` Option B (OTTL metrics-to-logs converter) was the preferred starting point. It's blocked upstream: at OTel contrib v0.130 (the release pinned in `builder-config.yaml`) the `transformprocessor` signal contexts are sealed per-signal — `metric_statements` can only reference `resource | scope | metric | datapoint` paths, never `log` ([README @ v0.130.0 § Config](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config)). Surveying every contrib connector at v0.130 also turns up zero metrics→logs primitives (`countconnector`, `signaltometricsconnector`, `spanmetrics`, `exceptions`, `servicegraph` all output metrics; `routing`/`failover`/`roundrobin`/`otlpjson` are signal-preserving; `datadog` and `grafanacloud` are vendor-specific). The full survey + reasoning is in [ADR-0001](docs/adrs/0001-metrics-to-logs-pattern-input.md). Option C (routing connector + sibling) is strictly heavier than A with the same end state — routing preserves signal type, so it cannot bridge metrics→logs either. Trade-off accepted: doubled processor surface inside `patterndetectorprocessor` (one `ConsumeLogs`, one `ConsumeMetrics`). Mitigation: the metrics path is a thin projection on top of the same `module/pkg/patterns/` library; verdict emission stays log-based via a connector. Upstream-contribution slot opened — a `metricthresholdconnector` (inverse of `signaltometricsconnector`) would let us collapse Option A back to Option B without operator-visible churn. Tracked under RFC-0013 §5. ### Per-pattern OTTL projection (this PR) | Pattern | Raw DCGM input | OTel output | Trigger-statement shape (PR-B consumes this) | |---|---|---|---| | #1 NVLink | `DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES` | `hw.gpu.nvlink.io` (Counter, `By`) + `hw.gpu.nvlink.link={N}` + `network.io.direction={transmit\|receive}` | `rate(hw.gpu.nvlink.io[5m]) < 0.5 × per-GPU median over the per-`hw.id` link set, for 10m` | | #3 HBM ECC | `DCGM_FI_DEV_ECC_{SBE,DBE}_{VOL,AGG}_TOTAL` | `hw.errors` (Counter, `{error}`) + `error.{type,subtype,persistence}` | `increase(hw.errors{error.type="uncorrected", error.persistence="volatile"}[5m]) > 0` | | #4 thermal | `DCGM_FI_DEV_{THERMAL,POWER,SYNC_BOOST,BOARD_LIMIT,LOW_UTIL}_VIOLATION` | `hw.gpu.throttle.duration` (Counter, `s`) + `hw.gpu.throttle.reason` | `increase(hw.gpu.throttle.duration{reason="thermal"}[5m]) > 30s, for 10m` | | #5 PCIe | `DCGM_FI_PROF_PCIE_{TX,RX}_BYTES` | `hw.gpu.io` (Counter, `By`) + `network.io.direction` + `hw.gpu.pci.bdf` resource attr | `rate(hw.gpu.io[5m]) < 0.3 × host median, for 15m` | Resource attribution: `UUID` or `gpu_uuid` → `hw.id`; `gpu` or `GPU` → `hw.gpu.index`; `hw.type="gpu"` stamped on every DCGM series. Dual-label fallback handles both legacy and modern dcgm-exporter releases. ### What's NOT in this PR The third commit originally planned by issue #260 — `feat(recipe): metrics→logs OTTL alert emission` — is dropped because OTTL cannot emit log records from metric streams at v0.130 (root cause cited above, full analysis in ADR-0001). The fallback is Option A, which requires modifying `patterndetectorprocessor` source — out of scope for a recipe-only PR. The alert-emission half lands in PR-B alongside the NVLink detector code. ## Test plan - [x] `./_build/tracecore validate --config=docs/integrations/examples/prometheus-scrape.yaml` exits 0. - [x] `bash scripts/doc-check.sh` clean (test-name parity, markdown link integrity, banned-phrase lint, integration-recipe rubric all pass). - [x] `bash scripts/validator-recipe.sh` validates 6 recipes (skips 2 requires-linux / requires-k8s-cluster — CI ubuntu runner exercises those paths). - [x] `make check` (golangci-lint, go vet, mod-verify) clean. - [ ] Adversarial reviewer check that the OTTL `set(metric.name, ...)` identity-conflict caveat (recipe markdown § "Identity-conflict caveat") matches the operator's downstream backend's deduplication semantics for the four projected metric families. - [ ] CI runner exercises the requires-linux / requires-k8s recipes that this PR's host skipped. ## Followup - **PR-B (issue #260, blocked by this PR's merge)** — extend `patterndetectorprocessor` with `processor.WithMetrics`; land `module/pkg/patterns/nvlink_degradation.go`; wire verdict emission through a sibling logs pipeline via a connector. Decision rationale baked into ADR-0001 so PR-B doesn't relitigate. - **Upstream contribution (v0.3 cycle, RFC-0013 §5)** — propose a `metricthresholdconnector` (or `signaltologsconnector`) to OTel-contrib. If/when it lands, the recipe can collapse back to Option B without operator-visible churn — the `hw.gpu.*` wire-format contract this PR ships is the load-bearing customer surface, not the upstream component selection. - **dcgm-exporter throttle-reason mapping verification** — the pattern #4 mapping table assumes modern dcgm-exporter `DCGM_FI_DEV_*_VIOLATION` series names. If your dcgm-exporter release exposes the throttle bitmask as `DCGM_FI_DEV_CLOCKS_EVENT_REASONS` instead (per the semconv proposal Open Question #1), extend the OTTL block to expand the bitmask into discrete `hw.gpu.throttle.duration` datapoints — out of this recipe's scope; tracked as a follow-up under the same milestone. ```release-notes docs(recipe): The `prometheus-scrape` recipe gains a second OTTL transform pass that projects raw `dcgm-exporter` `DCGM_FI_*` series into the customer-stable `hw.gpu.*` / `hw.errors` namespace (semconv proposal). Four pattern families covered — NVLink, HBM ECC, thermal throttle, PCIe — using upstream OTTL functions only (`set`, `IsMatch`, `ExtractPatterns`, `Int`). The wire-format contract this transform produces is the input PR-B's NVLink detector will consume; the in-tree component change for the detector itself lands in a follow-up PR per ADR-0001. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
…AR pattern #4) (#281) ## Summary Pattern #4 detector + processor wiring: thermal throttle cascade. Detects when N+ GPUs on the same node show rising `hw.gpu.throttle.duration{reason=thermal}` delta within a window — cluster-rack thermal event. **Layer decision**: single-layer (multi-GPU same-node cascade; no cross-signal join). Mirrors `nccl_hang`'s no-Confidence shape. **Defaults**: - `Window`: 5min - `ThrottleDeltaThreshold`: 30s (per pattern doc) - `MinCascadeGPUs`: 4 **Integration gap**: detector ready; end-to-end firing blocked on PR-B metrics→logs converter per ADR-0001. Mirrors HBM ECC pattern's gap at #273. Closes pattern #4 in the NORTHSTAR ladder (3/15 → 4/15 covered after this lands; #5 in parallel PR). ## Test plan - [x] `cd module && go test ./pkg/patterns/... ./processor/... -race -count=1` - [x] `make build && make check` - [ ] CI gates (validator-recipe, install-bench, bench-check) ```release-notes feat(patterns): thermal_throttle pattern detector (pattern #4) — detects multi-GPU same-node thermal cascade ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Ships NORTHSTAR pattern #5 per docs/patterns/pattern-5-pcie-aer.md. Two layers required for a verdict; a single layer alone emits nothing. - Layer 1 (kernel AER): `PCIe Bus Error: severity=…, type=…` line on a specific `gpu.id` (PCI BDF) extracted by the journald-kernel OTTL recipe. - Layer 2 (rate-collapse): per-GPU `hw.gpu.io` `BytesPerSecond` <= `(1 - RateDropThreshold) * BaselineBytesPerSecond` on the same BDF within the correlation window. Default threshold 0.5 (50% drop) and window 5min. Causality flows forward — the AER must precede the rate-collapse sample. Why two layers required (no Confidence taxonomy, mirroring xid_correlation + hbm_ecc): - AER alone: corrected errors recovered — the link re-trained. Operators see raw AER telemetry on journald already. - Rate-collapse alone: workload-natural Tx/Rx variance (a rank finished a comm phase, etc.). The AER is the hardware-fault confirmation that makes the join operator-actionable. Join key is `gpu.id` (PCI BDF) per RFC-0013 §3, shared across the dmesg AER preamble and the dcgm-exporter `hw.gpu.pci.bdf` resource attribute on `hw.gpu.io`. Node is carried for the verdict prose but not used in the join key — the BDF is the proximate hardware identifier. Defaults: - CorrelationWindow: 5min — mirrors the spec's `rate(...[5m])` PromQL and typical post-AER link-renegotiation latency. - RateDropThreshold: 0.5 — a one-generation PCIe downshift (Gen5 → Gen3 or x16 → x8) lands well past this floor. What's new: - module/pkg/patterns/pcie_aer.go — PCIeAERDetector library type, PCIeAERRecord (AER kernel message), PCIeIORecord (hw.gpu.io rate sample with baseline), PCIeAERVerdict output. PatternIDPCIeAER = "5"; EvidenceKindAER = "pcie_aer"; EvidenceKindPCIeIOCollapse = "hw_io_collapse". - module/pkg/patterns/testdata/pcie_aer_verdict.schema.json — JSON schema with 10 drift falsifiers (extra field, confidence re-add, pattern.id numeric, pattern.id wrong value, severity outside enum, gpu_id empty, drop_ratio negative, drop_ratio over 1, evidence kind outside enum, evidence_trail under min). Verdict struct carries promoted scalar fields (Severity, AERType, DropRatio, GPUID, Node) so the processor wiring can stamp them as top-level OTLP log attributes for dashboard table-aggregation per PR #275's lesson. No dead fields — every struct field is read by either the Evaluate path or the verdict-shape contract. Integration end-to-end firing in a real deployment is blocked on PR-B (issue #260): no OTTL recipe today derives the per-GPU `tracecore.alert.pcie_rate_collapse`-shaped log record from `hw.gpu.io`. The detector library is the v0.3 moat and ships independently per ADR-0001. Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
#5) (#286) ## Summary Ships NORTHSTAR pattern #5 (PCIe AER + hw.gpu.io rate-collapse) per [`docs/patterns/pattern-5-pcie-aer.md`](docs/patterns/pattern-5-pcie-aer.md). Two layers required for a verdict (no Confidence taxonomy — mirroring xid_correlation + hbm_ecc): - **Layer 1 (kernel AER):** `PCIe Bus Error: severity=..., type=...` line on a specific `gpu.id` (PCI BDF) extracted by the journald-kernel OTTL recipe. - **Layer 2 (rate-collapse):** per-GPU `hw.gpu.io` `BytesPerSecond` <= `(1 - RateDropThreshold) * BaselineBytesPerSecond` on the same BDF within the correlation window. Default threshold 0.5 (50% drop) and window 5min. Causality is forward — the AER must precede the rate-collapse sample. Join key: `gpu.id` (PCI BDF) per RFC-0013 §3, shared across the dmesg AER preamble and the dcgm-exporter `hw.gpu.pci.bdf` resource attribute on `hw.gpu.io`. Node is carried for the verdict prose only, not the join key — the BDF is the proximate hardware identifier. The verdict log record promotes operator-facing scalars onto top-level OTLP attributes (`gpu.id`, `kernelevents.pcie_aer.severity`, `kernelevents.pcie_aer.type`, `k8s.node.name`, `tracecore.alert.pcie_rate_collapse.drop_ratio`) per the issue #270 scalar-promotion contract, via the `putStrIfSet` helper so an empty upstream stamp doesn't silently match empty-filter dashboard queries. ## Integration gap (filed separately, not blocking this PR) Per ADR-0001's intentional split — library + processor wiring first, OTTL recipes second — end-to-end firing on a real cluster needs: - #284 — OTTL recipe to emit `tracecore.alert.pcie_rate_collapse.*` log records from raw `hw.gpu.io` Counter samples (the metrics→logs PR-B side; same blocker class as #260 for pattern-1). - #285 — `journald-kernel` recipe extension to extract `kernelevents.pcie_aer.severity` / `.type` / `gpu.id` from `PCIe Bus Error: ...` kernel lines (the existing NVRM Xid stanza is the proximate sibling — same file, same extraction shape). Until both land, the detector is the v0.3 moat (pattern logic + tests) and the processor wiring is configured-but-quiet (zero verdicts on real input — projections find nothing). The detector library and wiring ship independently per ADR-0001. ## What's new - `module/pkg/patterns/pcie_aer.go` — `PCIeAERDetector` library type, `PCIeAERRecord` (AER kernel message), `PCIeIORecord` (hw.gpu.io rate sample with baseline), `PCIeAERVerdict` output. `PatternIDPCIeAER = "5"`; `EvidenceKindAER = "pcie_aer"`; `EvidenceKindPCIeIOCollapse = "hw_io_collapse"`. - `module/pkg/patterns/testdata/pcie_aer_verdict.schema.json` — JSON schema with 10 drift falsifiers (extra field, confidence re-add, pattern.id numeric, pattern.id wrong value, severity outside enum, gpu_id empty, drop_ratio negative, drop_ratio over 1, evidence kind outside enum, evidence_trail under min). - Config: `pcie_aer_window` (default 5min), `pcie_aer_rate_drop_threshold` (default 0.5). `Validate` rejects sub-1s window and threshold outside [0, 1]. - `collectInputs` grows two new typed projections (`projectPCIeAERRecord`, `projectPCIeIORecord`) ordered after `hbm_ecc` in the pod_event > node_condition > nccl_fr > xid > hbm_ecc > pcie_aer > pcie_io discriminator priority. PCIe AER is checked before PCIe IO because the kernel-line discriminator is a tighter regex match than the metric-derived bridge attribute. - Wiring tests: emits-verdict, AER-alone (no fire), rate-collapse-alone (no fire), window-configurable, threshold-configurable, promoted-scalar attribute presence, Validate guard. ## Test plan - [x] `cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... -race -count=1` — green - [x] `make build` — green (OCB compile clean against the new processor wiring) - [x] `make check` — green (golangci-lint, go vet, go mod verify) - [x] `make doc-check` — green (comment-noise diff gate clean) - [x] PCIeAERDetector library tests cover both-layers-emit, AER-alone (no fire), rate-collapse-alone (no fire), window-respected (collapse outside window suppressed), threshold-respected (drop under threshold suppressed), schema-conformance, schema-drift falsifier battery - [x] Wiring tests confirm the projections gate on the right discriminators (severity + gpu.id; bridge attribute + gpu.id) and the verdict log record carries the promoted-scalar attribute taxonomy ```release-notes patterndetectorprocessor wires the PCIeAERDetector library and ships NORTHSTAR pattern #5 (PCIe AER + hw.gpu.io rate-collapse, join key gpu.id PCI BDF, default 5m correlation window, default 0.5 rate-drop threshold). Verdict log records carry pattern.* attrs plus promoted scalars (gpu.id, kernelevents.pcie_aer.severity, kernelevents.pcie_aer.type, k8s.node.name, tracecore.alert.pcie_rate_collapse.drop_ratio) for dashboard table-aggregation without parsing pattern.verdict_json. End-to-end firing on a real cluster requires the OTTL recipes tracked in issues #284 + #285; detector + wiring ship independently per ADR-0001. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary The `docs/adrs/` directory held one file (`0001-metrics-to-logs-pattern-input.md`). A single-file decision-record directory sitting parallel to `docs/rfcs/` (13 RFCs + README + template) is taxonomy drift — operators and contributors had to learn two near-identical conventions for "load-bearing architectural decision in tree." This PR collapses the split. ### What changed - ADR-0001 → **RFC-0014** (`docs/rfcs/0014-metrics-to-logs-pattern-input.md`). Content reformatted to match the RFC template's section headings (Summary / Motivation / Proposal / Alternatives / Open questions / Migration / References). The substance (Option A vs Option B vs Option C analysis, the v0.130 contrib survey, PR-A landed + PR-B pending sequencing) is preserved verbatim — this is active design for issue #260 PR-B, not archaeology. - `docs/adrs/` directory removed. - 6 cross-references repointed at the new RFC path: - `docs/README.md` (subdirectories table — `adrs/` row removed) - `docs/ATTRIBUTES.md` (2 spots: `tracecore.alert.pcie_rate_collapse.*` row + "See also" link) - `docs/integrations/prometheus-scrape.md` (2 spots) - `docs/patterns/pattern-4-thermal-throttle.md` - `docs/patterns/pattern-5-pcie-aer.md` - 5 source-code comments repointed (`module/processor/patterndetectorprocessor/{patterndetector.go,thermal_throttle_test.go}`, `module/pkg/patterns/{pcie_aer.go,thermal_throttle.go}`). - `docs/rfcs/README.md` status-index gains an RFC-0014 row (`accepted`, 2026-05-31). ### Why convert (not delete) ADR-0001 was evaluated against the "delete if RFC-0013 already covers it" bias. It is not covered: - RFC-0013 §5 mentions a `metricthresholdconnector` as a contribution slot in one bullet. It does **not** evaluate Option A vs Option B vs Option C, cite the contrib v0.130 evidence, or sequence the PR-A/PR-B split for the metric-sourced detectors. - ADR-0001 is the binding design contract for patterns #1 / #3 / #4 / #5 (4 of the next NORTHSTAR detectors). Source code (`pcie_aer.go`, `thermal_throttle.go`, `patterndetector.go`) cites it as the reason for the staged-but-quiet wire-up. Delete would orphan 5 source comments and break the audit trail for an active design decision. Convert keeps the record load-bearing without preserving the parallel taxonomy. ### Verification - `grep -rn "docs/adrs\|adrs/0001"` returns 0 hits. - `grep -rn "ADR-0001\|ADR 0001"` returns 0 hits. - `ls docs/ | grep -i adr` returns empty. - pre-commit golangci-lint + go vet + go mod verify clean. ```release-notes docs: collapse single-file `docs/adrs/` into `docs/rfcs/`. ADR-0001 (metrics-sourced pattern inputs) is promoted to RFC-0014 verbatim; cross-references across docs and module source repointed. ``` ## Test plan - [x] `grep -rn "docs/adrs"` returns 0 hits - [x] `grep -rn "ADR-0001"` returns 0 hits - [x] `docs/adrs/` directory removed - [x] golangci-lint + go vet clean (pre-commit hook) - [ ] CI green Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Extends prometheus-scrape.md with the bridge attribute contract for the four metrics-derived patterns: - pattern #1 NVLink (#260) — the `hw.gpu.nvlink.io` OTTL transform already lands in commit 0baa557; this PR closes #260's recipe-half. - pattern #3 HBM ECC (#273) — `hw.errors.delta` + error.{type, subtype,persistence} + gpu.id contract. - pattern #4 thermal throttle (#282) — `hw.gpu.throttle.duration.delta` in integer seconds + reason=thermal + gpu.id contract. - pattern #5 PCIe AER Layer 2 (#284) — the `tracecore.alert. pcie_rate_collapse.*` namespace contract. OTTL metrics->logs emission stays upstream-blocked at OTel-contrib v0.130 (RFC-0014): no contrib processor or connector emits log records from a metrics pipeline. The bridge contract documented here is the load-bearing wire format any future emitter (an upstream metricthresholdconnector OR the WithMetrics extension to patterndetectorprocessor per RFC-0014 PR-B) MUST honor; the detector projections at module/processor/patterndetectorprocessor/ patterndetector.go gate on this contract today. last-verified marker bumped to 2026-06-01. Closes #260. Closes #273. Closes #282. Refs #284 (Layer 1 closed under #285 in a prior commit; Layer 2 contract documented here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Reviewer cleanup pass on yellow-tier findings: * Schema: tighten gen_ai.training.job_id to minLength:1 and document the fallback-grouping semantic explicitly — the field is OMITTED (not empty-string) on the namespace-only fallback path. Downstream consumers must treat ABSENCE as the explicit fallback signal, not silent exclusion. processor already uses putStrIfSet to suppress empty-string variants. * Spec eval rule: clarify Pattern #8 (NCCL hang) vs Pattern #9 (NCCL bootstrap) trigger disjoint-ness — hang fires on PRESENCE of non-completed FR records mid-run; bootstrap fires on ABSENCE of any FR record past deadline. Both can fire on the same cohort during a heterogeneous bootstrap by design. * Spec impl-notes: add note #5 documenting the min(ReadyAt) / max(ReadyAt) split with the late-joiner-masks-stuck-rank scenario as the rationale. * Empty-Node skip comment in detector: previous comment claimed the skip biases toward false-negatives; in fact it STRENGTHENS the absence signal (rank stays "no FR seen" → counted as failed), biasing toward false-positives. Correct the directionality and call out the fallback-to-(namespace, rank) escape hatch. Signed-off-by: Tri Lam <tri@maydow.com>
5 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Closes #331. Replaces the single `pattern-pod-evicted` chaos job with a matrix-strategy `pattern-detectors` job covering every detector named in v1-rc1 cut-criterion §1: `hbm_ecc`, `thermal_throttle`, `pcie_aer`, `pod_evicted`, `nccl_hang`, `xid_correlation`. Each row runs the detector's positive + negative + schema `*_test.go` prefix under `-race`: - **Positive / Edge asserts** — verdict emitted with correct `pattern.id` within bounded time. - **Negative / SchemaRejectsDrift asserts** — zero verdicts on baseline / confounder input (the false-positive gate the chaos matrix exists to enforce). The `pod_evicted` row additionally runs the on-disk replay corpus under `module/pkg/replay/pod_evicted/`; sibling rows skip that step until their `replay/<pattern>/` fixtures land — tracked as Path A in [`docs/v1-rc1-test-audit.md`](../blob/main/docs/v1-rc1-test-audit.md) §4. ## Why this matrix shape, not new `failure-inject` modes Audit §4 explicitly recommends Path A (matrix over existing hermetic detector tests) for rc1 and defers Path B (per-pattern `failure-inject` CLI modes) to v1.x. Path A reuses the race-detector + golden-SHA shape already validated by `pattern-pod-evicted` and costs ~60 LoC of workflow YAML instead of five new CLI subcommands. ## Wall-time budget Observed per-row runtime under `-race` (measured locally on this worktree): | Row | runtime | |---|---| | hbm_ecc | 1.35s | | thermal_throttle | 1.33s | | pcie_aer | 1.30s | | pod_evicted (incl. replay corpus) | 1.35s + 1.34s | | nccl_hang | 1.31s | | xid_correlation | 1.32s | Per-row `timeout-minutes: 5` is ~3× headroom; matrix parallelism keeps total wall-time well inside the chaos-gate budget. ## Test plan - [x] `actionlint .github/workflows/chaos.yml` — clean. - [x] `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/chaos.yml'))"` — parses, 3 jobs (`harness-determinism`, `cpu-steal-mpstat`, `pattern-detectors`), 6-row matrix. - [x] For every matrix row, `cd module && go test -race -count=1 -run '<regex>' ./pkg/patterns/...` — all green (counts: HBMECC=16, ThermalThrottle=17, PCIeAER=15, PodEvicted+PressureFromNote=12, NCCLHang=10, XidCorrelation=14). - [x] `cd module && go test -race -count=1 ./pkg/replay/...` (pod_evicted corpus step) — green. - [x] Pre-commit hook: golangci-lint + go vet + go mod verify + attribute-namespace-check — all green. - [ ] Workflow runs green on `main` after merge (verifiable only post-merge — nightly cron + push-to-main). ```release-notes ci: chaos.yml now matrix-tests all six shipped detectors (#14 pod_evicted, #8 nccl_hang, #3 hbm_ecc, #4 thermal_throttle, #5 pcie_aer, xid_correlation) instead of just pod_evicted. Closes the chaos-coverage gap called out in v1-rc1-test-audit §4. No operator-visible behavior change. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Closes #337. The CUDA OOM detector (`projectFBMemoryRecord` at `module/processor/patterndetectorprocessor/cuda_oom.go:114`) gates on `hw.gpu.memory.{free,total}` log-record attributes, but nothing in the recipe layer produced them: `dcgm-exporter` emits `DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` as Prometheus gauges and no OTTL transform projected them onto the customer-stable namespace. Detector compiled, never fired on a real install — sibling gap to #273 (pattern #3), #282 (#4), #284 (#5). This PR closes the gap on the metric-side projection and pins the load-bearing log-shape contract the bridge layer MUST honor. - `docs/integrations/examples/prometheus-scrape.yaml`: - `DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used` (Gauge, unit `By`) - `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free` (Gauge, unit `By`) - Identity-preserving rename only; `hw.gpu.memory.total = used+free` deferred to the bridge layer per the named upstream limit (see below). - `docs/integrations/prometheus-scrape.md`: - New `### Pattern #10 — CUDA OOM (framebuffer)` metric-side projection section with raw-series → semconv table. - New `#### Pattern #10 — hw.gpu.memory.{free,total}` bridge- contract subsection with full log-record schema (yaml-shaped) matching what `projectFBMemoryRecord` reads, plus MIG caveat and unit-test cross-link. - Intro + bridge-contract header bumped to include pattern #10. - `docs/patterns/10-cuda-oom-deceptive.md`: - Signal-source line links to the recipe sections. - Open Question #1 (`DCGM_FI_DEV_FB_*` OTTL extension) struck through; resolution recorded. - `docs/ATTRIBUTES.md`: - `hw.gpu.memory.free` / `.total` rows updated to distinguish metric vs log shape and to cross-link to the recipe section. - New `hw.gpu.memory.used` row (now projected on the metrics pipeline by this PR — dashboard evidence context). ## Root cause + named upstream limit **Root cause (fixed in this PR):** the prometheus-scrape OTTL transform had no stanza projecting the DCGM FB series onto `hw.gpu.memory.*`. The detector's projection gate could not be satisfied on a real install. Fixed by adding the rename stanza in `transform/dcgm_to_hw_semconv` (same processor the #1/#3/#4/#5 projections already live in — no new processor surface). **Named upstream limit (NOT worked around — tracked):** OTel-contrib `transformprocessor` v0.130 `metric_statements` cannot perform cross-series arithmetic — there is no OTTL path to compute `hw.gpu.memory.total = hw.gpu.memory.used + hw.gpu.memory.free` on a metrics pipeline ([upstream README](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config)). Per RFC-0014 §Alternatives, the metrics→logs primitive does not exist in contrib at v0.130 either. The total scalar lives at the bridge layer (RFC-0014 PR-B `WithMetrics` extension to `patterndetectorprocessor`, tracked under #260). The recipe pins the load-bearing wire format the bridge MUST honor so PR-B lands without a contract change. ## Adopt-over-build posture Every new OTTL statement uses upstream functions only (`set`, `==` equality). No new transformprocessor extension. Mirrors the existing #1/#3/#4/#5 stanzas. ## Test plan - [x] `make build` — clean - [x] `./scripts/validator-recipe.sh` — 9 validated, 3 skipped (non-linux), 0 fail; prometheus-scrape example passes `tracecore validate`. - [x] `./scripts/doc-check.sh` — 721 markdown links resolve, all new cross-links + anchor refs included; banned-phrase lint clean; recipe markers (`tested-against`, `last-verified`) present. - [x] `./scripts/attribute-namespace-check.sh` — 67/67 attribute literals documented (no new undocumented attrs introduced). - [x] golangci-lint, go vet, go mod verify via commit hook — clean. - [ ] CI linux runner exercises journald + k8sobjects skip-paths we couldn't run locally (validator-recipe ubuntu job). ```release-notes recipe(ottl): project DCGM `DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` onto the customer-stable `hw.gpu.memory.{used,free}` namespace and pin the metrics-to-logs bridge log-shape spec the pattern #10 (CUDA OOM) detector consumes via `projectFBMemoryRecord`. `hw.gpu.memory.total = used + free` derivation is deferred to the RFC-0014 PR-B `WithMetrics` bridge layer because OTTL `transformprocessor` v0.130 has no cross-series arithmetic on metrics pipelines. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
chore(config): document pattern-prefixed knob naming convention (v1.0-rc1 cosmetic / v2.0 nest)
#379
Closed
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Bundle three post-wave-audit docs-cleanup issues (`docs/v1-rc1-post-wave-audit.md` findings #5, #8, #15). One commit per issue for clean revert / history. Fourth audit cleanup (#380) intentionally deferred — precondition unmet. ## Per-issue scope + delta ### `chore(module): trim stale PR-I genesis-tag rationale` (closes #381) Audit finding #8. `module/doc.go` referenced PR-I.1a/1b/2 as "Contents land in…" and explained the genesis-tag proxy rationale. All three PRs landed long ago; the module is alive. Trimmed to package-doc only. - **Before:** 14 lines (8-line historical-artifact block). - **After:** 6 lines (`package module` declaration only, retained for proxy compat). - `go list -m github.com/tracecoreai/tracecore/module` still resolves. ### `docs(rc1): sweep strikethrough resolved-in-issue blocks` (closes #383) Audit finding #15. The two specific examples called out by the issue: - `docs/patterns/10-cuda-oom-deceptive.md:74` — `~~DCGM_FI_DEV_FB_* OTTL recipe extension.~~` referenced closed #337 (verified via `gh issue view 337` → CLOSED). - `docs/v1-rc1-simplification-audit.md:185-186` — `~~components/exporters/otlphttp/~~` + `~~components/exporters/stdoutexporter/~~` referenced closed #333 / #334 (both verified CLOSED). Resolution text retained as cross-link history; only the strike-through visual noise removed. **Explicitly NOT touched:** `docs/followups/M*.md`, `docs/MILESTONES.md`, `docs/rfcs/0003-*.md` strikethroughs — these are documented convention markers per `MILESTONES.md:65,69` ("A PR that completes a follow-up strikes it through (`~~…~~`)"). Sweeping them would delete a load-bearing convention. ### `chore(config): document pattern-prefixed knob naming convention` (closes #379) Audit finding #5. `config.go` (540 lines) carries three knob styles. The post-wave prefixed shape is right; renaming bare→prefixed pre-RC1 would break every existing `values.yaml`. Take option 1 from the issue: document the convention at top-of-file so future detectors follow it. - **Before:** No top-of-file knob-naming guidance. - **After:** 12-line `// Knob-naming convention` block before `package patterndetectorprocessor`, naming bare-name knobs explicitly and pointing forward to v2.0 nesting. ## Why #380 is NOT in this bundle #380 sweeps 12 "future PR-B" comments referencing the RFC-0014 `WithMetrics` bridge. Its own acceptance criteria say "Closes with PR-B's merge commit." Precondition check today: - `git log --all --grep=WithMetrics` → only wave-3 port commits, no bridge. - `docs/rfcs/0014-metrics-to-logs-pattern-input.md:3` → status `accepted, blocks issue #260 PR-B`. - `docs/integrations/prometheus-scrape.md:381` → "PR-B has NOT yet shipped." All 12 comments are factually accurate today. Sweeping prematurely would either delete load-bearing forward-looking documentation, or replace it with stale post-PR-B prose. Issue stays open as tracker per its design; commented on issue noting the deferral. ## Verification - `go build ./...` (module) — clean. - `go vet ./...` — clean (via pre-commit). - `go list -m github.com/tracecoreai/tracecore/module` — resolves. - `golangci-lint run ./...` — 0 issues (via pre-commit). - `attribute-namespace-check` — 100/100 (via pre-commit). - `gh issue view 337 333 334` — all CLOSED (justifies #383 sweep). ## Test plan - [x] Module builds + vets. - [x] Referenced closed-issue states verified. - [x] Strikethrough convention markers in `docs/followups/*` / `MILESTONES.md` / RFC-0003 left intact (deliberate scope-limit). - [x] `#380` deferred with on-issue rationale; not in `Closes` trailer. - [ ] CI green. ```release-notes NONE ``` Closes #383 Closes #381 Closes #379 --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but the join shape varies across patterns and the rationale lived only in code comments + PR review threads. Operators tuning windows had to read source per detector. Audit finding: five distinct shapes are load-bearing (chosen by the causal physics of each signal), not bugs: - One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect. - Asymmetric two-sided (#11): pre-stall covers concurrent-start checkpoints; post-stall covers OTTL-bridge logger latency. - Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window could be cause OR consequence. - Job-window bounded (#13): SDC counter rise must fall in the bounded eval-cycle's owning job; no operator knob is meaningful. - Trailing-window rate / freshness (#2 #4 #8): rolling window anchored at `now` or the most-recent record. Decision: document the existing reality, do not converge. Forcing every detector to the asymmetric two-knob form would silently zero one leg for the one-sided detectors (footgun on clock skew) and would not apply to #13 at all. Adds: - 'Why this correlation shape' section in docs/patterns/07, 11, 13 (the three shapes the issue called out by name). - 'Correlation-window semantics' table in docs/patterns/README.md covering ALL 11 detectors with the predicate, anchor, and shape rationale, plus cross-links to the per-pattern sections. No code changes; no detector behavior changes. Closes #367. Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Lands replay corpora for the seven pattern detectors that lacked one, closing the Path-A test gap named in the v1-rc1 audit (#366). Each pattern now ships `module/pkg/replay/<pattern>/{canonical,_negative, _real_world}/` fixtures plus a `*_replay_test.go` runner that JSON-eqs detector output against the on-disk golden. Detectors covered: `hbm_ecc` (#3), `thermal_throttle` (#4), `pcie_aer` (#5), `ib_link_flap` (#2), `nccl_hang` (#15), `cuda_oom` (#10), `xid_correlation` (#16). `pod_evicted` (#14) corpus is unchanged. ## Design - Each detector takes a different input shape (e.g. `HBMECCRecord + XidRecord` for `hbm_ecc`, `ThermalThrottleRecord` for `thermal_throttle`, ...), so the existing `LoadFixturesUnder` helper (typed on `Record + NodeRecord`) cannot be reused. Each detector gets its own `*_replay_test.go` that inlines the per-detector JSON read; shared fixture-discovery and golden-assert helpers live in `helpers_test.go`. - Two detectors (`nccl_hang`, `ib_link_flap`) take a `Now` reference. Tests pin `Now` to a fixed timestamp matching the fixture's `started_ns` so hang-age and flap-window inclusion stay deterministic across replay runs (otherwise wall-clock drift would silently flip the verdicts as the fixture aged). - Goldens were generated from the live detectors via `UPDATE_REPLAY_GOLDEN=1 go test ./module/pkg/replay/...` and pinned; future drift in detector output (headline / remediation prose, evidence-trail UID shape, scalar-field rename) surfaces as a `JSONEq` diff against the fixture. Operators can also eyeball the golden to assert what they EXPECT vs what fires. - Negative fixtures each exercise a distinct discriminator (wrong Xid code, single GPU, no AER, no eviction, completed state, single transition, no OOM log) so a regression in one false-positive guard lights up the corresponding row only. - Flipped `run_corpus: true` on every row of the chaos.yml pattern-detectors matrix now that every detector has a corpus. ## Test plan - [x] `make check` — clean - [x] `go test -race -count=1 ./pkg/replay/...` — 35 tests pass (28 new + 7 pre-existing pod_evicted) - [x] `go test -count=1 ./processor/patterndetectorprocessor/...` — unchanged, still green - [x] `make verify` (pre-push) — clean - [ ] CI: chaos.yml `pattern-detectors` matrix — 8 rows, each runs hermetic regex + replay-corpus step Closes #366. ```release-notes NONE ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
8 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Closes #393. Ships the metric-side OTTL projection from `node_exporter --collector.infiniband`'s `node_infiniband_port_state_id` Gauge onto the customer-stable `hw.network.ib.*` namespace (`hw.network.ib.port.state` int + `hw.network.ib.device` + `hw.network.ib.port.num`) so pattern #2's `IBLinkFlapDetector` consumes the same vendor-neutral wire shape regardless of whether the underlying source is node_exporter, a Mellanox exporter, or the `mlx5_core` journald stream. Detector library + processor wiring already shipped in #391 (closed #300). Only the metric-side input recipe was missing — pattern #2 was configured-but-quiet on real deployments. This PR closes that gap. ## Wire contract (node_exporter raw → hw.network.ib.*) ``` node_infiniband_port_state_id{device="mlx5_0", port="1"} = 4 (IBA phys_state ID) ↓ transform/ib_to_hw_semconv Gauge metric "hw.network.ib.port.state" with datapoint attrs: hw.network.ib.device = "mlx5_0" (str, from `device` label) hw.network.ib.port.num = 1 (int, from Int(`port` label)) value = 4 (int, the phys_state ID) ``` The future RFC-0014 PR-B metrics→logs bridge emitter (shared with patterns #3/#4/#5/#10) will lift these three attributes onto a log record at emit time. The bridge log-record schema for pattern #2 is pinned in `docs/integrations/prometheus-scrape.md §Pattern #2 — hw.network.ib.port.state (issue #393)` so PR-B has no per-pattern reconstruction work to do. The companion series `node_infiniband_state{state="<name>"}` (string label) is intentionally NOT mapped — the detector (`module/processor/patterndetectorprocessor/ib_link_flap.go`) compares `state.Int()` against `patterns.IBPortState*` integer constants, so the string variant would round-trip wrong. ## No detector code change required The detector reads three attribute names off a log record: `hw.network.ib.port.state`, `hw.network.ib.device`, `hw.network.ib.port.num`. The recipe stanza stamps the exact same three names on the metric datapoint. The wire format `port.Int()` expects (the projector at `module/processor/patterndetectorprocessor/ib_link_flap.go` line 39 calls `int(port.Int())`) is satisfied because the OTTL `Int()` cast on the Prometheus `port` string label produces a pdata int Value. Confirmed by the new `TestRecipe_IBLinkFlap_RoundTripFiresVerdict` test. ## Root cause + scope - **Root cause of #393**: missing metric-side OTTL stanza. Fixed. - **Out of scope (separate blocker, tracked under #260 PR-B)**: the metrics→logs bridge emitter. Upstream-blocked at OTel-contrib v0.130 — `transformprocessor`'s `metric_statements` cannot reference `log.*` paths and no contrib connector emits log records from a metrics pipeline (per [RFC-0014](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0014-metrics-to-logs-pattern-input.md)). The recipe doc explicitly documents this gating relationship; PR-B is shared with patterns #3/#4/#5/#10 and lands the bridge once. ## Files changed - `docs/integrations/examples/prometheus-scrape.yaml` — new `transform/ib_to_hw_semconv` processor; wired into the `metrics/scrape` pipeline. Validates with `./_build/tracecore validate` (exit 0). - `docs/integrations/prometheus-scrape.md` — new "Pattern #2 — InfiniBand link flap" projection section, intro updated from "Two" to "Three OTTL transforms", and bridge log-record contract subsection added under the "Metrics-to-logs bridge contract" section. - `docs/patterns/pattern-2-ib-link-flap.md` — deleted "Integration gap" section, replaced with "Integration recipe" pointing at the shipped stanza; updated "Why node_exporter sees it" prose to drop the "pending" hedge. - `module/processor/patterndetectorprocessor/ib_link_flap_recipe_test.go` — new file. `TestRecipe_IBLinkFlap_StanzaPinsWireContract` parses the example YAML and asserts every load-bearing token is present (source metric name, three `hw.network.ib.*` attrs, the `Int()` cast on the port label, the transform name, and the pipeline wiring). `TestRecipe_IBLinkFlap_RoundTripFiresVerdict` simulates the end-to-end path: builds `plog.Logs` with the exact attribute shape the recipe stamps and asserts the processor emits a flap verdict. ## Test plan - [x] `./_build/tracecore validate --config=docs/integrations/examples/prometheus-scrape.yaml` → exit 0 - [x] `bash scripts/validator-recipe.sh` → 9 validated, 3 skipped (non-linux host) - [x] `bash scripts/doc-check.sh` → clean (no orphan test refs) - [x] `go test ./module/processor/patterndetectorprocessor/... -count=1` → PASS (incl. the two new tests + all 5 existing IB tests) - [x] `go build ./...` and `go vet ./...` → clean - [x] Pre-commit hooks: golangci-lint 0 issues, go mod verify, attribute-namespace-check 100/100 - [x] Mutation-verified: dropping `hw.network.ib.port.state` from the recipe yaml fails `TestRecipe_IBLinkFlap_StanzaPinsWireContract` with the expected remediation message naming the missing identifier - [ ] CI on the PR (waiting on push) ```release-notes feat(recipe): InfiniBand link-flap OTTL stanza projecting node_exporter's `node_infiniband_port_state_id` onto the tracecore-canonical `hw.network.ib.*` namespace (`hw.network.ib.port.state` int + `hw.network.ib.device` + `hw.network.ib.port.num`). Pattern #2's `IBLinkFlapDetector` now has its metric-side input wired; metrics→logs bridge emitter remains gated on RFC-0014 PR-B (#260). ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
10 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
…ard) (#477) ## Summary Closes the `docs/MILESTONES.md` §M6 carry-forward: *"every fenced block in `docs/getting-started.md` is exercised by `scripts/smoke.sh`"*. The ≤5-count gate shipped with the M6 wave; the binding half was tracked carry-forward because `smoke.sh` ran a parallel hand-written hostmetrics→debug config rather than the doc's actual YAML. ## Root cause Two scripts owned the "first OTLP byte" config — `smoke.sh` rendered one inline, `docs/getting-started.md` carried another. They happened to agree, but nothing forced them to. The carry-forward existed because the binding was *correct by inspection*, not *correct by construction*. The fix is to make the doc the single source: `smoke.sh` extracts the YAML from `docs/getting-started.md`'s `## Walkthrough` heredoc at runtime. If the doc grows a typo, a renamed receiver, or a different scraper, `smoke.sh` exercises the change automatically. If the heredoc disappears, the extractor fails loud with a named error. ## Changes - `scripts/smoke.sh` — extracts the Walkthrough heredoc via a perl one-liner, writes it to a tempfile, then runs `tracecore validate --config=` + `tracecore --config=` against it (Walkthrough steps 3 + 4). Lifecycle-log assertions retained, with `"Shutdown complete"` now load-bearing against the doc's post-walkthrough prose. - `scripts/doc-check.sh` — new gate (right after the existing ≤5-count gate) asserts the smoke↔doc binding with four mutation-verified clauses: Walkthrough scope, `"$BIN" validate --config=` invocation, `"$BIN" --config=` run invocation, `docs/getting-started.md` path reference. - `scripts/smoke_test.sh` — new mutation-verify harness mirroring the gate at runtime, plus an inline mutant-doc test that proves the extractor exits 1 and the wrapper emits the named error when the heredoc is removed. - `Makefile` — `make smoke` now also runs `smoke_test.sh`; wired into `ci-full` alongside the existing `smoke-quickstart` target. - `docs/MILESTONES.md` — §M6 status `⧗ partial` → `☑ delivered`; getting-started rubric `⧗` → `☑`; carry-forward bullet rewritten (remaining work is operator-config branch-protection only). ## Runtime End-to-end `bash scripts/smoke.sh` on darwin/arm64: **~2.2s** (extract + validate + 1.5s run window + lifecycle-log assertions). Well under the 120s ci-fast budget. No hardware required — uses the `hostmetrics` load scraper, portable across linux/darwin/windows. ## Test plan ```release-notes ci(smoke): scripts/smoke.sh now extracts its YAML config from docs/getting-started.md '## Walkthrough' instead of carrying a parallel hand-written config; doc-check.sh gates the doc↔smoke binding with four mutation-verified clauses. Closes the M6 carry-forward. ``` - [x] `bash scripts/smoke.sh` exits 0 on clean main (verified locally, ~2.2s). - [x] `bash scripts/smoke_test.sh` all assertions pass. - [x] `bash scripts/doc-check.sh` reports `scripts/smoke.sh binds to docs/getting-started.md (M6: every block exercised by smoke.sh)`. - [x] Mutation test #1: `sed -i 's/"$BIN" validate --config=/"$BIN" XXX --config=/' scripts/smoke.sh` → doc-check exits 1 naming "validate --config= invocation (Walkthrough step 3)". - [x] Mutation test #2: `sed -i 's/"$BIN" --config=/"$BIN" XXX=/' scripts/smoke.sh` → doc-check exits 1 naming "run invocation (Walkthrough step 4)". - [x] Mutation test #3: `sed -i 's/Walkthrough/Section/' scripts/smoke.sh` → doc-check exits 1 naming "extraction scope lost". - [x] Mutation test #4: `sed -i 's/docs/getting-started.md/docs/SOMEWHERE-ELSE.md/' scripts/smoke.sh` → doc-check exits 1 naming "binding source missing". - [x] Mutation test #5: getting-started.md with no `## Walkthrough` heredoc → smoke.sh exits 1 with named error message (covered by `smoke_test.sh`). - [x] `make lint` 0 issues; `make vet` clean; `make doc-check` clean (all 18 gates pass). - [x] `make smoke` end-to-end including `smoke_test.sh` passes. ## Related - Refs `docs/MILESTONES.md` §M6 (Documentation scaffold). - Sibling #460 (`fix(doc-check): drop unconditional exit 0`) made this carry-forward visible — before #460, the new gate would have been silently skipped by the line-99 short-circuit. Signed-off-by: Tri Lam <tree@lumalabs.ai>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Four small follow-ups on top of #4 (Wave-1 governance bootstrap). Each is independently small; they're bundled here because they share a theme (tightening the wave we just landed).
dco-check_test.shis now signing-safe. The test creates a temp git repo and commits to it. Without overridingcommit.gpgsignlocally, a contributor who's runscripts/setup-signing.sh(and thus hascommit.gpgsign=trueglobally) would see the test commits try to sign with a key not present in the sandbox repo — a false negative in our own test suite. Fix is twogit configlines.scripts/apply-branch-protection.shapplies the rules in.github/branch-protection.ymldirectly viagh api. Idempotent; safe to re-run. Replaces the manual-UI-clicks-then-pray pattern. After this PR merges, applying branch protection is one command instead of a series of GitHub UI navigations — and drift between the intent doc and live settings becomes a re-run away from being fixed.AGENTS.mddocuments the kernel-styleAssisted-by:trailer format (<agent>:<model> [<tool>]) as preferred, with the simple form (<tool name>) still accepted. Aligns the doc with the format every commit on this branch and Wave 1: governance bootstrap (CODEOWNERS, DCO, signing) #4 already uses, and matches the Linux kernel'scoding-assistants.rstconvention.The "Aim for under 500 lines" guidance is removed from
CONTRIBUTING.mdand the matching bullet inAGENTS.md. Following a recent design discussion, we explicitly chose not to enforce a PR size cap (the friction it adds outweighs the discipline it provides for a small-team project). Replaced with: "split when the diff outgrows the concern, not when it crosses an arbitrary line count".Depends on
#4. This branch is
wave1-followupcut fromwave1-governance; once #4 merges, this PR's diff is just the four follow-up changes.Release notes
Checklist
make ci, all*_test.shsuites)Assisted-by:trailer present