Add NORTHSTARS, RFC-0002, and Q1 MILESTONES#3
Merged
Conversation
Closes RFC 0001's open question on whether to import go.opentelemetry.io/collector/component directly or ship a standalone collector binary. Commits to: tracecore is its own Go binary. Receivers live in this repo and are structured so they could be upstreamed to opentelemetry-collector-contrib later, but the primary distribution is our own build. Why own-binary over upstream-extension: - Trust narrative requires we own the build chain end-to-end (PRINCIPLES.md §1). - Release cadence is faster than upstream's; tracecore needs point-release cadence aligned with NCCL/PyTorch upstream changes, not OTel collector cadence. - Convenience-first defaults for training observability can't live in a general-purpose collector. - Vendor SDK isolation policy (STYLE.md §Vendor SDK isolation) needs per-receiver build-tags that upstream's contribution model doesn't accommodate asymmetrically. Reversibility preserved by mirroring upstream's component interface shape in internal/pipeline; migration to upstream's contract is mechanical if we ever reverse. Three open follow-on questions deferred to later RFCs: when individual receivers get upstreamed, our pdata-import exposure, and whether general (non-training) receivers ever belong here. Signed-off-by: Tri Lam <trilamsr@gmail.com>
A goals doc that complements PRINCIPLES.md (why) and STYLE.md (what conventions). Names the northstar, seven strategic objectives with hero KPIs, supporting KPIs, operating rules, and named caveats. The northstar: make tracecore the default OSS collector running on every distributed AI training cluster — the most convenient, lowest-friction way to get cross-vendor training observability. Seven objectives: - O1 Coverage — root-cause-pattern coverage (15 named patterns enumerated in Appendix A) - O2 Convenience & quality — install-to-first-data on a named reference cluster - O3 Trust & supply chain — SLSA Build Level achieved and maintained - O4 Standards — external implementations of gen_ai.training.* OpenTelemetry semantic conventions - O5 Distribution & community — unique organizations in production - O6 Velocity — quarterly ship-commitment hit-rate - O7 Project governance — non-employee maintainers active Each objective has one accountable owner role, hero KPI, supporting KPIs, operating rules, and named caveats. The doc also enumerates inter-objective coupling, what we explicitly do not chase (closed-source synthesis, inference observability, GPU cost dashboards, experiment tracking, auto-phone-home), and the open questions tracked as RFCs. Goals don't grow with the codebase; KPIs do. Signed-off-by: Tri Lam <trilamsr@gmail.com>
The quarterly ship-commitment artifact per NORTHSTARS.md §O6. 23 milestones across foundation, receivers, and outcomes; target hit-rate ≥80%; in-repo scope only. Foundation (Weeks 1-4): pipeline runtime, self-telemetry, reproducible-build CI, lint/test harness, failure-injection harness, benchmark harness, Helm chart skeleton, docs scaffold. Receivers (Weeks 3-10): DCGM, dmesg/journald, k8s events, NCCL FlightRecorder (with a safe pickle parser that refuses code execution), NCCL Inspector, py-spy continuous, PyTorch Kineto, container stdout, Kueue scheduler. Outcomes (Weeks 5-12): three root-cause patterns covered end-to-end (pod-evicted → stragglers → NVLink degradation; staggered by data-injection difficulty), staged reference- cluster install benchmark (Helm install → first DCGM datapoint → ≤5 min median), first v0.1.0 release with a public failure-trace issue template so operators can contribute regression-test fixtures. Scope is in-repo work only. Standards-body engagement (O4) and external community outreach are recurring cadences in NORTHSTARS, not Q1 milestones — they don't produce commits to this repo. Misses get documented with reason and re-planned in the end-of-quarter retro, not silently dropped. Hits and misses both remain in MILESTONES.md as the public record. Signed-off-by: Tri Lam <trilamsr@gmail.com>
|
You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool. What Enabling Code Scanning Means:
For more information about GitHub Code Scanning, check out the documentation. |
5 tasks
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Pushes toward the prompt's self-eval gate by closing every gap that does NOT genuinely require Linux + libdcgm at build time. - metrics.go::fieldEmitters grows from 6 to all 13 metric families: + hw.gpu.io (PCIe Tx/Rx), hw.energy, hw.gpu.nvlink.io (per-link Tx/Rx), hw.gpu.clock.frequency (sm/memory/video domains), hw.gpu.xid.errors. ECC aggregate counters keep their dedicated drop tier. The receiver-side pipeline is now complete for every metric in the README's design table; only the SOURCE of samples remains gated on the cgo client. - pkg/dcgm/types.go: new well-known FieldID constants for SM / memory / video clock (100/101/102), NVLink L0 Tx/Rx (1040/1041), throttle reasons bitmask (112). RHS will switch to go-dcgm constants when client_cgo.go lands. - components/receivers/dcgm/integration_hardware_test.go: //go:build dcgm,hardware skeleton. Skips with a clear reason when DCGM is unreachable; runs end-to-end against a real GPU on a Linux host where both build tags are active. Hardware reviewers have the test to fill in; macOS CI doesn't run it. - emit_bench_test.go: BenchmarkEmit_TypicalScrape pins the per-scrape cost at 37 microseconds for 8 GPUs x 12 fields. At 15s collection_interval that's 0.00025 percent CPU -- three orders of magnitude under the 0.05% O2 budget. - resetSession() helper extracted from ensureConnected + scrape so the connection-loss state-reset doesn't drift between two call sites. Closes Loop-4 P3 nit on duplicated reset logic. - docs/agents/RECEIVER-PATTERNS.md: new "Pattern selection" table -- five source-type rows with constructor / lifecycle / pattern reference per row -- so M9 (streaming/subprocess), M10 (failure- triggered), M11 (vendor-SDK like dcgm) authors know which shape fits their work. Closes Loop-4 P3 question on the doc gap. - FOLLOWUPS.md created at repo root (was referenced repeatedly, never written): 4 opportunistic items, 4 "considered and explicitly skipped" items with Revisit-if predicates. - README.md: metric table no longer split into "emitted vs deferred" -- the table is the truth, and a single paragraph notes that the data SOURCE waits on the cgo client. - Smoke-tested the binary: `tracecore collect --config= example_config.yaml` boots, logs "dcgm receiver started", attempts Connect, fails with "dcgm: SDK unavailable", enters degraded mode (reason=init), shuts down within the 1s budget. End-to-end happy path verified on this host. Self-eval criterion #3 (metric set) lifts from 3 to 4. Criterion #2 (cgo wrapper) stays at 3 because client_cgo.go itself is the only remaining gate -- a Linux GPU host is required to compile the cgo bindings. The MILESTONES Carry-forward bullet commits to that work. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Address item #3: `syncBuffer` was duplicated in cmd/tracecore + internal/telemetry, and `failingSink` lived only in clockreceiver even though M8/M9 will need the same downstream-error stub. Hoisted into `internal/pipeline/pipelinetest`: pipelinetest.SyncBuffer (was syncBuffer ×2) pipelinetest.RecordingMetricsSink (new — for capturing pushes) pipelinetest.FailingMetricsSink (was failingSink ×1) Migrations: - cmd/tracecore/integration_test.go: drops 19-line syncBuffer def, type alias to pipelinetest.SyncBuffer (call sites unchanged). - internal/telemetry/server_test.go: same alias pattern. - components/receivers/clockreceiver/errors_integration_test.go: drops failingSink, uses pipelinetest.FailingMetricsSink{Err: ...}. clockreceiver_test.go's metricsSink intentionally not migrated — it's package-internal, tightly coupled to a few test assertions, and migration cost exceeds value for this PR. M8/M9 + future receivers reach for pipelinetest.RecordingMetricsSink instead. Coverage: internal/pipeline/pipelinetest 77.3% → 86.8% via new helper tests. Other packages unchanged. make ci clean. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Closes 4 of 7 new A+ criteria from the recursive self-review: #1 — e2e-otelcontrib now verifies the collector PARSED the record, not just that it accepted bytes. Workflow rewritten to docker-run otelcol-contrib with a custom config (file + debug exporters, detailed verbosity). After the e2e POST, the bash step greps /tmp/otelout/logs.json for the canonical body, the kernelevents.xid attribute, and the gpu.id attribute. Empty file or missing attributes → workflow fails. #2 — TestIntegration_KmsgWriteReadBehavioral (//go:build linux) writes a synthetic <6>NVRM Xid 79 line to /dev/kmsg, uses a marker string in a regex_filter to isolate from ring-buffer noise, then asserts the receiver emits a plog.LogRecord with kernelevents.xid=79 + gpu.id=0000:65:00.0 within 3s. A regression in parse/build/emit fails this on Linux CI. #3 — prometheus_alerts_test.go validates the alert YAML structure (every group has interval, every rule has expr/severity/summary/ description) AND cross-references the metric + label-filter names against the receiver's actual SelfTelemetry surface. A typo in the alert would silently never fire; this catches it before merge. #5 — runbook_test.go executes the RUNBOOK's "First 15 minutes" step 1 (`tracecore validate --config=...`) and step 2 (`tracecore debug dump`) as real commands. Documentation rot becomes a test failure, not a silent SRE-time discovery. #4 — sustained_test.go (`//go:build sustained`) feeds 1000 events/sec for 5 minutes (300k records), samples heap every 30s, asserts ≤10 MiB growth and p99 emit latency tail bounded. New `sustained-load` workflow job runs it on push-to-main + schedule (not PR — 5 minutes is too slow for the inner loop). The seventh criterion (two-week soak + external operator) requires elapsed time + a human; nothing in-session can close it. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Two independent reviews of PR #18 surfaced a stack of blockers, strong findings, and quality lifts. This commit lands the tractable items (defers documented in docs/FOLLOWUPS.md M8 section). Operator-visible drift (closes Reviewer 2 #13-#17): - RUNBOOK kind-triage row `consume` → `downstream` (last commit renamed the kind but missed the doc table). - README config table: `initial_delay` range updated from "0.. collection_interval" to "≥ 0" (code already relaxed for DGX cold-starts; doc was stale). - prometheus-alerts: DCGMReceiverHighErrorRate threshold changed from `rate > 0.1/sec` (unreachable: 15s default tick caps at 0.067/sec) to `increase > 5 in 5m`. Stale "M2 has not landed" caveat removed. - example_config: single `mode:` line with a clarifying comment (was showing both `mode: standalone` and `# mode: embedded`, inviting operators to uncomment both → YAML duplicate-key). - cmd/tracecore receivers list now prints `dcgm [stub]` / `dcgm [cgo]` so operators can verify deploy shape without reading go.mod. Build-tag-conditional via three small files in cmd/tracecore (receiver_variants*.go) — pattern extends to M11 NVML. Correctness bugs (closes Reviewer 2 #2, #3, #4, #6, #7): - receiver.Shutdown: `r.running.CompareAndSwap(true, false)` gates teardown so a second Shutdown is a no-op (cgo libdcgm `dcgmShutdown` is not documented idempotent). Same CAS provides the happens-before for `r.cancel` publish that Pass-1 flagged. - receiver.ensureWatched: zero-entities path now emits IncError(KindEnumerate) + degraded rather than returning true. Without this, a misconfigured host (no GPUs visible, ACL blocks /dev/nvidia*) had the receiver looking healthy while emitting nothing. - receiver: new `warnOnce` helper gates the 7 per-tick failure- path Warn logs to fire only on the first failure after recovery. Closes the log-storm bug (4 errors/min × 60 min = 240 lines). Counter (`receiver_errors_total`) still ticks every failure. - metrics.applyCardinalityCap: parameter `cap` → `maxSeries` (cap shadowed the Go builtin). Quality / contract lifts: - metrics.emit now returns a `stale` count for StatusStale and StatusError samples. pushSamples calls IncError(KindRead) once per tick when stale > 0 — surfaces DCGM serving slow/faulty data, which is precisely what StatusStale exists for. Per-tick not per-sample so GPU count doesn't inflate the rate. (StatusNoData and StatusFieldNotSupported still silent.) - docs_parity_test.go: new TestRUNBOOK_KindsMatchEmitted walks every emitted IncError/failedTick kind against the RUNBOOK per-kind triage table in both directions. This is the structural fix for the bug class — RUNBOOK can never again drift from emitted kinds without CI failure. - receiver.go: promoted `watchUpdateDivisor` / `watchKeepForMultiplier` / `watchUpdateEveryMinimum` constants for the previously-magic DCGM watch-cadence ratios. Documentation + dedup: - dcgm README: new "Privacy + data residency considerations" subsection (compliance-auditor ask). Flags hw.id / pci.bdf / NVLink peer IDs as quasi-identifying; provides two mitigation patterns (attr-drop processor, salt-hash pseudonymization). - docs/agents/examples/constructor_options.go: `WithTelemetry` renamed to `WithSelfTelemetry` to match the real in-tree receiver API. M9+ authors copying the example no longer drift. - RUNBOOK kind enumeration line restored with both watch and mig (dcgm-local kinds) per the last commit's promotion. - Repo-root `FOLLOWUPS.md` consolidated into `docs/FOLLOWUPS.md` (M8-opportunistic + M8-skipped sections). Single source of truth; 17 deferred items pulled forward with falsifiable triggers (cgo client landing, M11 sibling-receiver shape, operator-report thresholds, file-size triggers). - All bare `FOLLOWUPS.md` references updated to `docs/FOLLOWUPS.md`. Honest pushback documented: - I disagree with the M8-AGRADE-GAP claim of Operator UX 3.7→4.0. The drift findings above are exactly the class of bugs that rubric criterion was supposed to prevent — the alerts-vs-RUNBOOK parity test existed but didn't check kind values against emitted call sites. The new TestRUNBOOK_KindsMatchEmitted closes that gap; future operator-UX claims should pin to a test like this. - Deferred: split receiver.go (475 LOC) into 3 files, hoist dcgmtest.BaseClient, SECURITY.md for receiver, dcgm_info join-target, libdcgm setup in CONTRIBUTING — all logged with triggers. Reviewer 1's "construct receiver via M9-style primary-Option" inconsistency goes in the queue for M9-close review. `make ci` passes; dcgm coverage 86.0% (essentially flat — new tests offset by widened emit signature in test paths). Assisted-by: Claude Opus 4.7 Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Closes the R1 #3 / S4 gap: README + RUNBOOK reference deploying the receiver as a DaemonSet multiple times, but no manifest shipped. Operators landing on the docs had to derive the hostPath + CAP_SYSLOG + securityContext shape from prose. components/receivers/kernelevents/example-daemonset.yaml: - ConfigMap with a minimal `receivers: kernelevents` config and `service.pipelines.logs/kernel.receivers: [kernelevents]` (operator-empty `exporters: []` so the file is paste-and-run without forcing an exporter dependency). - DaemonSet with the exact securityContext shape the README's Container realities section prescribes: runAsNonRoot, drop ALL caps, add SYSLOG, readOnlyRootFilesystem, seccomp RuntimeDefault. - hostPath mount of /dev/kmsg with `type: CharDevice` (validates on apply; without the type, a `/dev/kmsg` that's missing on the node would create silently as a regular file). - updateStrategy rollingUpdate maxUnavailable: 1 — one pod at a time keeps kmsg coverage continuous across the fleet. - hostPID: false is required for the kernel namespaced-kmsg feature; we mount /dev/kmsg explicitly so the receiver still sees host-wide events. - tolerations: operator-Exists so the DaemonSet runs everywhere including the control plane; commented out priorityClass and the journald-host-mount block so they're discoverable but not forced. - automountServiceAccountToken: false because the receiver doesn't talk to the k8s API. Validated locally with `kubectl apply --dry-run=client -f` against kubernetes 1.28+. NOT yet validated on a live cluster — operators should also run `kubectl apply --dry-run=server` first to catch policy-engine (Kyverno, Gatekeeper, PSA) objections their fleet may have. Cross-references: - README "Container realities" section points readers here. - CHANGELOG row remains accurate (no change needed). Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <trilamsr@gmail.com>
7 tasks
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Test tag #3 confirmed the reproducibility fix lands: both build digests matched (f744dd584b5e…). The step still tripped because diffoscope-minimal in ubuntu-latest's apt repo doesn't recognise --exit-code; the assertion was returning 2 (argparse error) instead of 0. The default exit status (0 = no diff, 1 = diff) carries the same signal, so just drop the flag. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 15, 2026
Applying phase-4 reviewer findings on the skill at .claude/skills/pr-review-loop/SKILL.md. This commit is a test invocation of the skill's own protocol on the PR shipping the skill. Findings addressed: - BLOCKER P1.1: branch precondition `feat/m<X>-*` rejected this very branch (worktree-docs+loops-pr-review). Skill self-failed on its own PR. Widened to any non-main branch with optional milestone parsing; added `(no-milestone)` fallback when the branch doesn't match the feature-branch convention. SKILL.md lines 26-32, 56-63. - BLOCKER P1.2: PR #33 body described the deleted doc artifact, not the shipped skill. Per feedback_update_pr_body_when_premise_breaks. Updated via gh pr edit (separate from this commit). - CONCERN P1.3: skill cited `MILESTONES.md §M<X>` but the actual heading form is `### M<X>.` (zero §M references in MILESTONES.md body). Reviewer subagents told to read §M8 would fail to anchor. 6 occurrences updated to `MILESTONES.md "### M<X>."`. - CONCERN P1.4: single-quoted heredoc makes placeholder resolution the agent's responsibility, but skill didn't say so explicitly. Added clarifier in auto-detect block (SKILL.md:55-63). - REJECTED: reviewer-1 claim that `/docs/loops/` is still in .gitignore. False positive — verified absent via `grep 'docs/loops' .gitignore` returning no match. Was removed in commit a7fcc82. Per feedback_audit_finding_triage. - DEFERRED to FOLLOWUPS: - AI-vocab grep gate exclusion for .claude/skills/ vs actual CI (no enforced CI gate found; convention-only) - 5-phase rigidity vs feedback_narrow_pr_scope (2-cycle cap) and feedback_anti_bureaucracy — needs adaptive-phase design - TDD escape-hatch for doc/comment-only changes — design tradeoff - .claude/ralph-loop.local.md scaffolding / gitignore policy - --max-iterations 10 citation against ralph-loop plugin docs - Skill description vocab leak into Claude Code skill picker UI Edge cases enumerated (phase-1 protocol requires >=5): 1. Branch with no open PR yet — skill's precondition #3 catches this; aborts with "no open PR." Acceptable behavior. 2. Branch with multiple PRs — gh pr view defaults to first. Documented behavior. No action. 3. Detached HEAD — `git branch --show-current` returns empty; precondition catches via empty-branch check. 4. First-time invocation, no .claude/ralph-loop.local.md — agent creates on first append. Acceptable. 5. Re-invocation after partial completion (some [review-pass-N-*] commits exist) — agent counts existing commits and picks up at next phase. Acceptable; relies on agent counting. Hard proof for each applied change: git diff shows the line-level changes; before/after grep verifies §M, branch-regex, placeholder clarifier. This is phase 1 of 5. Subsequent phases (stakeholder lenses, adversarial, A+ aspiration, simplification) operate on the now- updated skill. Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…+ threat-root trace on go-mod-verify Phase-4 A+ aspiration review (2 fresh subagents; both graded A, diverged on which gates to apply). Validation cycle: Findings table: | ID | Lens | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+ | Action | |----|------|-------------|----------|---------|-------|------------|------------|---------|--------| | P4.1 (aplus-1 #2, also P2.6) | aplus | operator | CONCERN | A workflow_dispatch run with `inputs.tag` set but `github.ref` ≠ refs/tags/$INPUT_TAG passes Build and fails the OIDC smoke check 15-30 minutes later. Operator wastes runner time and sees the misuse late. | New "Verify dispatch ref matches tag (pre-flight)" step exit-1s within seconds with the documented workaround. | Reviewer noted the smoke check already enforces this — but at job-end, not at job-start. Fail-fast IS the load-bearing property. | n/a — workflow YAML, actionlint clean | yes — P4-aplus-1 | applied this commit; closes P2.6 deferral. | | P4.4 (aplus-2 #2) | aplus | repo-long-term | NIT | go-mod-verify comment says "defense in depth against a compromised GOPROXY mirror" but doesn't name the trust root or the orthogonal threat (a poisoned go.sum itself). | Comment now states "Trust root: the go.sum at this tag commit" and cross-references the tag-protection FOLLOWUPS entry. | A future maintainer might over-attribute the protection. | n/a | no | applied this commit | Rejected/deferred: - P4.2 (aplus-1 #4) — Structured diff lint for release.yml ↔ docs/reproducibility.md. DEFER to FOLLOWUPS.md (real value, but manual review caught both drift directions in Phases 2 + 3; automate when next edit happens). - P4.3 (aplus-1 #6) — Release artifact manifest validation before upload. REJECT. Per anti-bureaucracy: reviewer concedes `needs:` dependency already gates malformed artifacts from reaching the release job. Adding defensive validation against a CI-bug scenario is bloat. - P4.5 (aplus-1 #3) — docs/SUPPLY-CHAIN-IDENTITY.md consolidated reference. DEFER to FOLLOWUPS.md; ~30-min write, scope creep beyond release.yml. M21 release-checklist is the natural trigger. - P4.6 (aplus-1 #5, aplus-2 #3) — Formal threat-model document + M21 alignment narrative. DEFER to M21. - P4.7 (aplus-2 #5) — Cross-link health lint. Duplicate of P4.2; same deferral. Reproducibility: $ make actionlint zizmor # exit 0 $ grep -A1 "workflow_dispatch with inputs.tag" .github/workflows/release.yml # pre-flight gate present Letter-grade outcome: Reviewer #1 starting: A → A+ via criteria 2, 4, 6 (we applied 2 + threat-model comment) Reviewer #2 starting: A → APPROVED-AS-IS (already strong) After this commit: A on the falsifiable axis (one operator-UX gate + one comment clarification), with the broader doc/lint work scoped to follow-ups. Beneficiary: operator. The pre-flight gate cites a specific operator-facing surface (15-30 minute waste on workflow_dispatch misuse) and turns it into a seconds-fast named error. Signed-off-by: Tri Lam <tree@lumalabs.ai> Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…IT applied; 13 deferred 8 parallel stakeholder-lens reviews against the design-locked draft. 6/8 lenses returned CONCERNS-REQUIRE-FIX; 1 returned BLOCKER-REQUIRES-RESOLUTION (researcher). Pushback (severity-prefixed, applied inline unless noted): BLOCKER: - P2.R-1 (researcher) — §Attribute namespace's blanket "gen_ai-only, no dual-emit" contradicts MILESTONES.md M15 rubric L360 + M18 rubric L481, both of which compile-time-pin tracecore.training.data_time_s / iter_time_s (M18 via ErrStragglerInputMissing in internal/synthesis/patterns/straggler.go). APPLIED: wire-attribute table carves the two timing attributes as a documented exception until upstream proposal lands, then sibling cross-receiver PR. CONCERN (multi-lens — escalated): - P2.multi-2 (sre,adopter,security) — /var/log/containers mount + fsGroup + runAsNonRoot missing per M15 rubric L373. DEFERRED to FOLLOWUPS (§Pod manifest subsection in Phase-2 PR). - P2.multi-3 (operator,sre,adopter) — KindCardinality covers both fingerprint and attribution LRUs; on-call can't distinguish. APPLIED: split into KindFingerprintCardinality + KindAttributionCardinality; renamed config to `attribution.lru_cap` vs `egress_rate_limit.lru_cap`. - P2.multi-4 (contributor,maintainer,adopter,researcher) — "makes funded" ungrammatical. APPLIED: replaced with "the drafted upstream proposal already absorbs". - P2.multi-6 (contributor,adopter) — dataloader_regex shipped as default without marking placeholder per rubric L360 "operators MUST override". APPLIED: inline comment block. - P2.multi-7 (operator,sre) — RUNBOOK.md + prometheus-alerts.example.yaml + FAILURE-MODES.md rows not named as Phase-4 deliverables. DEFERRED. CONCERN (single-lens, applied): - P2.R-2 (researcher) — HintEvicted does not compile; actual constant is k8sevents.HintPodEvicted at hint.go:24. APPLIED: typo fixed + evictionMatchWindow defined as eviction_match_window default 5s in Configuration surface. - P2.R-4 (researcher) — JobID source-priority chain missing. APPLIED: JOB_ID → TORCHELASTIC_RUN_ID → SLURM_JOB_ID → k8s label fallback (per rubric L358 + research §13.13). - P2.sre-1 — ClusterRole named tracecore-containerstdout-clusterrole breaks k8sevents convention. APPLIED: renamed to tracecore-containerstdout (no -clusterrole suffix). - P2.contrib-1 — SchemaURLv0/SchemaURL collapse inverts k8sevents anti-pattern warning. APPLIED: rewrote snippet to match k8sevents/record.go:90-107 line-for-line with the warning comment preserved. - P2.maint-3 — "§Functional rubric #3" cross-doc reference points outside the RFC. APPLIED: fully-qualified to "MILESTONES.md M15 functional rubric bullet 3". - P2.sec-4 — cursor file mode unspecified. APPLIED: cursor.file_mode: 0600 + dir_mode: 0700 in Configuration. - P2.sec-5 — FieldSelector defense-in-depth missing. APPLIED: Validate rejects empty FieldSelector; runtime drops mismatched-nodeName events with KindWatch. - P2.R-7 — named-capture (?P<name>…) allowance not stated in validate-at-load list. APPLIED. CONCERN (single-lens, deferred to FOLLOWUPS): - P2.perf-1/2/3/4 — §Performance subsection (alloc budget, channel sizing rationale, regex literal-prefilter, attribution-lookup concurrency). DEFERRED as consolidated FOLLOWUPS entry. - P2.sec-2 — log-body secret redaction posture. DEFERRED (alpha-stage posture acceptable per sibling-receiver pattern). - P2.sec-3 — vendor-tree CVE tracking. DEFERRED. - P2.sre-2 — RBAC opt-out path when rank_source: downward_api. DEFERRED. - P2.sre-5 — Rollback subsection in §Migration. DEFERRED. - P2.maint-1/2 overlap with P2.R-2/R-4 (already applied). NIT (deferred or skipped): - P2.perf-5 — cursor flush cadence. DEFERRED. - P2.op-3 — covered by P2.multi-3 (applied). - P2.op-4 — chart smoke test asserts NODE_NAME. DEFERRED. - P2.op-5 — M16 reversal recheck. DEFERRED. - P2.contrib-3 — FOLLOWUPS governance-row falsifying-check. DEFERRED. - P2.contrib-4 — covered by P2.multi-6. - P2.maint-5 — forward-pointer to FOLLOWUPS. DEFERRED. - P2.maint-6 — vendor/ path collision with go mod vendor. DEFERRED. - P2.R-8 — SHA-pin manifest URL. DEFERRED. EXPLICITLY-SKIPPED: - P2.multi-5 / P2.adopt-3 / P2.maint-5 / P2.R-5 (cross-RFC label tracecore.io/rank vs gen_ai.training.io/rank). Phase 1 deferred to FOLLOWUPS; Phase 2 escalation contradicted by: design-locked DRAFT is the appropriate place to defer cross-receiver reconciliation pre-v1.0 (PRINCIPLES §11 backward-compat opt-in pre-v1.0); label fallback only fires when env discovery already failed, in which case rank is "unknown" regardless of label name. Phase 1 defer stands. Validation-cycle stats: - Findings rejected during contradict: 1 (P2.multi-5 namespace-rename; defended by pre-v1.0 + fallback-path-only argument) - Findings whose hard-proof did not reproduce: 0 TDD discipline: N/A — RFC is prose. Reproducibility: make doc-check clean. Rubric additions accepted (binding): - Alpha-receiver RFCs MUST contain §Operator-surfaces section (RUNBOOK + alerts + FAILURE-MODES rows). - RFCs with typed-record schemas MUST include wire-attribute name table cross-checked against MILESTONES. - RFCs naming sibling-package symbols MUST verify compile-resolve at HEAD. Discovered constraints: - Pattern: alpha receivers consistently drop one of {RUNBOOK row, alerts row, FAILURE-MODES row}. Promote rubric. - Pattern: cross-RFC join-key inconsistencies tend to defer to "whichever PR lands first"; design-locked status should exclude this class. Per-lens final verdicts: - Performance: CONCERNS-REQUIRE-FIX (deferred) - SRE/Infra: CONCERNS-REQUIRE-FIX (deferred/applied mix) - Maintainer: CONCERNS-REQUIRE-FIX (mostly applied) - Contributor: CONCERNS-REQUIRE-FIX (applied) - Operator/User: CONCERNS-REQUIRE-FIX (deferred) - Adopter: CONCERNS-REQUIRE-FIX (overlapped with SRE/Security) - Security: CONCERNS-REQUIRE-FIX (mostly deferred) - Researcher: BLOCKER-REQUIRES-RESOLUTION (resolved inline) Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
) ## Summary Four amendments to `docs/rfcs/0013-distro-first-pivot.md` (plus a one-line sweep across 6 companion docs) per the scope-review findings staged before PR-I.1 / PR-K / PR-M code work begins. Pre-stages each decision in the RFC so the autonomous code PRs don't escalate mid-flight. ## Root cause #181 (RFC-0013 PR-I in-repo submodule rescope) was incomplete: 1. Sweep missed 6 companion docs still pointing at the original-design external `tracecoreai/tracecore-components` repo. 2. §7 listed 3 GitHub workflows for deletion that were already removed pre-RFC and 1 issue template (`component-bug-dcgm.yml`) that was already removed pre-RFC. 3. PR-K was a single 4-receiver-delete-plus-chart-migration mega-PR with no decoupling of the `internal/synthesis/patterns/` k8sevents dep break, which is on PR-I.2's critical path. 4. PR-I.1 conflated the `module/go.mod` scaffolding with the `git mv` + package rename, blocking PR-I.1a from landing without PR-B2 even though the scaffolding step has no nccl_fr dep. Mid-flight discovery during merge cycle: merged commit #188 (`feat(pivot): PR-B2 — port dcgm off internal selftel + lifecycle`) reused the `PR-B2` slug for a PR-B1-shape dcgm port (which is moot since dcgm is deleted entirely in PR-F), creating a naming collision against the canonical PR-B2 defined in the RFC — the nccl_fr `internal/{pipeline,consumer,runtime/lifecycle}` → upstream port that hard-gates the PR-I.1b `git mv`. ## Amendments 1. **§6/§7 sweep miss (Amendment 1)**: Remove surviving `tracecoreai/tracecore-components` external-repo references across `docs/getting-started.md`, `docs/followups/M11.md`, `docs/followups/M19.md`, `docs/FOLLOWUPS.md`, `docs/rfcs/0003-pipeline-runtime-and-component-contract.md`, `AGENTS.md`. All re-pointed at `github.com/tracecoreai/tracecore/module` per RFC-0013 §6. Verified zero surviving stale refs. 2. **§7 nonexistent workflow entries (Amendment 2)**: Collapse `pyspy-integration.yml`, `python-publish.yml`, `kernelevents-integration.yml` deletion rows into one row marked "already removed pre-RFC". `component-bug-dcgm.yml` also already removed. Only `component-bug-kernelevents.yml` survives for PR-K. §4 v0.3.0 row + PR-M slug cleaned for consistency. 3. **§migration PR-K sub-slice (Amendment 3)**: - **PR-K.1** — sever `internal/synthesis/patterns/` from `components/receivers/k8sevents` via local model types in `internal/synthesis/patterns/model.go`. No deletions. **Unblocks PR-I.2.** - **PR-K.2** — delete `components/receivers/{clockreceiver,kernelevents,k8sevents,containerstdout}` + migrate ~86 test fixtures + delete `tools/failure-inject/xidgen/` + keep `tools/failure-inject/ncclhang/`. - **PR-K.3** — chart cleanup: flip `containerstdout-on-values.yaml` to filelog+container-stanza, delete `containerstdout-rbac.yaml`, delete `.github/ISSUE_TEMPLATE/component-bug-kernelevents.yml`, ship `NOTES.txt` deprecation + values-key removal. 4. **§migration PR-I sub-slice + PR-B2 promotion (Amendment 4)**: - **PR-B2** reframed as hard gate for PR-I.1b: port `components/receivers/nccl_fr` off `internal/{pipeline,consumer,runtime/lifecycle}` to upstream `go.opentelemetry.io/collector/{component,receiver,consumer,pipeline}`. Slug-collision note added re: merged #188. - **PR-I.1a** — `module/go.mod` + root `go.work` + `builder-config.yaml` `replaces:` skeleton. No file movement. Tag `module/v0.0.1` (genesis tag, validates the tagging contract). - **PR-I.1b** — `git mv components/receivers/nccl_fr → module/receiver/ncclfrreceiver` + `git mv pkg/nccl/fr_parser → module/pkg/nccl/fr_parser` + rename Go package `nccl_fr` → `ncclfrreceiver` + update all importers. Hard-gated on PR-B2. No new tag; next bump is `module/v0.1.0` at PR-I.2. - **PR-I.2** — `rankjoinprocessor` + `patterndetectorprocessor` net-new. Hard-gated on PR-K.1. Tag `module/v0.1.0` (first version pinned in `builder-config.yaml` for v0.2.0). Also: PR-J marked `(landed, #195)` with note that recipe docs landed but chart-values compat map follows in PR-K.3. ## Adversarial review (5 lenses, inline) - **(a) PR slug internal consistency**: PR-I.1b ↔ PR-B2 ↔ PR-I.2 ↔ PR-K.1 bidirectional gates all match. PR-J landed marker consistent with #195. §4 v0.2.0 row has pre-existing drift (mentions dcgm+kueue in v0.2.0 when PR-F+#168 already deleted them in v0.1.0) — out of these 4 amendments' scope; flag for follow-up. - **(b) PR-B2 hard-gate naming**: tightened from "Hard gate for PR-I.1" to "Hard gate for PR-I.1b" — accurate because PR-I.1a is scaffolding-only with no file movement. - **(c) Sub-PR numbering collision**: #188 explicitly addressed in slug-collision note. #185/#186/#187/#193/#194/#196 are PR-B1-shape ports for non-nccl receivers (no PR-slug label in their commits), no collision. - **(d) Stale external-repo refs**: `grep -rn "tracecoreai/tracecore-components" docs/ AGENTS.md README.md` returns zero hits post-amendment. - **(e) Cross-reference link integrity**: `docs/migration/v0.1-to-v0.2.md` references `#migration--rollout` and `#3-customer-stable-telemetry-contracts`; both anchors preserved (§ headers unchanged). `make doc-check` confirms 526 markdown links resolve. ## Test plan - [x] `make doc-check` — 526 markdown links resolve, 0 stale refs, banned-phrase lint clean, alert-check + chart-appversion gates green. - [x] Pre-push hooks: golangci-lint clean, go vet clean, go mod verify clean. - [ ] CI doc-check + actionlint + zizmor gates pass on PR. ```release-notes NONE ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
…ORTHSTAR pattern #3) (#255) ## Summary Third real detector in `module/pkg/patterns/` — closes v0.3.0 NORTHSTAR O1 ("3 patterns at v0.1.0") alongside `pod_evicted` (#14) and `nccl_hang` (#15). **Pattern**: a GPU Xid kernel event (carrying the customer-stable `kernelevents.xid` attribute) followed within a configurable window (default 60s) by a pod eviction on the same node → one verdict per (Xid, evicted_pod) tuple. Operator-actionable: "Xid 79 on gpu-node-0001 → training/job-rank-3 evicted 10s later" with remediation pinning the node to drain and the pod to reschedule. **Detector** (`module/pkg/patterns/xid_correlation.go`): - `XidCorrelationDetector` zero-value usable; `CorrelationWindow` overrideable (defaults to `DefaultXidCorrelationWindow = 60s`). - `Evaluate(xids, events)` joins on `(node, time-within-window)` with the most-recent Xid winning the proximate-cause attribution. - Causal-direction guard: eviction BEFORE Xid does not correlate (mirrors pod_evicted's future-transition exclusion). - Deterministic output sorted by `(eviction_time, EventUID)`. **Verdict shape** — single vs multi-layer decision: structurally two evidence surfaces (kernel_event + pod_event in causal order in the evidence trail), but emission rule is "both layers joined or no verdict". Xid-without-eviction is already operator-visible via the raw `kernelevents.xid` telemetry; an Xid-only verdict would duplicate that signal. So `Confidence` / `MissingLayers` are omitted (dead-fields discipline from PR #250 review). Mirrors the `nccl_hang` shape. **Multi-pod decision**: one verdict PER evicted pod (not one verdict listing all). Per-pod is the operator-actionable shape — each verdict's `Remediation` pins a specific pod to drain/recreate so alert routing fans out by pod owner; collapsing would force operators to parse a list and lose per-pod routing. **Schema** (`testdata/xid_correlation_verdict.schema.json`): `additionalProperties:false`, `pattern.id` const="16", `evidence_trail.kind` enum=`["kernel_event","pod_event"]`, `minItems:2` (both layers required). 7-row drift-rejection battery covers re-introduced `confidence`/`missing_layers`, evidence-minimum violation, kind-enum, and `pattern.id` numeric vs string drift. **Wiring** (`patterndetectorprocessor`): - New `Config.XidCorrelationWindow` YAML field (default 60s, floor 1s), validated alongside the existing window fields. - `collectInputs` returns a fourth typed slice (`[]patterns.XidRecord`) built by `projectXidRecord`, which gates on `kernelevents.xid` and reads the host node from the resource attribute `k8s.node.name` (the standard k8sattributes / resourcedetection stamp on a DaemonSet — same pattern `projectNodeCondition` uses). Per-record `k8s.node.name` fallback for non-DaemonSet emitters. - `appendXidCorrelationVerdict` mirrors `appendNCCLHangVerdict`'s wire format — broken-out scalar attrs plus full `pattern.verdict_json`. **Recipe alignment**: `docs/integrations/journald-kernel.md` §"Customer-stable attribute mapping" already documents `kernelevents.xid` (int) as the OTTL-stamped surface. No recipe update needed; the existing pipeline emits exactly what the detector consumes. `gpu.id` (PCI BDF) is also stamped but intentionally not projected — the detector doesn't use it yet, so it stays available on the raw record without entering the pattern library as a dead field. ## Test plan - [x] `cd module && go test ./pkg/patterns/... ./processor/... -race -count=1` — green - [x] `make build` — collector binary compiles - [x] `make check` (golangci-lint + go vet + go mod verify) — 0 issues - [x] TDD red-test-first: both `xid_correlation_test.go` files failed to compile/assert before their impl landed - [x] Detector test cases: positive Xid 79 → eviction, no-eviction negative, cross-node negative, out-of-window edge (61s default), multi-pod per Xid (3 verdicts), window-configurable, pre-Xid eviction excluded, non-evicted-hint ignored, deterministic order, most-recent-Xid-wins - [x] Schema-conformance + 7-row drift-rejection battery - [x] Wiring tests: positive verdict emission, healthy non-emission, window override - [x] `TestConfig_Validate` extended with sub-1s xid_correlation_window floor case - [x] `TestFactory_Surface` pins `DefaultXidCorrelationWindow` (and the previously-unpinned `DefaultNCCLHangThreshold`) on the factory's default config ```release-notes feat(patterns): add xid_correlation detector (v0.3.0 NORTHSTAR pattern #3) — correlates GPU Xid kernel events (`kernelevents.xid` attribute per RFC-0013 §3) with downstream pod evictions on the same node within a configurable window (default 60s). Emits one verdict per evicted pod so alert routing can fan out per pod owner. Wired into patterndetectorprocessor; configurable via `xid_correlation_window` YAML field. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
Closed
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
…ctor wiring (PR-A for #260) (#267) ## Summary PR-A for [issue #260](#260) — the recipe-side half. Two commits: 1. `docs(adr): metrics-to-logs converter for pattern input (Option A)` — records the architectural decision and the upstream-blocker analysis under a new `docs/adrs/` directory. 2. `docs(recipe): extend prometheus-scrape with DCGM to hw.* OTTL transform` — adds `transform/dcgm_to_hw_semconv` (a second OTTL pass after `transform/gpu_vendor`) and documents every per-pattern projection in `docs/integrations/prometheus-scrape.md`. ### Architectural decision (issue #260 §Proposed plan step 2) **Option A** — extend `patterndetectorprocessor` with `processor.WithMetrics` alongside the existing `WithLogs` (deferred to PR-B; this PR ships only the wire-format contract PR-B will consume). Per `[[adopt-over-build]]` Option B (OTTL metrics-to-logs converter) was the preferred starting point. It's blocked upstream: at OTel contrib v0.130 (the release pinned in `builder-config.yaml`) the `transformprocessor` signal contexts are sealed per-signal — `metric_statements` can only reference `resource | scope | metric | datapoint` paths, never `log` ([README @ v0.130.0 § Config](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config)). Surveying every contrib connector at v0.130 also turns up zero metrics→logs primitives (`countconnector`, `signaltometricsconnector`, `spanmetrics`, `exceptions`, `servicegraph` all output metrics; `routing`/`failover`/`roundrobin`/`otlpjson` are signal-preserving; `datadog` and `grafanacloud` are vendor-specific). The full survey + reasoning is in [ADR-0001](docs/adrs/0001-metrics-to-logs-pattern-input.md). Option C (routing connector + sibling) is strictly heavier than A with the same end state — routing preserves signal type, so it cannot bridge metrics→logs either. Trade-off accepted: doubled processor surface inside `patterndetectorprocessor` (one `ConsumeLogs`, one `ConsumeMetrics`). Mitigation: the metrics path is a thin projection on top of the same `module/pkg/patterns/` library; verdict emission stays log-based via a connector. Upstream-contribution slot opened — a `metricthresholdconnector` (inverse of `signaltometricsconnector`) would let us collapse Option A back to Option B without operator-visible churn. Tracked under RFC-0013 §5. ### Per-pattern OTTL projection (this PR) | Pattern | Raw DCGM input | OTel output | Trigger-statement shape (PR-B consumes this) | |---|---|---|---| | #1 NVLink | `DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES` | `hw.gpu.nvlink.io` (Counter, `By`) + `hw.gpu.nvlink.link={N}` + `network.io.direction={transmit\|receive}` | `rate(hw.gpu.nvlink.io[5m]) < 0.5 × per-GPU median over the per-`hw.id` link set, for 10m` | | #3 HBM ECC | `DCGM_FI_DEV_ECC_{SBE,DBE}_{VOL,AGG}_TOTAL` | `hw.errors` (Counter, `{error}`) + `error.{type,subtype,persistence}` | `increase(hw.errors{error.type="uncorrected", error.persistence="volatile"}[5m]) > 0` | | #4 thermal | `DCGM_FI_DEV_{THERMAL,POWER,SYNC_BOOST,BOARD_LIMIT,LOW_UTIL}_VIOLATION` | `hw.gpu.throttle.duration` (Counter, `s`) + `hw.gpu.throttle.reason` | `increase(hw.gpu.throttle.duration{reason="thermal"}[5m]) > 30s, for 10m` | | #5 PCIe | `DCGM_FI_PROF_PCIE_{TX,RX}_BYTES` | `hw.gpu.io` (Counter, `By`) + `network.io.direction` + `hw.gpu.pci.bdf` resource attr | `rate(hw.gpu.io[5m]) < 0.3 × host median, for 15m` | Resource attribution: `UUID` or `gpu_uuid` → `hw.id`; `gpu` or `GPU` → `hw.gpu.index`; `hw.type="gpu"` stamped on every DCGM series. Dual-label fallback handles both legacy and modern dcgm-exporter releases. ### What's NOT in this PR The third commit originally planned by issue #260 — `feat(recipe): metrics→logs OTTL alert emission` — is dropped because OTTL cannot emit log records from metric streams at v0.130 (root cause cited above, full analysis in ADR-0001). The fallback is Option A, which requires modifying `patterndetectorprocessor` source — out of scope for a recipe-only PR. The alert-emission half lands in PR-B alongside the NVLink detector code. ## Test plan - [x] `./_build/tracecore validate --config=docs/integrations/examples/prometheus-scrape.yaml` exits 0. - [x] `bash scripts/doc-check.sh` clean (test-name parity, markdown link integrity, banned-phrase lint, integration-recipe rubric all pass). - [x] `bash scripts/validator-recipe.sh` validates 6 recipes (skips 2 requires-linux / requires-k8s-cluster — CI ubuntu runner exercises those paths). - [x] `make check` (golangci-lint, go vet, mod-verify) clean. - [ ] Adversarial reviewer check that the OTTL `set(metric.name, ...)` identity-conflict caveat (recipe markdown § "Identity-conflict caveat") matches the operator's downstream backend's deduplication semantics for the four projected metric families. - [ ] CI runner exercises the requires-linux / requires-k8s recipes that this PR's host skipped. ## Followup - **PR-B (issue #260, blocked by this PR's merge)** — extend `patterndetectorprocessor` with `processor.WithMetrics`; land `module/pkg/patterns/nvlink_degradation.go`; wire verdict emission through a sibling logs pipeline via a connector. Decision rationale baked into ADR-0001 so PR-B doesn't relitigate. - **Upstream contribution (v0.3 cycle, RFC-0013 §5)** — propose a `metricthresholdconnector` (or `signaltologsconnector`) to OTel-contrib. If/when it lands, the recipe can collapse back to Option B without operator-visible churn — the `hw.gpu.*` wire-format contract this PR ships is the load-bearing customer surface, not the upstream component selection. - **dcgm-exporter throttle-reason mapping verification** — the pattern #4 mapping table assumes modern dcgm-exporter `DCGM_FI_DEV_*_VIOLATION` series names. If your dcgm-exporter release exposes the throttle bitmask as `DCGM_FI_DEV_CLOCKS_EVENT_REASONS` instead (per the semconv proposal Open Question #1), extend the OTTL block to expand the bitmask into discrete `hw.gpu.throttle.duration` datapoints — out of this recipe's scope; tracked as a follow-up under the same milestone. ```release-notes docs(recipe): The `prometheus-scrape` recipe gains a second OTTL transform pass that projects raw `dcgm-exporter` `DCGM_FI_*` series into the customer-stable `hw.gpu.*` / `hw.errors` namespace (semconv proposal). Four pattern families covered — NVLink, HBM ECC, thermal throttle, PCIe — using upstream OTTL functions only (`set`, `IsMatch`, `ExtractPatterns`, `Int`). The wire-format contract this transform produces is the input PR-B's NVLink detector will consume; the in-tree component change for the detector itself lands in a follow-up PR per ADR-0001. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
…#3) (#274) ## Summary Ships pattern #3 (uncorrectable HBM ECC) per `docs/patterns/pattern-3-hbm-ecc.md`. Two layers required for a verdict — a single layer alone emits nothing. - **Layer 1 (DCGM):** uncorrected double-bit volatile ECC counter delta on a GPU. Discriminator: `error.type=uncorrected` AND `error.subtype=double_bit` AND `error.persistence=volatile` AND `Delta >= threshold`. - **Layer 2 (kernel):** Xid 48 / 63 / 64 on the SAME `gpu.id` (PCI BDF) within 5min default window. Causality flows forward — Xid must be at or after the ECC delta. - Join key: `gpu.id`. Same-node is implied but not required for the join (the GPU identifier is the proximate hardware-fault key). ### Why two layers required - ECC delta alone: operators already see the rising counter on dashboards; pattern's value is the kernel-level confirmation that the GPU is failing. - Xid alone: may fire `xid_correlation` if a pod evicts on the same node — that's a node-scoped pattern, distinct from this GPU-scoped one. Operators get raw `kernelevents.xid` telemetry already. ### What's new - `module/pkg/patterns/hbm_ecc.go` — `HBMECCDetector` library type, `HBMECCRecord` projection, `HBMECCVerdict` output. `PatternIDHBMECC = "3"`, `EvidenceKindHBM = "hw_error"`. - `module/pkg/patterns/testdata/hbm_ecc_verdict.schema.json` — JSON schema with 10 drift falsifiers. - `XidRecord` extended with optional `GPUID` (PCI BDF from the journald-kernel OTTL recipe per RFC-0013 §3). `xid_correlation` ignores it; `hbm_ecc` requires it. - Processor wiring: `projectHBMECCRecord` + `appendHBMECCVerdict` mirror the xid_correlation shape. Config grows `hbm_ecc_window` (default 5min) + `hbm_ecc_delta_threshold` (default 1). ### Follow-up gap (not a workaround) The pattern doc names `hw.errors` as a Counter on the metrics pipeline; the patterndetectorprocessor consumes log records. The detector wires up against `hw.errors.delta`-shaped log records but no OTTL recipe today derives that log shape from the upstream metric. Tracked in #273 — needed for end-to-end firing in a real deployment; the detector library is the v0.3 moat and ships independently per RFC-0013. ## Test plan - [x] `cd module && go test ./... -race -count=1` — all green - [x] `make build` — exit 0 - [x] `make check` — golangci-lint 0 issues, go vet clean, go mod verified - [x] Window-edge fenced both sides: `EdgeOutOfWindow` (5min+1s, no fire) + `EdgeAtWindowBoundary` (exactly 5min, must fire) - [x] Negative paths: ECC alone, Xid alone, wrong Xid code, corrected ECC, cross-GPU, pre-ECC Xid (causal direction) - [x] Schema drift battery: 10 falsifiers (extra field, confidence re-added, pattern.id numeric, pattern.id wrong value, xid_code string, ecc_delta negative, ecc_delta zero, evidence_kind outside enum, evidence_trail under min, gpu_id empty) - [x] Wiring: emits-verdict, xid-alone, ecc-alone, out-of-window, cross-gpu, window+threshold configurable ```release-notes feat(patterns): hbm_ecc detector (NORTHSTAR pattern #3) — emits a verdict when an uncorrected double-bit volatile DCGM ECC counter rise joins an Xid 48/63/64 on the same gpu.id within the configurable window (default 5min). Two new YAML config keys: hbm_ecc_window, hbm_ecc_delta_threshold. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Sweep of broken doc anchors and stale `tracecore_*` self-telemetry metric names that drifted after the v0.2.0 OCB pivot. Three independent root causes: 1. **Anchor typos** — 8 cross-links to RFC-0013 used `#migration--rollout` (double dash) and one used `#3-customer-stable-contracts` (truncated). Replaced with the canonical slugs `#migration-rollout` and `#3-customer-stable-telemetry-contracts`. 2. **Stale RUNBOOK metric vocabulary** — `module/receiver/ncclfrreceiver/RUNBOOK.md` still referenced v0.1.x names (`tracecore_receiver_errors_total`, `tracecore_exporter_calls_total`, `tracecore_receiver_degraded_seconds_total`). The v0.2.0 OCB binary emits `otelcol_receiver_ncclfr_*` and `otelcol_exporter_<name>_*` per the in-tree namespace alignment table in `docs/migration/v0.1-to-v0.2.md`. Also flipped `result="fail"` → `result="error"` to match the migration table's worked example. 3. **Orphan attribute row** — `tracecore.container.lines_per_s` in `docs/ATTRIBUTES.md` carried a "recipe TBD" emitter, but the `containerstdout` receiver was deleted at v0.2.0 (RFC-0013 PR-K.2 / §7) and no replacement emitter exists. Dropped the row. ## Files touched - `docs/getting-started.md` — `#3-customer-stable-contracts` → `#3-customer-stable-telemetry-contracts` - `docs/ATTRIBUTES.md` — drop `tracecore.container.lines_per_s` row - `docs/integrations/{k8sobjects-events,filelog-container,journald-kernel}.md` — anchor fix - `docs/migration/v0.1-to-v0.2.md` — anchor fix (2 occurrences) - `docs/migration/v0.2-to-v0.3.md` — anchor fix - `docs/notes/ci.md` — anchor fix - `docs/reproducibility.md` — anchor fix - `module/receiver/ncclfrreceiver/RUNBOOK.md` — 4 metric-name rewrites + migration-doc citation ## Why Anchor fragments silently 404 in GitHub-rendered docs (no broken-link CI gate for fragment-only mismatches). Stale RUNBOOK metric names point operators at series that no longer exist on the `:8888/metrics` endpoint, blocking incident response. The orphan ATTRIBUTES row implied a recipe was in flight when in fact the upstream emitter was deleted two minors ago. ## Test plan - [x] `git grep '#migration--rollout' docs/` returns 0 in-scope hits (the one remaining occurrence is an intra-doc link in `docs/rfcs/0009-pyspy-receiver-scope.md`, outside this PR's scope). - [x] `grep -n 'Customer-stable telemetry contracts' docs/rfcs/0013-distro-first-pivot.md` confirms heading exists at L97. - [x] RUNBOOK metric names cross-checked against `docs/migration/v0.1-to-v0.2.md` lines 113-121 (in-tree namespace alignment table) and lines 137-138 (worked PromQL migration recipe — `result="error"`). - [x] Pre-push hooks: `golangci-lint`, `go vet`, `go mod verify`, `doc-check` (554 markdown links resolved on-disk), `alert-check`, `chart-appversion-check`, `no-autoupdate-check` — all green. ```release-notes docs: fix 9 broken cross-doc anchors (RFC-0013 §migration and §3 slug), rewrite stale v0.1.x `tracecore_*` self-telemetry metric names in `module/receiver/ncclfrreceiver/RUNBOOK.md` to the v0.2.0 `otelcol_receiver_ncclfr_*` / `otelcol_exporter_<name>_*` vocabulary, and drop the orphan `tracecore.container.lines_per_s` row from `docs/ATTRIBUTES.md` (no emitter — `containerstdout` deleted at v0.2.0 per RFC-0013 §7). ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary The `docs/adrs/` directory held one file (`0001-metrics-to-logs-pattern-input.md`). A single-file decision-record directory sitting parallel to `docs/rfcs/` (13 RFCs + README + template) is taxonomy drift — operators and contributors had to learn two near-identical conventions for "load-bearing architectural decision in tree." This PR collapses the split. ### What changed - ADR-0001 → **RFC-0014** (`docs/rfcs/0014-metrics-to-logs-pattern-input.md`). Content reformatted to match the RFC template's section headings (Summary / Motivation / Proposal / Alternatives / Open questions / Migration / References). The substance (Option A vs Option B vs Option C analysis, the v0.130 contrib survey, PR-A landed + PR-B pending sequencing) is preserved verbatim — this is active design for issue #260 PR-B, not archaeology. - `docs/adrs/` directory removed. - 6 cross-references repointed at the new RFC path: - `docs/README.md` (subdirectories table — `adrs/` row removed) - `docs/ATTRIBUTES.md` (2 spots: `tracecore.alert.pcie_rate_collapse.*` row + "See also" link) - `docs/integrations/prometheus-scrape.md` (2 spots) - `docs/patterns/pattern-4-thermal-throttle.md` - `docs/patterns/pattern-5-pcie-aer.md` - 5 source-code comments repointed (`module/processor/patterndetectorprocessor/{patterndetector.go,thermal_throttle_test.go}`, `module/pkg/patterns/{pcie_aer.go,thermal_throttle.go}`). - `docs/rfcs/README.md` status-index gains an RFC-0014 row (`accepted`, 2026-05-31). ### Why convert (not delete) ADR-0001 was evaluated against the "delete if RFC-0013 already covers it" bias. It is not covered: - RFC-0013 §5 mentions a `metricthresholdconnector` as a contribution slot in one bullet. It does **not** evaluate Option A vs Option B vs Option C, cite the contrib v0.130 evidence, or sequence the PR-A/PR-B split for the metric-sourced detectors. - ADR-0001 is the binding design contract for patterns #1 / #3 / #4 / #5 (4 of the next NORTHSTAR detectors). Source code (`pcie_aer.go`, `thermal_throttle.go`, `patterndetector.go`) cites it as the reason for the staged-but-quiet wire-up. Delete would orphan 5 source comments and break the audit trail for an active design decision. Convert keeps the record load-bearing without preserving the parallel taxonomy. ### Verification - `grep -rn "docs/adrs\|adrs/0001"` returns 0 hits. - `grep -rn "ADR-0001\|ADR 0001"` returns 0 hits. - `ls docs/ | grep -i adr` returns empty. - pre-commit golangci-lint + go vet + go mod verify clean. ```release-notes docs: collapse single-file `docs/adrs/` into `docs/rfcs/`. ADR-0001 (metrics-sourced pattern inputs) is promoted to RFC-0014 verbatim; cross-references across docs and module source repointed. ``` ## Test plan - [x] `grep -rn "docs/adrs"` returns 0 hits - [x] `grep -rn "ADR-0001"` returns 0 hits - [x] `docs/adrs/` directory removed - [x] golangci-lint + go vet clean (pre-commit hook) - [ ] CI green Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Extends prometheus-scrape.md with the bridge attribute contract for the four metrics-derived patterns: - pattern #1 NVLink (#260) — the `hw.gpu.nvlink.io` OTTL transform already lands in commit 0baa557; this PR closes #260's recipe-half. - pattern #3 HBM ECC (#273) — `hw.errors.delta` + error.{type, subtype,persistence} + gpu.id contract. - pattern #4 thermal throttle (#282) — `hw.gpu.throttle.duration.delta` in integer seconds + reason=thermal + gpu.id contract. - pattern #5 PCIe AER Layer 2 (#284) — the `tracecore.alert. pcie_rate_collapse.*` namespace contract. OTTL metrics->logs emission stays upstream-blocked at OTel-contrib v0.130 (RFC-0014): no contrib processor or connector emits log records from a metrics pipeline. The bridge contract documented here is the load-bearing wire format any future emitter (an upstream metricthresholdconnector OR the WithMetrics extension to patterndetectorprocessor per RFC-0014 PR-B) MUST honor; the detector projections at module/processor/patterndetectorprocessor/ patterndetector.go gate on this contract today. last-verified marker bumped to 2026-06-01. Closes #260. Closes #273. Closes #282. Refs #284 (Layer 1 closed under #285 in a prior commit; Layer 2 contract documented here). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Closes #331. Replaces the single `pattern-pod-evicted` chaos job with a matrix-strategy `pattern-detectors` job covering every detector named in v1-rc1 cut-criterion §1: `hbm_ecc`, `thermal_throttle`, `pcie_aer`, `pod_evicted`, `nccl_hang`, `xid_correlation`. Each row runs the detector's positive + negative + schema `*_test.go` prefix under `-race`: - **Positive / Edge asserts** — verdict emitted with correct `pattern.id` within bounded time. - **Negative / SchemaRejectsDrift asserts** — zero verdicts on baseline / confounder input (the false-positive gate the chaos matrix exists to enforce). The `pod_evicted` row additionally runs the on-disk replay corpus under `module/pkg/replay/pod_evicted/`; sibling rows skip that step until their `replay/<pattern>/` fixtures land — tracked as Path A in [`docs/v1-rc1-test-audit.md`](../blob/main/docs/v1-rc1-test-audit.md) §4. ## Why this matrix shape, not new `failure-inject` modes Audit §4 explicitly recommends Path A (matrix over existing hermetic detector tests) for rc1 and defers Path B (per-pattern `failure-inject` CLI modes) to v1.x. Path A reuses the race-detector + golden-SHA shape already validated by `pattern-pod-evicted` and costs ~60 LoC of workflow YAML instead of five new CLI subcommands. ## Wall-time budget Observed per-row runtime under `-race` (measured locally on this worktree): | Row | runtime | |---|---| | hbm_ecc | 1.35s | | thermal_throttle | 1.33s | | pcie_aer | 1.30s | | pod_evicted (incl. replay corpus) | 1.35s + 1.34s | | nccl_hang | 1.31s | | xid_correlation | 1.32s | Per-row `timeout-minutes: 5` is ~3× headroom; matrix parallelism keeps total wall-time well inside the chaos-gate budget. ## Test plan - [x] `actionlint .github/workflows/chaos.yml` — clean. - [x] `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/chaos.yml'))"` — parses, 3 jobs (`harness-determinism`, `cpu-steal-mpstat`, `pattern-detectors`), 6-row matrix. - [x] For every matrix row, `cd module && go test -race -count=1 -run '<regex>' ./pkg/patterns/...` — all green (counts: HBMECC=16, ThermalThrottle=17, PCIeAER=15, PodEvicted+PressureFromNote=12, NCCLHang=10, XidCorrelation=14). - [x] `cd module && go test -race -count=1 ./pkg/replay/...` (pod_evicted corpus step) — green. - [x] Pre-commit hook: golangci-lint + go vet + go mod verify + attribute-namespace-check — all green. - [ ] Workflow runs green on `main` after merge (verifiable only post-merge — nightly cron + push-to-main). ```release-notes ci: chaos.yml now matrix-tests all six shipped detectors (#14 pod_evicted, #8 nccl_hang, #3 hbm_ecc, #4 thermal_throttle, #5 pcie_aer, xid_correlation) instead of just pod_evicted. Closes the chaos-coverage gap called out in v1-rc1-test-audit §4. No operator-visible behavior change. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Closes #337. The CUDA OOM detector (`projectFBMemoryRecord` at `module/processor/patterndetectorprocessor/cuda_oom.go:114`) gates on `hw.gpu.memory.{free,total}` log-record attributes, but nothing in the recipe layer produced them: `dcgm-exporter` emits `DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` as Prometheus gauges and no OTTL transform projected them onto the customer-stable namespace. Detector compiled, never fired on a real install — sibling gap to #273 (pattern #3), #282 (#4), #284 (#5). This PR closes the gap on the metric-side projection and pins the load-bearing log-shape contract the bridge layer MUST honor. - `docs/integrations/examples/prometheus-scrape.yaml`: - `DCGM_FI_DEV_FB_USED` → `hw.gpu.memory.used` (Gauge, unit `By`) - `DCGM_FI_DEV_FB_FREE` → `hw.gpu.memory.free` (Gauge, unit `By`) - Identity-preserving rename only; `hw.gpu.memory.total = used+free` deferred to the bridge layer per the named upstream limit (see below). - `docs/integrations/prometheus-scrape.md`: - New `### Pattern #10 — CUDA OOM (framebuffer)` metric-side projection section with raw-series → semconv table. - New `#### Pattern #10 — hw.gpu.memory.{free,total}` bridge- contract subsection with full log-record schema (yaml-shaped) matching what `projectFBMemoryRecord` reads, plus MIG caveat and unit-test cross-link. - Intro + bridge-contract header bumped to include pattern #10. - `docs/patterns/10-cuda-oom-deceptive.md`: - Signal-source line links to the recipe sections. - Open Question #1 (`DCGM_FI_DEV_FB_*` OTTL extension) struck through; resolution recorded. - `docs/ATTRIBUTES.md`: - `hw.gpu.memory.free` / `.total` rows updated to distinguish metric vs log shape and to cross-link to the recipe section. - New `hw.gpu.memory.used` row (now projected on the metrics pipeline by this PR — dashboard evidence context). ## Root cause + named upstream limit **Root cause (fixed in this PR):** the prometheus-scrape OTTL transform had no stanza projecting the DCGM FB series onto `hw.gpu.memory.*`. The detector's projection gate could not be satisfied on a real install. Fixed by adding the rename stanza in `transform/dcgm_to_hw_semconv` (same processor the #1/#3/#4/#5 projections already live in — no new processor surface). **Named upstream limit (NOT worked around — tracked):** OTel-contrib `transformprocessor` v0.130 `metric_statements` cannot perform cross-series arithmetic — there is no OTTL path to compute `hw.gpu.memory.total = hw.gpu.memory.used + hw.gpu.memory.free` on a metrics pipeline ([upstream README](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config)). Per RFC-0014 §Alternatives, the metrics→logs primitive does not exist in contrib at v0.130 either. The total scalar lives at the bridge layer (RFC-0014 PR-B `WithMetrics` extension to `patterndetectorprocessor`, tracked under #260). The recipe pins the load-bearing wire format the bridge MUST honor so PR-B lands without a contract change. ## Adopt-over-build posture Every new OTTL statement uses upstream functions only (`set`, `==` equality). No new transformprocessor extension. Mirrors the existing #1/#3/#4/#5 stanzas. ## Test plan - [x] `make build` — clean - [x] `./scripts/validator-recipe.sh` — 9 validated, 3 skipped (non-linux), 0 fail; prometheus-scrape example passes `tracecore validate`. - [x] `./scripts/doc-check.sh` — 721 markdown links resolve, all new cross-links + anchor refs included; banned-phrase lint clean; recipe markers (`tested-against`, `last-verified`) present. - [x] `./scripts/attribute-namespace-check.sh` — 67/67 attribute literals documented (no new undocumented attrs introduced). - [x] golangci-lint, go vet, go mod verify via commit hook — clean. - [ ] CI linux runner exercises journald + k8sobjects skip-paths we couldn't run locally (validator-recipe ubuntu job). ```release-notes recipe(ottl): project DCGM `DCGM_FI_DEV_FB_USED` / `DCGM_FI_DEV_FB_FREE` onto the customer-stable `hw.gpu.memory.{used,free}` namespace and pin the metrics-to-logs bridge log-shape spec the pattern #10 (CUDA OOM) detector consumes via `projectFBMemoryRecord`. `hw.gpu.memory.total = used + free` derivation is deferred to the RFC-0014 PR-B `WithMetrics` bridge layer because OTTL `transformprocessor` v0.130 has no cross-series arithmetic on metrics pipelines. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
The patterndetector ships 11 detectors with 14 time-bounded knobs, but the join shape varies across patterns and the rationale lived only in code comments + PR review threads. Operators tuning windows had to read source per detector. Audit finding: five distinct shapes are load-bearing (chosen by the causal physics of each signal), not bugs: - One-sided lookback (#1 #3 #5 #6 #7 #10): cause precedes effect. - Asymmetric two-sided (#11): pre-stall covers concurrent-start checkpoints; post-stall covers OTTL-bridge logger latency. - Symmetric two-sided (#9 CNI-event leg): cohort-ready ±window could be cause OR consequence. - Job-window bounded (#13): SDC counter rise must fall in the bounded eval-cycle's owning job; no operator knob is meaningful. - Trailing-window rate / freshness (#2 #4 #8): rolling window anchored at `now` or the most-recent record. Decision: document the existing reality, do not converge. Forcing every detector to the asymmetric two-knob form would silently zero one leg for the one-sided detectors (footgun on clock skew) and would not apply to #13 at all. Adds: - 'Why this correlation shape' section in docs/patterns/07, 11, 13 (the three shapes the issue called out by name). - 'Correlation-window semantics' table in docs/patterns/README.md covering ALL 11 detectors with the predicate, anchor, and shape rationale, plus cross-links to the per-pattern sections. No code changes; no detector behavior changes. Closes #367. Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Lands replay corpora for the seven pattern detectors that lacked one, closing the Path-A test gap named in the v1-rc1 audit (#366). Each pattern now ships `module/pkg/replay/<pattern>/{canonical,_negative, _real_world}/` fixtures plus a `*_replay_test.go` runner that JSON-eqs detector output against the on-disk golden. Detectors covered: `hbm_ecc` (#3), `thermal_throttle` (#4), `pcie_aer` (#5), `ib_link_flap` (#2), `nccl_hang` (#15), `cuda_oom` (#10), `xid_correlation` (#16). `pod_evicted` (#14) corpus is unchanged. ## Design - Each detector takes a different input shape (e.g. `HBMECCRecord + XidRecord` for `hbm_ecc`, `ThermalThrottleRecord` for `thermal_throttle`, ...), so the existing `LoadFixturesUnder` helper (typed on `Record + NodeRecord`) cannot be reused. Each detector gets its own `*_replay_test.go` that inlines the per-detector JSON read; shared fixture-discovery and golden-assert helpers live in `helpers_test.go`. - Two detectors (`nccl_hang`, `ib_link_flap`) take a `Now` reference. Tests pin `Now` to a fixed timestamp matching the fixture's `started_ns` so hang-age and flap-window inclusion stay deterministic across replay runs (otherwise wall-clock drift would silently flip the verdicts as the fixture aged). - Goldens were generated from the live detectors via `UPDATE_REPLAY_GOLDEN=1 go test ./module/pkg/replay/...` and pinned; future drift in detector output (headline / remediation prose, evidence-trail UID shape, scalar-field rename) surfaces as a `JSONEq` diff against the fixture. Operators can also eyeball the golden to assert what they EXPECT vs what fires. - Negative fixtures each exercise a distinct discriminator (wrong Xid code, single GPU, no AER, no eviction, completed state, single transition, no OOM log) so a regression in one false-positive guard lights up the corresponding row only. - Flipped `run_corpus: true` on every row of the chaos.yml pattern-detectors matrix now that every detector has a corpus. ## Test plan - [x] `make check` — clean - [x] `go test -race -count=1 ./pkg/replay/...` — 35 tests pass (28 new + 7 pre-existing pod_evicted) - [x] `go test -count=1 ./processor/patterndetectorprocessor/...` — unchanged, still green - [x] `make verify` (pre-push) — clean - [ ] CI: chaos.yml `pattern-detectors` matrix — 8 rows, each runs hermetic regex + replay-corpus step Closes #366. ```release-notes NONE ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
8 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Closes #393. Ships the metric-side OTTL projection from `node_exporter --collector.infiniband`'s `node_infiniband_port_state_id` Gauge onto the customer-stable `hw.network.ib.*` namespace (`hw.network.ib.port.state` int + `hw.network.ib.device` + `hw.network.ib.port.num`) so pattern #2's `IBLinkFlapDetector` consumes the same vendor-neutral wire shape regardless of whether the underlying source is node_exporter, a Mellanox exporter, or the `mlx5_core` journald stream. Detector library + processor wiring already shipped in #391 (closed #300). Only the metric-side input recipe was missing — pattern #2 was configured-but-quiet on real deployments. This PR closes that gap. ## Wire contract (node_exporter raw → hw.network.ib.*) ``` node_infiniband_port_state_id{device="mlx5_0", port="1"} = 4 (IBA phys_state ID) ↓ transform/ib_to_hw_semconv Gauge metric "hw.network.ib.port.state" with datapoint attrs: hw.network.ib.device = "mlx5_0" (str, from `device` label) hw.network.ib.port.num = 1 (int, from Int(`port` label)) value = 4 (int, the phys_state ID) ``` The future RFC-0014 PR-B metrics→logs bridge emitter (shared with patterns #3/#4/#5/#10) will lift these three attributes onto a log record at emit time. The bridge log-record schema for pattern #2 is pinned in `docs/integrations/prometheus-scrape.md §Pattern #2 — hw.network.ib.port.state (issue #393)` so PR-B has no per-pattern reconstruction work to do. The companion series `node_infiniband_state{state="<name>"}` (string label) is intentionally NOT mapped — the detector (`module/processor/patterndetectorprocessor/ib_link_flap.go`) compares `state.Int()` against `patterns.IBPortState*` integer constants, so the string variant would round-trip wrong. ## No detector code change required The detector reads three attribute names off a log record: `hw.network.ib.port.state`, `hw.network.ib.device`, `hw.network.ib.port.num`. The recipe stanza stamps the exact same three names on the metric datapoint. The wire format `port.Int()` expects (the projector at `module/processor/patterndetectorprocessor/ib_link_flap.go` line 39 calls `int(port.Int())`) is satisfied because the OTTL `Int()` cast on the Prometheus `port` string label produces a pdata int Value. Confirmed by the new `TestRecipe_IBLinkFlap_RoundTripFiresVerdict` test. ## Root cause + scope - **Root cause of #393**: missing metric-side OTTL stanza. Fixed. - **Out of scope (separate blocker, tracked under #260 PR-B)**: the metrics→logs bridge emitter. Upstream-blocked at OTel-contrib v0.130 — `transformprocessor`'s `metric_statements` cannot reference `log.*` paths and no contrib connector emits log records from a metrics pipeline (per [RFC-0014](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0014-metrics-to-logs-pattern-input.md)). The recipe doc explicitly documents this gating relationship; PR-B is shared with patterns #3/#4/#5/#10 and lands the bridge once. ## Files changed - `docs/integrations/examples/prometheus-scrape.yaml` — new `transform/ib_to_hw_semconv` processor; wired into the `metrics/scrape` pipeline. Validates with `./_build/tracecore validate` (exit 0). - `docs/integrations/prometheus-scrape.md` — new "Pattern #2 — InfiniBand link flap" projection section, intro updated from "Two" to "Three OTTL transforms", and bridge log-record contract subsection added under the "Metrics-to-logs bridge contract" section. - `docs/patterns/pattern-2-ib-link-flap.md` — deleted "Integration gap" section, replaced with "Integration recipe" pointing at the shipped stanza; updated "Why node_exporter sees it" prose to drop the "pending" hedge. - `module/processor/patterndetectorprocessor/ib_link_flap_recipe_test.go` — new file. `TestRecipe_IBLinkFlap_StanzaPinsWireContract` parses the example YAML and asserts every load-bearing token is present (source metric name, three `hw.network.ib.*` attrs, the `Int()` cast on the port label, the transform name, and the pipeline wiring). `TestRecipe_IBLinkFlap_RoundTripFiresVerdict` simulates the end-to-end path: builds `plog.Logs` with the exact attribute shape the recipe stamps and asserts the processor emits a flap verdict. ## Test plan - [x] `./_build/tracecore validate --config=docs/integrations/examples/prometheus-scrape.yaml` → exit 0 - [x] `bash scripts/validator-recipe.sh` → 9 validated, 3 skipped (non-linux host) - [x] `bash scripts/doc-check.sh` → clean (no orphan test refs) - [x] `go test ./module/processor/patterndetectorprocessor/... -count=1` → PASS (incl. the two new tests + all 5 existing IB tests) - [x] `go build ./...` and `go vet ./...` → clean - [x] Pre-commit hooks: golangci-lint 0 issues, go mod verify, attribute-namespace-check 100/100 - [x] Mutation-verified: dropping `hw.network.ib.port.state` from the recipe yaml fails `TestRecipe_IBLinkFlap_StanzaPinsWireContract` with the expected remediation message naming the missing identifier - [ ] CI on the PR (waiting on push) ```release-notes feat(recipe): InfiniBand link-flap OTTL stanza projecting node_exporter's `node_infiniband_port_state_id` onto the tracecore-canonical `hw.network.ib.*` namespace (`hw.network.ib.port.state` int + `hw.network.ib.device` + `hw.network.ib.port.num`). Pattern #2's `IBLinkFlapDetector` now has its metric-side input wired; metrics→logs bridge emitter remains gated on RFC-0014 PR-B (#260). ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
10 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
…ard) (#477) ## Summary Closes the `docs/MILESTONES.md` §M6 carry-forward: *"every fenced block in `docs/getting-started.md` is exercised by `scripts/smoke.sh`"*. The ≤5-count gate shipped with the M6 wave; the binding half was tracked carry-forward because `smoke.sh` ran a parallel hand-written hostmetrics→debug config rather than the doc's actual YAML. ## Root cause Two scripts owned the "first OTLP byte" config — `smoke.sh` rendered one inline, `docs/getting-started.md` carried another. They happened to agree, but nothing forced them to. The carry-forward existed because the binding was *correct by inspection*, not *correct by construction*. The fix is to make the doc the single source: `smoke.sh` extracts the YAML from `docs/getting-started.md`'s `## Walkthrough` heredoc at runtime. If the doc grows a typo, a renamed receiver, or a different scraper, `smoke.sh` exercises the change automatically. If the heredoc disappears, the extractor fails loud with a named error. ## Changes - `scripts/smoke.sh` — extracts the Walkthrough heredoc via a perl one-liner, writes it to a tempfile, then runs `tracecore validate --config=` + `tracecore --config=` against it (Walkthrough steps 3 + 4). Lifecycle-log assertions retained, with `"Shutdown complete"` now load-bearing against the doc's post-walkthrough prose. - `scripts/doc-check.sh` — new gate (right after the existing ≤5-count gate) asserts the smoke↔doc binding with four mutation-verified clauses: Walkthrough scope, `"$BIN" validate --config=` invocation, `"$BIN" --config=` run invocation, `docs/getting-started.md` path reference. - `scripts/smoke_test.sh` — new mutation-verify harness mirroring the gate at runtime, plus an inline mutant-doc test that proves the extractor exits 1 and the wrapper emits the named error when the heredoc is removed. - `Makefile` — `make smoke` now also runs `smoke_test.sh`; wired into `ci-full` alongside the existing `smoke-quickstart` target. - `docs/MILESTONES.md` — §M6 status `⧗ partial` → `☑ delivered`; getting-started rubric `⧗` → `☑`; carry-forward bullet rewritten (remaining work is operator-config branch-protection only). ## Runtime End-to-end `bash scripts/smoke.sh` on darwin/arm64: **~2.2s** (extract + validate + 1.5s run window + lifecycle-log assertions). Well under the 120s ci-fast budget. No hardware required — uses the `hostmetrics` load scraper, portable across linux/darwin/windows. ## Test plan ```release-notes ci(smoke): scripts/smoke.sh now extracts its YAML config from docs/getting-started.md '## Walkthrough' instead of carrying a parallel hand-written config; doc-check.sh gates the doc↔smoke binding with four mutation-verified clauses. Closes the M6 carry-forward. ``` - [x] `bash scripts/smoke.sh` exits 0 on clean main (verified locally, ~2.2s). - [x] `bash scripts/smoke_test.sh` all assertions pass. - [x] `bash scripts/doc-check.sh` reports `scripts/smoke.sh binds to docs/getting-started.md (M6: every block exercised by smoke.sh)`. - [x] Mutation test #1: `sed -i 's/"$BIN" validate --config=/"$BIN" XXX --config=/' scripts/smoke.sh` → doc-check exits 1 naming "validate --config= invocation (Walkthrough step 3)". - [x] Mutation test #2: `sed -i 's/"$BIN" --config=/"$BIN" XXX=/' scripts/smoke.sh` → doc-check exits 1 naming "run invocation (Walkthrough step 4)". - [x] Mutation test #3: `sed -i 's/Walkthrough/Section/' scripts/smoke.sh` → doc-check exits 1 naming "extraction scope lost". - [x] Mutation test #4: `sed -i 's/docs/getting-started.md/docs/SOMEWHERE-ELSE.md/' scripts/smoke.sh` → doc-check exits 1 naming "binding source missing". - [x] Mutation test #5: getting-started.md with no `## Walkthrough` heredoc → smoke.sh exits 1 with named error message (covered by `smoke_test.sh`). - [x] `make lint` 0 issues; `make vet` clean; `make doc-check` clean (all 18 gates pass). - [x] `make smoke` end-to-end including `smoke_test.sh` passes. ## Related - Refs `docs/MILESTONES.md` §M6 (Documentation scaffold). - Sibling #460 (`fix(doc-check): drop unconditional exit 0`) made this carry-forward visible — before #460, the new gate would have been silently skipped by the line-99 short-circuit. Signed-off-by: Tri Lam <tree@lumalabs.ai>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Adds three project-planning artifacts that together set the goals tracecore optimizes for and what we commit to ship this quarter:
opentelemetry-collector-contriblater.Reviewer test plan:
Linked issue(s)
N/A — initial planning artifacts; no linked issue.
Release notes
Checklist
make cipasses locally — N/A (docs-only PR, no Go changes)git commit -s)STYLE.md— N/A (no new components)