[docs] FOLLOWUPS: close squash-merge-trailer item with observation#14
Merged
Conversation
PR-13 (commit a57883f) merged via GitHub's squash-merge. Observed: the per-commit Assisted-by trailers on the 43 source commits did NOT survive as trailers on the squash commit. The squash body includes the PR description with every source SHA, so the audit trail is intact; the trailer-level disclosure AGENTS.md describes is not. Two governance paths, neither chosen here: - update AGENTS.md to treat the PR-description SHA list as canonical squash-merge disclosure - switch AI-assisted PRs to merge-commit / rebase-merge so per-commit trailers survive Flagging for governance review; not a code fix. Signed-off-by: tree <tree@lumalabs.ai> Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 14, 2026
Address items #9 (--version OTel SDK), #13 (Example_* tests), #14 (Kind constant coverage + helper), #15 (shutdown error logging). Also tactical: extract initTelemetryStack from runCollect so the silent-shutdown-error fix didn't push cyclomatic complexity past 15. - internal/version: Build gains OTelSDKMetric + OTelPromExporter fields, populated from debug.ReadBuildInfo. Surfaces in --version output as `otel-sdk-metric=vX.Y.Z, otel-prom-exporter=vA.B.C`. Operators debugging OTel-side bugs see which SDK actually shipped without grep'ing go.mod. - internal/selftelemetry: new CanonicalKinds() returning the Kind* constant set and IsCanonicalKind(s) helper. Test pins the set + asserts clockreceiver's KindDownstream is canonical. Future receivers (M8/M9) plug in the same test pattern; eventual lint enforcement is documented in FOLLOWUPS. - internal/selftelemetry/example_test.go: ExampleNewReceiver + ExampleNewExporter render in pkg.go.dev for M8/M9 authors and compile-check the 6-line wiring pattern every CI run. The RECEIVER-PATTERNS.md snippet can no longer drift from a working example without breaking this test. - cmd/tracecore.runCollect: previously swallowed mp.Shutdown errors in the failure-unwinder paths with `_ = mp.Shutdown(ctx)`. Now logs at Warn level so a real drain failure surfaces. Extracted the telemetry-init block into initTelemetryStack helper to keep runCollect's cyclomatic complexity under the gocyclo 15 ceiling (the helper's own error-handling carries the rest of the cleanup branches). make ci clean. Coverage unchanged on most pkgs; selftelemetry up to 91.0% via the new IsCanonicalKind test. Assisted-by: Anthropic:claude-opus-4-7 [Claude Code] Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…rd surface ## What this PR does Lands M19 (pattern #14 pod-evicted) end-to-end against the rubric's hermetic replay-test gate, plus the M10 NodeRecord surface the detector consumes. ### Pattern detector (internal/synthesis/patterns/) - `pod_evicted.go`: `PodEvictedDetector{}.Evaluate(events, nodeConds)` joins each Pod-Evicted Record against the most-recent per-node pressure transition within a 30s sliding window (configurable via `JoinWindow`). Returns one `PodEvictedVerdict` per evicted pod, sorted by event time for golden-test determinism. - Headline matches the rubric's regex contract for the canonical disk-pressure fixture (`/Pod .* evicted at .* due to disk pressure/`); remediation matches `/relocate.*NVMe/`. Three other pressure roots (memory/pid/notready) get distinct operator-facing remediations. - Missing-layer path: when no NodeRecord joins, verdict falls back to `confidence: partial` with `missing_layers: ["node_condition"]` and the pressure kind is parsed heuristically from the Event Note. - `verdict.go`: `PodEvictedVerdict` + `EvidenceRef` + `Confidence` with JSON tags pinned by `testdata/verdict.schema.json` (additional- Properties:false; drift rejected in `verdict_schema_test.go`). - Generic `Detector` interface deferred per PRINCIPLES §3 rule-of-three — M17 / M18 land as the second and third consumers. ### Replay runner + corpus (internal/synthesis/replay/) - `runner.go`: `LoadFixture(dir)` / `LoadFixturesUnder(root)` walk a directory tree, load `manifest.json` + `events.json` + `node_conditions.json` + `golden.json`. Pure stdlib; hermetic by design per rubric L414. - `runner_test.go`: drives every fixture under `pod_evicted/`, JSONEq-diffs detector output against `golden.json`. Asserts the canonical + three negative fixtures all exist as a structural gate. - Canonical fixture: `pod_evicted/canonical/` (Evicted Pod on `gpu-node-0001` at T=10:00:00Z; DiskPressure transition at T-5s → full-confidence verdict). - Negative fixtures under `_negative/`: `killing/`, `preempted/`, `failed_scheduling/` — each produces an empty verdict slice (rubric L407 FP-rate contract). ### M10 NodeRecord surface (components/receivers/k8sevents/) - `node_record.go`: typed `NodeRecord` + `NodePressureKind` constants (memory/disk/pid/notready) + distinct `NodeSchemaURL` so downstream consumers version-gate the two surfaces independently. `HintForPressure` inversion: NotReady → HintNodeUnhealthy; rest → HintNodePressure. - `node_emit.go`: `buildNodeLogRecord` analogous to `buildLogRecord`. - `node_informer.go`: pure `convertNodeConditions(*corev1.Node) []NodeRecord` (only alarming statuses surface: True for the three pressures, False for Ready) + `nodeConditionDedup` that tracks (nodeUID, conditionType, transitionNano) so an unrelated Node-object update doesn't re-emit. - `receiver.go`: Node SharedInformer registered when `Config.NodeConditions.Enabled=true` (default false — wider RBAC). `nodeRecords chan NodeRecord` joins the existing `events chan` in `run()`'s select; nil channel blocks when disabled. - `rbac.yaml`: adds `core/v1.nodes get,list,watch`. `rbac.can-i.golden` updated to match (sorted verb-then-resource). - `Record.ReportingInstance` added (kubelet sets to node name; M19's join key); `MaxAttributesFloor` bumped 9→10 to keep it intact. - `pattern_consumer_test.go`: 4 new compile-gate tests pin the NodeRecord field walk, the four-kind exhaustiveness, the pressure→ hint inversion, and `reporting.instance` wire-format pinning. - `node_record_test.go`: mirrors `hint_test.go` for the conditionType→NodePressureKind taxonomy and the pressure→Hint map. ### CLI - `tracecore failure-inject pod-evict --reason=... --seed=N --out=path` wraps `tools/failure-inject/podevict/podevict.go` for deterministic Event YAML generation. Closes FOLLOWUPS L806-815 carry-forward. `--allow-cluster-write` still returns `ErrClusterWriteUnsupported` (exit 70) — captured in M19's new carry-forward #5. ### Hygiene - CHANGELOG entry under [Unreleased] / ### Added. - MILESTONES §M19 flipped ☐ → ⧗ partial; rubric items annotated ☑/⧗ with carry-forwards listed. - FOLLOWUPS L806-815 closed; L927-934 (chaos.yml pattern_evicted matrix row) trigger-fired annotation added. ## Linked issue(s) _No linked issue._ ## Release notes ```release-notes NONE ``` ## Checklist - [x] `make ci` exit 0 (verified twice; runtime budget under 60s) - [x] All new unit + replay tests pass; no regressions across repo suite - [x] JSON Schema gates verdict shape (additionalProperties:false) - [x] Negative fixtures produce zero verdicts - [x] RBAC golden regenerated; pattern_consumer compile-gate updated - [x] Commits signed off Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…h error handler; defer 6 findings
Phase 2 dispatched 8 lens reviewers (performance, SRE/Infra,
maintainer, contributor, operator, adopter, security, researcher)
+ 1 mechanical compliance gate in parallel. Real bugs applied;
load-bearing-but-deferable findings logged in FOLLOWUPS.
## Pushback table
| ID | Lens(es) | Beneficiary | Severity | Finding | Proof | Contradict | TDD record | Rubric+? | Action |
|-----|----------|-------------|----------|---------|-------|------------|------------|----------|--------|
| P2.1 | security + mechanical | operator (DCO policy) | BLOCKER | 4 of 5 commits on branch carried `Co-Authored-By: Claude` + `🤖 Generated` trailers, violating AGENTS.md "Load-bearing lessons" forward-comply rule. | `git log origin/main..HEAD --pretty=%B \| grep -c 'Co-Authored-By:'` returned 4. | Rule is advisory / grandfathered? Falsified — AGENTS.md explicit forward-comply, commits dated 2026-05-18/19. | n/a (process). | yes (gate added) | applied — operator-authorized interactive rebase + force-push-with-lease. New gate at `git log origin/main..HEAD --pretty=%B \| grep -cE 'Co-Authored-By:\|🤖 Generated' == 0`. |
| P2.4 | researcher | repo-long-term + operator | BLOCKER | `node.name` / `node.uid` collide with OTel semconv canonical `k8s.node.name` / `k8s.node.uid`; package already uses `k8s.event.hint`, so the chosen `node.*` broke local convention. | Existing `AttrEventHint = "k8s.event.hint"` precedent; OTel semconv k8s resource attributes doc. | tracecore-private namespace per `NodeSchemaURL` scope? Partial — but two of five attrs have direct upstream canonicals (name, uid); the other three (`node.pressure.kind`, `node.condition.transition_at`, `node.condition.message`) ARE tracecore-private and remain in `node.*`. | RED: `TestPatternConsumer_NodeRecordTypeCompiles` updated to expect `k8s.node.name`; ran, failed with `expected k8s.node.name got node.name`. GREEN: `AttrNodeName` / `AttrNodeUID` constants flipped; test passes; integration test propagates via Go-name reference (no string callsite drift). | yes (semconv-compat rubric addition) | applied (this commit) |
| P2.2 | operator | operator (alert visibility) | BLOCKER | Node informer registered AddEventHandler but never SetWatchErrorHandler — when nodes RBAC is missing the watch fails silently, the K8sEventsReceiverDegraded alert never fires, `tracecore_receiver_errors_total{kind="watch"}` does not move. | `git diff origin/main -- receiver.go` showed only eventInformer carried SetWatchErrorHandler. | Symptom only visible without RBAC? Confirmed — that's the production failure mode the alert exists for. | Shared `onWatchError` handler now wired on the Node informer (mirrors eventInformer). The handler itself is already tested via `TriggerWatchError`. Integration coverage for "no-nodes-RBAC → KindWatch increments" needs apiserver-side watch-fail injection; deferred to FOLLOWUPS as Hard-proof falsifier with concrete reproducer command. | no | applied (this commit) |
| P2.3 | operator | operator (alert correctness) | CONCERN → deferred | `emitNode` unconditionally `SetDegraded(false)` — a successful node emit clears degraded state set by an Event-watch failure, masking the Event-side outage. | Receiver.go `run()` calls `SetDegraded(false)` on either-side emit success; degraded is shared receiver state. | Both-informer-broken case fires both error paths and re-sets degraded? True — but a healthy node informer + broken event informer is the lasting masked-failure case. | Falsifier sketch logged in FOLLOWUPS: `TriggerWatchError → inject NodeRecord → assert Degraded()=true`. | no | deferred (FOLLOWUPS, M21-audit trigger) |
| P2.6 | operator | operator (telemetry triage) | CONCERN → deferred | Self-telemetry counters `KindParse`/`KindPanic`/`KindBackpressureDrop` shared between event + node deliverers; operator cannot disambiguate via `kind` label. | `IncError(selftelemetry.KindPanic)` callsites in `deliver` and `deliverNode`. | Adopters look at the message body, not the kind label? Falsifier: RUNBOOK triage steps that say "filter by kind label" lose specificity. | Fix shape (add `KindNodeParse` etc.) logged in FOLLOWUPS. | no | deferred (FOLLOWUPS, M21-audit trigger) |
| A1 | adopter | adopter (first-touch) | CONCERN → deferred | Helm-installed chart has no RBAC templates; `helm install` → silent zero events. Chart README Troubleshooting section has 8 entries; none mention RBAC. | `ls install/kubernetes/tracecore/templates/` shows no ClusterRole. | Already in FOLLOWUPS? Yes, but trigger ("first operator reports") is post-failure; adopter lens raised it as a first-touch gap. | n/a | no | deferred (FOLLOWUPS — existing entry sufficient, Troubleshooting README enhancement is the smaller follow-up) |
| A2 | adopter | adopter | NIT → deferred | `failure-inject pod-evict` undocumented in `docs/` or `getting-started.md`. | `grep -rn failure-inject docs/` zero hits. | Chaos tool, not setup step? Defensible. | n/a | no | deferred (FOLLOWUPS, M21 docs-sweep trigger) |
| F4 | operator | operator | NIT → deferred | Exit-70 contract for `--allow-cluster-write` not in `--help` or RUNBOOK. | Kingpin description omits exit code; RUNBOOK has no failure-inject section. | n/a | n/a | no | deferred (FOLLOWUPS, M21 docs-sweep trigger) |
| Contrib-F1 | contributor | repo-long-term | NIT → deferred | `testdata/verdict.schema.json` is pattern-#14-specific (`pattern.id const: "14"`); M17/M18 cannot reuse the path. | File body L12. | rule-of-three: don't anticipate the 2nd schema? Counter: 1-line rename is cheap. Counter-counter: every M17/M18 PR can rename in its own first commit. Verdict: defer. | n/a | no | deferred (FOLLOWUPS, M17/M18 PR-start trigger) |
| Contrib-F3 | contributor | repo-long-term | NIT → deferred | No `docs/patterns/pattern-14-pod-evicted.md`; precedent broken silently. | `ls docs/patterns/` shows pattern-1, -3, -4, -5 but no -14. | M19 MILESTONES entry IS the spec doc? Partially — pre-existing pattern docs are research notes; this PR didn't write one. | n/a | no | deferred (FOLLOWUPS, M21 docs-sweep trigger) |
| Mtnr-M1 | maintainer | repo-long-term | NIT → deferred | `Manifest.Description`/`ExpectedTiming` are dead fields (no consumer reads them). | `grep -rn Manifest.Description` returns zero non-test hits. | YAGNI per §2; trim later. | n/a | no | explicitly-skipped (operator-prose value in manifest body justifies retention even if no Go reader yet) |
| Perf-1..5 | performance | repo-long-term | NIT → APPROVED | Detector hot-path allocation, RFC3339Nano formatting, dedup mutex contention, golden bytes retention, nil-channel select cost — all within v0 scale. | Reviewer's per-finding benchmark thresholds documented in their report. | n/a | n/a | no | explicitly-skipped — PRINCIPLES §10 / §2 |
## Verdicts
| Lens | Verdict |
|------|---------|
| Performance | APPROVED |
| Maintainer | APPROVED (nits) |
| Contributor | APPROVED (doc-followups) |
| Operator | CONCERNS-REQUIRE-FIX → applied (P2.2) + deferred (P2.3, P2.6, F4) |
| Adopter | CONCERNS → deferred (chart gap, doc gap) |
| Security | BLOCKER-REQUIRES-RESOLUTION → applied (P2.1 DCO trailer purge) |
| Researcher | CHANGES-REQUESTED → applied (P2.4 semconv align) |
## Rubric evolution
- [P2-researcher] semconv-compat — attributes with OTel canonical use the `k8s.*` prefix; private attrs keep project namespace.
- [P2-security] forward-comply DCO trailer gate — `git log origin/main..HEAD | grep -cE 'Co-Authored-By:|🤖 Generated' == 0`.
- [P2-operator] node-side watch errors must surface via SetWatchErrorHandler.
## Verification
`make ci` exit 0. Force-pushed origin via interactive rebase
(operator-authorized). Gate 1 (AI-vocab grep) clean. Gate 2 (DCO
trailer) now clean.
Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…rd surface (#75) ## Summary First tracecore pattern detector + the supporting receiver-side surface, with the operator-facing failure modes fully instrumented: - **`PodEvictedDetector`** joins Pod-Evicted Events against per-node pressure transitions inside a configurable sliding window (30s default) and emits `PodEvictedVerdict` with structured evidence trail, remediation prose pinned to the offending node, and `pattern.id="14"`. - **Hermetic replay corpus** under `internal/synthesis/replay/pod_evicted/`: one canonical disk-pressure fixture + three negatives (Killing, Preempted, FailedScheduling). Runner JSONEq-diffs detector output against per-fixture `golden.json`; runs inside `make ci` budget. - **`NodeRecord` surface on k8sevents** — typed record + `NodePressureKind` constants + opt-in Node SharedInformer behind `node_conditions.enabled` (default false). `core/v1.nodes` RBAC delta. Attribute names follow OTel semconv (`k8s.node.name`, `k8s.node.uid`). - **Per-source operability discipline** — `degradedEvent` / `degradedNode` atomic bits OR'd to the receiver-wide flag, so a successful Node emit cannot mask an Event-watch outage. Distinct `KindNodeWatch` / `KindNodeParse` / `KindNodePanic` / `KindNodeBackpressureDrop` labels on `tracecore_receiver_errors_total`. `K8sEventsBackpressureDrops` alert matches both kinds via regex. - **`tracecore failure-inject pod-evict`** CLI wraps the YAML generator with O_NOFOLLOW-guarded output, deterministic-by-seed, exit 70 sentinel for the live-cluster stub. - **`Record.ReportingInstance`** added so detectors can join Pod-Evicted events to nodes without inspecting the message body. `MaxAttributesFloor` bumped 9 → 10 (breaking, flagged under `### Changed` in CHANGELOG). - **Operator runbook section** for the Node-informer failure modes, triage by `kind` label, and the disable-in-a-hurry config knob. ## Test plan - [x] `make ci` exit 0 — full lint, vet, race-enabled tests, doc-check, alert-check, banned-phrase scan, govulncheck, deterministic build. - [x] `go test -race ./... -count=1` clean across the whole repo. - [x] `BenchmarkPodEvictedDetector_1kEventWindow` baseline: 865μs/op on Apple M4 Pro against the ≤1ms/eval budget. `.github/workflows/bench.yml` runs the same bench on GHA `ubuntu-latest` for a native Linux x86_64 datapoint after merge. - [x] JSON Schema rejects extra fields, out-of-enum confidence, empty evidence trail, numeric `pattern.id`, and unknown evidence kinds (five negative-coverage cases). - [x] Replay corpus round-trips deterministically; negative fixtures produce empty verdict slices. - [x] Node informer end-to-end through fake clientset: emits `NodeRecord` with the correct `NodeSchemaURL`, opt-in gate enforced. - [x] Node-watch RBAC denial via `PrependWatchReactor("nodes", forbidden)` increments `KindNodeWatch` and not `KindWatch`; removing the SetWatchErrorHandler line on the Node informer flips this test red. - [x] Event-watch failure followed by successful Node emit keeps the receiver-wide degraded flag true (alert silencing regression test). - [x] `tracecore failure-inject pod-evict --reason=DiskPressure --seed=42` produces byte-identical YAML; symlink targets refused with ELOOP on Linux. ## Scope and follow-ups Generic `Detector` interface deferred per the rule of three — M17/M18 land as the second and third consumers. `MILESTONES §M19` ships at ⧗ partial: **7/7 functional and 5/5 non-functional rubrics ☑**; the only remaining work flagged on the milestone is anonymized-real-world replay corpus (needs external capture). Eleven items captured in `docs/FOLLOWUPS.md` each with a concrete trigger — these cover the Helm chart RBAC gap, live-cluster detection wiring (M20+ scope), distro-patched kubelet variants, OTel processor integration, custom-controller NodeCondition Message variants, the `failure-inject pod-evict` core/v1 vs events.k8s.io/v1 API-version mismatch, and four smaller doc/refactor items. ```release-notes NONE ``` --------- Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 31, 2026
…2b) (#246) ## Summary Two net-new OTel processors land in the in-repo submodule under `module/processor/` (RFC-0013 §M19 / PR-I.2b): - **`rankjoinprocessor`** — 5s windowed cross-signal join. NCCL FR rank records + kubelet pod-evicted Events for the same pod cross-stamp `m19.joined`, `gen_ai.training.rank`, and `k8s.pod_evicted_at`. Same-batch joins stamp both records; across-batch joins stamp the eviction-side record (the rank record was already forwarded — the eviction-side stamp is the M19-canonical surface). Component type: `rankjoin`. - **`patterndetectorprocessor`** — wraps `module/pkg/patterns/` + `module/pkg/replay/`. Projects OTel log records into the typed pattern-library model, runs the pod-evicted detector, and appends one verdict log record per match. The verdict carries `pattern.id`, `pattern.confidence`, `pattern.headline`, `pattern.remediation`, and `pattern.verdict_json` (full evidence trail). Golden test against the canonical `pod_evicted/canonical` fixture pins the OTel-attrs → verdict round-trip. Component type: `patterndetector`. Closes #163 — M19 cross-signal join integration test now lives at `module/processor/rankjoinprocessor/rankjoin_test.go` and exercises the real 5s windowed join (not just the compile-pin the prior containerstdout test asserted). `builder-config.yaml` adds the two processor entries pointing at `github.com/tracecoreai/tracecore/module v0.2.0`. The `ncclfrreceiver` line moves from `v0.1.0` to `v0.2.0` in lockstep so OCB's MVS picks one submodule version per build. The submodule tag bump from `v0.1.0` to `v0.2.0` is **post-merge** via `git tag` — not part of this PR. `module/go.mod` gains `go.opentelemetry.io/collector/processor v0.115.0` (+ its `processortest` test dep). ### Adversarial review (self, fallback path) The orchestrator requested an `Agent`-tool review subagent. The `Agent` tool was not available in this session's tool surface; the spec explicitly authorizes self-adversarial review as a fallback. Findings reviewed against: - **Race on `nowFn`:** `-race` tag passed; tests overwrite `nowFn` sequentially with no concurrent reads. Production code reads `nowFn` under `p.mu.Lock()`. - **Across-batch pointer-mutation safety:** the rank/eviction buffers retain `plog.LogRecord` pointers ONLY for the in-batch case; `invalidateBufferPointers` at the tail of every `ConsumeLogs` call drops the pointers and flags the entries pointer-stale before forwarding downstream (the next consumer declared `MutatesData=true`). - **Memory bound:** `MaxBufferedRecords` (default 4096) caps both buffers with FIFO eviction at capacity. `evictExpired` runs at the start of every `ConsumeLogs` call so the buffer also drains on the time axis. - **Patterndetector across-call statelessness:** the detector evaluates records arriving in ONE `ConsumeLogs` call together. The README + RUNBOOK both call this out; cross-call buffering is on the v0.3 roadmap. The canonical fixture's events fit in one call so the golden test exercises the realistic shape. ```release-notes - new processor: `rankjoin` — windowed cross-signal join between NCCL FlightRecorder rank records and kubelet pod-evicted Events (default window 5s, configurable). M19 integration-test home. - new processor: `patterndetector` — wraps `module/pkg/patterns/`; emits one verdict log record per matched pattern. Today's set: pod-evicted (pattern #14). - bump: `github.com/tracecoreai/tracecore/module` baseline in `builder-config.yaml` moves from `v0.1.0` → `v0.2.0` to track the processor additions; submodule tag cuts post-merge. ``` ## Test plan - [x] `cd module && GOWORK=off go test -race -count=1 ./processor/rankjoinprocessor/... ./processor/patterndetectorprocessor/... ./pkg/...` - [x] root `GOWORK=off go test ./...` - [x] `make build` (OCB end-to-end, ./_build/tracecore registers nccl_fr + rankjoin + patterndetector) - [x] `docker run --rm -v $(pwd)/install/kubernetes/tracecore:/chart alpine/helm:3.16.4 lint /chart` - [x] `make check` (fmt, lint, vet, tidy-check, mod-verify) --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
Merged
9 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
…ORTHSTAR pattern #3) (#255) ## Summary Third real detector in `module/pkg/patterns/` — closes v0.3.0 NORTHSTAR O1 ("3 patterns at v0.1.0") alongside `pod_evicted` (#14) and `nccl_hang` (#15). **Pattern**: a GPU Xid kernel event (carrying the customer-stable `kernelevents.xid` attribute) followed within a configurable window (default 60s) by a pod eviction on the same node → one verdict per (Xid, evicted_pod) tuple. Operator-actionable: "Xid 79 on gpu-node-0001 → training/job-rank-3 evicted 10s later" with remediation pinning the node to drain and the pod to reschedule. **Detector** (`module/pkg/patterns/xid_correlation.go`): - `XidCorrelationDetector` zero-value usable; `CorrelationWindow` overrideable (defaults to `DefaultXidCorrelationWindow = 60s`). - `Evaluate(xids, events)` joins on `(node, time-within-window)` with the most-recent Xid winning the proximate-cause attribution. - Causal-direction guard: eviction BEFORE Xid does not correlate (mirrors pod_evicted's future-transition exclusion). - Deterministic output sorted by `(eviction_time, EventUID)`. **Verdict shape** — single vs multi-layer decision: structurally two evidence surfaces (kernel_event + pod_event in causal order in the evidence trail), but emission rule is "both layers joined or no verdict". Xid-without-eviction is already operator-visible via the raw `kernelevents.xid` telemetry; an Xid-only verdict would duplicate that signal. So `Confidence` / `MissingLayers` are omitted (dead-fields discipline from PR #250 review). Mirrors the `nccl_hang` shape. **Multi-pod decision**: one verdict PER evicted pod (not one verdict listing all). Per-pod is the operator-actionable shape — each verdict's `Remediation` pins a specific pod to drain/recreate so alert routing fans out by pod owner; collapsing would force operators to parse a list and lose per-pod routing. **Schema** (`testdata/xid_correlation_verdict.schema.json`): `additionalProperties:false`, `pattern.id` const="16", `evidence_trail.kind` enum=`["kernel_event","pod_event"]`, `minItems:2` (both layers required). 7-row drift-rejection battery covers re-introduced `confidence`/`missing_layers`, evidence-minimum violation, kind-enum, and `pattern.id` numeric vs string drift. **Wiring** (`patterndetectorprocessor`): - New `Config.XidCorrelationWindow` YAML field (default 60s, floor 1s), validated alongside the existing window fields. - `collectInputs` returns a fourth typed slice (`[]patterns.XidRecord`) built by `projectXidRecord`, which gates on `kernelevents.xid` and reads the host node from the resource attribute `k8s.node.name` (the standard k8sattributes / resourcedetection stamp on a DaemonSet — same pattern `projectNodeCondition` uses). Per-record `k8s.node.name` fallback for non-DaemonSet emitters. - `appendXidCorrelationVerdict` mirrors `appendNCCLHangVerdict`'s wire format — broken-out scalar attrs plus full `pattern.verdict_json`. **Recipe alignment**: `docs/integrations/journald-kernel.md` §"Customer-stable attribute mapping" already documents `kernelevents.xid` (int) as the OTTL-stamped surface. No recipe update needed; the existing pipeline emits exactly what the detector consumes. `gpu.id` (PCI BDF) is also stamped but intentionally not projected — the detector doesn't use it yet, so it stays available on the raw record without entering the pattern library as a dead field. ## Test plan - [x] `cd module && go test ./pkg/patterns/... ./processor/... -race -count=1` — green - [x] `make build` — collector binary compiles - [x] `make check` (golangci-lint + go vet + go mod verify) — 0 issues - [x] TDD red-test-first: both `xid_correlation_test.go` files failed to compile/assert before their impl landed - [x] Detector test cases: positive Xid 79 → eviction, no-eviction negative, cross-node negative, out-of-window edge (61s default), multi-pod per Xid (3 verdicts), window-configurable, pre-Xid eviction excluded, non-evicted-hint ignored, deterministic order, most-recent-Xid-wins - [x] Schema-conformance + 7-row drift-rejection battery - [x] Wiring tests: positive verdict emission, healthy non-emission, window override - [x] `TestConfig_Validate` extended with sub-1s xid_correlation_window floor case - [x] `TestFactory_Surface` pins `DefaultXidCorrelationWindow` (and the previously-unpinned `DefaultNCCLHangThreshold`) on the factory's default config ```release-notes feat(patterns): add xid_correlation detector (v0.3.0 NORTHSTAR pattern #3) — correlates GPU Xid kernel events (`kernelevents.xid` attribute per RFC-0013 §3) with downstream pod evictions on the same node within a configurable window (default 60s). Emits one verdict per evicted pod so alert routing can fan out per pod owner. Wired into patterndetectorprocessor; configurable via `xid_correlation_window` YAML field. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Adds install/kubernetes/tracecore/dashboards/patterns.json — an operator-facing Grafana 10+ dashboard for the three shipped pattern verdicts (pod_evicted #14, nccl_hang #15, xid_correlation #16) plus future-proof templating on pattern.id. Seven panels: 1. Verdict rate by pattern_id (timeseries, LogQL) 2. Top 10 evicted pods — pod_evicted (table, LogQL) 3. Top 10 hung NCCL collectives — nccl_hang (table, LogQL) 4. Top 10 Xid+eviction correlations — xid_correlation (table, LogQL) 5. Verdict count by node (bargauge, LogQL) 6. Confidence distribution full vs partial (piechart, LogQL) 7. patterndetector processor throughput (timeseries, PromQL) The verdict signal lives in OTLP log records (the processor's README explicitly defers self-tel metrics to v0.3 — tracked in #261), so the six verdict-derived panels query Loki via LogQL and the throughput panel queries Prometheus for the upstream-standard otelcol_processor_* items metrics. Two datasource template vars (prometheus_datasource, loki_datasource) keep both backends pluggable on import. Bundled .lint config waives target-promql-rule on the six LogQL panels — the dashboard-linter parses every target as PromQL irrespective of target.datasource.type and fails on the first | in a LogQL pipeline. The target-logql-rule still validates each query. dashboard-linter lint --strict --config .lint patterns.json → exit 0. json.load → 7 panels, 6 templating vars. README documents prerequisites (Prometheus + Loki datasources, the patterndetector processor wired into service.pipelines.logs), three install paths (manual upload, grafana-cli API, kube-prometheus-stack ConfigMap), the pattern → panel coverage matrix, the self-tel gap, and the lint-exclusion justification. Refs #261 Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Adds `install/kubernetes/tracecore/dashboards/patterns.json` — an operator-facing Grafana 10+ dashboard for the three shipped pattern verdicts (pod_evicted #14, nccl_hang #15, xid_correlation #16), with future-proof templating on `pattern.id` for M17/M18 detectors. Closes #280 (LogQL drift vs Loki native OTLP). Issue #270 (scalar attribute promotion) was closed by PR #275 upstream of this rebase. ### LogQL shape: Loki native OTLP structured metadata Verdict log records carry the verdict scalars as OTLP log-record attributes (PR #275): `pattern.id`, `pattern.confidence`, `k8s.pod.name`, `k8s.pod.namespace`, `k8s.node.name`, `k8s.event.reason`, `nccl.fr.pg_id`, `nccl.fr.collective_seq_id`, `nccl.fr.hanging_ranks_count`, `kernelevents.xid`. The Loki backend recipe (PR #278, `docs/integrations/loki.md`) sends these to Loki 3.0+'s native OTLP endpoint (`/otlp/v1/logs`) via the bundled `otlphttp` exporter. Loki's OTLP receiver lands log attributes as **structured metadata**, queryable as direct LogQL label filters — no `| json` parser stage, no `attributes_` prefix. Dots in attribute names normalize to underscores at the LogQL surface (per Loki upstream docs `docs/sources/shared/otel.md` "Format considerations"): - `pattern.id` → `pattern_id` - `pattern.confidence` → `pattern_confidence` - `k8s.pod.name` → `k8s_pod_name` - `nccl.fr.pg_id` → `nccl_fr_pg_id` - `kernelevents.xid` → `kernelevents_xid` All six verdict panels are written against this shape (e.g. `{job=~"$job"} | pattern_id="14" | k8s_node_name=~".+" [$__auto]`). README documents the Loki native-OTLP install path (cross-refs PR #278) and notes the Promtail/Alloy fallback for operators not on native OTLP (those need `| json pattern_id="attributes_pattern_id"` extraction; not shipped in-tree, fork the JSON). ### Self-telemetry counter still on v0.3 roadmap The spec proposed PromQL queries against `otelcol_processor_patterndetector_verdicts_emitted_total`. That metric does not exist yet — the patterndetectorprocessor README says *"No self-telemetry yet. Self-telemetry is on the v0.3 roadmap"*. Per the spec's constraint (*"DO NOT touch detector code or processor code"*), root-causing the missing metric is out of scope for this PR; the six verdict-derived panels query Loki via LogQL in the meantime. Tracked under #261. When that lands, the six LogQL panels swap to PromQL and the Loki dependency drops. Panel 7 (throughput) queries Prometheus against the upstream OTel-Collector standard `otelcol_processor_{incoming,outgoing}_items` which the collector emits for every processor automatically. ### Panels shipped (7) | # | Title | Datasource | Patterns covered | |---|---|---|---| | 1 | Verdict rate by pattern_id | Loki / LogQL | 14, 15, 16 (templated) | | 2 | Top 10 evicted pods | Loki / LogQL | 14 (pod_evicted) | | 3 | Top 10 hung NCCL collectives | Loki / LogQL | 15 (nccl_hang) | | 4 | Top 10 Xid+eviction correlations | Loki / LogQL | 16 (xid_correlation) | | 5 | Verdict count by node | Loki / LogQL | all (templated) | | 6 | Confidence distribution (full vs partial) | Loki / LogQL | all (templated) | | 7 | patterndetector processor throughput | Prom / PromQL | (pipeline liveness) | ### Templating vars (6) - `prometheus_datasource`, `loki_datasource` — datasource selectors (no hardcoded UIDs). - `job`, `instance` — linter-mandated PromQL matchers, populated from `otelcol_process_uptime`. - `cluster` — multi-cluster slice, populated from `otelcol_process_uptime`. - `pattern_id` — custom-options var seeded with the three shipped IDs (14/15/16). Extend in-place when new detectors land. ### Linter exclusion (justified) `.lint` config waives `target-promql-rule` on the six Loki panels. The dashboard-linter parses every target as PromQL irrespective of `target.datasource.type` and fails on the first `|` in a valid LogQL pipeline. `target-logql-rule` still validates each query as LogQL and passes. Full justification in `install/kubernetes/tracecore/dashboards/README.md` §"Linter exclusions" and inline in the `.lint` `reason:` block. Removable once issue #261 swaps the panels to PromQL. ## Test plan - [x] `dashboard-linter lint --strict --config .lint patterns.json` → exit 0 (built from source — `go install` rejects the linter's own go.mod replace directives; build steps in `install/kubernetes/tracecore/dashboards/README.md`). - [x] `python3 -c "import json; json.load(open('install/kubernetes/tracecore/dashboards/patterns.json'))"` → exit 0; 7 panels, 6 templating vars confirmed. - [x] `make check` (golangci-lint + vet + tidy-check + mod verify) → exit 0. - [x] `make verify` (license-check + doc-check + register-lint + actionlint + zizmor + no-autoupdate) → exit 0. - [x] LogQL shape verified against Loki upstream docs (`docs/sources/shared/otel.md` "Format considerations" — dots and special characters normalize to underscores; no `attributes_` prefix on the native OTLP surface) and `docs/sources/send-data/otel/native_otlp_vs_loki_exporter.md` query examples (`{service_name="auth"} | severity_text="INFO"`). - [x] Promoted scalar attrs verified against `module/processor/patterndetectorprocessor/patterndetector.go` `VerdictAttr*` constants (lines 25-88) and `appendVerdict` / `appendNCCLHangVerdict` / `appendXidCorrelationVerdict` emitters (lines 510-517+). - [ ] Manual smoke test against a live cluster with Loki 3.0+ native OTLP receiver (deferred — adversarial reviewer to verify panels render against actual OTLP-native Loki output before merge). ```release-notes Ship Grafana dashboard for tracecore's pattern verdicts: install/kubernetes/tracecore/dashboards/patterns.json. Seven panels cover the three shipped pattern detectors (pod_evicted #14, nccl_hang #15, xid_correlation #16) plus templated pattern.id for future M17/M18 patterns. LogQL queries target Loki 3.0+ native OTLP structured metadata (pairs with the Loki backend recipe). Includes README install guide (manual upload, grafana-cli, kube-prometheus-stack ConfigMap), Promtail/Alloy fallback notes, and pattern coverage matrix. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Closes #331. Replaces the single `pattern-pod-evicted` chaos job with a matrix-strategy `pattern-detectors` job covering every detector named in v1-rc1 cut-criterion §1: `hbm_ecc`, `thermal_throttle`, `pcie_aer`, `pod_evicted`, `nccl_hang`, `xid_correlation`. Each row runs the detector's positive + negative + schema `*_test.go` prefix under `-race`: - **Positive / Edge asserts** — verdict emitted with correct `pattern.id` within bounded time. - **Negative / SchemaRejectsDrift asserts** — zero verdicts on baseline / confounder input (the false-positive gate the chaos matrix exists to enforce). The `pod_evicted` row additionally runs the on-disk replay corpus under `module/pkg/replay/pod_evicted/`; sibling rows skip that step until their `replay/<pattern>/` fixtures land — tracked as Path A in [`docs/v1-rc1-test-audit.md`](../blob/main/docs/v1-rc1-test-audit.md) §4. ## Why this matrix shape, not new `failure-inject` modes Audit §4 explicitly recommends Path A (matrix over existing hermetic detector tests) for rc1 and defers Path B (per-pattern `failure-inject` CLI modes) to v1.x. Path A reuses the race-detector + golden-SHA shape already validated by `pattern-pod-evicted` and costs ~60 LoC of workflow YAML instead of five new CLI subcommands. ## Wall-time budget Observed per-row runtime under `-race` (measured locally on this worktree): | Row | runtime | |---|---| | hbm_ecc | 1.35s | | thermal_throttle | 1.33s | | pcie_aer | 1.30s | | pod_evicted (incl. replay corpus) | 1.35s + 1.34s | | nccl_hang | 1.31s | | xid_correlation | 1.32s | Per-row `timeout-minutes: 5` is ~3× headroom; matrix parallelism keeps total wall-time well inside the chaos-gate budget. ## Test plan - [x] `actionlint .github/workflows/chaos.yml` — clean. - [x] `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/chaos.yml'))"` — parses, 3 jobs (`harness-determinism`, `cpu-steal-mpstat`, `pattern-detectors`), 6-row matrix. - [x] For every matrix row, `cd module && go test -race -count=1 -run '<regex>' ./pkg/patterns/...` — all green (counts: HBMECC=16, ThermalThrottle=17, PCIeAER=15, PodEvicted+PressureFromNote=12, NCCLHang=10, XidCorrelation=14). - [x] `cd module && go test -race -count=1 ./pkg/replay/...` (pod_evicted corpus step) — green. - [x] Pre-commit hook: golangci-lint + go vet + go mod verify + attribute-namespace-check — all green. - [ ] Workflow runs green on `main` after merge (verifiable only post-merge — nightly cron + push-to-main). ```release-notes ci: chaos.yml now matrix-tests all six shipped detectors (#14 pod_evicted, #8 nccl_hang, #3 hbm_ecc, #4 thermal_throttle, #5 pcie_aer, xid_correlation) instead of just pod_evicted. Closes the chaos-coverage gap called out in v1-rc1-test-audit §4. No operator-visible behavior change. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Lands replay corpora for the seven pattern detectors that lacked one, closing the Path-A test gap named in the v1-rc1 audit (#366). Each pattern now ships `module/pkg/replay/<pattern>/{canonical,_negative, _real_world}/` fixtures plus a `*_replay_test.go` runner that JSON-eqs detector output against the on-disk golden. Detectors covered: `hbm_ecc` (#3), `thermal_throttle` (#4), `pcie_aer` (#5), `ib_link_flap` (#2), `nccl_hang` (#15), `cuda_oom` (#10), `xid_correlation` (#16). `pod_evicted` (#14) corpus is unchanged. ## Design - Each detector takes a different input shape (e.g. `HBMECCRecord + XidRecord` for `hbm_ecc`, `ThermalThrottleRecord` for `thermal_throttle`, ...), so the existing `LoadFixturesUnder` helper (typed on `Record + NodeRecord`) cannot be reused. Each detector gets its own `*_replay_test.go` that inlines the per-detector JSON read; shared fixture-discovery and golden-assert helpers live in `helpers_test.go`. - Two detectors (`nccl_hang`, `ib_link_flap`) take a `Now` reference. Tests pin `Now` to a fixed timestamp matching the fixture's `started_ns` so hang-age and flap-window inclusion stay deterministic across replay runs (otherwise wall-clock drift would silently flip the verdicts as the fixture aged). - Goldens were generated from the live detectors via `UPDATE_REPLAY_GOLDEN=1 go test ./module/pkg/replay/...` and pinned; future drift in detector output (headline / remediation prose, evidence-trail UID shape, scalar-field rename) surfaces as a `JSONEq` diff against the fixture. Operators can also eyeball the golden to assert what they EXPECT vs what fires. - Negative fixtures each exercise a distinct discriminator (wrong Xid code, single GPU, no AER, no eviction, completed state, single transition, no OOM log) so a regression in one false-positive guard lights up the corresponding row only. - Flipped `run_corpus: true` on every row of the chaos.yml pattern-detectors matrix now that every detector has a corpus. ## Test plan - [x] `make check` — clean - [x] `go test -race -count=1 ./pkg/replay/...` — 35 tests pass (28 new + 7 pre-existing pod_evicted) - [x] `go test -count=1 ./processor/patterndetectorprocessor/...` — unchanged, still green - [x] `make verify` (pre-push) — clean - [ ] CI: chaos.yml `pattern-detectors` matrix — 8 rows, each runs hermetic regex + replay-corpus step Closes #366. ```release-notes NONE ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
3 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 4, 2026
Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr
added a commit
that referenced
this pull request
Jun 4, 2026
## Summary Closes the missing-pattern-docs gap in `docs/patterns/` for the two numerically-missing IDs (#6, #14) against NORTHSTARS Appendix A's 15-pattern roster. ## Investigation Surveyed `module/pkg/patterns/`, NORTHSTARS Appendix A, and MILESTONES M18/M19 to pin actual detector state per ID: | ID | NORTHSTARS Appendix A claim | Actual repo state | Decision | |---|---|---|---| | #6 Stragglers (M18) | `☑ in-tree detector` | **No detector exists** — no `straggler*.go`, no `data_time` aggregation, no replay fixture. Only `failure-inject cpu-steal` chaos hook (M4b) lands the symptom. MILESTONES line 237 acknowledges M18 is at-risk and build-time-coupled to M17's `cross_rank.go`. The Appendix A mark is aspirational, not current state. | **Option B** — README gap-note explicitly documenting the unimplemented state + linking the M18 milestone. Reconciliation of Appendix A's claim is left for the detector-shipping PR. | | #14 Pod evicted (M19) | `☑ in-tree detector` | Shipped — `patterns.PodEvictedDetector` in `module/pkg/patterns/pod_evicted.go`, processor wiring in `module/processor/patterndetectorprocessor/`, canonical replay fixture at `module/pkg/replay/pod_evicted/canonical/`. | **Option A** — backfill full operator walkthrough `14-pod-evicted-walkthrough.md` following the `02-ib-link-flap-walkthrough.md` template (Symptom → Signal → Query → Alert → Escalation → Replay → Detector status → Verdict shape → Integration recipe). | Also surfaced #15 (Image pull / `FailedMount` on restart) — present in Appendix A but absent from this directory. Added to the same "Reserved / unfilled" table for completeness so every Appendix A ID is accounted for somewhere in `docs/patterns/`. ## Changes - `docs/patterns/14-pod-evicted-walkthrough.md` (new, ~210 lines) — full operator walkthrough sourced from `module/pkg/patterns/pod_evicted.go` godoc, `verdict.go` `PodEvictedVerdict` struct, processor config knobs (`join_window`, `emit_partial_verdicts`), the `k8sobjectsreceiver` integration recipe, and the canonical replay fixture's headline regex contract. - `docs/patterns/README.md` — adds the #14 row to the Operator walkthroughs (shipped) table; rewrites the coverage prose (5 → 7 patterns walkthrough'd); adds a new "Reserved / unfilled" section documenting #6 and #15 explicitly so the numeric gaps in the existing tables are no longer silent. ## Test plan - [x] `make doc-check` green — all markdown links + anchors resolve, banned-phrase lint clean, walkthrough renders. - [x] Pre-commit hooks pass (`golangci-lint`, `go vet`, `go mod verify`, `attribute-namespace-check`, `deprecation-check`). - [x] Cross-checked every link in the new walkthrough resolves to a real on-disk file (`pod_evicted.go`, `pod_evicted_test.go`, `pod_evicted_bench_test.go`, `replay/pod_evicted/canonical/`, `k8sobjects-events.md`, `ATTRIBUTES.md`, `example_config.yaml`). ```release-notes docs(patterns): add operator walkthrough for pattern #14 (pod evicted); document pattern #6 + #15 gaps in the patterns README. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Closes the one FOLLOWUPS item from PR #13 that could only be observed
post-merge: whether GitHub's squash-merge preserves the per-commit
Assisted-by:trailers. Observed on squash commita57883f: itdoes not. The audit trail survives via the PR description (embedded
in the squash commit body, listing every source SHA), but the
trailer-level disclosure AGENTS.md describes is gone.
Documents the finding + flags two governance paths that aren't being
chosen unilaterally:
squash-merge disclosure, or
trailers survive.
Tag
v0.1.0-m1was also created ona57883f(dev-preview, not arelease artifact) — pushed alongside this branch but separately.
Linked issue(s)
Refs PR #13 (post-merge observation). N/A — no GitHub issue.
Release notes
Checklist
make cipasses locallygit commit -s)scripts/setup-signing.sh)