feat(patterns): xid_correlation detector + processor wiring (v0.3.0 NORTHSTAR pattern #3) by trilamsr · Pull Request #255 · TraceCoreAI/tracecore

trilamsr · 2026-06-01T02:55:55Z

Summary

Third real detector in module/pkg/patterns/ — closes v0.3.0 NORTHSTAR O1 ("3 patterns at v0.1.0") alongside pod_evicted (#14) and nccl_hang (#15).

Pattern: a GPU Xid kernel event (carrying the customer-stable kernelevents.xid attribute) followed within a configurable window (default 60s) by a pod eviction on the same node → one verdict per (Xid, evicted_pod) tuple. Operator-actionable: "Xid 79 on gpu-node-0001 → training/job-rank-3 evicted 10s later" with remediation pinning the node to drain and the pod to reschedule.

Detector (module/pkg/patterns/xid_correlation.go):

XidCorrelationDetector zero-value usable; CorrelationWindow overrideable (defaults to DefaultXidCorrelationWindow = 60s).
Evaluate(xids, events) joins on (node, time-within-window) with the most-recent Xid winning the proximate-cause attribution.
Causal-direction guard: eviction BEFORE Xid does not correlate (mirrors pod_evicted's future-transition exclusion).
Deterministic output sorted by (eviction_time, EventUID).

Verdict shape — single vs multi-layer decision: structurally two evidence surfaces (kernel_event + pod_event in causal order in the evidence trail), but emission rule is "both layers joined or no verdict". Xid-without-eviction is already operator-visible via the raw kernelevents.xid telemetry; an Xid-only verdict would duplicate that signal. So Confidence / MissingLayers are omitted (dead-fields discipline from PR #250 review). Mirrors the nccl_hang shape.

Multi-pod decision: one verdict PER evicted pod (not one verdict listing all). Per-pod is the operator-actionable shape — each verdict's Remediation pins a specific pod to drain/recreate so alert routing fans out by pod owner; collapsing would force operators to parse a list and lose per-pod routing.

Schema (testdata/xid_correlation_verdict.schema.json): additionalProperties:false, pattern.id const="16", evidence_trail.kind enum=["kernel_event","pod_event"], minItems:2 (both layers required). 7-row drift-rejection battery covers re-introduced confidence/missing_layers, evidence-minimum violation, kind-enum, and pattern.id numeric vs string drift.

Wiring (patterndetectorprocessor):

New Config.XidCorrelationWindow YAML field (default 60s, floor 1s), validated alongside the existing window fields.
collectInputs returns a fourth typed slice ([]patterns.XidRecord) built by projectXidRecord, which gates on kernelevents.xid and reads the host node from the resource attribute k8s.node.name (the standard k8sattributes / resourcedetection stamp on a DaemonSet — same pattern projectNodeCondition uses). Per-record k8s.node.name fallback for non-DaemonSet emitters.
appendXidCorrelationVerdict mirrors appendNCCLHangVerdict's wire format — broken-out scalar attrs plus full pattern.verdict_json.

Recipe alignment: docs/integrations/journald-kernel.md §"Customer-stable attribute mapping" already documents kernelevents.xid (int) as the OTTL-stamped surface. No recipe update needed; the existing pipeline emits exactly what the detector consumes. gpu.id (PCI BDF) is also stamped but intentionally not projected — the detector doesn't use it yet, so it stays available on the raw record without entering the pattern library as a dead field.

Test plan

cd module && go test ./pkg/patterns/... ./processor/... -race -count=1 — green
make build — collector binary compiles
make check (golangci-lint + go vet + go mod verify) — 0 issues
TDD red-test-first: both xid_correlation_test.go files failed to compile/assert before their impl landed
Detector test cases: positive Xid 79 → eviction, no-eviction negative, cross-node negative, out-of-window edge (61s default), multi-pod per Xid (3 verdicts), window-configurable, pre-Xid eviction excluded, non-evicted-hint ignored, deterministic order, most-recent-Xid-wins
Schema-conformance + 7-row drift-rejection battery
Wiring tests: positive verdict emission, healthy non-emission, window override
TestConfig_Validate extended with sub-1s xid_correlation_window floor case
TestFactory_Surface pins DefaultXidCorrelationWindow (and the previously-unpinned DefaultNCCLHangThreshold) on the factory's default config

feat(patterns): add xid_correlation detector (v0.3.0 NORTHSTAR pattern #3) — correlates GPU Xid kernel events (`kernelevents.xid` attribute per RFC-0013 §3) with downstream pod evictions on the same node within a configurable window (default 60s). Emits one verdict per evicted pod so alert routing can fan out per pod owner. Wired into patterndetectorprocessor; configurable via `xid_correlation_window` YAML field.

Third real detector in module/pkg/patterns/ (v0.3.0 NORTHSTAR pattern #3). XidCorrelationDetector reads GPU Xid kernel events (carrying the customer-stable kernelevents.xid attribute per RFC-0013 §3) and Kubernetes Pod eviction records, then emits one XidCorrelationVerdict per (Xid → Pod-Evicted) join inside a configurable correlation window (default 60s). Single-vs-multi-layer decision: structurally the pattern has TWO evidence surfaces (kernel_event + pod_event), so every verdict carries both refs in causal order. But the emission rule is "both layers joined or no verdict" — there is no partial path because Xid-without-eviction is already operator-visible via the raw kernelevents.xid telemetry, and emitting an Xid-only verdict would duplicate that signal. Confidence + MissingLayers are therefore omitted from the verdict shape (dead-field discipline from PR #250 review). Output mirrors the nccl_hang verdict shape. Multi-pod decision: one verdict PER evicted pod (not one verdict listing all pods). Per-pod is the operator-actionable shape — each verdict's Remediation pins a specific pod to drain/recreate so alert routing fans out by pod owner; collapsing into one verdict would force the operator to parse a list and lose per-pod routing. Schema (testdata/xid_correlation_verdict.schema.json) pins the wire shape with additionalProperties:false, pattern.id const="16", evidence_trail.kind enum=["kernel_event","pod_event"], and minItems:2 (both layers required). 7-row drift-rejection battery covers confidence/missing_layers re-introduction, evidence-minimum, kind-enum, and pattern.id type drift. Causal-direction guard: eviction BEFORE the Xid does NOT correlate — mirrors pod_evicted's future-transition exclusion. Same-node required; cross-node Xid + eviction emits zero verdicts. Signed-off-by: Tri Lam <tri@maydow.com>

Surfaces the v0.3.0 NORTHSTAR pattern #3 detector via the patterndetectorprocessor data path. New Config.XidCorrelationWindow YAML field (yaml:"xid_correlation_window") defaults to 60s with a 1s floor, mirroring the JoinWindow / NCCLHangThreshold style. Projection: collectInputs returns a fourth typed slice ([]patterns.XidRecord) built by projectXidRecord, which gates on the customer-stable `kernelevents.xid` attribute (RFC-0013 §3, the contract the journald-kernel OTTL recipe stamps) and reads the hosting node from resource-attr `k8s.node.name` (with a per-record fallback). The recipe's OTTL also stamps `gpu.id` (PCI BDF) — we intentionally don't project it onto XidRecord today because the detector doesn't use it; the contract is preserved on the raw record and can be picked up later if a remediation needs it. Emission: appendXidCorrelationVerdict mirrors appendNCCLHangVerdict's wire-format contract — broken-out scalar attrs (pattern.id, pattern.headline, pattern.remediation) plus full pattern.verdict_json — so downstream consumers don't branch on pattern.id to find headline/remediation. No pattern.confidence emitted (the xid_correlation verdict shape has no Confidence field, per the detector commit's dead-fields decision). Test plan: positive Xid → eviction emits one verdict; Xid alone emits none; XidCorrelationWindow surfaces the detector's window via YAML override; TestConfig_Validate gains a sub-1s xid floor case; TestFactory_Surface pins the new default constant. Signed-off-by: Tri Lam <tri@maydow.com>

Per fresh-context review of #255: EdgeOutOfWindow pinned the strict-greater-than at 61s but no test fenced the at-boundary case. Adding EdgeAtWindowBoundary asserts eviction at exactly window after Xid MUST correlate — flipping mostRecentXidWithin from `>` to `>=` now fails a test. Signed-off-by: Tri Lam <tri@maydow.com>

NORTHSTAR pattern #4 (thermal_throttle): same-node cascade of >=4 GPUs each accumulating >=30s of hw.gpu.throttle.duration{reason= thermal} within a 5min rolling window. Single-layer pattern — no cross-signal join. The cascade itself is the operator-actionable signal (spec escalation tier 2: HVAC/facilities); single-GPU throttling (tier 1: rack-airflow) is left to raw-counter dashboards since pattern #4's NORTHSTAR value is the facility-level signal. - ThermalThrottleRecord model + ThermalThrottleDetector w/ Window, MinCascadeGPUs, ThrottleDeltaThreshold (defaults: 5min, 4, 30s mirroring the spec's PromQL `[5m]` rate window + `> 30` threshold + escalation tier 2 "half the rack" cutoff for an 8-GPU node). - Deterministic sort: verdicts by node, GPUIDs within each verdict. - Per-GPU delta-summing inside the window so multi-scrape coverage (15s scrape intervals × 4 scrapes) clears the 30s bar correctly. - Schema: testdata/thermal_throttle_verdict.schema.json w/ 10 drift-rejection falsifiers (additionalProperties:false, pattern.id const, gpu_count minimum, gpu_ids minItems, evidence_trail enum + minItems). - Window-edge inclusivity fenced both sides (paired tests) to catch a future `<=` -> `<` flip (lesson from PR #255). Layer decision: single-layer (multi-GPU same-node, no cross-signal join). Justification — the spec's alert is single-signal PromQL (`hw.gpu.throttle.duration{reason="thermal"}`); a Xid or pod-event join would not add information operators don't already have from the cascade size + node alone. Integration: detector library only. The patterndetectorprocessor metrics path is deferred to PR-B (ADR-0001); until that lands the detector cannot fire end-to-end. Tracked in a follow-up issue mirroring the HBM ECC gap (#273). Signed-off-by: Tri Lam <tri@maydow.com>

Pattern #10 - CUDA OOM, deceptive allocator - per NORTHSTARS Appendix A row #10 and the design spec at docs/patterns/10-cuda-oom-deceptive.md. Detector evaluation rule - per-OOM, look up most-recent same-GPU FB sample within CorrelationWindow (default 2min, forward-only - fb.Timestamp <= oom.Timestamp) - if fb_free_ratio >= FBFreeFragmentationThreshold (default 0.05) -> kind=fragmentation (raise max_split_size_mb, empty_cache) - if fb_free_ratio < threshold -> kind=true_oom (shrink batch, shard) - if no FB sample joins -> kind=unknown, confidence=partial Discriminator value - fragmentation vs true-OOM is the operator's #1 question on a CUDA OOM - without DCGM cross-check the operator retries with same batch, hits same OOM, wastes a slot - partial-confidence verdict surfaces the OOM even when DCGM scrape lags, so the operator branches on concurrent pod_evicted / xid_correlation rather than silence Files - module/pkg/patterns/cuda_oom.go - detector + verdict + records - module/processor/patterndetectorprocessor/cuda_oom.go - projections, collectCUDAOOMInputs, appendCUDAOOMVerdict, runCUDAOOMDetector - module/processor/patterndetectorprocessor/cuda_oom_test.go - 7 wiring tests + 2 Validate guards - module/processor/patterndetectorprocessor/example_config.yaml - cuda_oom_correlation_window + cuda_oom_fb_free_fragmentation_threshold knobs - docs/ATTRIBUTES.md - hw.gpu.memory.{free,total} namespace entries Scalar promotions per issue #270 contract: gpu.id, k8s.{pod,node}.*, cuda_oom.kind, cuda_oom.tried_alloc_bytes, cuda_oom.fb_free_bytes, cuda_oom.fb_free_ratio, pattern.confidence. Window-edge fenced both sides per PR #255 lesson. Threshold-boundary fenced inclusive per same lesson. Most-recent-pre-OOM rule mirrors xid_correlation / pcie_aer / hbm_ecc. Integration-gap follow-ups (tracked separately on PR body): - DCGM_FI_DEV_FB_USED/FREE OTTL recipe extension (sibling to #273) - filelogreceiver OTTL stanza for CUDA OOM regex parsing (sibling to #285) - metrics-path on patterndetectorprocessor per ADR-0001 (PR-B) Tests - 17 detector tests in module/pkg/patterns/cuda_oom_test.go (filed in red commit, now green) - 11 schema-drift falsifier sub-tests on CUDAOOMVerdict - 7 wiring + 2 Validate tests in processor cuda_oom_test.go - all 35 green; full ./pkg/patterns + ./processor/patterndetectorprocessor suites green with -race; make check + make build green Refs #303 Signed-off-by: Tri Lam <tri@maydow.com>

Tri Lam added 3 commits May 31, 2026 19:49

trilamsr enabled auto-merge (squash) June 1, 2026 03:02

trilamsr merged commit e5fbf23 into main Jun 1, 2026
15 checks passed

trilamsr deleted the feat/xid-correlation-detector branch June 1, 2026 03:10

trilamsr mentioned this pull request Jun 1, 2026

A4: CUDA OOM detector + fragmentation-vs-true-OOM discriminator (pattern #10) #303

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(patterns): xid_correlation detector + processor wiring (v0.3.0 NORTHSTAR pattern #3)#255

feat(patterns): xid_correlation detector + processor wiring (v0.3.0 NORTHSTAR pattern #3)#255
trilamsr merged 3 commits into
mainfrom
feat/xid-correlation-detector

trilamsr commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 1, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant