feat(patterns): xid_correlation detector + processor wiring (v0.3.0 NORTHSTAR pattern #3)#255
Merged
Merged
Conversation
added 3 commits
May 31, 2026 19:49
Third real detector in module/pkg/patterns/ (v0.3.0 NORTHSTAR pattern #3). XidCorrelationDetector reads GPU Xid kernel events (carrying the customer-stable kernelevents.xid attribute per RFC-0013 §3) and Kubernetes Pod eviction records, then emits one XidCorrelationVerdict per (Xid → Pod-Evicted) join inside a configurable correlation window (default 60s). Single-vs-multi-layer decision: structurally the pattern has TWO evidence surfaces (kernel_event + pod_event), so every verdict carries both refs in causal order. But the emission rule is "both layers joined or no verdict" — there is no partial path because Xid-without-eviction is already operator-visible via the raw kernelevents.xid telemetry, and emitting an Xid-only verdict would duplicate that signal. Confidence + MissingLayers are therefore omitted from the verdict shape (dead-field discipline from PR #250 review). Output mirrors the nccl_hang verdict shape. Multi-pod decision: one verdict PER evicted pod (not one verdict listing all pods). Per-pod is the operator-actionable shape — each verdict's Remediation pins a specific pod to drain/recreate so alert routing fans out by pod owner; collapsing into one verdict would force the operator to parse a list and lose per-pod routing. Schema (testdata/xid_correlation_verdict.schema.json) pins the wire shape with additionalProperties:false, pattern.id const="16", evidence_trail.kind enum=["kernel_event","pod_event"], and minItems:2 (both layers required). 7-row drift-rejection battery covers confidence/missing_layers re-introduction, evidence-minimum, kind-enum, and pattern.id type drift. Causal-direction guard: eviction BEFORE the Xid does NOT correlate — mirrors pod_evicted's future-transition exclusion. Same-node required; cross-node Xid + eviction emits zero verdicts. Signed-off-by: Tri Lam <tri@maydow.com>
Surfaces the v0.3.0 NORTHSTAR pattern #3 detector via the patterndetectorprocessor data path. New Config.XidCorrelationWindow YAML field (yaml:"xid_correlation_window") defaults to 60s with a 1s floor, mirroring the JoinWindow / NCCLHangThreshold style. Projection: collectInputs returns a fourth typed slice ([]patterns.XidRecord) built by projectXidRecord, which gates on the customer-stable `kernelevents.xid` attribute (RFC-0013 §3, the contract the journald-kernel OTTL recipe stamps) and reads the hosting node from resource-attr `k8s.node.name` (with a per-record fallback). The recipe's OTTL also stamps `gpu.id` (PCI BDF) — we intentionally don't project it onto XidRecord today because the detector doesn't use it; the contract is preserved on the raw record and can be picked up later if a remediation needs it. Emission: appendXidCorrelationVerdict mirrors appendNCCLHangVerdict's wire-format contract — broken-out scalar attrs (pattern.id, pattern.headline, pattern.remediation) plus full pattern.verdict_json — so downstream consumers don't branch on pattern.id to find headline/remediation. No pattern.confidence emitted (the xid_correlation verdict shape has no Confidence field, per the detector commit's dead-fields decision). Test plan: positive Xid → eviction emits one verdict; Xid alone emits none; XidCorrelationWindow surfaces the detector's window via YAML override; TestConfig_Validate gains a sub-1s xid floor case; TestFactory_Surface pins the new default constant. Signed-off-by: Tri Lam <tri@maydow.com>
Per fresh-context review of #255: EdgeOutOfWindow pinned the strict-greater-than at 61s but no test fenced the at-boundary case. Adding EdgeAtWindowBoundary asserts eviction at exactly window after Xid MUST correlate — flipping mostRecentXidWithin from `>` to `>=` now fails a test. Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
NORTHSTAR pattern #4 (thermal_throttle): same-node cascade of >=4 GPUs each accumulating >=30s of hw.gpu.throttle.duration{reason= thermal} within a 5min rolling window. Single-layer pattern — no cross-signal join. The cascade itself is the operator-actionable signal (spec escalation tier 2: HVAC/facilities); single-GPU throttling (tier 1: rack-airflow) is left to raw-counter dashboards since pattern #4's NORTHSTAR value is the facility-level signal. - ThermalThrottleRecord model + ThermalThrottleDetector w/ Window, MinCascadeGPUs, ThrottleDeltaThreshold (defaults: 5min, 4, 30s mirroring the spec's PromQL `[5m]` rate window + `> 30` threshold + escalation tier 2 "half the rack" cutoff for an 8-GPU node). - Deterministic sort: verdicts by node, GPUIDs within each verdict. - Per-GPU delta-summing inside the window so multi-scrape coverage (15s scrape intervals × 4 scrapes) clears the 30s bar correctly. - Schema: testdata/thermal_throttle_verdict.schema.json w/ 10 drift-rejection falsifiers (additionalProperties:false, pattern.id const, gpu_count minimum, gpu_ids minItems, evidence_trail enum + minItems). - Window-edge inclusivity fenced both sides (paired tests) to catch a future `<=` -> `<` flip (lesson from PR #255). Layer decision: single-layer (multi-GPU same-node, no cross-signal join). Justification — the spec's alert is single-signal PromQL (`hw.gpu.throttle.duration{reason="thermal"}`); a Xid or pod-event join would not add information operators don't already have from the cascade size + node alone. Integration: detector library only. The patterndetectorprocessor metrics path is deferred to PR-B (ADR-0001); until that lands the detector cannot fire end-to-end. Tracked in a follow-up issue mirroring the HBM ECC gap (#273). Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Pattern #10 - CUDA OOM, deceptive allocator - per NORTHSTARS Appendix A row #10 and the design spec at docs/patterns/10-cuda-oom-deceptive.md. Detector evaluation rule - per-OOM, look up most-recent same-GPU FB sample within CorrelationWindow (default 2min, forward-only - fb.Timestamp <= oom.Timestamp) - if fb_free_ratio >= FBFreeFragmentationThreshold (default 0.05) -> kind=fragmentation (raise max_split_size_mb, empty_cache) - if fb_free_ratio < threshold -> kind=true_oom (shrink batch, shard) - if no FB sample joins -> kind=unknown, confidence=partial Discriminator value - fragmentation vs true-OOM is the operator's #1 question on a CUDA OOM - without DCGM cross-check the operator retries with same batch, hits same OOM, wastes a slot - partial-confidence verdict surfaces the OOM even when DCGM scrape lags, so the operator branches on concurrent pod_evicted / xid_correlation rather than silence Files - module/pkg/patterns/cuda_oom.go - detector + verdict + records - module/processor/patterndetectorprocessor/cuda_oom.go - projections, collectCUDAOOMInputs, appendCUDAOOMVerdict, runCUDAOOMDetector - module/processor/patterndetectorprocessor/cuda_oom_test.go - 7 wiring tests + 2 Validate guards - module/processor/patterndetectorprocessor/example_config.yaml - cuda_oom_correlation_window + cuda_oom_fb_free_fragmentation_threshold knobs - docs/ATTRIBUTES.md - hw.gpu.memory.{free,total} namespace entries Scalar promotions per issue #270 contract: gpu.id, k8s.{pod,node}.*, cuda_oom.kind, cuda_oom.tried_alloc_bytes, cuda_oom.fb_free_bytes, cuda_oom.fb_free_ratio, pattern.confidence. Window-edge fenced both sides per PR #255 lesson. Threshold-boundary fenced inclusive per same lesson. Most-recent-pre-OOM rule mirrors xid_correlation / pcie_aer / hbm_ecc. Integration-gap follow-ups (tracked separately on PR body): - DCGM_FI_DEV_FB_USED/FREE OTTL recipe extension (sibling to #273) - filelogreceiver OTTL stanza for CUDA OOM regex parsing (sibling to #285) - metrics-path on patterndetectorprocessor per ADR-0001 (PR-B) Tests - 17 detector tests in module/pkg/patterns/cuda_oom_test.go (filed in red commit, now green) - 11 schema-drift falsifier sub-tests on CUDAOOMVerdict - 7 wiring + 2 Validate tests in processor cuda_oom_test.go - all 35 green; full ./pkg/patterns + ./processor/patterndetectorprocessor suites green with -race; make check + make build green Refs #303 Signed-off-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Third real detector in
module/pkg/patterns/— closes v0.3.0 NORTHSTAR O1 ("3 patterns at v0.1.0") alongsidepod_evicted(#14) andnccl_hang(#15).Pattern: a GPU Xid kernel event (carrying the customer-stable
kernelevents.xidattribute) followed within a configurable window (default 60s) by a pod eviction on the same node → one verdict per (Xid, evicted_pod) tuple. Operator-actionable: "Xid 79 on gpu-node-0001 → training/job-rank-3 evicted 10s later" with remediation pinning the node to drain and the pod to reschedule.Detector (
module/pkg/patterns/xid_correlation.go):XidCorrelationDetectorzero-value usable;CorrelationWindowoverrideable (defaults toDefaultXidCorrelationWindow = 60s).Evaluate(xids, events)joins on(node, time-within-window)with the most-recent Xid winning the proximate-cause attribution.(eviction_time, EventUID).Verdict shape — single vs multi-layer decision: structurally two evidence surfaces (kernel_event + pod_event in causal order in the evidence trail), but emission rule is "both layers joined or no verdict". Xid-without-eviction is already operator-visible via the raw
kernelevents.xidtelemetry; an Xid-only verdict would duplicate that signal. SoConfidence/MissingLayersare omitted (dead-fields discipline from PR #250 review). Mirrors thenccl_hangshape.Multi-pod decision: one verdict PER evicted pod (not one verdict listing all). Per-pod is the operator-actionable shape — each verdict's
Remediationpins a specific pod to drain/recreate so alert routing fans out by pod owner; collapsing would force operators to parse a list and lose per-pod routing.Schema (
testdata/xid_correlation_verdict.schema.json):additionalProperties:false,pattern.idconst="16",evidence_trail.kindenum=["kernel_event","pod_event"],minItems:2(both layers required). 7-row drift-rejection battery covers re-introducedconfidence/missing_layers, evidence-minimum violation, kind-enum, andpattern.idnumeric vs string drift.Wiring (
patterndetectorprocessor):Config.XidCorrelationWindowYAML field (default 60s, floor 1s), validated alongside the existing window fields.collectInputsreturns a fourth typed slice ([]patterns.XidRecord) built byprojectXidRecord, which gates onkernelevents.xidand reads the host node from the resource attributek8s.node.name(the standard k8sattributes / resourcedetection stamp on a DaemonSet — same patternprojectNodeConditionuses). Per-recordk8s.node.namefallback for non-DaemonSet emitters.appendXidCorrelationVerdictmirrorsappendNCCLHangVerdict's wire format — broken-out scalar attrs plus fullpattern.verdict_json.Recipe alignment:
docs/integrations/journald-kernel.md§"Customer-stable attribute mapping" already documentskernelevents.xid(int) as the OTTL-stamped surface. No recipe update needed; the existing pipeline emits exactly what the detector consumes.gpu.id(PCI BDF) is also stamped but intentionally not projected — the detector doesn't use it yet, so it stays available on the raw record without entering the pattern library as a dead field.Test plan
cd module && go test ./pkg/patterns/... ./processor/... -race -count=1— greenmake build— collector binary compilesmake check(golangci-lint + go vet + go mod verify) — 0 issuesxid_correlation_test.gofiles failed to compile/assert before their impl landedTestConfig_Validateextended with sub-1s xid_correlation_window floor caseTestFactory_SurfacepinsDefaultXidCorrelationWindow(and the previously-unpinnedDefaultNCCLHangThreshold) on the factory's default config