Skip to content

feat(patterns): xid_correlation detector + processor wiring (v0.3.0 NORTHSTAR pattern #3)#255

Merged
trilamsr merged 3 commits into
mainfrom
feat/xid-correlation-detector
Jun 1, 2026
Merged

feat(patterns): xid_correlation detector + processor wiring (v0.3.0 NORTHSTAR pattern #3)#255
trilamsr merged 3 commits into
mainfrom
feat/xid-correlation-detector

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Third real detector in module/pkg/patterns/ — closes v0.3.0 NORTHSTAR O1 ("3 patterns at v0.1.0") alongside pod_evicted (#14) and nccl_hang (#15).

Pattern: a GPU Xid kernel event (carrying the customer-stable kernelevents.xid attribute) followed within a configurable window (default 60s) by a pod eviction on the same node → one verdict per (Xid, evicted_pod) tuple. Operator-actionable: "Xid 79 on gpu-node-0001 → training/job-rank-3 evicted 10s later" with remediation pinning the node to drain and the pod to reschedule.

Detector (module/pkg/patterns/xid_correlation.go):

  • XidCorrelationDetector zero-value usable; CorrelationWindow overrideable (defaults to DefaultXidCorrelationWindow = 60s).
  • Evaluate(xids, events) joins on (node, time-within-window) with the most-recent Xid winning the proximate-cause attribution.
  • Causal-direction guard: eviction BEFORE Xid does not correlate (mirrors pod_evicted's future-transition exclusion).
  • Deterministic output sorted by (eviction_time, EventUID).

Verdict shape — single vs multi-layer decision: structurally two evidence surfaces (kernel_event + pod_event in causal order in the evidence trail), but emission rule is "both layers joined or no verdict". Xid-without-eviction is already operator-visible via the raw kernelevents.xid telemetry; an Xid-only verdict would duplicate that signal. So Confidence / MissingLayers are omitted (dead-fields discipline from PR #250 review). Mirrors the nccl_hang shape.

Multi-pod decision: one verdict PER evicted pod (not one verdict listing all). Per-pod is the operator-actionable shape — each verdict's Remediation pins a specific pod to drain/recreate so alert routing fans out by pod owner; collapsing would force operators to parse a list and lose per-pod routing.

Schema (testdata/xid_correlation_verdict.schema.json): additionalProperties:false, pattern.id const="16", evidence_trail.kind enum=["kernel_event","pod_event"], minItems:2 (both layers required). 7-row drift-rejection battery covers re-introduced confidence/missing_layers, evidence-minimum violation, kind-enum, and pattern.id numeric vs string drift.

Wiring (patterndetectorprocessor):

  • New Config.XidCorrelationWindow YAML field (default 60s, floor 1s), validated alongside the existing window fields.
  • collectInputs returns a fourth typed slice ([]patterns.XidRecord) built by projectXidRecord, which gates on kernelevents.xid and reads the host node from the resource attribute k8s.node.name (the standard k8sattributes / resourcedetection stamp on a DaemonSet — same pattern projectNodeCondition uses). Per-record k8s.node.name fallback for non-DaemonSet emitters.
  • appendXidCorrelationVerdict mirrors appendNCCLHangVerdict's wire format — broken-out scalar attrs plus full pattern.verdict_json.

Recipe alignment: docs/integrations/journald-kernel.md §"Customer-stable attribute mapping" already documents kernelevents.xid (int) as the OTTL-stamped surface. No recipe update needed; the existing pipeline emits exactly what the detector consumes. gpu.id (PCI BDF) is also stamped but intentionally not projected — the detector doesn't use it yet, so it stays available on the raw record without entering the pattern library as a dead field.

Test plan

  • cd module && go test ./pkg/patterns/... ./processor/... -race -count=1 — green
  • make build — collector binary compiles
  • make check (golangci-lint + go vet + go mod verify) — 0 issues
  • TDD red-test-first: both xid_correlation_test.go files failed to compile/assert before their impl landed
  • Detector test cases: positive Xid 79 → eviction, no-eviction negative, cross-node negative, out-of-window edge (61s default), multi-pod per Xid (3 verdicts), window-configurable, pre-Xid eviction excluded, non-evicted-hint ignored, deterministic order, most-recent-Xid-wins
  • Schema-conformance + 7-row drift-rejection battery
  • Wiring tests: positive verdict emission, healthy non-emission, window override
  • TestConfig_Validate extended with sub-1s xid_correlation_window floor case
  • TestFactory_Surface pins DefaultXidCorrelationWindow (and the previously-unpinned DefaultNCCLHangThreshold) on the factory's default config
feat(patterns): add xid_correlation detector (v0.3.0 NORTHSTAR pattern #3) — correlates GPU Xid kernel events (`kernelevents.xid` attribute per RFC-0013 §3) with downstream pod evictions on the same node within a configurable window (default 60s). Emits one verdict per evicted pod so alert routing can fan out per pod owner. Wired into patterndetectorprocessor; configurable via `xid_correlation_window` YAML field.

Tri Lam added 3 commits May 31, 2026 19:49
Third real detector in module/pkg/patterns/ (v0.3.0 NORTHSTAR pattern
#3). XidCorrelationDetector reads GPU Xid kernel events (carrying
the customer-stable kernelevents.xid attribute per RFC-0013 §3) and
Kubernetes Pod eviction records, then emits one
XidCorrelationVerdict per (Xid → Pod-Evicted) join inside a
configurable correlation window (default 60s).

Single-vs-multi-layer decision: structurally the pattern has TWO
evidence surfaces (kernel_event + pod_event), so every verdict
carries both refs in causal order. But the emission rule is
"both layers joined or no verdict" — there is no partial path
because Xid-without-eviction is already operator-visible via the
raw kernelevents.xid telemetry, and emitting an Xid-only verdict
would duplicate that signal. Confidence + MissingLayers are
therefore omitted from the verdict shape (dead-field discipline
from PR #250 review). Output mirrors the nccl_hang verdict shape.

Multi-pod decision: one verdict PER evicted pod (not one verdict
listing all pods). Per-pod is the operator-actionable shape — each
verdict's Remediation pins a specific pod to drain/recreate so
alert routing fans out by pod owner; collapsing into one verdict
would force the operator to parse a list and lose per-pod routing.

Schema (testdata/xid_correlation_verdict.schema.json) pins the wire
shape with additionalProperties:false, pattern.id const="16",
evidence_trail.kind enum=["kernel_event","pod_event"], and minItems:2
(both layers required). 7-row drift-rejection battery covers
confidence/missing_layers re-introduction, evidence-minimum,
kind-enum, and pattern.id type drift.

Causal-direction guard: eviction BEFORE the Xid does NOT correlate —
mirrors pod_evicted's future-transition exclusion. Same-node
required; cross-node Xid + eviction emits zero verdicts.

Signed-off-by: Tri Lam <tri@maydow.com>
Surfaces the v0.3.0 NORTHSTAR pattern #3 detector via the
patterndetectorprocessor data path. New Config.XidCorrelationWindow
YAML field (yaml:"xid_correlation_window") defaults to 60s with a
1s floor, mirroring the JoinWindow / NCCLHangThreshold style.

Projection: collectInputs returns a fourth typed slice
([]patterns.XidRecord) built by projectXidRecord, which gates on
the customer-stable `kernelevents.xid` attribute (RFC-0013 §3, the
contract the journald-kernel OTTL recipe stamps) and reads the
hosting node from resource-attr `k8s.node.name` (with a per-record
fallback). The recipe's OTTL also stamps `gpu.id` (PCI BDF) — we
intentionally don't project it onto XidRecord today because the
detector doesn't use it; the contract is preserved on the raw
record and can be picked up later if a remediation needs it.

Emission: appendXidCorrelationVerdict mirrors appendNCCLHangVerdict's
wire-format contract — broken-out scalar attrs (pattern.id,
pattern.headline, pattern.remediation) plus full pattern.verdict_json
— so downstream consumers don't branch on pattern.id to find
headline/remediation. No pattern.confidence emitted (the
xid_correlation verdict shape has no Confidence field, per the
detector commit's dead-fields decision).

Test plan: positive Xid → eviction emits one verdict; Xid alone
emits none; XidCorrelationWindow surfaces the detector's window via
YAML override; TestConfig_Validate gains a sub-1s xid floor case;
TestFactory_Surface pins the new default constant.

Signed-off-by: Tri Lam <tri@maydow.com>
Per fresh-context review of #255: EdgeOutOfWindow pinned the
strict-greater-than at 61s but no test fenced the at-boundary
case. Adding EdgeAtWindowBoundary asserts eviction at exactly
window after Xid MUST correlate — flipping mostRecentXidWithin
from `>` to `>=` now fails a test.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 03:02
@trilamsr trilamsr merged commit e5fbf23 into main Jun 1, 2026
15 checks passed
@trilamsr trilamsr deleted the feat/xid-correlation-detector branch June 1, 2026 03:10
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
NORTHSTAR pattern #4 (thermal_throttle): same-node cascade of >=4
GPUs each accumulating >=30s of hw.gpu.throttle.duration{reason=
thermal} within a 5min rolling window. Single-layer pattern — no
cross-signal join. The cascade itself is the operator-actionable
signal (spec escalation tier 2: HVAC/facilities); single-GPU
throttling (tier 1: rack-airflow) is left to raw-counter dashboards
since pattern #4's NORTHSTAR value is the facility-level signal.

- ThermalThrottleRecord model + ThermalThrottleDetector w/ Window,
  MinCascadeGPUs, ThrottleDeltaThreshold (defaults: 5min, 4, 30s
  mirroring the spec's PromQL `[5m]` rate window + `> 30` threshold
  + escalation tier 2 "half the rack" cutoff for an 8-GPU node).
- Deterministic sort: verdicts by node, GPUIDs within each verdict.
- Per-GPU delta-summing inside the window so multi-scrape coverage
  (15s scrape intervals × 4 scrapes) clears the 30s bar correctly.
- Schema: testdata/thermal_throttle_verdict.schema.json w/ 10
  drift-rejection falsifiers (additionalProperties:false, pattern.id
  const, gpu_count minimum, gpu_ids minItems, evidence_trail enum +
  minItems).
- Window-edge inclusivity fenced both sides (paired tests) to catch
  a future `<=` -> `<` flip (lesson from PR #255).

Layer decision: single-layer (multi-GPU same-node, no cross-signal
join). Justification — the spec's alert is single-signal PromQL
(`hw.gpu.throttle.duration{reason="thermal"}`); a Xid or pod-event
join would not add information operators don't already have from
the cascade size + node alone.

Integration: detector library only. The patterndetectorprocessor
metrics path is deferred to PR-B (ADR-0001); until that lands the
detector cannot fire end-to-end. Tracked in a follow-up issue
mirroring the HBM ECC gap (#273).

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Pattern #10 - CUDA OOM, deceptive allocator - per NORTHSTARS Appendix A
row #10 and the design spec at docs/patterns/10-cuda-oom-deceptive.md.

Detector evaluation rule
- per-OOM, look up most-recent same-GPU FB sample within CorrelationWindow
  (default 2min, forward-only - fb.Timestamp <= oom.Timestamp)
- if fb_free_ratio >= FBFreeFragmentationThreshold (default 0.05) ->
  kind=fragmentation (raise max_split_size_mb, empty_cache)
- if fb_free_ratio < threshold -> kind=true_oom (shrink batch, shard)
- if no FB sample joins -> kind=unknown, confidence=partial

Discriminator value
- fragmentation vs true-OOM is the operator's #1 question on a CUDA OOM
- without DCGM cross-check the operator retries with same batch, hits
  same OOM, wastes a slot
- partial-confidence verdict surfaces the OOM even when DCGM scrape lags,
  so the operator branches on concurrent pod_evicted / xid_correlation
  rather than silence

Files
- module/pkg/patterns/cuda_oom.go - detector + verdict + records
- module/processor/patterndetectorprocessor/cuda_oom.go - projections,
  collectCUDAOOMInputs, appendCUDAOOMVerdict, runCUDAOOMDetector
- module/processor/patterndetectorprocessor/cuda_oom_test.go - 7 wiring
  tests + 2 Validate guards
- module/processor/patterndetectorprocessor/example_config.yaml -
  cuda_oom_correlation_window + cuda_oom_fb_free_fragmentation_threshold
  knobs
- docs/ATTRIBUTES.md - hw.gpu.memory.{free,total} namespace entries

Scalar promotions per issue #270 contract: gpu.id, k8s.{pod,node}.*,
cuda_oom.kind, cuda_oom.tried_alloc_bytes, cuda_oom.fb_free_bytes,
cuda_oom.fb_free_ratio, pattern.confidence.

Window-edge fenced both sides per PR #255 lesson. Threshold-boundary
fenced inclusive per same lesson. Most-recent-pre-OOM rule mirrors
xid_correlation / pcie_aer / hbm_ecc.

Integration-gap follow-ups (tracked separately on PR body):
- DCGM_FI_DEV_FB_USED/FREE OTTL recipe extension (sibling to #273)
- filelogreceiver OTTL stanza for CUDA OOM regex parsing (sibling to #285)
- metrics-path on patterndetectorprocessor per ADR-0001 (PR-B)

Tests
- 17 detector tests in module/pkg/patterns/cuda_oom_test.go (filed in
  red commit, now green)
- 11 schema-drift falsifier sub-tests on CUDAOOMVerdict
- 7 wiring + 2 Validate tests in processor cuda_oom_test.go
- all 35 green; full ./pkg/patterns + ./processor/patterndetectorprocessor
  suites green with -race; make check + make build green

Refs #303

Signed-off-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant