feat(patterns): nccl_hang detector + processor wiring (v0.3.0 NORTHSTAR pattern #2)#250
Merged
Conversation
added 3 commits
May 31, 2026 18:48
NCCLHangDetector reads cross-rank NCCL FlightRecorder records and emits one NCCLHangVerdict per (pg_id, collective_seq_id) where >=2 ranks have a non-completed record older than HangThreshold (default 5min, configurable). Zero-value usable; mirrors the PodEvictedDetector API. Output is sorted by (collective_seq_id, earliest_started_ns) for deterministic golden tests. The detector takes the latest record per (rank, pg, collective) so a later "completed" entry supersedes an earlier "started" in the same FR ring buffer. Schema-conformance: testdata/nccl_hang_verdict.schema.json pins the wire shape; drift-rejection test battery defends 5 schema constraints. TDD: red-test-first with 8 cases — positive 3-rank hang, all- completed negative, straggler-under-threshold edge, solo rank not-hang, threshold configurability, deterministic order, cross- collective non-join, later-completed supersedes started. Signed-off-by: Tri Lam <tri@maydow.com>
Adds the nccl_hang detector path alongside the existing pod_evicted flow. The processor now: - Projects log records carrying gen_ai.training.rank + nccl.fr.collective_seq_id into patterns.NCCLFRRecord values (rank fallback to nccl.rank / nccl.fr.rank for direct-from- receiver paths bypassing rankjoinprocessor). - Surfaces patterns.NCCLHangDetector with HangThreshold sourced from Config.NCCLHangThreshold (YAML: nccl_hang_threshold). Default 5min, floor 1s, validated. - Emits nccl_hang verdicts on the same wire format as pod_evicted (broken-out scalar attrs + pattern.verdict_json) so downstream consumers don't branch on pattern.id to read headline/remediation. TDD: 3 wiring tests cover the positive 3-rank hang path, the healthy all-completed negative, and threshold configurability. Signed-off-by: Tri Lam <tri@maydow.com>
Per fresh-context review of #250: the detector hardcoded Confidence to Full and never populated MissingLayers, yet the verdict struct and schema both advertised a Partial branch that never fires. Pod-evicted keeps the Partial path because node-condition layer can genuinely miss; nccl_hang has only one evidence layer (FR records) so the field is dead weight. Delete > add. TimeDiscoveredCompletedNs was projected from OTel attrs and stored on NCCLFRRecord but never read by the detector — State=='completed' is the sole supersession signal. Removed projection + field + schema prop + test fixtures. Schema drift-battery: replaced the now-impossible 'confidence_outside_enum' case with 'confidence_reintroduced' to pin that additionalProperties:false rejects re-adding the field. Tests + lint + vet + mod verify green. Signed-off-by: Tri Lam <tri@maydow.com>
Merged
9 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
…ORTHSTAR pattern #3) (#255) ## Summary Third real detector in `module/pkg/patterns/` — closes v0.3.0 NORTHSTAR O1 ("3 patterns at v0.1.0") alongside `pod_evicted` (#14) and `nccl_hang` (#15). **Pattern**: a GPU Xid kernel event (carrying the customer-stable `kernelevents.xid` attribute) followed within a configurable window (default 60s) by a pod eviction on the same node → one verdict per (Xid, evicted_pod) tuple. Operator-actionable: "Xid 79 on gpu-node-0001 → training/job-rank-3 evicted 10s later" with remediation pinning the node to drain and the pod to reschedule. **Detector** (`module/pkg/patterns/xid_correlation.go`): - `XidCorrelationDetector` zero-value usable; `CorrelationWindow` overrideable (defaults to `DefaultXidCorrelationWindow = 60s`). - `Evaluate(xids, events)` joins on `(node, time-within-window)` with the most-recent Xid winning the proximate-cause attribution. - Causal-direction guard: eviction BEFORE Xid does not correlate (mirrors pod_evicted's future-transition exclusion). - Deterministic output sorted by `(eviction_time, EventUID)`. **Verdict shape** — single vs multi-layer decision: structurally two evidence surfaces (kernel_event + pod_event in causal order in the evidence trail), but emission rule is "both layers joined or no verdict". Xid-without-eviction is already operator-visible via the raw `kernelevents.xid` telemetry; an Xid-only verdict would duplicate that signal. So `Confidence` / `MissingLayers` are omitted (dead-fields discipline from PR #250 review). Mirrors the `nccl_hang` shape. **Multi-pod decision**: one verdict PER evicted pod (not one verdict listing all). Per-pod is the operator-actionable shape — each verdict's `Remediation` pins a specific pod to drain/recreate so alert routing fans out by pod owner; collapsing would force operators to parse a list and lose per-pod routing. **Schema** (`testdata/xid_correlation_verdict.schema.json`): `additionalProperties:false`, `pattern.id` const="16", `evidence_trail.kind` enum=`["kernel_event","pod_event"]`, `minItems:2` (both layers required). 7-row drift-rejection battery covers re-introduced `confidence`/`missing_layers`, evidence-minimum violation, kind-enum, and `pattern.id` numeric vs string drift. **Wiring** (`patterndetectorprocessor`): - New `Config.XidCorrelationWindow` YAML field (default 60s, floor 1s), validated alongside the existing window fields. - `collectInputs` returns a fourth typed slice (`[]patterns.XidRecord`) built by `projectXidRecord`, which gates on `kernelevents.xid` and reads the host node from the resource attribute `k8s.node.name` (the standard k8sattributes / resourcedetection stamp on a DaemonSet — same pattern `projectNodeCondition` uses). Per-record `k8s.node.name` fallback for non-DaemonSet emitters. - `appendXidCorrelationVerdict` mirrors `appendNCCLHangVerdict`'s wire format — broken-out scalar attrs plus full `pattern.verdict_json`. **Recipe alignment**: `docs/integrations/journald-kernel.md` §"Customer-stable attribute mapping" already documents `kernelevents.xid` (int) as the OTTL-stamped surface. No recipe update needed; the existing pipeline emits exactly what the detector consumes. `gpu.id` (PCI BDF) is also stamped but intentionally not projected — the detector doesn't use it yet, so it stays available on the raw record without entering the pattern library as a dead field. ## Test plan - [x] `cd module && go test ./pkg/patterns/... ./processor/... -race -count=1` — green - [x] `make build` — collector binary compiles - [x] `make check` (golangci-lint + go vet + go mod verify) — 0 issues - [x] TDD red-test-first: both `xid_correlation_test.go` files failed to compile/assert before their impl landed - [x] Detector test cases: positive Xid 79 → eviction, no-eviction negative, cross-node negative, out-of-window edge (61s default), multi-pod per Xid (3 verdicts), window-configurable, pre-Xid eviction excluded, non-evicted-hint ignored, deterministic order, most-recent-Xid-wins - [x] Schema-conformance + 7-row drift-rejection battery - [x] Wiring tests: positive verdict emission, healthy non-emission, window override - [x] `TestConfig_Validate` extended with sub-1s xid_correlation_window floor case - [x] `TestFactory_Surface` pins `DefaultXidCorrelationWindow` (and the previously-unpinned `DefaultNCCLHangThreshold`) on the factory's default config ```release-notes feat(patterns): add xid_correlation detector (v0.3.0 NORTHSTAR pattern #3) — correlates GPU Xid kernel events (`kernelevents.xid` attribute per RFC-0013 §3) with downstream pod evictions on the same node within a configurable window (default 60s). Emits one verdict per evicted pod so alert routing can fan out per pod owner. Wired into patterndetectorprocessor; configurable via `xid_correlation_window` YAML field. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Second real detector in
module/pkg/patterns/(v0.3.0 NORTHSTAR pattern #2).NCCLHangDetectorreads cross-rank NCCL FlightRecorder records joined byrankjoinprocessorand emits oneNCCLHangVerdictper(pg_id, collective_seq_id)where >=2 ranks are stuck on the same collective pastHangThreshold(default 5min, configurable).module/pkg/patterns/nccl_hang.go): zero-value usable; mirrorsPodEvictedDetectorAPI shape (config-field struct,Evaluate()method, deterministic output sorted by(collective_seq_id, earliest_started_ns)). Takes latest record per(rank, pg, collective)so a later "completed" supersedes an earlier "started" in the same FR ring buffer.testdata/nccl_hang_verdict.schema.json): pins wire shape withadditionalProperties:false,pattern.idconst="15",hanging_ranksminItems:2. 5-row drift-rejection test battery.module/processor/patterndetectorprocessor/): newConfig.NCCLHangThresholdYAML field (default 5min, floor 1s);collectInputsextended to project log records carryinggen_ai.training.rank+nccl.fr.collective_seq_idinto typedNCCLFRRecordvalues; same wire format as pod_evicted (broken-out scalar attrs +pattern.verdict_json).Data shape verdict:
ncclfrreceiveralready stamps every field the detector needs (nccl.fr.pg_id,collective_seq_id,state,profiling_name,time_discovered_started_ns,time_discovered_completed_ns) andrankjoinprocessoraddsgen_ai.training.rank. Norankjoinprocessorextension required — the detector reasons over the latest state per(rank, collective)rather than needing a syntheticlast_completed_collective_id.Test plan
cd module && go test ./pkg/patterns/... ./processor/... -count=1— all greenmake build— collector binary compiles at v0.125make check(lint + vet + mod verify) — 0 issues