Skip to content

feat(patterns): nccl_hang detector + processor wiring (v0.3.0 NORTHSTAR pattern #2)#250

Merged
trilamsr merged 3 commits into
mainfrom
feat/nccl-hang-detector
Jun 1, 2026
Merged

feat(patterns): nccl_hang detector + processor wiring (v0.3.0 NORTHSTAR pattern #2)#250
trilamsr merged 3 commits into
mainfrom
feat/nccl-hang-detector

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Second real detector in module/pkg/patterns/ (v0.3.0 NORTHSTAR pattern #2). NCCLHangDetector reads cross-rank NCCL FlightRecorder records joined by rankjoinprocessor and emits one NCCLHangVerdict per (pg_id, collective_seq_id) where >=2 ranks are stuck on the same collective past HangThreshold (default 5min, configurable).

  • Detector (module/pkg/patterns/nccl_hang.go): zero-value usable; mirrors PodEvictedDetector API shape (config-field struct, Evaluate() method, deterministic output sorted by (collective_seq_id, earliest_started_ns)). Takes latest record per (rank, pg, collective) so a later "completed" supersedes an earlier "started" in the same FR ring buffer.
  • Schema (testdata/nccl_hang_verdict.schema.json): pins wire shape with additionalProperties:false, pattern.id const="15", hanging_ranks minItems:2. 5-row drift-rejection test battery.
  • Wiring (module/processor/patterndetectorprocessor/): new Config.NCCLHangThreshold YAML field (default 5min, floor 1s); collectInputs extended to project log records carrying gen_ai.training.rank + nccl.fr.collective_seq_id into typed NCCLFRRecord values; same wire format as pod_evicted (broken-out scalar attrs + pattern.verdict_json).

Data shape verdict: ncclfrreceiver already stamps every field the detector needs (nccl.fr.pg_id, collective_seq_id, state, profiling_name, time_discovered_started_ns, time_discovered_completed_ns) and rankjoinprocessor adds gen_ai.training.rank. No rankjoinprocessor extension required — the detector reasons over the latest state per (rank, collective) rather than needing a synthetic last_completed_collective_id.

Test plan

  • cd module && go test ./pkg/patterns/... ./processor/... -count=1 — all green
  • make build — collector binary compiles at v0.125
  • make check (lint + vet + mod verify) — 0 issues
  • TDD red-test-first discipline: each new test failed compile/assert before its impl landed
  • Detector test cases: positive 3-rank hang, all-completed negative, straggler-under-threshold edge, solo-rank non-hang, threshold configurability, deterministic order, cross-collective non-join, later-completed-supersedes-started
  • Schema-conformance + 5-row drift-rejection test battery
  • Wiring tests: positive verdict emission, healthy non-emission, threshold configurability
feat(patterns): add nccl_hang detector (v0.3.0 NORTHSTAR pattern #2) — detects collective hangs across NCCL ranks by reading FlightRecorder state past a configurable threshold (default 5min). Wired into patterndetectorprocessor alongside pod_evicted; configurable via `nccl_hang_threshold` YAML field.

Tri Lam added 3 commits May 31, 2026 18:48
NCCLHangDetector reads cross-rank NCCL FlightRecorder records and
emits one NCCLHangVerdict per (pg_id, collective_seq_id) where
>=2 ranks have a non-completed record older than HangThreshold
(default 5min, configurable). Zero-value usable; mirrors the
PodEvictedDetector API.

Output is sorted by (collective_seq_id, earliest_started_ns) for
deterministic golden tests. The detector takes the latest record
per (rank, pg, collective) so a later "completed" entry supersedes
an earlier "started" in the same FR ring buffer.

Schema-conformance: testdata/nccl_hang_verdict.schema.json pins
the wire shape; drift-rejection test battery defends 5 schema
constraints.

TDD: red-test-first with 8 cases — positive 3-rank hang, all-
completed negative, straggler-under-threshold edge, solo rank
not-hang, threshold configurability, deterministic order, cross-
collective non-join, later-completed supersedes started.

Signed-off-by: Tri Lam <tri@maydow.com>
Adds the nccl_hang detector path alongside the existing pod_evicted
flow. The processor now:

- Projects log records carrying gen_ai.training.rank +
  nccl.fr.collective_seq_id into patterns.NCCLFRRecord values
  (rank fallback to nccl.rank / nccl.fr.rank for direct-from-
  receiver paths bypassing rankjoinprocessor).
- Surfaces patterns.NCCLHangDetector with HangThreshold sourced
  from Config.NCCLHangThreshold (YAML: nccl_hang_threshold).
  Default 5min, floor 1s, validated.
- Emits nccl_hang verdicts on the same wire format as pod_evicted
  (broken-out scalar attrs + pattern.verdict_json) so downstream
  consumers don't branch on pattern.id to read headline/remediation.

TDD: 3 wiring tests cover the positive 3-rank hang path, the
healthy all-completed negative, and threshold configurability.

Signed-off-by: Tri Lam <tri@maydow.com>
Per fresh-context review of #250: the detector hardcoded Confidence
to Full and never populated MissingLayers, yet the verdict struct and
schema both advertised a Partial branch that never fires. Pod-evicted
keeps the Partial path because node-condition layer can genuinely
miss; nccl_hang has only one evidence layer (FR records) so the field
is dead weight. Delete > add.

TimeDiscoveredCompletedNs was projected from OTel attrs and stored on
NCCLFRRecord but never read by the detector — State=='completed' is
the sole supersession signal. Removed projection + field + schema
prop + test fixtures.

Schema drift-battery: replaced the now-impossible 'confidence_outside_enum'
case with 'confidence_reintroduced' to pin that additionalProperties:false
rejects re-adding the field.

Tests + lint + vet + mod verify green.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 02:06
@trilamsr trilamsr merged commit f89958d into main Jun 1, 2026
15 checks passed
@trilamsr trilamsr deleted the feat/nccl-hang-detector branch June 1, 2026 02:15
trilamsr added a commit that referenced this pull request Jun 1, 2026
…ORTHSTAR pattern #3) (#255)

## Summary

Third real detector in `module/pkg/patterns/` — closes v0.3.0 NORTHSTAR
O1 ("3 patterns at v0.1.0") alongside `pod_evicted` (#14) and
`nccl_hang` (#15).

**Pattern**: a GPU Xid kernel event (carrying the customer-stable
`kernelevents.xid` attribute) followed within a configurable window
(default 60s) by a pod eviction on the same node → one verdict per (Xid,
evicted_pod) tuple. Operator-actionable: "Xid 79 on gpu-node-0001 →
training/job-rank-3 evicted 10s later" with remediation pinning the node
to drain and the pod to reschedule.

**Detector** (`module/pkg/patterns/xid_correlation.go`):
- `XidCorrelationDetector` zero-value usable; `CorrelationWindow`
overrideable (defaults to `DefaultXidCorrelationWindow = 60s`).
- `Evaluate(xids, events)` joins on `(node, time-within-window)` with
the most-recent Xid winning the proximate-cause attribution.
- Causal-direction guard: eviction BEFORE Xid does not correlate
(mirrors pod_evicted's future-transition exclusion).
- Deterministic output sorted by `(eviction_time, EventUID)`.

**Verdict shape** — single vs multi-layer decision: structurally two
evidence surfaces (kernel_event + pod_event in causal order in the
evidence trail), but emission rule is "both layers joined or no
verdict". Xid-without-eviction is already operator-visible via the raw
`kernelevents.xid` telemetry; an Xid-only verdict would duplicate that
signal. So `Confidence` / `MissingLayers` are omitted (dead-fields
discipline from PR #250 review). Mirrors the `nccl_hang` shape.

**Multi-pod decision**: one verdict PER evicted pod (not one verdict
listing all). Per-pod is the operator-actionable shape — each verdict's
`Remediation` pins a specific pod to drain/recreate so alert routing
fans out by pod owner; collapsing would force operators to parse a list
and lose per-pod routing.

**Schema** (`testdata/xid_correlation_verdict.schema.json`):
`additionalProperties:false`, `pattern.id` const="16",
`evidence_trail.kind` enum=`["kernel_event","pod_event"]`, `minItems:2`
(both layers required). 7-row drift-rejection battery covers
re-introduced `confidence`/`missing_layers`, evidence-minimum violation,
kind-enum, and `pattern.id` numeric vs string drift.

**Wiring** (`patterndetectorprocessor`):
- New `Config.XidCorrelationWindow` YAML field (default 60s, floor 1s),
validated alongside the existing window fields.
- `collectInputs` returns a fourth typed slice (`[]patterns.XidRecord`)
built by `projectXidRecord`, which gates on `kernelevents.xid` and reads
the host node from the resource attribute `k8s.node.name` (the standard
k8sattributes / resourcedetection stamp on a DaemonSet — same pattern
`projectNodeCondition` uses). Per-record `k8s.node.name` fallback for
non-DaemonSet emitters.
- `appendXidCorrelationVerdict` mirrors `appendNCCLHangVerdict`'s wire
format — broken-out scalar attrs plus full `pattern.verdict_json`.

**Recipe alignment**: `docs/integrations/journald-kernel.md`
§"Customer-stable attribute mapping" already documents
`kernelevents.xid` (int) as the OTTL-stamped surface. No recipe update
needed; the existing pipeline emits exactly what the detector consumes.
`gpu.id` (PCI BDF) is also stamped but intentionally not projected — the
detector doesn't use it yet, so it stays available on the raw record
without entering the pattern library as a dead field.

## Test plan

- [x] `cd module && go test ./pkg/patterns/... ./processor/... -race
-count=1` — green
- [x] `make build` — collector binary compiles
- [x] `make check` (golangci-lint + go vet + go mod verify) — 0 issues
- [x] TDD red-test-first: both `xid_correlation_test.go` files failed to
compile/assert before their impl landed
- [x] Detector test cases: positive Xid 79 → eviction, no-eviction
negative, cross-node negative, out-of-window edge (61s default),
multi-pod per Xid (3 verdicts), window-configurable, pre-Xid eviction
excluded, non-evicted-hint ignored, deterministic order,
most-recent-Xid-wins
- [x] Schema-conformance + 7-row drift-rejection battery
- [x] Wiring tests: positive verdict emission, healthy non-emission,
window override
- [x] `TestConfig_Validate` extended with sub-1s xid_correlation_window
floor case
- [x] `TestFactory_Surface` pins `DefaultXidCorrelationWindow` (and the
previously-unpinned `DefaultNCCLHangThreshold`) on the factory's default
config

```release-notes
feat(patterns): add xid_correlation detector (v0.3.0 NORTHSTAR pattern #3) — correlates GPU Xid kernel events (`kernelevents.xid` attribute per RFC-0013 §3) with downstream pod evictions on the same node within a configurable window (default 60s). Emits one verdict per evicted pod so alert routing can fan out per pod owner. Wired into patterndetectorprocessor; configurable via `xid_correlation_window` YAML field.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant