Skip to content

feat(patterns): hbm_ecc detector + processor wiring (NORTHSTAR pattern #3)#274

Merged
trilamsr merged 3 commits into
mainfrom
feat/hbm-ecc-detector
Jun 1, 2026
Merged

feat(patterns): hbm_ecc detector + processor wiring (NORTHSTAR pattern #3)#274
trilamsr merged 3 commits into
mainfrom
feat/hbm-ecc-detector

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Ships pattern #3 (uncorrectable HBM ECC) per docs/patterns/pattern-3-hbm-ecc.md. Two layers required for a verdict — a single layer alone emits nothing.

  • Layer 1 (DCGM): uncorrected double-bit volatile ECC counter delta on a GPU. Discriminator: error.type=uncorrected AND error.subtype=double_bit AND error.persistence=volatile AND Delta >= threshold.
  • Layer 2 (kernel): Xid 48 / 63 / 64 on the SAME gpu.id (PCI BDF) within 5min default window. Causality flows forward — Xid must be at or after the ECC delta.
  • Join key: gpu.id. Same-node is implied but not required for the join (the GPU identifier is the proximate hardware-fault key).

Why two layers required

  • ECC delta alone: operators already see the rising counter on dashboards; pattern's value is the kernel-level confirmation that the GPU is failing.
  • Xid alone: may fire xid_correlation if a pod evicts on the same node — that's a node-scoped pattern, distinct from this GPU-scoped one. Operators get raw kernelevents.xid telemetry already.

What's new

  • module/pkg/patterns/hbm_ecc.goHBMECCDetector library type, HBMECCRecord projection, HBMECCVerdict output. PatternIDHBMECC = "3", EvidenceKindHBM = "hw_error".
  • module/pkg/patterns/testdata/hbm_ecc_verdict.schema.json — JSON schema with 10 drift falsifiers.
  • XidRecord extended with optional GPUID (PCI BDF from the journald-kernel OTTL recipe per RFC-0013 §3). xid_correlation ignores it; hbm_ecc requires it.
  • Processor wiring: projectHBMECCRecord + appendHBMECCVerdict mirror the xid_correlation shape. Config grows hbm_ecc_window (default 5min) + hbm_ecc_delta_threshold (default 1).

Follow-up gap (not a workaround)

The pattern doc names hw.errors as a Counter on the metrics pipeline; the patterndetectorprocessor consumes log records. The detector wires up against hw.errors.delta-shaped log records but no OTTL recipe today derives that log shape from the upstream metric. Tracked in #273 — needed for end-to-end firing in a real deployment; the detector library is the v0.3 moat and ships independently per RFC-0013.

Test plan

  • cd module && go test ./... -race -count=1 — all green
  • make build — exit 0
  • make check — golangci-lint 0 issues, go vet clean, go mod verified
  • Window-edge fenced both sides: EdgeOutOfWindow (5min+1s, no fire) + EdgeAtWindowBoundary (exactly 5min, must fire)
  • Negative paths: ECC alone, Xid alone, wrong Xid code, corrected ECC, cross-GPU, pre-ECC Xid (causal direction)
  • Schema drift battery: 10 falsifiers (extra field, confidence re-added, pattern.id numeric, pattern.id wrong value, xid_code string, ecc_delta negative, ecc_delta zero, evidence_kind outside enum, evidence_trail under min, gpu_id empty)
  • Wiring: emits-verdict, xid-alone, ecc-alone, out-of-window, cross-gpu, window+threshold configurable
feat(patterns): hbm_ecc detector (NORTHSTAR pattern #3) — emits a verdict when an uncorrected double-bit volatile DCGM ECC counter rise joins an Xid 48/63/64 on the same gpu.id within the configurable window (default 5min). Two new YAML config keys: hbm_ecc_window, hbm_ecc_delta_threshold.

Tri Lam added 2 commits May 31, 2026 20:54
Adds the HBMECCDetector library type alongside pod_evicted /
nccl_hang / xid_correlation. Two-layer join: DCGM uncorrected
double-bit volatile ECC counter delta + Xid 48/63/64 kernel event
on the SAME gpu.id (PCI BDF) within a 5min default window.

Layer decision (single layer alone does not fire):
- ECC delta alone: operators see the raw counter trend on
  dashboards; pattern's value is the kernel-level confirmation.
- Xid alone: may fire xid_correlation if paired with a pod
  eviction; that is a node-scoped pattern, distinct from this
  GPU-scoped one. Operators get raw kernelevents.xid telemetry.

Discriminators:
- ECC: error.type=uncorrected AND error.subtype=double_bit AND
  error.persistence=volatile AND Delta >= ECCDeltaThreshold.
- Xid: code in {48, 63, 64} per docs/patterns/pattern-3-hbm-ecc.md.
- Join: same gpu.id, xid.Timestamp - ecc.Timestamp in [0, window].

Extends XidRecord with an optional GPUID field (PCI BDF; populated
by the journald-kernel OTTL recipe per RFC-0013 §3). xid_correlation
ignores it; hbm_ecc requires it.

PatternIDHBMECC = "3" matches the NORTHSTAR pattern number directly.
EvidenceKindHBM = "hw_error" mirrors the customer-stable hw.errors
metric name.

Window-edge contract is fenced on both sides: EdgeOutOfWindow
(window+1s, no fire) AND EdgeAtWindowBoundary (exactly window, must
fire). Schema drift battery has 10 falsifiers.

Signed-off-by: Tri Lam <tri@maydow.com>
Mirrors the xid_correlation wiring shape — collectInputs gains a
fifth typed projection (patterns.HBMECCRecord), ConsumeLogs runs
HBMECCDetector on (hbm_recs, xid_recs), and appendHBMECCVerdict
emits one verdict log record per match with the same broken-out
attrs + pattern.verdict_json contract.

projectHBMECCRecord gates on (hw.errors.delta + gpu.id); the
remaining DCGM taxonomy (error.type / error.subtype /
error.persistence / hw.gpu.index) is read into the typed record so
the detector's discriminator stays in the library, not the
projection.

projectXidRecord now also reads gpu.id (per the journald-kernel
OTTL recipe in docs/integrations/journald-kernel.md); the
xid_correlation detector ignores it but the hbm_ecc detector
requires it for the same-GPU join.

Config grows two YAML keys — hbm_ecc_window (default 5min, floor
1s) and hbm_ecc_delta_threshold (default 1, floor 0) — both
appended at the end of Config struct so a parallel pattern-#1
agent's diff stays mergable.

Update pattern-3 doc footer with detector-implemented status; the
metrics→logs OTTL recipe deriving hw.errors.delta is tracked as
follow-up.

Signed-off-by: Tri Lam <tri@maydow.com>
Per fresh-context review of #274: HBMECCDetector.Now was declared
"reserved for future gates" but never read by Evaluate and never set
by any test. Peer detectors (xid_correlation, pod_evicted) don't
carry an equivalent. Apply [[no-bloat]] — delete > add; re-introduce
when the gate it claims to support actually exists.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 04:05
@trilamsr trilamsr merged commit a234c62 into main Jun 1, 2026
15 checks passed
@trilamsr trilamsr deleted the feat/hbm-ecc-detector branch June 1, 2026 04:13
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
PR #274 (hbm_ecc detector) landed during PR #264's rebase window.
Dashboard template options + README coverage matrix stale post-merge:

- pattern_id template: add '3 (hbm_ecc)' option
- README: promote pattern.id=3 from 'not shipped' to 'shipped' row;
  list Verdict rate (panel 1) + Verdict by node (panel 5) as the
  panels that cover it (no dedicated top-N table this round; can
  add in a follow-up when operator-facing HBM ECC fields stabilize)

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
panel 1 description listed only 14/15/16 as shipped; HBM ECC
(pattern_id=3) landed in #274. add it to the enumeration. README
matrix + template-var options already reflected the ship.

closes #287

Signed-off-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant