feat(patterns): hbm_ecc detector + processor wiring (NORTHSTAR pattern #3)#274
Merged
Conversation
added 2 commits
May 31, 2026 20:54
Adds the HBMECCDetector library type alongside pod_evicted /
nccl_hang / xid_correlation. Two-layer join: DCGM uncorrected
double-bit volatile ECC counter delta + Xid 48/63/64 kernel event
on the SAME gpu.id (PCI BDF) within a 5min default window.
Layer decision (single layer alone does not fire):
- ECC delta alone: operators see the raw counter trend on
dashboards; pattern's value is the kernel-level confirmation.
- Xid alone: may fire xid_correlation if paired with a pod
eviction; that is a node-scoped pattern, distinct from this
GPU-scoped one. Operators get raw kernelevents.xid telemetry.
Discriminators:
- ECC: error.type=uncorrected AND error.subtype=double_bit AND
error.persistence=volatile AND Delta >= ECCDeltaThreshold.
- Xid: code in {48, 63, 64} per docs/patterns/pattern-3-hbm-ecc.md.
- Join: same gpu.id, xid.Timestamp - ecc.Timestamp in [0, window].
Extends XidRecord with an optional GPUID field (PCI BDF; populated
by the journald-kernel OTTL recipe per RFC-0013 §3). xid_correlation
ignores it; hbm_ecc requires it.
PatternIDHBMECC = "3" matches the NORTHSTAR pattern number directly.
EvidenceKindHBM = "hw_error" mirrors the customer-stable hw.errors
metric name.
Window-edge contract is fenced on both sides: EdgeOutOfWindow
(window+1s, no fire) AND EdgeAtWindowBoundary (exactly window, must
fire). Schema drift battery has 10 falsifiers.
Signed-off-by: Tri Lam <tri@maydow.com>
Mirrors the xid_correlation wiring shape — collectInputs gains a fifth typed projection (patterns.HBMECCRecord), ConsumeLogs runs HBMECCDetector on (hbm_recs, xid_recs), and appendHBMECCVerdict emits one verdict log record per match with the same broken-out attrs + pattern.verdict_json contract. projectHBMECCRecord gates on (hw.errors.delta + gpu.id); the remaining DCGM taxonomy (error.type / error.subtype / error.persistence / hw.gpu.index) is read into the typed record so the detector's discriminator stays in the library, not the projection. projectXidRecord now also reads gpu.id (per the journald-kernel OTTL recipe in docs/integrations/journald-kernel.md); the xid_correlation detector ignores it but the hbm_ecc detector requires it for the same-GPU join. Config grows two YAML keys — hbm_ecc_window (default 5min, floor 1s) and hbm_ecc_delta_threshold (default 1, floor 0) — both appended at the end of Config struct so a parallel pattern-#1 agent's diff stays mergable. Update pattern-3 doc footer with detector-implemented status; the metrics→logs OTTL recipe deriving hw.errors.delta is tracked as follow-up. Signed-off-by: Tri Lam <tri@maydow.com>
Per fresh-context review of #274: HBMECCDetector.Now was declared "reserved for future gates" but never read by Evaluate and never set by any test. Peer detectors (xid_correlation, pod_evicted) don't carry an equivalent. Apply [[no-bloat]] — delete > add; re-introduce when the gate it claims to support actually exists. Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
PR #274 (hbm_ecc detector) landed during PR #264's rebase window. Dashboard template options + README coverage matrix stale post-merge: - pattern_id template: add '3 (hbm_ecc)' option - README: promote pattern.id=3 from 'not shipped' to 'shipped' row; list Verdict rate (panel 1) + Verdict by node (panel 5) as the panels that cover it (no dedicated top-N table this round; can add in a follow-up when operator-facing HBM ECC fields stabilize) Signed-off-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships pattern #3 (uncorrectable HBM ECC) per
docs/patterns/pattern-3-hbm-ecc.md. Two layers required for a verdict — a single layer alone emits nothing.error.type=uncorrectedANDerror.subtype=double_bitANDerror.persistence=volatileANDDelta >= threshold.gpu.id(PCI BDF) within 5min default window. Causality flows forward — Xid must be at or after the ECC delta.gpu.id. Same-node is implied but not required for the join (the GPU identifier is the proximate hardware-fault key).Why two layers required
xid_correlationif a pod evicts on the same node — that's a node-scoped pattern, distinct from this GPU-scoped one. Operators get rawkernelevents.xidtelemetry already.What's new
module/pkg/patterns/hbm_ecc.go—HBMECCDetectorlibrary type,HBMECCRecordprojection,HBMECCVerdictoutput.PatternIDHBMECC = "3",EvidenceKindHBM = "hw_error".module/pkg/patterns/testdata/hbm_ecc_verdict.schema.json— JSON schema with 10 drift falsifiers.XidRecordextended with optionalGPUID(PCI BDF from the journald-kernel OTTL recipe per RFC-0013 §3).xid_correlationignores it;hbm_eccrequires it.projectHBMECCRecord+appendHBMECCVerdictmirror the xid_correlation shape. Config growshbm_ecc_window(default 5min) +hbm_ecc_delta_threshold(default 1).Follow-up gap (not a workaround)
The pattern doc names
hw.errorsas a Counter on the metrics pipeline; the patterndetectorprocessor consumes log records. The detector wires up againsthw.errors.delta-shaped log records but no OTTL recipe today derives that log shape from the upstream metric. Tracked in #273 — needed for end-to-end firing in a real deployment; the detector library is the v0.3 moat and ships independently per RFC-0013.Test plan
cd module && go test ./... -race -count=1— all greenmake build— exit 0make check— golangci-lint 0 issues, go vet clean, go mod verifiedEdgeOutOfWindow(5min+1s, no fire) +EdgeAtWindowBoundary(exactly 5min, must fire)