feat(patterns): pcie_aer detector + processor wiring (NORTHSTAR pattern #5)#286
Merged
Conversation
added 3 commits
May 31, 2026 21:58
Ships NORTHSTAR pattern #5 per docs/patterns/pattern-5-pcie-aer.md. Two layers required for a verdict; a single layer alone emits nothing. - Layer 1 (kernel AER): `PCIe Bus Error: severity=…, type=…` line on a specific `gpu.id` (PCI BDF) extracted by the journald-kernel OTTL recipe. - Layer 2 (rate-collapse): per-GPU `hw.gpu.io` `BytesPerSecond` <= `(1 - RateDropThreshold) * BaselineBytesPerSecond` on the same BDF within the correlation window. Default threshold 0.5 (50% drop) and window 5min. Causality flows forward — the AER must precede the rate-collapse sample. Why two layers required (no Confidence taxonomy, mirroring xid_correlation + hbm_ecc): - AER alone: corrected errors recovered — the link re-trained. Operators see raw AER telemetry on journald already. - Rate-collapse alone: workload-natural Tx/Rx variance (a rank finished a comm phase, etc.). The AER is the hardware-fault confirmation that makes the join operator-actionable. Join key is `gpu.id` (PCI BDF) per RFC-0013 §3, shared across the dmesg AER preamble and the dcgm-exporter `hw.gpu.pci.bdf` resource attribute on `hw.gpu.io`. Node is carried for the verdict prose but not used in the join key — the BDF is the proximate hardware identifier. Defaults: - CorrelationWindow: 5min — mirrors the spec's `rate(...[5m])` PromQL and typical post-AER link-renegotiation latency. - RateDropThreshold: 0.5 — a one-generation PCIe downshift (Gen5 → Gen3 or x16 → x8) lands well past this floor. What's new: - module/pkg/patterns/pcie_aer.go — PCIeAERDetector library type, PCIeAERRecord (AER kernel message), PCIeIORecord (hw.gpu.io rate sample with baseline), PCIeAERVerdict output. PatternIDPCIeAER = "5"; EvidenceKindAER = "pcie_aer"; EvidenceKindPCIeIOCollapse = "hw_io_collapse". - module/pkg/patterns/testdata/pcie_aer_verdict.schema.json — JSON schema with 10 drift falsifiers (extra field, confidence re-add, pattern.id numeric, pattern.id wrong value, severity outside enum, gpu_id empty, drop_ratio negative, drop_ratio over 1, evidence kind outside enum, evidence_trail under min). Verdict struct carries promoted scalar fields (Severity, AERType, DropRatio, GPUID, Node) so the processor wiring can stamp them as top-level OTLP log attributes for dashboard table-aggregation per PR #275's lesson. No dead fields — every struct field is read by either the Evaluate path or the verdict-shape contract. Integration end-to-end firing in a real deployment is blocked on PR-B (issue #260): no OTTL recipe today derives the per-GPU `tracecore.alert.pcie_rate_collapse`-shaped log record from `hw.gpu.io`. The detector library is the v0.3 moat and ships independently per ADR-0001. Signed-off-by: Tri Lam <tri@maydow.com>
Wires the PCIeAERDetector library type into patterndetectorprocessor:
- Config: PCIeAERWindow (yaml: pcie_aer_window, default 5min),
PCIeAERRateDropThreshold (yaml: pcie_aer_rate_drop_threshold,
default 0.5). Validate rejects sub-1s window and threshold
outside [0, 1].
- collectInputs grows two new typed projections behind tighter
discriminators than the existing five:
- projectPCIeAERRecord — gate `kernelevents.pcie_aer.severity`
AND `gpu.id`. Reads severity/type/gpu.id off log attrs, falls
back to resource gpu.id when the journald-kernel OTTL stamps
it on the resource.
- projectPCIeIORecord — gate
`tracecore.alert.pcie_rate_collapse.bytes_per_second` AND
`gpu.id`. Reads BytesPerSecond + Baseline + Direction off log
attrs; the bridge attribute name is namespaced
`tracecore.alert.pcie_rate_collapse.*` so downstream knows it's
a tracecore-derived alert (vs a raw hw.gpu.io scrape sample).
- appendPCIeAERVerdict promotes (GPUID, Severity, AERType,
DropRatio, Node) onto the verdict log record as top-level OTLP
attributes per PR #275's lesson, so dashboards can table-
aggregate without parsing pattern.verdict_json.
Names track OTel semconv (`gpu.id`, `k8s.node.name`) and
recipe-canonical keys (`kernelevents.pcie_aer.severity/.type`).
Wiring tests cover: emits-verdict, AER-alone (no fire),
rate-collapse-alone (no fire), window-configurable, threshold-
configurable, promoted-scalar attribute presence, and the new
Validate guard.
Integration gap (filed separately, not blocking this PR):
1. The journald-kernel OTTL recipe extracts kernelevents.xid +
gpu.id from `NVRM: Xid` lines but does NOT extract
kernelevents.pcie_aer.* from `PCIe Bus Error: …` lines —
needs a sibling OTTL stanza in
docs/integrations/journald-kernel.md.
2. The metrics→logs PCIe rate-collapse alert OTTL recipe is the
PR-B side of issue #260 (ADR-0001 the blocker for all
metrics-sourced patterns).
Until both land, the detector is the v0.3 moat (pattern logic +
tests) and the wiring is configured-but-quiet (zero verdicts on
real input — projections find nothing).
Signed-off-by: Tri Lam <tri@maydow.com>
doc-check fail mode: `//.*\bPR\s*#\d+` regex blocks bare PR refs in source comments per STYLE.md (defaults to no comments; PR refs rot in long-lived files). Rephrased to issue-N where the trail still matters; otherwise stripped. Signed-off-by: Tri Lam <tri@maydow.com>
94b844a to
46b8eb3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships NORTHSTAR pattern #5 (PCIe AER + hw.gpu.io rate-collapse) per
docs/patterns/pattern-5-pcie-aer.md.Two layers required for a verdict (no Confidence taxonomy — mirroring
xid_correlation + hbm_ecc):
PCIe Bus Error: severity=..., type=...lineon a specific
gpu.id(PCI BDF) extracted by the journald-kernelOTTL recipe.
hw.gpu.ioBytesPerSecond<=(1 - RateDropThreshold) * BaselineBytesPerSecondon the same BDFwithin the correlation window. Default threshold 0.5 (50% drop) and
window 5min. Causality is forward — the AER must precede the
rate-collapse sample.
Join key:
gpu.id(PCI BDF) per RFC-0013 §3, shared across the dmesgAER preamble and the dcgm-exporter
hw.gpu.pci.bdfresource attributeon
hw.gpu.io. Node is carried for the verdict prose only, not thejoin key — the BDF is the proximate hardware identifier.
The verdict log record promotes operator-facing scalars onto top-level
OTLP attributes (
gpu.id,kernelevents.pcie_aer.severity,kernelevents.pcie_aer.type,k8s.node.name,tracecore.alert.pcie_rate_collapse.drop_ratio) per the issue #270scalar-promotion contract, via the
putStrIfSethelper so an emptyupstream stamp doesn't silently match empty-filter dashboard queries.
Integration gap (filed separately, not blocking this PR)
Per ADR-0001's intentional split — library + processor wiring first,
OTTL recipes second — end-to-end firing on a real cluster needs:
tracecore.alert.pcie_rate_collapse.*logrecords from raw
hw.gpu.ioCounter samples (the metrics→logs PR-Bside; same blocker class as Recipe extension: emit hw.gpu.nvlink.* + wire metrics path for pattern-1 detector #260 for pattern-1).
journald-kernelrecipe extension to extractkernelevents.pcie_aer.severity/.type/gpu.idfromPCIe Bus Error: ...kernel lines (the existing NVRM Xid stanza isthe proximate sibling — same file, same extraction shape).
Until both land, the detector is the v0.3 moat (pattern logic + tests)
and the processor wiring is configured-but-quiet (zero verdicts on
real input — projections find nothing). The detector library and
wiring ship independently per ADR-0001.
What's new
module/pkg/patterns/pcie_aer.go—PCIeAERDetectorlibrary type,PCIeAERRecord(AER kernel message),PCIeIORecord(hw.gpu.io ratesample with baseline),
PCIeAERVerdictoutput.PatternIDPCIeAER = "5";EvidenceKindAER = "pcie_aer";EvidenceKindPCIeIOCollapse = "hw_io_collapse".module/pkg/patterns/testdata/pcie_aer_verdict.schema.json— JSONschema with 10 drift falsifiers (extra field, confidence re-add,
pattern.id numeric, pattern.id wrong value, severity outside enum,
gpu_id empty, drop_ratio negative, drop_ratio over 1, evidence kind
outside enum, evidence_trail under min).
pcie_aer_window(default 5min),pcie_aer_rate_drop_threshold(default 0.5).
Validaterejects sub-1s window and threshold outside[0, 1].
collectInputsgrows two new typed projections (projectPCIeAERRecord,projectPCIeIORecord) ordered afterhbm_eccin thepod_event > node_condition > nccl_fr > xid > hbm_ecc > pcie_aer >
pcie_io discriminator priority. PCIe AER is checked before PCIe IO
because the kernel-line discriminator is a tighter regex match than
the metric-derived bridge attribute.
(no fire), window-configurable, threshold-configurable,
promoted-scalar attribute presence, Validate guard.
Test plan
cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... -race -count=1— greenmake build— green (OCB compile clean against the new processor wiring)make check— green (golangci-lint, go vet, go mod verify)make doc-check— green (comment-noise diff gate clean)