Skip to content

feat(patterns): pcie_aer detector + processor wiring (NORTHSTAR pattern #5)#286

Merged
trilamsr merged 3 commits into
mainfrom
feat/pcie-aer-detector
Jun 1, 2026
Merged

feat(patterns): pcie_aer detector + processor wiring (NORTHSTAR pattern #5)#286
trilamsr merged 3 commits into
mainfrom
feat/pcie-aer-detector

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Ships NORTHSTAR pattern #5 (PCIe AER + hw.gpu.io rate-collapse) per
docs/patterns/pattern-5-pcie-aer.md.
Two layers required for a verdict (no Confidence taxonomy — mirroring
xid_correlation + hbm_ecc):

  • Layer 1 (kernel AER): PCIe Bus Error: severity=..., type=... line
    on a specific gpu.id (PCI BDF) extracted by the journald-kernel
    OTTL recipe.
  • Layer 2 (rate-collapse): per-GPU hw.gpu.io BytesPerSecond <=
    (1 - RateDropThreshold) * BaselineBytesPerSecond on the same BDF
    within the correlation window. Default threshold 0.5 (50% drop) and
    window 5min. Causality is forward — the AER must precede the
    rate-collapse sample.

Join key: gpu.id (PCI BDF) per RFC-0013 §3, shared across the dmesg
AER preamble and the dcgm-exporter hw.gpu.pci.bdf resource attribute
on hw.gpu.io. Node is carried for the verdict prose only, not the
join key — the BDF is the proximate hardware identifier.

The verdict log record promotes operator-facing scalars onto top-level
OTLP attributes (gpu.id, kernelevents.pcie_aer.severity,
kernelevents.pcie_aer.type, k8s.node.name,
tracecore.alert.pcie_rate_collapse.drop_ratio) per the issue #270
scalar-promotion contract, via the putStrIfSet helper so an empty
upstream stamp doesn't silently match empty-filter dashboard queries.

Integration gap (filed separately, not blocking this PR)

Per ADR-0001's intentional split — library + processor wiring first,
OTTL recipes second — end-to-end firing on a real cluster needs:

Until both land, the detector is the v0.3 moat (pattern logic + tests)
and the processor wiring is configured-but-quiet (zero verdicts on
real input — projections find nothing). The detector library and
wiring ship independently per ADR-0001.

What's new

  • module/pkg/patterns/pcie_aer.goPCIeAERDetector library type,
    PCIeAERRecord (AER kernel message), PCIeIORecord (hw.gpu.io rate
    sample with baseline), PCIeAERVerdict output. PatternIDPCIeAER = "5";
    EvidenceKindAER = "pcie_aer";
    EvidenceKindPCIeIOCollapse = "hw_io_collapse".
  • module/pkg/patterns/testdata/pcie_aer_verdict.schema.json — JSON
    schema with 10 drift falsifiers (extra field, confidence re-add,
    pattern.id numeric, pattern.id wrong value, severity outside enum,
    gpu_id empty, drop_ratio negative, drop_ratio over 1, evidence kind
    outside enum, evidence_trail under min).
  • Config: pcie_aer_window (default 5min), pcie_aer_rate_drop_threshold
    (default 0.5). Validate rejects sub-1s window and threshold outside
    [0, 1].
  • collectInputs grows two new typed projections (projectPCIeAERRecord,
    projectPCIeIORecord) ordered after hbm_ecc in the
    pod_event > node_condition > nccl_fr > xid > hbm_ecc > pcie_aer >
    pcie_io discriminator priority. PCIe AER is checked before PCIe IO
    because the kernel-line discriminator is a tighter regex match than
    the metric-derived bridge attribute.
  • Wiring tests: emits-verdict, AER-alone (no fire), rate-collapse-alone
    (no fire), window-configurable, threshold-configurable,
    promoted-scalar attribute presence, Validate guard.

Test plan

  • cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... -race -count=1 — green
  • make build — green (OCB compile clean against the new processor wiring)
  • make check — green (golangci-lint, go vet, go mod verify)
  • make doc-check — green (comment-noise diff gate clean)
  • PCIeAERDetector library tests cover both-layers-emit, AER-alone (no fire), rate-collapse-alone (no fire), window-respected (collapse outside window suppressed), threshold-respected (drop under threshold suppressed), schema-conformance, schema-drift falsifier battery
  • Wiring tests confirm the projections gate on the right discriminators (severity + gpu.id; bridge attribute + gpu.id) and the verdict log record carries the promoted-scalar attribute taxonomy
patterndetectorprocessor wires the PCIeAERDetector library and ships NORTHSTAR pattern #5 (PCIe AER + hw.gpu.io rate-collapse, join key gpu.id PCI BDF, default 5m correlation window, default 0.5 rate-drop threshold). Verdict log records carry pattern.* attrs plus promoted scalars (gpu.id, kernelevents.pcie_aer.severity, kernelevents.pcie_aer.type, k8s.node.name, tracecore.alert.pcie_rate_collapse.drop_ratio) for dashboard table-aggregation without parsing pattern.verdict_json. End-to-end firing on a real cluster requires the OTTL recipes tracked in issues #284 + #285; detector + wiring ship independently per ADR-0001.

@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 04:55
Tri Lam added 3 commits May 31, 2026 21:58
Ships NORTHSTAR pattern #5 per docs/patterns/pattern-5-pcie-aer.md.
Two layers required for a verdict; a single layer alone emits
nothing.

- Layer 1 (kernel AER): `PCIe Bus Error: severity=…, type=…` line
  on a specific `gpu.id` (PCI BDF) extracted by the journald-kernel
  OTTL recipe.
- Layer 2 (rate-collapse): per-GPU `hw.gpu.io` `BytesPerSecond` <=
  `(1 - RateDropThreshold) * BaselineBytesPerSecond` on the same
  BDF within the correlation window. Default threshold 0.5 (50%
  drop) and window 5min. Causality flows forward — the AER must
  precede the rate-collapse sample.

Why two layers required (no Confidence taxonomy, mirroring
xid_correlation + hbm_ecc):
- AER alone: corrected errors recovered — the link re-trained.
  Operators see raw AER telemetry on journald already.
- Rate-collapse alone: workload-natural Tx/Rx variance (a rank
  finished a comm phase, etc.). The AER is the hardware-fault
  confirmation that makes the join operator-actionable.

Join key is `gpu.id` (PCI BDF) per RFC-0013 §3, shared across the
dmesg AER preamble and the dcgm-exporter `hw.gpu.pci.bdf` resource
attribute on `hw.gpu.io`. Node is carried for the verdict prose
but not used in the join key — the BDF is the proximate hardware
identifier.

Defaults:
- CorrelationWindow: 5min — mirrors the spec's `rate(...[5m])`
  PromQL and typical post-AER link-renegotiation latency.
- RateDropThreshold: 0.5 — a one-generation PCIe downshift (Gen5
  → Gen3 or x16 → x8) lands well past this floor.

What's new:
- module/pkg/patterns/pcie_aer.go — PCIeAERDetector library type,
  PCIeAERRecord (AER kernel message), PCIeIORecord (hw.gpu.io
  rate sample with baseline), PCIeAERVerdict output.
  PatternIDPCIeAER = "5"; EvidenceKindAER = "pcie_aer";
  EvidenceKindPCIeIOCollapse = "hw_io_collapse".
- module/pkg/patterns/testdata/pcie_aer_verdict.schema.json — JSON
  schema with 10 drift falsifiers (extra field, confidence re-add,
  pattern.id numeric, pattern.id wrong value, severity outside
  enum, gpu_id empty, drop_ratio negative, drop_ratio over 1,
  evidence kind outside enum, evidence_trail under min).

Verdict struct carries promoted scalar fields (Severity, AERType,
DropRatio, GPUID, Node) so the processor wiring can stamp them as
top-level OTLP log attributes for dashboard table-aggregation per
PR #275's lesson. No dead fields — every struct field is read by
either the Evaluate path or the verdict-shape contract.

Integration end-to-end firing in a real deployment is blocked on
PR-B (issue #260): no OTTL recipe today derives the per-GPU
`tracecore.alert.pcie_rate_collapse`-shaped log record from
`hw.gpu.io`. The detector library is the v0.3 moat and ships
independently per ADR-0001.

Signed-off-by: Tri Lam <tri@maydow.com>
Wires the PCIeAERDetector library type into patterndetectorprocessor:

- Config: PCIeAERWindow (yaml: pcie_aer_window, default 5min),
  PCIeAERRateDropThreshold (yaml: pcie_aer_rate_drop_threshold,
  default 0.5). Validate rejects sub-1s window and threshold
  outside [0, 1].
- collectInputs grows two new typed projections behind tighter
  discriminators than the existing five:
  - projectPCIeAERRecord — gate `kernelevents.pcie_aer.severity`
    AND `gpu.id`. Reads severity/type/gpu.id off log attrs, falls
    back to resource gpu.id when the journald-kernel OTTL stamps
    it on the resource.
  - projectPCIeIORecord — gate
    `tracecore.alert.pcie_rate_collapse.bytes_per_second` AND
    `gpu.id`. Reads BytesPerSecond + Baseline + Direction off log
    attrs; the bridge attribute name is namespaced
    `tracecore.alert.pcie_rate_collapse.*` so downstream knows it's
    a tracecore-derived alert (vs a raw hw.gpu.io scrape sample).
- appendPCIeAERVerdict promotes (GPUID, Severity, AERType,
  DropRatio, Node) onto the verdict log record as top-level OTLP
  attributes per PR #275's lesson, so dashboards can table-
  aggregate without parsing pattern.verdict_json.
  Names track OTel semconv (`gpu.id`, `k8s.node.name`) and
  recipe-canonical keys (`kernelevents.pcie_aer.severity/.type`).

Wiring tests cover: emits-verdict, AER-alone (no fire),
rate-collapse-alone (no fire), window-configurable, threshold-
configurable, promoted-scalar attribute presence, and the new
Validate guard.

Integration gap (filed separately, not blocking this PR):

1. The journald-kernel OTTL recipe extracts kernelevents.xid +
   gpu.id from `NVRM: Xid` lines but does NOT extract
   kernelevents.pcie_aer.* from `PCIe Bus Error: …` lines —
   needs a sibling OTTL stanza in
   docs/integrations/journald-kernel.md.
2. The metrics→logs PCIe rate-collapse alert OTTL recipe is the
   PR-B side of issue #260 (ADR-0001 the blocker for all
   metrics-sourced patterns).

Until both land, the detector is the v0.3 moat (pattern logic +
tests) and the wiring is configured-but-quiet (zero verdicts on
real input — projections find nothing).

Signed-off-by: Tri Lam <tri@maydow.com>
doc-check fail mode: `//.*\bPR\s*#\d+` regex blocks bare PR refs in source comments per STYLE.md (defaults to no comments; PR refs rot in long-lived files). Rephrased to issue-N where the trail still matters; otherwise stripped.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr force-pushed the feat/pcie-aer-detector branch from 94b844a to 46b8eb3 Compare June 1, 2026 05:07
@trilamsr trilamsr merged commit ebf8f54 into main Jun 1, 2026
15 checks passed
@trilamsr trilamsr deleted the feat/pcie-aer-detector branch June 1, 2026 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant