You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pattern #5 (PCIe AER + hw.gpu.io rate-collapse) — the second NORTHSTAR
detector now wired into patterndetectorprocessor — needs a per-GPU
rate-collapse alert log record on its second input layer. The detector
library ships in this PR, but Layer 2 will find zero records on a real
cluster until this recipe lands. The detector is configured-but-quiet
on the metrics-derived layer until then.
( rate(hw_gpu_io_bytes_total[5m]) <=
(1 - 0.5) * quantile(0.5, rate(hw_gpu_io_bytes_total[5m])) by (k8s_node_name) )
on (gpu_id)
What's missing
OTTL transform — hw.gpu.io Counter → rate-collapse alert log. No
recipe today emits the bridge log shape tracecore.alert.pcie_rate_collapse.* from raw hw.gpu.io Counter
samples. The wiring projection
(projectPCIeIORecord) gates on tracecore.alert.pcie_rate_collapse.bytes_per_second + gpu.id — both attributes need to be stamped on the log record by
an OTTL recipe operating on the metrics→logs path.
Namespacing (tracecore.alert.pcie_rate_collapse.*) keeps the bridge
shape distinguishable downstream from raw hw.gpu.io scrape samples.
Why now
The pattern-5 detector + processor wiring + tests ship as the v0.3 moat
per ADR-0001's intentional split (library first, OTTL recipes second).
Integration end-to-end firing on a real cluster is gated on this issue.
Blocker
Pattern #5 (PCIe AER + hw.gpu.io rate-collapse) — the second NORTHSTAR
detector now wired into
patterndetectorprocessor— needs a per-GPUrate-collapse alert log record on its second input layer. The detector
library ships in this PR, but Layer 2 will find zero records on a real
cluster until this recipe lands. The detector is configured-but-quiet
on the metrics-derived layer until then.
Source-of-truth:
docs/patterns/pattern-5-pcie-aer.md.That doc specifies Layer 2 as a PromQL-style rate-collapse alert:
What's missing
OTTL transform —
hw.gpu.ioCounter → rate-collapse alert log. Norecipe today emits the bridge log shape
tracecore.alert.pcie_rate_collapse.*from rawhw.gpu.ioCountersamples. The wiring projection
(
projectPCIeIORecord) gates ontracecore.alert.pcie_rate_collapse.bytes_per_second+gpu.id— both attributes need to be stamped on the log record byan OTTL recipe operating on the metrics→logs path.
Metrics→logs path itself. Same blocker class as Recipe extension: emit hw.gpu.nvlink.* + wire metrics path for pattern-1 detector #260 — the
patterndetectorprocessoris logs-only today; the rate-collapsealert needs a rule-engine OTTL stanza (analogous to the
PromQL-alert-derived log emitter named in ADR-0001) sitting upstream.
Bridge attribute contract (recipe spec)
The recipe MUST emit one log record per offending GPU per evaluation
window, with these attributes:
tracecore.alert.pcie_rate_collapse.bytes_per_secondrate(hw.gpu.io[5m])tracecore.alert.pcie_rate_collapse.baseline_bytes_per_secondquantile(0.5, rate(...)) by (k8s.node.name)tracecore.alert.pcie_rate_collapse.directiontransmit/receivenetwork.io.directiononhw.gpu.iogpu.idhw.gpu.pci.bdfresource attr onhw.gpu.iok8s.node.nameNamespacing (
tracecore.alert.pcie_rate_collapse.*) keeps the bridgeshape distinguishable downstream from raw
hw.gpu.ioscrape samples.Why now
The pattern-5 detector + processor wiring + tests ship as the v0.3 moat
per ADR-0001's intentional split (library first, OTTL recipes second).
Integration end-to-end firing on a real cluster is gated on this issue.
Sibling tracking issues: #260 (pattern-1 NVLink), #273 (pattern-3 HBM
ECC
hw.errors), #282 (pattern-4 thermal throttle).