Skip to content

OTTL recipe: derive tracecore.alert.pcie_rate_collapse.* from hw.gpu.io counter (pattern #5 wiring) #284

Description

@trilamsr

Blocker

Pattern #5 (PCIe AER + hw.gpu.io rate-collapse) — the second NORTHSTAR
detector now wired into patterndetectorprocessor — needs a per-GPU
rate-collapse alert log record on its second input layer. The detector
library ships in this PR, but Layer 2 will find zero records on a real
cluster until this recipe lands. The detector is configured-but-quiet
on the metrics-derived layer until then.

Source-of-truth: docs/patterns/pattern-5-pcie-aer.md.
That doc specifies Layer 2 as a PromQL-style rate-collapse alert:

( rate(hw_gpu_io_bytes_total[5m]) <=
  (1 - 0.5) * quantile(0.5, rate(hw_gpu_io_bytes_total[5m])) by (k8s_node_name) )
  on (gpu_id)

What's missing

  1. OTTL transform — hw.gpu.io Counter → rate-collapse alert log. No
    recipe today emits the bridge log shape
    tracecore.alert.pcie_rate_collapse.* from raw hw.gpu.io Counter
    samples. The wiring projection
    (projectPCIeIORecord) gates on
    tracecore.alert.pcie_rate_collapse.bytes_per_second +
    gpu.id — both attributes need to be stamped on the log record by
    an OTTL recipe operating on the metrics→logs path.

  2. Metrics→logs path itself. Same blocker class as Recipe extension: emit hw.gpu.nvlink.* + wire metrics path for pattern-1 detector #260 — the
    patterndetectorprocessor is logs-only today; the rate-collapse
    alert needs a rule-engine OTTL stanza (analogous to the
    PromQL-alert-derived log emitter named in ADR-0001) sitting upstream.

Bridge attribute contract (recipe spec)

The recipe MUST emit one log record per offending GPU per evaluation
window, with these attributes:

Attribute Type Value Source
tracecore.alert.pcie_rate_collapse.bytes_per_second Double current per-GPU rate rate(hw.gpu.io[5m])
tracecore.alert.pcie_rate_collapse.baseline_bytes_per_second Double host-median or per-GPU pre-degradation quantile(0.5, rate(...)) by (k8s.node.name)
tracecore.alert.pcie_rate_collapse.direction String transmit / receive network.io.direction on hw.gpu.io
gpu.id String PCI BDF hw.gpu.pci.bdf resource attr on hw.gpu.io
k8s.node.name String node resource attr (stamped by k8sattributes)

Namespacing (tracecore.alert.pcie_rate_collapse.*) keeps the bridge
shape distinguishable downstream from raw hw.gpu.io scrape samples.

Why now

The pattern-5 detector + processor wiring + tests ship as the v0.3 moat
per ADR-0001's intentional split (library first, OTTL recipes second).
Integration end-to-end firing on a real cluster is gated on this issue.

Sibling tracking issues: #260 (pattern-1 NVLink), #273 (pattern-3 HBM
ECC hw.errors), #282 (pattern-4 thermal throttle).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions