OTTL recipe: derive tracecore.alert.pcie_rate_collapse.* from hw.gpu.io counter (pattern #5 wiring)

## Blocker

Pattern #5 (PCIe AER + hw.gpu.io rate-collapse) — the second NORTHSTAR
detector now wired into `patterndetectorprocessor` — needs a per-GPU
rate-collapse alert log record on its second input layer. The detector
library ships in this PR, but Layer 2 will find zero records on a real
cluster until this recipe lands. The detector is configured-but-quiet
on the metrics-derived layer until then.

Source-of-truth: [`docs/patterns/pattern-5-pcie-aer.md`](../docs/patterns/pattern-5-pcie-aer.md).
That doc specifies Layer 2 as a PromQL-style rate-collapse alert:

```promql
( rate(hw_gpu_io_bytes_total[5m]) <=
  (1 - 0.5) * quantile(0.5, rate(hw_gpu_io_bytes_total[5m])) by (k8s_node_name) )
  on (gpu_id)
```

## What's missing

1. **OTTL transform — `hw.gpu.io` Counter → rate-collapse alert log.** No
   recipe today emits the bridge log shape
   `tracecore.alert.pcie_rate_collapse.*` from raw `hw.gpu.io` Counter
   samples. The wiring projection
   (`projectPCIeIORecord`) gates on
   `tracecore.alert.pcie_rate_collapse.bytes_per_second` +
   `gpu.id` — both attributes need to be stamped on the log record by
   an OTTL recipe operating on the metrics→logs path.

2. **Metrics→logs path itself.** Same blocker class as #260 — the
   `patterndetectorprocessor` is logs-only today; the rate-collapse
   alert needs a rule-engine OTTL stanza (analogous to the
   PromQL-alert-derived log emitter named in ADR-0001) sitting upstream.

## Bridge attribute contract (recipe spec)

The recipe MUST emit one log record per offending GPU per evaluation
window, with these attributes:

| Attribute | Type | Value | Source |
|---|---|---|---|
| `tracecore.alert.pcie_rate_collapse.bytes_per_second` | Double | current per-GPU rate | `rate(hw.gpu.io[5m])` |
| `tracecore.alert.pcie_rate_collapse.baseline_bytes_per_second` | Double | host-median or per-GPU pre-degradation | `quantile(0.5, rate(...)) by (k8s.node.name)` |
| `tracecore.alert.pcie_rate_collapse.direction` | String | `transmit` / `receive` | `network.io.direction` on `hw.gpu.io` |
| `gpu.id` | String | PCI BDF | `hw.gpu.pci.bdf` resource attr on `hw.gpu.io` |
| `k8s.node.name` | String | node | resource attr (stamped by k8sattributes) |

Namespacing (`tracecore.alert.pcie_rate_collapse.*`) keeps the bridge
shape distinguishable downstream from raw `hw.gpu.io` scrape samples.

## Why now

The pattern-5 detector + processor wiring + tests ship as the v0.3 moat
per ADR-0001's intentional split (library first, OTTL recipes second).
Integration end-to-end firing on a real cluster is gated on this issue.

Sibling tracking issues: #260 (pattern-1 NVLink), #273 (pattern-3 HBM
ECC `hw.errors`), #282 (pattern-4 thermal throttle).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OTTL recipe: derive tracecore.alert.pcie_rate_collapse.* from hw.gpu.io counter (pattern #5 wiring) #284

Blocker

What's missing

Bridge attribute contract (recipe spec)

Why now

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Attribute	Type	Value	Source
`tracecore.alert.pcie_rate_collapse.bytes_per_second`	Double	current per-GPU rate	`rate(hw.gpu.io[5m])`
`tracecore.alert.pcie_rate_collapse.baseline_bytes_per_second`	Double	host-median or per-GPU pre-degradation	`quantile(0.5, rate(...)) by (k8s.node.name)`
`tracecore.alert.pcie_rate_collapse.direction`	String	`transmit` / `receive`	`network.io.direction` on `hw.gpu.io`
`gpu.id`	String	PCI BDF	`hw.gpu.pci.bdf` resource attr on `hw.gpu.io`
`k8s.node.name`	String	node	resource attr (stamped by k8sattributes)

Uh oh!

OTTL recipe: derive tracecore.alert.pcie_rate_collapse.* from hw.gpu.io counter (pattern #5 wiring) #284

Description

Blocker

What's missing

Bridge attribute contract (recipe spec)

Why now

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions