diff --git a/docs/README.md b/docs/README.md index e239b541..fefe079d 100644 --- a/docs/README.md +++ b/docs/README.md @@ -29,6 +29,7 @@ Legend: 👤 operator · 🛠️ contributor · 🏛️ maintainer · 🌐 exter | Path | Audience | Purpose | |---|---|---| | [rfcs/](rfcs/) | 🏛️ 🛠️ | Architecture decision records. See [rfcs/README.md](rfcs/README.md) for the status index. | +| [adrs/](adrs/) | 🛠️ 🏛️ | Narrowly-scoped architectural decisions that fit under an existing RFC. One file per decision; see [adrs/0001-metrics-to-logs-pattern-input.md](adrs/0001-metrics-to-logs-pattern-input.md) for the metrics-sourced pattern-input wiring. | | [patterns/](patterns/) | 🛠️ 👤 | Root-cause-pattern walkthroughs (NVLink degradation, HBM ECC, thermal throttle, PCIe AER). | | [proposals/](proposals/) | 🏛️ | Drafts pending upstream (semconv extensions, etc.). | | [research/](research/) | 🛠️ | Synthesized findings from reading external sources (OTel collector internals, benchmark baselines). | diff --git a/docs/adrs/0001-metrics-to-logs-pattern-input.md b/docs/adrs/0001-metrics-to-logs-pattern-input.md new file mode 100644 index 00000000..fe36274e --- /dev/null +++ b/docs/adrs/0001-metrics-to-logs-pattern-input.md @@ -0,0 +1,177 @@ +# ADR 0001 — Metrics-sourced pattern inputs feed `patterndetectorprocessor` via a sibling `WithMetrics` registration (Option A), NOT via an OTTL metrics-to-logs converter (Option B) + +- **Status:** accepted, blocks issue #260 PR-B +- **Date:** 2026-05-31 +- **Authors:** Tri Lam (@trilam) +- **Affects:** `module/processor/patterndetectorprocessor/`, `docs/integrations/prometheus-scrape.md`, patterns #1 (NVLink), #3 (HBM ECC), #4 (thermal throttle), #5 (PCIe AER) + +## Context + +`patterndetectorprocessor` (RFC-0013 §6.3, PR-I.2b) ships logs-only: +`processor.WithLogs(...)` in `factory.go`. The detectors landed so far — +`pod_evicted`, `nccl_hang`, `xid_correlation` — all consume log records +projected from K8s objects, NCCL FlightRecorder, and journald. + +The next four NORTHSTAR detectors (patterns #1, #3, #4, #5) consume +**metric** signals scraped from `dcgm-exporter` via `prometheusreceiver` +(per RFC-0013 §2 adoption matrix). They need a way to feed metric +datapoints — `hw.gpu.nvlink.io`, `hw.errors`, `hw.gpu.throttle.duration`, +`hw.gpu.io` — into a pattern engine that today only reads `plog.Logs`. + +Issue [#260](https://github.com/tracecoreai/tracecore/issues/260) +frames this as a binary decision: + +- **Option A** — extend `patterndetectorprocessor` to register + `processor.WithMetrics` alongside the existing `WithLogs`. +- **Option B** — keep `patterndetectorprocessor` logs-only and pipe a + **metrics→logs converter** in front of it. Each pattern declares its + trigger as an OTTL statement that, when a metric threshold/rate + crosses, emits a synthetic log record. The existing logs detector + consumes the synthetic record via the same attribute projection + contract it uses for kernel events. +- **Option C** — `routingconnector` splits the metrics signal off to a + sibling metrics-aware processor. + +## Decision + +**Option A.** We extend `patterndetectorprocessor` with a +`processor.WithMetrics` registration (separate factory entry, same +module, parallel `consumeMetrics` path). The metric-sourced detectors +read `pmetric.Metrics` directly and emit verdict log records onto a +sibling logs pipeline via a connector (`forward` or an in-tree +`patternverdict` connector — decided at PR-B time). + +The decision sequence is recorded here so the next PR (issue #260 +PR-B — the NVLink detector) doesn't relitigate the trade-off. + +## Architectural alternatives evaluated + +### Option B (preferred starting point per `[[adopt-over-build]]`) — **BLOCKED upstream** + +Option B was the recommended starting point because it adopts the +upstream `transformprocessor` already bundled in `builder-config.yaml`, +keeps `patterndetectorprocessor` shape-monomorphic (one signal in, +one signal out), and centralizes pattern triggers in declarative OTTL. + +**Why it's blocked.** OTel-contrib `transformprocessor` v0.130 (the +release pinned in `builder-config.yaml`) cannot emit log records from a +metrics pipeline. Verified against the upstream README: + +> Within each `` list, only certain OTTL Path prefixes can be used: +> +> | Signal | Path Prefix Values | +> |--------------------|------------------------------------------------| +> | trace_statements | `resource`, `scope`, `span`, and `spanevent` | +> | metric_statements | `resource`, `scope`, `metric`, and `datapoint` | +> | log_statements | `resource`, `scope`, and `log` | +> | profile_statements | `resource`, `scope`, and `profile` | +> +> This means, for example, that you cannot use the Path `span.attributes` +> within the `log_statements` configuration section. + +Source: [`processor/transformprocessor/README.md` § Config (v0.130.0)](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config). + +`metric_statements` cannot reference `log.*` paths. The processor's +pipeline binding (`processor.WithMetrics` vs `processor.WithLogs`) is +sealed per-signal — there is no "emit a log on the side" path in OTTL. + +We also surveyed the contrib **connector** family at v0.130 for a +metrics→logs primitive (a connector is the only OTel-contrib component +type allowed to change signal type): + +| Connector | Receiver pipeline types | Notes | +|---|---|---| +| `countconnector` | **metrics** only | Counts any-signal inputs into metrics; cannot emit logs. | +| `signaltometricsconnector` | metrics only | Same — output is metrics. | +| `routingconnector` | signal-preserving | logs→logs / metrics→metrics / traces→traces. | +| `failoverconnector`, `roundrobinconnector` | signal-preserving | Same. | +| `spanmetricsconnector`, `exceptionsconnector`, `servicegraphconnector` | traces → metrics | Wrong direction. | +| `datadogconnector`, `grafanacloudconnector` | vendor-specific | Out of scope. | +| `otlpjsonconnector` | signal-preserving | Same. | + +Source: [`open-telemetry/opentelemetry-collector-contrib/tree/v0.130.0/connector`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/v0.130.0/connector) (directory listing, 2026-05-31). + +**No upstream contrib component at v0.130 emits log records from a +metrics input.** A metric-rule-as-log-alert primitive would be a net-new +contrib component — an upstream contribution per RFC-0013 §5, not +recipe-shaped work tracecore can land in v0.3.0. + +### Option C — `routingconnector` + sibling processor + +Equivalent to Option A in operator surface (still need a metrics-aware +processor) but adds an extra component to the pipeline graph. Routing +preserves signal type — it cannot bridge metrics→logs either. Rejected: +strictly heavier than Option A with the same end state. + +### Option A (this ADR) + +Trade-offs accepted: + +- **Doubled processor surface.** `patterndetectorprocessor` now ships + both `ConsumeLogs` and `ConsumeMetrics` paths. Mitigation: the + metrics path is a thin projection layer (`pmetric.Metrics` → + `patterns.`) on top of the same `patterns/` library; + the verdict-emission shape stays log-based (verdicts go out on a + sibling logs pipeline via a connector, not as metrics). The two + consume paths share the verdict-emit code. +- **Per-signal join semantics divergence.** Logs are record-batched, + metrics are datapoint-batched within ResourceMetrics. The metrics + path will not pretend to be a log record at runtime; the projection + is explicit. The pattern library already separates per-input record + types (`patterns.Record`, `patterns.NodeRecord`, `patterns.XidRecord`, + `patterns.NCCLFRRecord`) — a new `patterns.GPUMetricRecord` family + follows the same shape. +- **`[[adopt-over-build]]` cost.** Option A is custom in-tree code we'd + rather not write. We accept it only because Option B is blocked + upstream. When OTel contrib ships a `metricthresholdconnector` (or + equivalent — emit a log when a metric rate/threshold crosses), the + recipe is re-evaluated and the metrics-path detectors can be + refactored onto the upstream primitive without changing the + customer-facing verdict-record contract. + +## Consequences + +1. **PR-A (this PR — issue #260)** ships only the DCGM → `hw.*` + namespace OTTL transform on the metric stream. The recipe stops + short of trying to emit log records from metrics, because that path + is unsupported at v0.130. The transform is independently valuable: + it normalizes raw `DCGM_FI_PROF_NVLINK_L*_TX_BYTES` (and the other + three families) into the customer-stable `hw.gpu.*` namespace + declared in `docs/proposals/semconv-hw-gpu-extensions.md` so PR-B's + detector can consume the customer-stable shape directly, regardless + of which architectural path PR-B implements. + +2. **PR-B (issue #260 follow-up, NOT in this PR)** extends + `patterndetectorprocessor` with `processor.WithMetrics`. Scope: + - `factory.go` registers `createMetrics` alongside `createLogs`. + - `patterndetector.go` adds `ConsumeMetrics(ctx, pmetric.Metrics)` + and a `collectMetricInputs` that mirrors `collectInputs` but + reads metric datapoints by `metric.name` + attribute gates. + - `pkg/patterns/nvlink_degradation.go` lands as the first + `WithMetrics`-driven detector. + - Verdicts continue to land as log records on a sibling logs + pipeline via a connector wiring documented in the chart's + `renderedConfig` template. + +3. **Customer-facing recipe shape stays unchanged.** Operators do not + change their `prometheus-scrape.yaml` between PR-A and PR-B beyond + adding the metrics input to the `patterndetector` processor in + their pipeline config (one new line). The OTTL `hw.*` namespace + transform from PR-A is the load-bearing wire-format contract for + PR-B and any future GPU pattern. + +4. **Upstream contribution slot opened.** A + `metricthresholdconnector` (or `signaltologsconnector`) that emits + a log record when a metric datapoint matches a condition — the + inverse of `signaltometricsconnector` — is the missing primitive + that would let us collapse Option A back to Option B. Tracked as a + v0.3 follow-up under RFC-0013 §5 (upstream contribution policy). + +## References + +- Issue [#260](https://github.com/tracecoreai/tracecore/issues/260) — recipe extension + metrics-side detector plumbing. +- [`docs/rfcs/0013-distro-first-pivot.md`](../rfcs/0013-distro-first-pivot.md) §3 (customer-stable contracts), §5 (upstream contribution policy), §6 (the four in-house moat scopes). +- [`docs/patterns/pattern-1-nvlink-degradation.md`](../patterns/pattern-1-nvlink-degradation.md) — the first metrics-sourced pattern unblocked by this decision. +- [`docs/proposals/semconv-hw-gpu-extensions.md`](../proposals/semconv-hw-gpu-extensions.md) §3 — the `hw.gpu.nvlink.io` shape PR-A's OTTL transform emits. +- [transformprocessor README @ v0.130.0](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config) — the upstream signal-context table cited above. +- [opentelemetry-collector-contrib connector tree @ v0.130.0](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/v0.130.0/connector) — surveyed for any metrics→logs primitive. diff --git a/docs/integrations/examples/prometheus-scrape.yaml b/docs/integrations/examples/prometheus-scrape.yaml index 3845ac7f..2519bcfd 100644 --- a/docs/integrations/examples/prometheus-scrape.yaml +++ b/docs/integrations/examples/prometheus-scrape.yaml @@ -12,13 +12,18 @@ # targets like the Kueue control-plane) -> Prometheus-style scrape # at the configured interval -> OTTL transform normalizes # customer-stable resource attributes (`gpu.vendor`, `gpu.id`) -> -# otlphttpexporter to backend. The example below scrapes a -# dcgm-exporter DaemonSet at the conventional :9400/metrics -# endpoint. Replace the OTLP endpoint and dcgm-exporter target -# placeholders at deploy time; tracecore does not expand -# environment variables in YAML, so render the literals with a -# secret-injection tool (envsubst, External Secrets, -# sealed-secrets) before `helm install`. See +# OTTL transform projects raw DCGM `DCGM_FI_*` series into the +# customer-stable `hw.gpu.*` namespace (RFC-0013 §3 + +# docs/proposals/semconv-hw-gpu-extensions.md) so pattern detectors +# (#1 NVLink, #3 HBM ECC, #4 thermal throttle, #5 PCIe AER) read a +# stable wire format regardless of which vendor exporter is on +# the other side of the scrape -> otlphttpexporter to backend. +# The example below scrapes a dcgm-exporter DaemonSet at the +# conventional :9400/metrics endpoint. Replace the OTLP endpoint +# and dcgm-exporter target placeholders at deploy time; tracecore +# does not expand environment variables in YAML, so render the +# literals with a secret-injection tool (envsubst, External +# Secrets, sealed-secrets) before `helm install`. See # docs/integrations/prometheus-scrape.md. receivers: @@ -50,6 +55,120 @@ processors: - set(resource.attributes["gpu.vendor"], "amd") where IsMatch(metric.name, "^amdsmi_") - set(resource.attributes["gpu.vendor"], "intel") where IsMatch(metric.name, "^xpum_") - set(resource.attributes["gpu.vendor"], "habana") where IsMatch(metric.name, "^habanalabs_") + # Project the raw DCGM `DCGM_FI_*` namespace into the customer- + # stable `hw.gpu.*` namespace declared in + # docs/proposals/semconv-hw-gpu-extensions.md so downstream pattern + # detectors consume a vendor-neutral shape. Statements run in + # declared order; per-series attributes are stamped BEFORE the + # metric is renamed because OTTL `set(metric.name, ...)` is a + # whole-metric rename and the per-link / per-direction / per- + # reason attribute drives the datapoint's identity AFTER the + # rename. + transform/dcgm_to_hw_semconv: + metric_statements: + - context: datapoint + # Statements are single-quoted to prevent YAML from + # interpreting embedded `:` (inside regex character classes) + # as a map-key separator. Without the quotes the parser + # rejects the value as `type=string cannot be used as a + # Conf` at validate time. Mirrors the kmsg recipe's + # `transform/kmsg_xid` style. + statements: + # ---- Resource attribution: dcgm-exporter labels -> hw.* ---- + # dcgm-exporter v3+ exposes the GPU NVML UUID as either + # `UUID` (newer collectors) or `gpu_uuid` (legacy). Map + # either onto the customer-stable `hw.id` resource attr + # (RFC-0013 §3 + semconv `hw.id`). The `where` clauses + # are mutually exclusive so a series that carries both + # legacy and new labels resolves deterministically. + - 'set(resource.attributes["hw.id"], datapoint.attributes["UUID"]) where datapoint.attributes["UUID"] != nil' + - 'set(resource.attributes["hw.id"], datapoint.attributes["gpu_uuid"]) where datapoint.attributes["gpu_uuid"] != nil and resource.attributes["hw.id"] == nil' + # NVML index — same dual-label dance as `hw.id`. The index + # is volatile across reboots; `hw.id` is the durable join + # key. We carry both because dashboards key on whichever + # is most operator-friendly. + - 'set(resource.attributes["hw.gpu.index"], datapoint.attributes["gpu"]) where datapoint.attributes["gpu"] != nil' + - 'set(resource.attributes["hw.gpu.index"], datapoint.attributes["GPU"]) where datapoint.attributes["GPU"] != nil and resource.attributes["hw.gpu.index"] == nil' + # Constant resource attrs that complete the `hw.*` + # namespace contract (semconv hardware/common). `hw.type` + # gates downstream pattern detectors against + # non-GPU `hw.*` records (CPU, memory, NIC) that future + # recipes may share the pipeline with. + - 'set(resource.attributes["hw.type"], "gpu") where IsMatch(metric.name, "^DCGM_")' + + # ---- Pattern #1 (NVLink): per-link / per-direction Counter ---- + # Series shape (H100): DCGM_FI_PROF_NVLINK_L{0..17}_TX_BYTES + # and `..._RX_BYTES`. The link index is part of the metric + # NAME (not a label) because that is what dcgm-exporter + # emits today; ExtractPatterns lifts the index into a + # datapoint attribute so the downstream pattern engine can + # filter on `hw.gpu.nvlink.link` instead of regex'ing the + # metric name. Direction is encoded the same way — `_TX_` + # vs `_RX_` segment is lifted into the standard semconv + # `network.io.direction` attribute (semconv hardware § + # `hw.gpu.nvlink.io`). + - 'set(datapoint.attributes["hw.gpu.nvlink.link"], Int(ExtractPatterns(metric.name, "^DCGM_FI_PROF_NVLINK_L(?P\\d+)_(TX|RX)_BYTES$")["link"])) where IsMatch(metric.name, "^DCGM_FI_PROF_NVLINK_L\\d+_(TX|RX)_BYTES$")' + - 'set(datapoint.attributes["network.io.direction"], "transmit") where IsMatch(metric.name, "^DCGM_FI_PROF_NVLINK_L\\d+_TX_BYTES$")' + - 'set(datapoint.attributes["network.io.direction"], "receive") where IsMatch(metric.name, "^DCGM_FI_PROF_NVLINK_L\\d+_RX_BYTES$")' + - 'set(metric.unit, "By") where IsMatch(metric.name, "^DCGM_FI_PROF_NVLINK_L\\d+_(TX|RX)_BYTES$")' + - 'set(metric.name, "hw.gpu.nvlink.io") where IsMatch(metric.name, "^DCGM_FI_PROF_NVLINK_L\\d+_(TX|RX)_BYTES$")' + + # ---- Pattern #3 (HBM ECC): `hw.errors` Counter ---- + # DCGM exposes ECC counters as four series (SBE/DBE x + # volatile/aggregate). Each maps onto `hw.errors` with + # the canonical semconv `error.type` / `error.subtype` / + # `error.persistence` attributes (semconv hardware § + # `hw.errors`). The uncorrected/double_bit row is the + # one pattern #3 alerts on; the others are evidence + # context. + - 'set(datapoint.attributes["error.type"], "uncorrected") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_DBE_(VOL|AGG)_TOTAL$")' + - 'set(datapoint.attributes["error.subtype"], "double_bit") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_DBE_(VOL|AGG)_TOTAL$")' + - 'set(datapoint.attributes["error.type"], "corrected") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_SBE_(VOL|AGG)_TOTAL$")' + - 'set(datapoint.attributes["error.subtype"], "single_bit") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_SBE_(VOL|AGG)_TOTAL$")' + - 'set(datapoint.attributes["error.persistence"], "volatile") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_(SBE|DBE)_VOL_TOTAL$")' + - 'set(datapoint.attributes["error.persistence"], "aggregate") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_(SBE|DBE)_AGG_TOTAL$")' + - 'set(metric.unit, "{error}") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_(SBE|DBE)_(VOL|AGG)_TOTAL$")' + - 'set(metric.name, "hw.errors") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_(SBE|DBE)_(VOL|AGG)_TOTAL$")' + + # ---- Pattern #4 (thermal throttle): per-reason Counter ---- + # DCGM throttle-reasons surface as discrete `XID_ERRORS_*`- + # style series on modern dcgm-exporter (one counter per + # reason; see dcgm-exporter `default-counters.csv`). Map + # each onto `hw.gpu.throttle.duration` with the canonical + # `hw.gpu.throttle.reason` attribute (semconv proposal §2). + # The unit stays seconds (`s`) per the proposal. Pattern + # #4 alerts on the `thermal` reason; the others are + # diagnostic context. + # NOTE: `DCGM_FI_DEV_LOW_UTIL_VIOLATION` is intentionally + # excluded — the upstream semconv proposal's + # `hw.gpu.throttle.reason` enum has no value matching an + # "idle / low utilization" throttle. Mapping it to a value + # outside the proposal (e.g. `sw_slowdown`) would create + # forward-incompat drift once the SIG resolves the + # vocabulary. Operators who need it can add a row in their + # downstream chart values; revisit once the proposal lists + # an idle-throttle reason. + - 'set(datapoint.attributes["hw.gpu.throttle.reason"], "thermal") where metric.name == "DCGM_FI_DEV_THERMAL_VIOLATION"' + - 'set(datapoint.attributes["hw.gpu.throttle.reason"], "power") where metric.name == "DCGM_FI_DEV_POWER_VIOLATION"' + - 'set(datapoint.attributes["hw.gpu.throttle.reason"], "sync_boost") where metric.name == "DCGM_FI_DEV_SYNC_BOOST_VIOLATION"' + - 'set(datapoint.attributes["hw.gpu.throttle.reason"], "hw_slowdown") where metric.name == "DCGM_FI_DEV_BOARD_LIMIT_VIOLATION"' + - 'set(metric.unit, "s") where IsMatch(metric.name, "^DCGM_FI_DEV_(THERMAL|POWER|SYNC_BOOST|BOARD_LIMIT)_VIOLATION$")' + - 'set(metric.name, "hw.gpu.throttle.duration") where IsMatch(metric.name, "^DCGM_FI_DEV_(THERMAL|POWER|SYNC_BOOST|BOARD_LIMIT)_VIOLATION$")' + + # ---- Pattern #5 (PCIe AER): per-direction Counter ---- + # DCGM's `DCGM_FI_PROF_PCIE_{TX,RX}_BYTES` map onto the + # generic semconv `hw.gpu.io` Counter (vendor-neutral + # PCIe / Infinity Fabric / Xe Link byte counter). PCI + # bus-device-function is lifted from the `pci_bus_id` / + # `device` label that dcgm-exporter stamps on each series; + # pattern #5's escalation matrix cross-references against + # dmesg AER lines on the same BDF. + - 'set(datapoint.attributes["network.io.direction"], "transmit") where metric.name == "DCGM_FI_PROF_PCIE_TX_BYTES"' + - 'set(datapoint.attributes["network.io.direction"], "receive") where metric.name == "DCGM_FI_PROF_PCIE_RX_BYTES"' + - 'set(resource.attributes["hw.gpu.pci.bdf"], datapoint.attributes["pci_bus_id"]) where datapoint.attributes["pci_bus_id"] != nil and IsMatch(metric.name, "^DCGM_FI_PROF_PCIE_(TX|RX)_BYTES$")' + - 'set(metric.unit, "By") where IsMatch(metric.name, "^DCGM_FI_PROF_PCIE_(TX|RX)_BYTES$")' + - 'set(metric.name, "hw.gpu.io") where IsMatch(metric.name, "^DCGM_FI_PROF_PCIE_(TX|RX)_BYTES$")' + batch: send_batch_size: 8192 timeout: 10s @@ -64,5 +183,10 @@ service: pipelines: metrics/scrape: receivers: [prometheus] - processors: [transform/gpu_vendor, batch] + # `transform/gpu_vendor` stamps the cross-vendor tag FIRST so + # downstream alerts that filter on `gpu.vendor` survive even + # for DCGM series the next transform doesn't recognize. + # `transform/dcgm_to_hw_semconv` runs AFTER so its statements + # see the original `DCGM_FI_*` names and per-series labels. + processors: [transform/gpu_vendor, transform/dcgm_to_hw_semconv, batch] exporters: [otlphttp] diff --git a/docs/integrations/prometheus-scrape.md b/docs/integrations/prometheus-scrape.md index 9d8b7b40..c2459e83 100644 --- a/docs/integrations/prometheus-scrape.md +++ b/docs/integrations/prometheus-scrape.md @@ -1,5 +1,5 @@ - + # Prometheus scrape via `prometheusreceiver` @@ -11,9 +11,25 @@ NVIDIA `dcgm-exporter`, AMD `ROCm/device-metrics-exporter`, Intel `intel/xpumanager`, Habana Prometheus Metric Exporter — and for the Kueue scheduler's metrics endpoint. Replaces the in-tree `dcgm` and `kueue` receivers per RFC-0013 §7 (Deletion list — v0.1.0). -An OTTL `transform` processor stamps the customer-stable -`gpu.vendor` resource attribute (RFC-0013 §3) so dashboards survive -a future swap between vendor exporters. + +Two OTTL `transform` processors run in series over the scraped +metrics: + +1. **`transform/gpu_vendor`** stamps the customer-stable + `gpu.vendor` resource attribute (RFC-0013 §3) so dashboards + survive a future swap between vendor exporters. +2. **`transform/dcgm_to_hw_semconv`** projects the raw + `DCGM_FI_*` namespace onto the customer-stable `hw.gpu.*` / + `hw.errors` namespace declared in + [docs/proposals/semconv-hw-gpu-extensions.md](../proposals/semconv-hw-gpu-extensions.md) + so the next-cycle pattern detectors (issue [#260](https://github.com/tracecoreai/tracecore/issues/260) + patterns #1 NVLink, #3 HBM ECC, #4 thermal throttle, #5 PCIe AER) + read one vendor-neutral wire format. Per + [docs/adrs/0001-metrics-to-logs-pattern-input.md](../adrs/0001-metrics-to-logs-pattern-input.md) + the verdict-emission half lands in a separate PR-B that extends + `patterndetectorprocessor` with `processor.WithMetrics`; the + transform below is the load-bearing wire-format contract that + PR-B consumes. ## Config @@ -127,6 +143,155 @@ The tag survives the [RFC-0013 §3](../rfcs/0013-distro-first-pivot.md#3-custome contract; existing dashboards keyed on `gpu.vendor` continue to work after a vendor swap. +## `DCGM_FI_*` → `hw.gpu.*` semconv projection + +The second OTTL transform (`transform/dcgm_to_hw_semconv` in the +example YAML) projects every load-bearing `DCGM_FI_*` series into +the customer-stable namespace from +[docs/proposals/semconv-hw-gpu-extensions.md](../proposals/semconv-hw-gpu-extensions.md). +The contract is one-direction: a downstream consumer reads only +`hw.gpu.*` / `hw.errors`, never the raw DCGM names. Per +[ADR-0001](../adrs/0001-metrics-to-logs-pattern-input.md) the +pattern detectors built on top of this namespace (issue [#260](https://github.com/tracecoreai/tracecore/issues/260) +PR-B) land as a `processor.WithMetrics` extension to +`patterndetectorprocessor` — not as an OTTL metrics-to-logs +emitter — because OTel-contrib `transformprocessor` v0.130 cannot +emit log records from a metrics pipeline. + +### Resource attribution + +dcgm-exporter stamps two cross-version label flavors per series: +`UUID` / `gpu_uuid` for the NVML UUID, and `gpu` / `GPU` for the +NVML index. The transform maps either onto the customer-stable +resource attribute: + +| dcgm-exporter label | Resource attribute | Notes | +|---|---|---| +| `UUID` or `gpu_uuid` | `hw.id` | NVML UUID; durable join key. The transform prefers `UUID` and falls back to `gpu_uuid` when only the legacy label is present. | +| `gpu` or `GPU` | `hw.gpu.index` | NVML index; volatile across reboots. Same dual-label preference. | +| (computed) | `hw.type` = `"gpu"` | Stamped on every `DCGM_*` series; gates downstream `hw.*` filters against future non-GPU `hw.*` sources. | +| `pci_bus_id` | `hw.gpu.pci.bdf` | PCI bus-device-function; lifted only on `DCGM_FI_PROF_PCIE_{TX,RX}_BYTES` series so pattern #5's escalation can cross-reference dmesg AER lines on the same BDF. | + +### Pattern #1 — NVLink degradation + +Per-link Tx/Rx Counter. Each `DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES` +series collapses into one metric name with the link index lifted +into a datapoint attribute via OTTL `ExtractPatterns`: + +| Raw DCGM series | OTel metric | Datapoint attributes (added) | +|---|---|---| +| `DCGM_FI_PROF_NVLINK_L{N}_TX_BYTES` (N ∈ 0..17) | `hw.gpu.nvlink.io` (Counter, unit `By`) | `hw.gpu.nvlink.link={N}`, `network.io.direction=transmit` | +| `DCGM_FI_PROF_NVLINK_L{N}_RX_BYTES` (N ∈ 0..17) | `hw.gpu.nvlink.io` (Counter, unit `By`) | `hw.gpu.nvlink.link={N}`, `network.io.direction=receive` | + +The link index lift uses `Int(ExtractPatterns(metric.name, +"^DCGM_FI_PROF_NVLINK_L(?P\\d+)_(TX|RX)_BYTES$")["link"])` +so the resulting attribute is integer-typed (matches the semconv +proposal's `hw.gpu.nvlink.link: int`). Per-link decomposition is +the diagnostic-critical surface for +[pattern #1 silent NVLink degradation](../patterns/pattern-1-nvlink-degradation.md); +without it the alert query has no group-by axis. + +> **dcgm-exporter opt-in required.** The +> `DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES` field IDs (1040..1075) are +> **commented out** in dcgm-exporter's upstream `default-counters.csv`. +> Operators must mount a custom counters ConfigMap and pass it via the +> chart's `-m :` flag (or set +> `arguments[1]=-f=/etc/dcgm-exporter/custom-counters.csv` and a +> matching `extraVolumes` entry). Without this, the recipe compiles +> cleanly but emits zero `hw.gpu.nvlink.io` series — pattern #1 will +> never fire. + +### Pattern #3 — Uncorrectable HBM ECC + +ECC counters expand into four series (correctable / uncorrectable +× volatile / aggregate). Pattern #3 alerts on the uncorrectable +volatile row; the rest are evidence context that the runbook +references. + +| Raw DCGM series | OTel metric | Datapoint attributes (added) | +|---|---|---| +| `DCGM_FI_DEV_ECC_DBE_VOL_TOTAL` | `hw.errors` (Counter, unit `{error}`) | `error.type=uncorrected`, `error.subtype=double_bit`, `error.persistence=volatile` | +| `DCGM_FI_DEV_ECC_DBE_AGG_TOTAL` | `hw.errors` | `error.type=uncorrected`, `error.subtype=double_bit`, `error.persistence=aggregate` | +| `DCGM_FI_DEV_ECC_SBE_VOL_TOTAL` | `hw.errors` | `error.type=corrected`, `error.subtype=single_bit`, `error.persistence=volatile` | +| `DCGM_FI_DEV_ECC_SBE_AGG_TOTAL` | `hw.errors` | `error.type=corrected`, `error.subtype=single_bit`, `error.persistence=aggregate` | + +The attribute names match the semconv `hw.errors` shape (see +[hw common](https://opentelemetry.io/docs/specs/semconv/hardware/common/)). +[Pattern #3 doc](../patterns/pattern-3-hbm-ecc.md) consumes the +`error.persistence=volatile` row in its alert query. + +### Pattern #4 — Thermal throttle cascade + +Modern dcgm-exporter emits per-reason throttle counters as +discrete `DCGM_FI_DEV_*_VIOLATION` series. Each maps onto +`hw.gpu.throttle.duration` with a `hw.gpu.throttle.reason` +attribute (semconv proposal §2). + +| Raw DCGM series | OTel metric | Datapoint attributes (added) | +|---|---|---| +| `DCGM_FI_DEV_THERMAL_VIOLATION` | `hw.gpu.throttle.duration` (Counter, unit `s`) | `hw.gpu.throttle.reason=thermal` | +| `DCGM_FI_DEV_POWER_VIOLATION` | `hw.gpu.throttle.duration` | `hw.gpu.throttle.reason=power` | +| `DCGM_FI_DEV_SYNC_BOOST_VIOLATION` | `hw.gpu.throttle.duration` | `hw.gpu.throttle.reason=sync_boost` | +| `DCGM_FI_DEV_BOARD_LIMIT_VIOLATION` | `hw.gpu.throttle.duration` | `hw.gpu.throttle.reason=hw_slowdown` | + +`DCGM_FI_DEV_LOW_UTIL_VIOLATION` is intentionally **not** mapped: +the upstream semconv proposal's `hw.gpu.throttle.reason` enum +(`thermal`, `power`, `sync_boost`, `hw_slowdown`, `sw_thermal`, +`display_clock`, `app_clock_setting`) has no value for an +"idle / low-utilization" throttle. Mapping it to a value outside +the proposal would create forward-incompat drift once the SIG +resolves the vocabulary. Tracked at +[#272](https://github.com/TraceCoreAI/tracecore/issues/272) for the +upstream proposal extension. + +[Pattern #4 doc](../patterns/pattern-4-thermal-throttle.md) alerts +on the `reason=thermal` row; the other reasons are diagnostic +context (`power` correlates with PSU sag, `hw_slowdown` is the +"GPU has decided to clock itself down" hard signal). + +### Pattern #5 — PCIe AER cascade + +DCGM exposes per-direction PCIe byte counters whose **rate** +collapses when the link renegotiates to a lower generation / +width. Pattern #5 watches the rate divergence across the host's +GPU set. + +| Raw DCGM series | OTel metric | Datapoint attributes (added) | +|---|---|---| +| `DCGM_FI_PROF_PCIE_TX_BYTES` | `hw.gpu.io` (Counter, unit `By`) | `network.io.direction=transmit` | +| `DCGM_FI_PROF_PCIE_RX_BYTES` | `hw.gpu.io` (Counter, unit `By`) | `network.io.direction=receive` | + +The `pci_bus_id` label is lifted to the resource-level +`hw.gpu.pci.bdf` so pattern #5's escalation matrix can cross- +reference dmesg `PCIe Bus Error: Corrected` lines against the +same BDF without joining series. [Pattern #5 doc](../patterns/pattern-5-pcie-aer.md) +shows the divergence query in PromQL form. + +### `[[adopt-over-build]]` posture + +Every statement in `transform/dcgm_to_hw_semconv` uses upstream +OTTL functions only: `set`, `IsMatch`, `ExtractPatterns`, `Int`. +No new OTTL functions are introduced. If a future series cannot +be projected with the existing function set, the right response +is to propose the missing function upstream to OTel contrib — +not to ship a tracecore-specific OTTL extension. + +### Identity-conflict caveat + +OTTL `set(metric.name, ...)` renames a metric in place; the v0.130 +README warns that "Transformation of metrics have the potential to +affect the identity of a metric leading to an Identity Crisis." +For this recipe the conflict is intentional: 36 input series +(18 NVLink links × 2 directions) collapse into one output metric +named `hw.gpu.nvlink.io` with distinct attribute sets per +datapoint. Downstream OTel + Prometheus backends merge by +`(metric.name, attributes)` and produce the expected per-link / +per-direction series. If your backend rejects the merged shape, +follow the upstream guidance to apply the rename inside a +separate statement group from any other identity-affecting +operation; this recipe already isolates the rename inside its +own processor (`transform/dcgm_to_hw_semconv`). + ## Placeholders | Placeholder | What to fill in |