TraceCoreAI · trilamsr · Jun 1, 2026
diff --git a/docs/README.md b/docs/README.md
@@ -58,6 +58,7 @@ Source (receiver-side) recipes — RFC-0013 §migration PR-J replacements for th
 | [integrations/journald-kernel.md](integrations/journald-kernel.md) | 👤 | Kernel + systemd events via `journaldreceiver` + `filelogreceiver` (kmsg) + OTTL transform preserving `kernelevents.xid` / `gpu.id`. Replaces `kernelevents`. |
 | [integrations/k8sobjects-events.md](integrations/k8sobjects-events.md) | 👤 | Kubernetes events via `k8sobjectsreceiver` + OTTL transform preserving the eleven-entry `k8s.event.hint` enum. Replaces `k8sevents`. |
 | [integrations/prometheus-scrape.md](integrations/prometheus-scrape.md) | 👤 | Generic Prometheus scrape via `prometheusreceiver` (dcgm-exporter, AMD/Intel/Habana exporters, Kueue) + OTTL `gpu.vendor` normalization. Replaces `dcgm` and `kueue`. |
+| [integrations/dcgm-exporter.md](integrations/dcgm-exporter.md) | 👤 | NVIDIA `dcgm-exporter` scraped via `prometheusreceiver` + OTTL rename of `DCGM_FI_*` to the `hw.gpu.*` namespace consumed by pattern detectors #1 / #3 / #4 / #5. NVIDIA-specific extension of `prometheus-scrape.md`. |
 
 ## Per-component docs
 

diff --git a/docs/integrations/dcgm-exporter.md b/docs/integrations/dcgm-exporter.md
@@ -0,0 +1,297 @@
+<!-- tested-against: tracecore -->
+<!-- last-verified: 2026-05-31 -->
+
+# NVIDIA dcgm-exporter scraped via `prometheusreceiver` + OTTL rename to `hw.gpu.*`
+
+Tracecore adopts NVIDIA's upstream
+[`dcgm-exporter`](https://github.com/NVIDIA/dcgm-exporter)
+(Apache-2.0) as the GPU-metrics source on NVIDIA fleets per
+[RFC-0013 §2 (Adoption matrix)](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix).
+This recipe is the install-companion to the pattern-detector docs
+([#1 NVLink degradation](../patterns/pattern-1-nvlink-degradation.md),
+[#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md),
+[#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md),
+[#5 PCIe AER](../patterns/pattern-5-pcie-aer.md))
+— each pattern asserts its input is "dcgm-exporter scraped via
+prometheusreceiver" and consumes the `hw.gpu.*` namespace; this recipe
+wires those two halves together.
+
+We do not fork or vendor dcgm-exporter. We install it from the
+upstream Helm chart, scrape it with the in-tree `prometheusreceiver`,
+and OTTL-rename the `DCGM_FI_*` metric names to the OTel
+hardware-semconv `hw.gpu.*` namespace plus the small set of tracecore
+extensions the pattern detectors depend on.
+
+For the generic Prometheus-scrape shape (Kueue, AMD, Intel, Habana
+exporters), see
+[`docs/integrations/prometheus-scrape.md`](prometheus-scrape.md).
+This recipe is the NVIDIA-specific extension that adds the
+metric-name rename layer.
+
+## Config
+
+```yaml
+# docs/integrations/examples/dcgm-exporter.yaml
+receivers:
+  prometheus/dcgm:
+    config:
+      scrape_configs:
+        - job_name: dcgm-exporter
+          scrape_interval: 15s
+          scrape_timeout: 10s
+          metrics_path: /metrics
+          fallback_scrape_protocol: PrometheusText1.0.0
+          static_configs:
+            - targets:
+                - REPLACE_WITH_DCGM_EXPORTER_TARGET
+
+processors:
+  transform/dcgm_to_hw:
+    metric_statements:
+      - context: resource
+        statements:
+          - set(attributes["gpu.vendor"], "nvidia")
+          - set(attributes["hw.type"], "gpu")
+      - context: datapoint
+        statements:
+          - set(attributes["hw.id"], attributes["UUID"]) where attributes["UUID"] != nil
+          - delete_key(attributes, "UUID") where attributes["hw.id"] != nil
+          - set(attributes["error.type"], "corrected") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_SBE_")
+          - set(attributes["error.type"], "uncorrected") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_DBE_")
+          - set(attributes["error.subtype"], "single_bit") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_SBE_")
+          - set(attributes["error.subtype"], "double_bit") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_DBE_")
+          - set(attributes["error.persistence"], "volatile") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_.._VOL_")
+          - set(attributes["error.persistence"], "aggregate") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_.._AGG_")
+          - set(attributes["hw.gpu.throttle.reason"], "thermal") where metric.name == "DCGM_FI_DEV_THERMAL_VIOLATION"
+          - set(attributes["hw.gpu.throttle.reason"], "power") where metric.name == "DCGM_FI_DEV_POWER_VIOLATION"
+          - set(attributes["hw.gpu.throttle.reason"], "sync_boost") where metric.name == "DCGM_FI_DEV_SYNC_BOOST_VIOLATION"
+          - set(attributes["network.io.direction"], "transmit") where metric.name == "DCGM_FI_PROF_PCIE_TX_BYTES"
+          - set(attributes["network.io.direction"], "receive") where metric.name == "DCGM_FI_PROF_PCIE_RX_BYTES"
+      - context: metric
+        statements:
+          - set(name, "hw.errors") where IsMatch(name, "^DCGM_FI_DEV_ECC_(SBE|DBE)_(VOL|AGG)_TOTAL$")
+          - set(name, "hw.temperature") where name == "DCGM_FI_DEV_GPU_TEMP"
+          - set(unit, "Cel") where name == "hw.temperature"
+          - set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_THERMAL_VIOLATION"
+          - set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_POWER_VIOLATION"
+          - set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_SYNC_BOOST_VIOLATION"
+          - set(unit, "s") where name == "hw.gpu.throttle.duration"
+          - set(name, "hw.gpu.io") where name == "DCGM_FI_PROF_PCIE_TX_BYTES"
+          - set(name, "hw.gpu.io") where name == "DCGM_FI_PROF_PCIE_RX_BYTES"
+          - set(unit, "By") where name == "hw.gpu.io"
+  batch:
+    send_batch_size: 8192
+    timeout: 10s
+
+exporters:
+  otlphttp:
+    endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT
+    compression: gzip
+    timeout: 10s
+
+service:
+  pipelines:
+    metrics/dcgm:
+      receivers: [prometheus/dcgm]
+      processors: [transform/dcgm_to_hw, batch]
+      exporters: [otlphttp]
+```
+
+The ordering inside `metric_statements` is load-bearing: attribute
+writes that key off the **original** DCGM metric name run in the
+`datapoint` context **before** the `metric` context renames `name`
+to `hw.*`. Reversing the order leaves the rewritten attributes empty
+because the `IsMatch` test runs against the already-renamed
+`hw.errors` / `hw.gpu.io` strings.
+
+Validate with the in-tree binary:
+
+```sh
+./_build/tracecore validate --config=docs/integrations/examples/dcgm-exporter.yaml
+```
+
+Exit 0 means the config parses, the scrape-target URL is well-formed,
+and every OTTL statement type-checks against the metric / datapoint /
+resource contexts.
+
+## Install dcgm-exporter
+
+dcgm-exporter ships as an Apache-2.0 Helm chart. No Pro / Enterprise
+features required.
+
+```sh
+helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
+helm repo update
+helm upgrade --install dcgm-exporter gpu-helm-charts/dcgm-exporter \
+  --namespace gpu-operator --create-namespace \
+  --values dcgm-exporter-values.yaml
+```
+
+Minimal `dcgm-exporter-values.yaml` (the chart defaults cover
+everything else; we override only the keys we depend on):
+
+```yaml
+# Pin GPU nodes only; the chart's default tolerations cover the
+# nvidia.com/gpu taint that GPU Operator applies. Adjust the
+# nodeSelector to match whatever label your node-feature-discovery /
+# GPU operator stamps.
+nodeSelector:
+  nvidia.com/gpu.present: "true"
+
+# We scrape dcgm-exporter directly from the co-located tracecore
+# DaemonSet at localhost:9400. The Service stays available for
+# operator-side `kubectl port-forward` debugging but no ServiceMonitor
+# is needed - tracecore is the scraper, not Prometheus Operator.
+serviceMonitor:
+  enabled: false
+
+service:
+  enable: true
+  type: ClusterIP
+  port: 9400
+
+# Use the default counter set. To trim cardinality, point at a
+# custom ConfigMap via `-m <namespace>:<configmap>` per the upstream
+# chart README; the recipe's OTTL transform only renames metrics
+# present in the scrape, so trimming dcgm-exporter is safe.
+arguments: ["-f", "/etc/dcgm-exporter/default-counters.csv"]
+```
+
+[Upstream chart values reference](https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/values.yaml).
+
+Confirm the pod is up and serving Prometheus-text on the conventional
+:9400 endpoint:
+
+```sh
+kubectl -n gpu-operator port-forward ds/dcgm-exporter 9400:9400 &
+curl -sS http://127.0.0.1:9400/metrics | head -20
+# # HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
+# # TYPE DCGM_FI_DEV_SM_CLOCK gauge
+# ...
+```
+
+The first line **must** start with `# HELP DCGM_` - that's what
+`prometheusreceiver` requires (see the
+[prometheus-scrape recipe Failure modes table](prometheus-scrape.md#failure-modes)).
+
+## Deployment shape
+
+Run tracecore as a **DaemonSet** on the same node selector as
+dcgm-exporter (`nvidia.com/gpu.present: "true"`). Each tracecore pod
+scrapes its node-local dcgm-exporter pod at `localhost:9400`. No
+cluster-wide service discovery, no cross-node scrape traffic. This is
+the same shape as the
+[generic prometheus-scrape recipe](prometheus-scrape.md#deployment-shape).
+
+Set `REPLACE_WITH_DCGM_EXPORTER_TARGET` to `localhost:9400` in the
+DaemonSet shape, or to the dcgm-exporter Service DNS
+(`dcgm-exporter.gpu-operator.svc:9400`) if you'd rather centralize
+scraping in a single tracecore Deployment.
+
+The dcgm-exporter chart's default tolerations cover the
+`nvidia.com/gpu: NoSchedule` taint NVIDIA GPU Operator applies. If
+your cluster uses a different taint, mirror it on both the
+dcgm-exporter chart values and the tracecore DaemonSet so the two
+pods co-schedule.
+
+## Verify
+
+End-to-end check after `helm install` + `tracecore` rollout:
+
+```sh
+# 1. Confirm dcgm-exporter is shipping the canonical metric names
+kubectl -n gpu-operator port-forward ds/dcgm-exporter 9400:9400 &
+curl -sS http://127.0.0.1:9400/metrics \
+  | grep -E '^DCGM_FI_(DEV_GPU_TEMP|PROF_PCIE_TX_BYTES|DEV_ECC_DBE_VOL_TOTAL) '
+# DCGM_FI_DEV_GPU_TEMP{gpu="0", UUID="GPU-...", ...} 42
+# DCGM_FI_PROF_PCIE_TX_BYTES{gpu="0", UUID="GPU-...", ...} 0
+# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL{gpu="0", UUID="GPU-...", ...} 0
+```
+
+```sh
+# 2. Confirm tracecore is renaming them. Scrape tracecore's own
+#    self-telemetry endpoint and look for the receiver-accepted
+#    counter for the prometheus/dcgm receiver.
+kubectl -n tracecore port-forward ds/tracecore 8888:8888 &
+curl -sS http://127.0.0.1:8888/metrics \
+  | grep -E '^otelcol_receiver_accepted_metric_points\{receiver="prometheus/dcgm"'
+```
+
+```sh
+# 3. End-to-end validate the rendered config (with REPLACE_* values
+#    substituted) before promoting to production.
+./_build/tracecore validate --config=/path/to/rendered.yaml
+```
+
+If `otelcol_receiver_accepted_metric_points{receiver="prometheus/dcgm"}`
+is non-zero but the backend isn't seeing `hw.errors` / `hw.temperature`,
+the OTTL ordering drifted - see the load-bearing-ordering note above.
+
+## DCGM_FI_* -> hw.* mapping
+
+The renames the recipe applies and the pattern detector that consumes
+each. Canonical OTel hardware-semconv names ([spec](https://opentelemetry.io/docs/specs/semconv/hardware/))
+are marked **(semconv)**; tracecore extensions are marked **(ext)**
+and are tracked for upstream contribution alongside the cross-vendor
+`gpu.vendor` resource attribute (RFC-0013 §3 + §5).
+
+| DCGM metric | Tracecore name | Attributes added | Pattern consumer |
+|---|---|---|---|
+| `DCGM_FI_DEV_ECC_SBE_VOL_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=corrected`, `error.subtype=single_bit` **(ext)**, `error.persistence=volatile` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) precursor signal |
+| `DCGM_FI_DEV_ECC_DBE_VOL_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=uncorrected`, `error.subtype=double_bit` **(ext)**, `error.persistence=volatile` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) primary signal |
+| `DCGM_FI_DEV_ECC_SBE_AGG_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=corrected`, `error.subtype=single_bit` **(ext)**, `error.persistence=aggregate` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) cross-reboot persistence |
+| `DCGM_FI_DEV_ECC_DBE_AGG_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=uncorrected`, `error.subtype=double_bit` **(ext)**, `error.persistence=aggregate` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) cross-reboot persistence |
+| `DCGM_FI_DEV_GPU_TEMP` | `hw.temperature` **(semconv)** | `hw.type=gpu` | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) supporting context |
+| `DCGM_FI_DEV_THERMAL_VIOLATION` | `hw.gpu.throttle.duration` **(ext)** | `hw.gpu.throttle.reason=thermal` **(ext)** | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) primary signal |
+| `DCGM_FI_DEV_POWER_VIOLATION` | `hw.gpu.throttle.duration` **(ext)** | `hw.gpu.throttle.reason=power` **(ext)** | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) sibling reason |
+| `DCGM_FI_DEV_SYNC_BOOST_VIOLATION` | `hw.gpu.throttle.duration` **(ext)** | `hw.gpu.throttle.reason=sync_boost` **(ext)** | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) sibling reason |
+| `DCGM_FI_PROF_PCIE_TX_BYTES` | `hw.gpu.io` **(semconv)** | `network.io.direction=transmit` | [#5 PCIe AER](../patterns/pattern-5-pcie-aer.md) primary signal |
+| `DCGM_FI_PROF_PCIE_RX_BYTES` | `hw.gpu.io` **(semconv)** | `network.io.direction=receive` | [#5 PCIe AER](../patterns/pattern-5-pcie-aer.md) primary signal |
+| `DCGM_FI_DEV_NVLINK_BANDWIDTH_L{0..N}` | `hw.gpu.nvlink.io` **(ext)** | `hw.gpu.nvlink.link={0..N}` **(ext)**, `network.io.direction` | [#1 NVLink degradation](../patterns/pattern-1-nvlink-degradation.md) primary signal |
+
+**Pattern-detector wiring note for #1 NVLink:** the per-link
+`DCGM_FI_DEV_NVLINK_BANDWIDTH_L*` family is commented out by default
+in dcgm-exporter's `default-counters.csv`. Enable the per-link
+counters by mounting a custom counter ConfigMap that uncomments those
+rows; the chart's `arguments` array passes `-m <namespace>:<configmap>`
+to dcgm-exporter. Without the per-link counters the #1 detector falls
+back to the aggregate `DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL` and loses
+the per-link divergence signal.
+
+Resource attributes stamped on every datapoint by the transform:
+`gpu.vendor=nvidia`, `hw.type=gpu`. Per-datapoint
+`hw.id` is promoted from the dcgm-exporter `UUID` label (the GPU UUID,
+unique within the host). `hw.gpu.pci.bdf` referenced by pattern #5 is
+**not** populated by this recipe - `dcgm-exporter` does not currently
+export the PCI BDF as a label; the pattern doc's PCI BDF cross-reference
+relies on the journald AER receiver in the
+[journald-kernel recipe](journald-kernel.md) and joins on `hw.id`.
+
+## Placeholders
+
+| Placeholder | What to fill in |
+|---|---|
+| `REPLACE_WITH_OTLP_HTTP_ENDPOINT` | The OTLP/HTTP base URL of your sink. `/v1/metrics` is appended automatically per the OTLP/HTTP spec. |
+| `REPLACE_WITH_DCGM_EXPORTER_TARGET` | `localhost:9400` for a tracecore DaemonSet shape, or `dcgm-exporter.gpu-operator.svc:9400` for a centralized Deployment shape. The `:port` suffix is mandatory - `prometheusreceiver` rejects bare hostnames at validate. |
+
+Tracecore does not expand environment variables in YAML. Render the
+literals at deploy time via `envsubst`, a Helm template, or a
+Kubernetes secret-injection driver.
+
+## Failure modes
+
+| Symptom | First check |
+|---|---|
+| `hw.errors` series flow but `error.subtype` attribute empty | OTTL statement-ordering drift: the `datapoint`-context rewrites must run BEFORE the `metric`-context `set(name, "hw.errors")`. Restore the order in the example. |
+| `dcgm-exporter` pod stuck `CrashLoopBackOff` on a Hopper / H100 node | The chart default image often lags new GPU silicon. Bump the image tag to the matching `nvcr.io/nvidia/k8s/dcgm-exporter:<datacenter-gpu-manager-version>-<dcgm-exporter-version>` build that lists Hopper support. |
+| `prometheus/dcgm: address ... incorrect` at validate | `REPLACE_WITH_DCGM_EXPORTER_TARGET` was not rendered. The validator rejects literal placeholders that look like hostnames. Render at deploy time. |
+| `DCGM_FI_DEV_NVLINK_BANDWIDTH_L0` absent from `/metrics` | The default `default-counters.csv` ships the per-link family commented out. Mount a custom counter ConfigMap and pass `-m <ns>:<cm>` via `arguments`. See the upstream README's "Changing metrics" section. |
+| `hw.id` empty | dcgm-exporter dropped the `UUID` label (it's emitted by default but a custom counter ConfigMap that overrides the label-set may exclude it). Confirm via `curl localhost:9400/metrics` that every series carries `UUID="GPU-..."`. |
+| Throttle metric stuck at the same value | DCGM exposes throttle durations as cumulative-counter nanoseconds. Use `rate()` / `increase()` against `hw_gpu_throttle_duration_seconds_total` in PromQL - not the raw value. The unit-rename to `s` is cosmetic; DCGM continues to emit nanoseconds and the OTTL transform does not divide. |
+
+Upstream component docs:
+[`receiver/prometheusreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver),
+[`processor/transformprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor).
+Upstream dcgm-exporter:
+[NVIDIA/dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter)
+(Apache-2.0).