Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ Source (receiver-side) recipes — RFC-0013 §migration PR-J replacements for th
| [integrations/journald-kernel.md](integrations/journald-kernel.md) | 👤 | Kernel + systemd events via `journaldreceiver` + `filelogreceiver` (kmsg) + OTTL transform preserving `kernelevents.xid` / `gpu.id`. Replaces `kernelevents`. |
| [integrations/k8sobjects-events.md](integrations/k8sobjects-events.md) | 👤 | Kubernetes events via `k8sobjectsreceiver` + OTTL transform preserving the eleven-entry `k8s.event.hint` enum. Replaces `k8sevents`. |
| [integrations/prometheus-scrape.md](integrations/prometheus-scrape.md) | 👤 | Generic Prometheus scrape via `prometheusreceiver` (dcgm-exporter, AMD/Intel/Habana exporters, Kueue) + OTTL `gpu.vendor` normalization. Replaces `dcgm` and `kueue`. |
| [integrations/dcgm-exporter.md](integrations/dcgm-exporter.md) | 👤 | NVIDIA `dcgm-exporter` scraped via `prometheusreceiver` + OTTL rename of `DCGM_FI_*` to the `hw.gpu.*` namespace consumed by pattern detectors #1 / #3 / #4 / #5. NVIDIA-specific extension of `prometheus-scrape.md`. |

## Per-component docs

Expand Down
297 changes: 297 additions & 0 deletions docs/integrations/dcgm-exporter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,297 @@
<!-- tested-against: tracecore -->
<!-- last-verified: 2026-05-31 -->

# NVIDIA dcgm-exporter scraped via `prometheusreceiver` + OTTL rename to `hw.gpu.*`

Tracecore adopts NVIDIA's upstream
[`dcgm-exporter`](https://github.com/NVIDIA/dcgm-exporter)
(Apache-2.0) as the GPU-metrics source on NVIDIA fleets per
[RFC-0013 §2 (Adoption matrix)](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix).
This recipe is the install-companion to the pattern-detector docs
([#1 NVLink degradation](../patterns/pattern-1-nvlink-degradation.md),
[#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md),
[#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md),
[#5 PCIe AER](../patterns/pattern-5-pcie-aer.md))
— each pattern asserts its input is "dcgm-exporter scraped via
prometheusreceiver" and consumes the `hw.gpu.*` namespace; this recipe
wires those two halves together.

We do not fork or vendor dcgm-exporter. We install it from the
upstream Helm chart, scrape it with the in-tree `prometheusreceiver`,
and OTTL-rename the `DCGM_FI_*` metric names to the OTel
hardware-semconv `hw.gpu.*` namespace plus the small set of tracecore
extensions the pattern detectors depend on.

For the generic Prometheus-scrape shape (Kueue, AMD, Intel, Habana
exporters), see
[`docs/integrations/prometheus-scrape.md`](prometheus-scrape.md).
This recipe is the NVIDIA-specific extension that adds the
metric-name rename layer.

## Config

```yaml
# docs/integrations/examples/dcgm-exporter.yaml
receivers:
prometheus/dcgm:
config:
scrape_configs:
- job_name: dcgm-exporter
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
fallback_scrape_protocol: PrometheusText1.0.0
static_configs:
- targets:
- REPLACE_WITH_DCGM_EXPORTER_TARGET

processors:
transform/dcgm_to_hw:
metric_statements:
- context: resource
statements:
- set(attributes["gpu.vendor"], "nvidia")
- set(attributes["hw.type"], "gpu")
- context: datapoint
statements:
- set(attributes["hw.id"], attributes["UUID"]) where attributes["UUID"] != nil
- delete_key(attributes, "UUID") where attributes["hw.id"] != nil
- set(attributes["error.type"], "corrected") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_SBE_")
- set(attributes["error.type"], "uncorrected") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_DBE_")
- set(attributes["error.subtype"], "single_bit") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_SBE_")
- set(attributes["error.subtype"], "double_bit") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_DBE_")
- set(attributes["error.persistence"], "volatile") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_.._VOL_")
- set(attributes["error.persistence"], "aggregate") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_.._AGG_")
- set(attributes["hw.gpu.throttle.reason"], "thermal") where metric.name == "DCGM_FI_DEV_THERMAL_VIOLATION"
- set(attributes["hw.gpu.throttle.reason"], "power") where metric.name == "DCGM_FI_DEV_POWER_VIOLATION"
- set(attributes["hw.gpu.throttle.reason"], "sync_boost") where metric.name == "DCGM_FI_DEV_SYNC_BOOST_VIOLATION"
- set(attributes["network.io.direction"], "transmit") where metric.name == "DCGM_FI_PROF_PCIE_TX_BYTES"
- set(attributes["network.io.direction"], "receive") where metric.name == "DCGM_FI_PROF_PCIE_RX_BYTES"
- context: metric
statements:
- set(name, "hw.errors") where IsMatch(name, "^DCGM_FI_DEV_ECC_(SBE|DBE)_(VOL|AGG)_TOTAL$")
- set(name, "hw.temperature") where name == "DCGM_FI_DEV_GPU_TEMP"
- set(unit, "Cel") where name == "hw.temperature"
- set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_THERMAL_VIOLATION"
- set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_POWER_VIOLATION"
- set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_SYNC_BOOST_VIOLATION"
- set(unit, "s") where name == "hw.gpu.throttle.duration"
- set(name, "hw.gpu.io") where name == "DCGM_FI_PROF_PCIE_TX_BYTES"
- set(name, "hw.gpu.io") where name == "DCGM_FI_PROF_PCIE_RX_BYTES"
- set(unit, "By") where name == "hw.gpu.io"
batch:
send_batch_size: 8192
timeout: 10s

exporters:
otlphttp:
endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT
compression: gzip
timeout: 10s

service:
pipelines:
metrics/dcgm:
receivers: [prometheus/dcgm]
processors: [transform/dcgm_to_hw, batch]
exporters: [otlphttp]
```

The ordering inside `metric_statements` is load-bearing: attribute
writes that key off the **original** DCGM metric name run in the
`datapoint` context **before** the `metric` context renames `name`
to `hw.*`. Reversing the order leaves the rewritten attributes empty
because the `IsMatch` test runs against the already-renamed
`hw.errors` / `hw.gpu.io` strings.

Validate with the in-tree binary:

```sh
./_build/tracecore validate --config=docs/integrations/examples/dcgm-exporter.yaml
```

Exit 0 means the config parses, the scrape-target URL is well-formed,
and every OTTL statement type-checks against the metric / datapoint /
resource contexts.

## Install dcgm-exporter

dcgm-exporter ships as an Apache-2.0 Helm chart. No Pro / Enterprise
features required.

```sh
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm repo update
helm upgrade --install dcgm-exporter gpu-helm-charts/dcgm-exporter \
--namespace gpu-operator --create-namespace \
--values dcgm-exporter-values.yaml
```

Minimal `dcgm-exporter-values.yaml` (the chart defaults cover
everything else; we override only the keys we depend on):

```yaml
# Pin GPU nodes only; the chart's default tolerations cover the
# nvidia.com/gpu taint that GPU Operator applies. Adjust the
# nodeSelector to match whatever label your node-feature-discovery /
# GPU operator stamps.
nodeSelector:
nvidia.com/gpu.present: "true"

# We scrape dcgm-exporter directly from the co-located tracecore
# DaemonSet at localhost:9400. The Service stays available for
# operator-side `kubectl port-forward` debugging but no ServiceMonitor
# is needed - tracecore is the scraper, not Prometheus Operator.
serviceMonitor:
enabled: false

service:
enable: true
type: ClusterIP
port: 9400

# Use the default counter set. To trim cardinality, point at a
# custom ConfigMap via `-m <namespace>:<configmap>` per the upstream
# chart README; the recipe's OTTL transform only renames metrics
# present in the scrape, so trimming dcgm-exporter is safe.
arguments: ["-f", "/etc/dcgm-exporter/default-counters.csv"]
```

[Upstream chart values reference](https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/values.yaml).

Confirm the pod is up and serving Prometheus-text on the conventional
:9400 endpoint:

```sh
kubectl -n gpu-operator port-forward ds/dcgm-exporter 9400:9400 &
curl -sS http://127.0.0.1:9400/metrics | head -20
# # HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# # TYPE DCGM_FI_DEV_SM_CLOCK gauge
# ...
```

The first line **must** start with `# HELP DCGM_` - that's what
`prometheusreceiver` requires (see the
[prometheus-scrape recipe Failure modes table](prometheus-scrape.md#failure-modes)).

## Deployment shape

Run tracecore as a **DaemonSet** on the same node selector as
dcgm-exporter (`nvidia.com/gpu.present: "true"`). Each tracecore pod
scrapes its node-local dcgm-exporter pod at `localhost:9400`. No
cluster-wide service discovery, no cross-node scrape traffic. This is
the same shape as the
[generic prometheus-scrape recipe](prometheus-scrape.md#deployment-shape).

Set `REPLACE_WITH_DCGM_EXPORTER_TARGET` to `localhost:9400` in the
DaemonSet shape, or to the dcgm-exporter Service DNS
(`dcgm-exporter.gpu-operator.svc:9400`) if you'd rather centralize
scraping in a single tracecore Deployment.

The dcgm-exporter chart's default tolerations cover the
`nvidia.com/gpu: NoSchedule` taint NVIDIA GPU Operator applies. If
your cluster uses a different taint, mirror it on both the
dcgm-exporter chart values and the tracecore DaemonSet so the two
pods co-schedule.

## Verify

End-to-end check after `helm install` + `tracecore` rollout:

```sh
# 1. Confirm dcgm-exporter is shipping the canonical metric names
kubectl -n gpu-operator port-forward ds/dcgm-exporter 9400:9400 &
curl -sS http://127.0.0.1:9400/metrics \
| grep -E '^DCGM_FI_(DEV_GPU_TEMP|PROF_PCIE_TX_BYTES|DEV_ECC_DBE_VOL_TOTAL) '
# DCGM_FI_DEV_GPU_TEMP{gpu="0", UUID="GPU-...", ...} 42
# DCGM_FI_PROF_PCIE_TX_BYTES{gpu="0", UUID="GPU-...", ...} 0
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL{gpu="0", UUID="GPU-...", ...} 0
```

```sh
# 2. Confirm tracecore is renaming them. Scrape tracecore's own
# self-telemetry endpoint and look for the receiver-accepted
# counter for the prometheus/dcgm receiver.
kubectl -n tracecore port-forward ds/tracecore 8888:8888 &
curl -sS http://127.0.0.1:8888/metrics \
| grep -E '^otelcol_receiver_accepted_metric_points\{receiver="prometheus/dcgm"'
```

```sh
# 3. End-to-end validate the rendered config (with REPLACE_* values
# substituted) before promoting to production.
./_build/tracecore validate --config=/path/to/rendered.yaml
```

If `otelcol_receiver_accepted_metric_points{receiver="prometheus/dcgm"}`
is non-zero but the backend isn't seeing `hw.errors` / `hw.temperature`,
the OTTL ordering drifted - see the load-bearing-ordering note above.

## DCGM_FI_* -> hw.* mapping

The renames the recipe applies and the pattern detector that consumes
each. Canonical OTel hardware-semconv names ([spec](https://opentelemetry.io/docs/specs/semconv/hardware/))
are marked **(semconv)**; tracecore extensions are marked **(ext)**
and are tracked for upstream contribution alongside the cross-vendor
`gpu.vendor` resource attribute (RFC-0013 §3 + §5).

| DCGM metric | Tracecore name | Attributes added | Pattern consumer |
|---|---|---|---|
| `DCGM_FI_DEV_ECC_SBE_VOL_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=corrected`, `error.subtype=single_bit` **(ext)**, `error.persistence=volatile` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) precursor signal |
| `DCGM_FI_DEV_ECC_DBE_VOL_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=uncorrected`, `error.subtype=double_bit` **(ext)**, `error.persistence=volatile` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) primary signal |
| `DCGM_FI_DEV_ECC_SBE_AGG_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=corrected`, `error.subtype=single_bit` **(ext)**, `error.persistence=aggregate` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) cross-reboot persistence |
| `DCGM_FI_DEV_ECC_DBE_AGG_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=uncorrected`, `error.subtype=double_bit` **(ext)**, `error.persistence=aggregate` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) cross-reboot persistence |
| `DCGM_FI_DEV_GPU_TEMP` | `hw.temperature` **(semconv)** | `hw.type=gpu` | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) supporting context |
| `DCGM_FI_DEV_THERMAL_VIOLATION` | `hw.gpu.throttle.duration` **(ext)** | `hw.gpu.throttle.reason=thermal` **(ext)** | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) primary signal |
| `DCGM_FI_DEV_POWER_VIOLATION` | `hw.gpu.throttle.duration` **(ext)** | `hw.gpu.throttle.reason=power` **(ext)** | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) sibling reason |
| `DCGM_FI_DEV_SYNC_BOOST_VIOLATION` | `hw.gpu.throttle.duration` **(ext)** | `hw.gpu.throttle.reason=sync_boost` **(ext)** | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) sibling reason |
| `DCGM_FI_PROF_PCIE_TX_BYTES` | `hw.gpu.io` **(semconv)** | `network.io.direction=transmit` | [#5 PCIe AER](../patterns/pattern-5-pcie-aer.md) primary signal |
| `DCGM_FI_PROF_PCIE_RX_BYTES` | `hw.gpu.io` **(semconv)** | `network.io.direction=receive` | [#5 PCIe AER](../patterns/pattern-5-pcie-aer.md) primary signal |
| `DCGM_FI_DEV_NVLINK_BANDWIDTH_L{0..N}` | `hw.gpu.nvlink.io` **(ext)** | `hw.gpu.nvlink.link={0..N}` **(ext)**, `network.io.direction` | [#1 NVLink degradation](../patterns/pattern-1-nvlink-degradation.md) primary signal |

**Pattern-detector wiring note for #1 NVLink:** the per-link
`DCGM_FI_DEV_NVLINK_BANDWIDTH_L*` family is commented out by default
in dcgm-exporter's `default-counters.csv`. Enable the per-link
counters by mounting a custom counter ConfigMap that uncomments those
rows; the chart's `arguments` array passes `-m <namespace>:<configmap>`
to dcgm-exporter. Without the per-link counters the #1 detector falls
back to the aggregate `DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL` and loses
the per-link divergence signal.

Resource attributes stamped on every datapoint by the transform:
`gpu.vendor=nvidia`, `hw.type=gpu`. Per-datapoint
`hw.id` is promoted from the dcgm-exporter `UUID` label (the GPU UUID,
unique within the host). `hw.gpu.pci.bdf` referenced by pattern #5 is
**not** populated by this recipe - `dcgm-exporter` does not currently
export the PCI BDF as a label; the pattern doc's PCI BDF cross-reference
relies on the journald AER receiver in the
[journald-kernel recipe](journald-kernel.md) and joins on `hw.id`.

## Placeholders

| Placeholder | What to fill in |
|---|---|
| `REPLACE_WITH_OTLP_HTTP_ENDPOINT` | The OTLP/HTTP base URL of your sink. `/v1/metrics` is appended automatically per the OTLP/HTTP spec. |
| `REPLACE_WITH_DCGM_EXPORTER_TARGET` | `localhost:9400` for a tracecore DaemonSet shape, or `dcgm-exporter.gpu-operator.svc:9400` for a centralized Deployment shape. The `:port` suffix is mandatory - `prometheusreceiver` rejects bare hostnames at validate. |

Tracecore does not expand environment variables in YAML. Render the
literals at deploy time via `envsubst`, a Helm template, or a
Kubernetes secret-injection driver.

## Failure modes

| Symptom | First check |
|---|---|
| `hw.errors` series flow but `error.subtype` attribute empty | OTTL statement-ordering drift: the `datapoint`-context rewrites must run BEFORE the `metric`-context `set(name, "hw.errors")`. Restore the order in the example. |
| `dcgm-exporter` pod stuck `CrashLoopBackOff` on a Hopper / H100 node | The chart default image often lags new GPU silicon. Bump the image tag to the matching `nvcr.io/nvidia/k8s/dcgm-exporter:<datacenter-gpu-manager-version>-<dcgm-exporter-version>` build that lists Hopper support. |
| `prometheus/dcgm: address ... incorrect` at validate | `REPLACE_WITH_DCGM_EXPORTER_TARGET` was not rendered. The validator rejects literal placeholders that look like hostnames. Render at deploy time. |
| `DCGM_FI_DEV_NVLINK_BANDWIDTH_L0` absent from `/metrics` | The default `default-counters.csv` ships the per-link family commented out. Mount a custom counter ConfigMap and pass `-m <ns>:<cm>` via `arguments`. See the upstream README's "Changing metrics" section. |
| `hw.id` empty | dcgm-exporter dropped the `UUID` label (it's emitted by default but a custom counter ConfigMap that overrides the label-set may exclude it). Confirm via `curl localhost:9400/metrics` that every series carries `UUID="GPU-..."`. |
| Throttle metric stuck at the same value | DCGM exposes throttle durations as cumulative-counter nanoseconds. Use `rate()` / `increase()` against `hw_gpu_throttle_duration_seconds_total` in PromQL - not the raw value. The unit-rename to `s` is cosmetic; DCGM continues to emit nanoseconds and the OTTL transform does not divide. |

Upstream component docs:
[`receiver/prometheusreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver),
[`processor/transformprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor).
Upstream dcgm-exporter:
[NVIDIA/dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter)
(Apache-2.0).
Loading