diff --git a/docs/README.md b/docs/README.md index e239b541..8389afe3 100644 --- a/docs/README.md +++ b/docs/README.md @@ -58,6 +58,7 @@ Source (receiver-side) recipes — RFC-0013 §migration PR-J replacements for th | [integrations/journald-kernel.md](integrations/journald-kernel.md) | 👤 | Kernel + systemd events via `journaldreceiver` + `filelogreceiver` (kmsg) + OTTL transform preserving `kernelevents.xid` / `gpu.id`. Replaces `kernelevents`. | | [integrations/k8sobjects-events.md](integrations/k8sobjects-events.md) | 👤 | Kubernetes events via `k8sobjectsreceiver` + OTTL transform preserving the eleven-entry `k8s.event.hint` enum. Replaces `k8sevents`. | | [integrations/prometheus-scrape.md](integrations/prometheus-scrape.md) | 👤 | Generic Prometheus scrape via `prometheusreceiver` (dcgm-exporter, AMD/Intel/Habana exporters, Kueue) + OTTL `gpu.vendor` normalization. Replaces `dcgm` and `kueue`. | +| [integrations/dcgm-exporter.md](integrations/dcgm-exporter.md) | 👤 | NVIDIA `dcgm-exporter` scraped via `prometheusreceiver` + OTTL rename of `DCGM_FI_*` to the `hw.gpu.*` namespace consumed by pattern detectors #1 / #3 / #4 / #5. NVIDIA-specific extension of `prometheus-scrape.md`. | ## Per-component docs diff --git a/docs/integrations/dcgm-exporter.md b/docs/integrations/dcgm-exporter.md new file mode 100644 index 00000000..bb0d6058 --- /dev/null +++ b/docs/integrations/dcgm-exporter.md @@ -0,0 +1,297 @@ + + + +# NVIDIA dcgm-exporter scraped via `prometheusreceiver` + OTTL rename to `hw.gpu.*` + +Tracecore adopts NVIDIA's upstream +[`dcgm-exporter`](https://github.com/NVIDIA/dcgm-exporter) +(Apache-2.0) as the GPU-metrics source on NVIDIA fleets per +[RFC-0013 §2 (Adoption matrix)](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix). +This recipe is the install-companion to the pattern-detector docs +([#1 NVLink degradation](../patterns/pattern-1-nvlink-degradation.md), +[#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md), +[#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md), +[#5 PCIe AER](../patterns/pattern-5-pcie-aer.md)) +— each pattern asserts its input is "dcgm-exporter scraped via +prometheusreceiver" and consumes the `hw.gpu.*` namespace; this recipe +wires those two halves together. + +We do not fork or vendor dcgm-exporter. We install it from the +upstream Helm chart, scrape it with the in-tree `prometheusreceiver`, +and OTTL-rename the `DCGM_FI_*` metric names to the OTel +hardware-semconv `hw.gpu.*` namespace plus the small set of tracecore +extensions the pattern detectors depend on. + +For the generic Prometheus-scrape shape (Kueue, AMD, Intel, Habana +exporters), see +[`docs/integrations/prometheus-scrape.md`](prometheus-scrape.md). +This recipe is the NVIDIA-specific extension that adds the +metric-name rename layer. + +## Config + +```yaml +# docs/integrations/examples/dcgm-exporter.yaml +receivers: + prometheus/dcgm: + config: + scrape_configs: + - job_name: dcgm-exporter + scrape_interval: 15s + scrape_timeout: 10s + metrics_path: /metrics + fallback_scrape_protocol: PrometheusText1.0.0 + static_configs: + - targets: + - REPLACE_WITH_DCGM_EXPORTER_TARGET + +processors: + transform/dcgm_to_hw: + metric_statements: + - context: resource + statements: + - set(attributes["gpu.vendor"], "nvidia") + - set(attributes["hw.type"], "gpu") + - context: datapoint + statements: + - set(attributes["hw.id"], attributes["UUID"]) where attributes["UUID"] != nil + - delete_key(attributes, "UUID") where attributes["hw.id"] != nil + - set(attributes["error.type"], "corrected") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_SBE_") + - set(attributes["error.type"], "uncorrected") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_DBE_") + - set(attributes["error.subtype"], "single_bit") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_SBE_") + - set(attributes["error.subtype"], "double_bit") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_DBE_") + - set(attributes["error.persistence"], "volatile") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_.._VOL_") + - set(attributes["error.persistence"], "aggregate") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_.._AGG_") + - set(attributes["hw.gpu.throttle.reason"], "thermal") where metric.name == "DCGM_FI_DEV_THERMAL_VIOLATION" + - set(attributes["hw.gpu.throttle.reason"], "power") where metric.name == "DCGM_FI_DEV_POWER_VIOLATION" + - set(attributes["hw.gpu.throttle.reason"], "sync_boost") where metric.name == "DCGM_FI_DEV_SYNC_BOOST_VIOLATION" + - set(attributes["network.io.direction"], "transmit") where metric.name == "DCGM_FI_PROF_PCIE_TX_BYTES" + - set(attributes["network.io.direction"], "receive") where metric.name == "DCGM_FI_PROF_PCIE_RX_BYTES" + - context: metric + statements: + - set(name, "hw.errors") where IsMatch(name, "^DCGM_FI_DEV_ECC_(SBE|DBE)_(VOL|AGG)_TOTAL$") + - set(name, "hw.temperature") where name == "DCGM_FI_DEV_GPU_TEMP" + - set(unit, "Cel") where name == "hw.temperature" + - set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_THERMAL_VIOLATION" + - set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_POWER_VIOLATION" + - set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_SYNC_BOOST_VIOLATION" + - set(unit, "s") where name == "hw.gpu.throttle.duration" + - set(name, "hw.gpu.io") where name == "DCGM_FI_PROF_PCIE_TX_BYTES" + - set(name, "hw.gpu.io") where name == "DCGM_FI_PROF_PCIE_RX_BYTES" + - set(unit, "By") where name == "hw.gpu.io" + batch: + send_batch_size: 8192 + timeout: 10s + +exporters: + otlphttp: + endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT + compression: gzip + timeout: 10s + +service: + pipelines: + metrics/dcgm: + receivers: [prometheus/dcgm] + processors: [transform/dcgm_to_hw, batch] + exporters: [otlphttp] +``` + +The ordering inside `metric_statements` is load-bearing: attribute +writes that key off the **original** DCGM metric name run in the +`datapoint` context **before** the `metric` context renames `name` +to `hw.*`. Reversing the order leaves the rewritten attributes empty +because the `IsMatch` test runs against the already-renamed +`hw.errors` / `hw.gpu.io` strings. + +Validate with the in-tree binary: + +```sh +./_build/tracecore validate --config=docs/integrations/examples/dcgm-exporter.yaml +``` + +Exit 0 means the config parses, the scrape-target URL is well-formed, +and every OTTL statement type-checks against the metric / datapoint / +resource contexts. + +## Install dcgm-exporter + +dcgm-exporter ships as an Apache-2.0 Helm chart. No Pro / Enterprise +features required. + +```sh +helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts +helm repo update +helm upgrade --install dcgm-exporter gpu-helm-charts/dcgm-exporter \ + --namespace gpu-operator --create-namespace \ + --values dcgm-exporter-values.yaml +``` + +Minimal `dcgm-exporter-values.yaml` (the chart defaults cover +everything else; we override only the keys we depend on): + +```yaml +# Pin GPU nodes only; the chart's default tolerations cover the +# nvidia.com/gpu taint that GPU Operator applies. Adjust the +# nodeSelector to match whatever label your node-feature-discovery / +# GPU operator stamps. +nodeSelector: + nvidia.com/gpu.present: "true" + +# We scrape dcgm-exporter directly from the co-located tracecore +# DaemonSet at localhost:9400. The Service stays available for +# operator-side `kubectl port-forward` debugging but no ServiceMonitor +# is needed - tracecore is the scraper, not Prometheus Operator. +serviceMonitor: + enabled: false + +service: + enable: true + type: ClusterIP + port: 9400 + +# Use the default counter set. To trim cardinality, point at a +# custom ConfigMap via `-m :` per the upstream +# chart README; the recipe's OTTL transform only renames metrics +# present in the scrape, so trimming dcgm-exporter is safe. +arguments: ["-f", "/etc/dcgm-exporter/default-counters.csv"] +``` + +[Upstream chart values reference](https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/values.yaml). + +Confirm the pod is up and serving Prometheus-text on the conventional +:9400 endpoint: + +```sh +kubectl -n gpu-operator port-forward ds/dcgm-exporter 9400:9400 & +curl -sS http://127.0.0.1:9400/metrics | head -20 +# # HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz). +# # TYPE DCGM_FI_DEV_SM_CLOCK gauge +# ... +``` + +The first line **must** start with `# HELP DCGM_` - that's what +`prometheusreceiver` requires (see the +[prometheus-scrape recipe Failure modes table](prometheus-scrape.md#failure-modes)). + +## Deployment shape + +Run tracecore as a **DaemonSet** on the same node selector as +dcgm-exporter (`nvidia.com/gpu.present: "true"`). Each tracecore pod +scrapes its node-local dcgm-exporter pod at `localhost:9400`. No +cluster-wide service discovery, no cross-node scrape traffic. This is +the same shape as the +[generic prometheus-scrape recipe](prometheus-scrape.md#deployment-shape). + +Set `REPLACE_WITH_DCGM_EXPORTER_TARGET` to `localhost:9400` in the +DaemonSet shape, or to the dcgm-exporter Service DNS +(`dcgm-exporter.gpu-operator.svc:9400`) if you'd rather centralize +scraping in a single tracecore Deployment. + +The dcgm-exporter chart's default tolerations cover the +`nvidia.com/gpu: NoSchedule` taint NVIDIA GPU Operator applies. If +your cluster uses a different taint, mirror it on both the +dcgm-exporter chart values and the tracecore DaemonSet so the two +pods co-schedule. + +## Verify + +End-to-end check after `helm install` + `tracecore` rollout: + +```sh +# 1. Confirm dcgm-exporter is shipping the canonical metric names +kubectl -n gpu-operator port-forward ds/dcgm-exporter 9400:9400 & +curl -sS http://127.0.0.1:9400/metrics \ + | grep -E '^DCGM_FI_(DEV_GPU_TEMP|PROF_PCIE_TX_BYTES|DEV_ECC_DBE_VOL_TOTAL) ' +# DCGM_FI_DEV_GPU_TEMP{gpu="0", UUID="GPU-...", ...} 42 +# DCGM_FI_PROF_PCIE_TX_BYTES{gpu="0", UUID="GPU-...", ...} 0 +# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL{gpu="0", UUID="GPU-...", ...} 0 +``` + +```sh +# 2. Confirm tracecore is renaming them. Scrape tracecore's own +# self-telemetry endpoint and look for the receiver-accepted +# counter for the prometheus/dcgm receiver. +kubectl -n tracecore port-forward ds/tracecore 8888:8888 & +curl -sS http://127.0.0.1:8888/metrics \ + | grep -E '^otelcol_receiver_accepted_metric_points\{receiver="prometheus/dcgm"' +``` + +```sh +# 3. End-to-end validate the rendered config (with REPLACE_* values +# substituted) before promoting to production. +./_build/tracecore validate --config=/path/to/rendered.yaml +``` + +If `otelcol_receiver_accepted_metric_points{receiver="prometheus/dcgm"}` +is non-zero but the backend isn't seeing `hw.errors` / `hw.temperature`, +the OTTL ordering drifted - see the load-bearing-ordering note above. + +## DCGM_FI_* -> hw.* mapping + +The renames the recipe applies and the pattern detector that consumes +each. Canonical OTel hardware-semconv names ([spec](https://opentelemetry.io/docs/specs/semconv/hardware/)) +are marked **(semconv)**; tracecore extensions are marked **(ext)** +and are tracked for upstream contribution alongside the cross-vendor +`gpu.vendor` resource attribute (RFC-0013 §3 + §5). + +| DCGM metric | Tracecore name | Attributes added | Pattern consumer | +|---|---|---|---| +| `DCGM_FI_DEV_ECC_SBE_VOL_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=corrected`, `error.subtype=single_bit` **(ext)**, `error.persistence=volatile` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) precursor signal | +| `DCGM_FI_DEV_ECC_DBE_VOL_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=uncorrected`, `error.subtype=double_bit` **(ext)**, `error.persistence=volatile` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) primary signal | +| `DCGM_FI_DEV_ECC_SBE_AGG_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=corrected`, `error.subtype=single_bit` **(ext)**, `error.persistence=aggregate` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) cross-reboot persistence | +| `DCGM_FI_DEV_ECC_DBE_AGG_TOTAL` | `hw.errors` **(semconv)** | `hw.type=gpu`, `error.type=uncorrected`, `error.subtype=double_bit` **(ext)**, `error.persistence=aggregate` **(ext)** | [#3 HBM ECC](../patterns/pattern-3-hbm-ecc.md) cross-reboot persistence | +| `DCGM_FI_DEV_GPU_TEMP` | `hw.temperature` **(semconv)** | `hw.type=gpu` | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) supporting context | +| `DCGM_FI_DEV_THERMAL_VIOLATION` | `hw.gpu.throttle.duration` **(ext)** | `hw.gpu.throttle.reason=thermal` **(ext)** | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) primary signal | +| `DCGM_FI_DEV_POWER_VIOLATION` | `hw.gpu.throttle.duration` **(ext)** | `hw.gpu.throttle.reason=power` **(ext)** | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) sibling reason | +| `DCGM_FI_DEV_SYNC_BOOST_VIOLATION` | `hw.gpu.throttle.duration` **(ext)** | `hw.gpu.throttle.reason=sync_boost` **(ext)** | [#4 thermal throttle](../patterns/pattern-4-thermal-throttle.md) sibling reason | +| `DCGM_FI_PROF_PCIE_TX_BYTES` | `hw.gpu.io` **(semconv)** | `network.io.direction=transmit` | [#5 PCIe AER](../patterns/pattern-5-pcie-aer.md) primary signal | +| `DCGM_FI_PROF_PCIE_RX_BYTES` | `hw.gpu.io` **(semconv)** | `network.io.direction=receive` | [#5 PCIe AER](../patterns/pattern-5-pcie-aer.md) primary signal | +| `DCGM_FI_DEV_NVLINK_BANDWIDTH_L{0..N}` | `hw.gpu.nvlink.io` **(ext)** | `hw.gpu.nvlink.link={0..N}` **(ext)**, `network.io.direction` | [#1 NVLink degradation](../patterns/pattern-1-nvlink-degradation.md) primary signal | + +**Pattern-detector wiring note for #1 NVLink:** the per-link +`DCGM_FI_DEV_NVLINK_BANDWIDTH_L*` family is commented out by default +in dcgm-exporter's `default-counters.csv`. Enable the per-link +counters by mounting a custom counter ConfigMap that uncomments those +rows; the chart's `arguments` array passes `-m :` +to dcgm-exporter. Without the per-link counters the #1 detector falls +back to the aggregate `DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL` and loses +the per-link divergence signal. + +Resource attributes stamped on every datapoint by the transform: +`gpu.vendor=nvidia`, `hw.type=gpu`. Per-datapoint +`hw.id` is promoted from the dcgm-exporter `UUID` label (the GPU UUID, +unique within the host). `hw.gpu.pci.bdf` referenced by pattern #5 is +**not** populated by this recipe - `dcgm-exporter` does not currently +export the PCI BDF as a label; the pattern doc's PCI BDF cross-reference +relies on the journald AER receiver in the +[journald-kernel recipe](journald-kernel.md) and joins on `hw.id`. + +## Placeholders + +| Placeholder | What to fill in | +|---|---| +| `REPLACE_WITH_OTLP_HTTP_ENDPOINT` | The OTLP/HTTP base URL of your sink. `/v1/metrics` is appended automatically per the OTLP/HTTP spec. | +| `REPLACE_WITH_DCGM_EXPORTER_TARGET` | `localhost:9400` for a tracecore DaemonSet shape, or `dcgm-exporter.gpu-operator.svc:9400` for a centralized Deployment shape. The `:port` suffix is mandatory - `prometheusreceiver` rejects bare hostnames at validate. | + +Tracecore does not expand environment variables in YAML. Render the +literals at deploy time via `envsubst`, a Helm template, or a +Kubernetes secret-injection driver. + +## Failure modes + +| Symptom | First check | +|---|---| +| `hw.errors` series flow but `error.subtype` attribute empty | OTTL statement-ordering drift: the `datapoint`-context rewrites must run BEFORE the `metric`-context `set(name, "hw.errors")`. Restore the order in the example. | +| `dcgm-exporter` pod stuck `CrashLoopBackOff` on a Hopper / H100 node | The chart default image often lags new GPU silicon. Bump the image tag to the matching `nvcr.io/nvidia/k8s/dcgm-exporter:-` build that lists Hopper support. | +| `prometheus/dcgm: address ... incorrect` at validate | `REPLACE_WITH_DCGM_EXPORTER_TARGET` was not rendered. The validator rejects literal placeholders that look like hostnames. Render at deploy time. | +| `DCGM_FI_DEV_NVLINK_BANDWIDTH_L0` absent from `/metrics` | The default `default-counters.csv` ships the per-link family commented out. Mount a custom counter ConfigMap and pass `-m :` via `arguments`. See the upstream README's "Changing metrics" section. | +| `hw.id` empty | dcgm-exporter dropped the `UUID` label (it's emitted by default but a custom counter ConfigMap that overrides the label-set may exclude it). Confirm via `curl localhost:9400/metrics` that every series carries `UUID="GPU-..."`. | +| Throttle metric stuck at the same value | DCGM exposes throttle durations as cumulative-counter nanoseconds. Use `rate()` / `increase()` against `hw_gpu_throttle_duration_seconds_total` in PromQL - not the raw value. The unit-rename to `s` is cosmetic; DCGM continues to emit nanoseconds and the OTTL transform does not divide. | + +Upstream component docs: +[`receiver/prometheusreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver), +[`processor/transformprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor). +Upstream dcgm-exporter: +[NVIDIA/dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) +(Apache-2.0). diff --git a/docs/integrations/examples/dcgm-exporter.yaml b/docs/integrations/examples/dcgm-exporter.yaml new file mode 100644 index 00000000..ccc21115 --- /dev/null +++ b/docs/integrations/examples/dcgm-exporter.yaml @@ -0,0 +1,119 @@ +# docs/integrations/examples/dcgm-exporter.yaml +# +# Scrape upstream NVIDIA dcgm-exporter via `prometheusreceiver`, then +# OTTL-rename DCGM_FI_* metric names to the `hw.gpu.*` namespace the +# tracecore pattern detectors (patterns #1, #3, #4, #5) consume. This +# is the recipe-form of the v0.3.0 NVIDIA adoption shape per +# RFC-0013 §2 (Adoption matrix); we ADOPT dcgm-exporter as the +# metrics source, we don't fork or vendor it. +# +# Deployment shape: tracecore DaemonSet alongside dcgm-exporter +# DaemonSet on GPU nodes -> Prometheus-style scrape every 15s -> +# OTTL `transform` rewires metric names + stamps `hw.type=gpu`, +# `gpu.vendor=nvidia` resource attrs -> otlphttpexporter to backend. +# Pair-of-pods stays on a single GPU node; no cluster-wide service +# discovery required. +# +# Renames applied here (DCGM_FI_* -> hw.gpu.*): +# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL -> hw.errors{error.type=corrected,error.subtype=single_bit,error.persistence=volatile} +# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL -> hw.errors{error.type=uncorrected,error.subtype=double_bit,error.persistence=volatile} +# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL -> hw.errors{error.type=corrected,error.subtype=single_bit,error.persistence=aggregate} +# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL -> hw.errors{error.type=uncorrected,error.subtype=double_bit,error.persistence=aggregate} +# DCGM_FI_DEV_GPU_TEMP -> hw.temperature{hw.type=gpu} +# DCGM_FI_DEV_THERMAL_VIOLATION -> hw.gpu.throttle.duration{hw.gpu.throttle.reason=thermal} +# DCGM_FI_DEV_POWER_VIOLATION -> hw.gpu.throttle.duration{hw.gpu.throttle.reason=power} +# DCGM_FI_DEV_SYNC_BOOST_VIOLATION -> hw.gpu.throttle.duration{hw.gpu.throttle.reason=sync_boost} +# DCGM_FI_PROF_PCIE_TX_BYTES -> hw.gpu.io{network.io.direction=transmit} +# DCGM_FI_PROF_PCIE_RX_BYTES -> hw.gpu.io{network.io.direction=receive} +# +# Notes on naming drift: `hw.errors`, `hw.gpu.io`, `hw.temperature` +# are canonical OTel hardware-semconv names. `hw.gpu.throttle.duration`, +# `hw.gpu.nvlink.io`, and the `error.subtype`/`error.persistence` +# attributes are tracecore extensions ahead of the OTel hw.* semconv +# contribution (see docs/integrations/dcgm-exporter.md +# §"DCGM_FI_* -> hw.* mapping"). Pattern detectors #1/#3/#4/#5 key on +# these names verbatim. +# +# Tracecore does not expand env vars in YAML. Render placeholders at +# deploy time via Helm, envsubst, or a CSI secret-injection driver. +# See docs/integrations/dcgm-exporter.md. + +receivers: + prometheus/dcgm: + config: + scrape_configs: + - job_name: dcgm-exporter + scrape_interval: 15s + scrape_timeout: 10s + metrics_path: /metrics + fallback_scrape_protocol: PrometheusText1.0.0 + static_configs: + - targets: + - REPLACE_WITH_DCGM_EXPORTER_TARGET + +processors: + # Rename DCGM_FI_* -> hw.gpu.* + stamp customer-stable resource + # attributes. The order matters: attribute writes that depend on + # the ORIGINAL DCGM metric name run BEFORE the `set(name, ...)` + # rewrite that replaces it. + transform/dcgm_to_hw: + metric_statements: + - context: resource + statements: + - set(attributes["gpu.vendor"], "nvidia") + - set(attributes["hw.type"], "gpu") + - context: datapoint + statements: + # Translate DCGM `UUID` label to OTel hw.id resource-shape attr. + # dcgm-exporter emits {gpu, UUID, device, modelName, Hostname, + # container, namespace, pod}; tracecore patterns key on hw.id + # (the GPU UUID) so we promote it and drop the upper-case label. + - set(attributes["hw.id"], attributes["UUID"]) where attributes["UUID"] != nil + - delete_key(attributes, "UUID") where attributes["hw.id"] != nil + # ECC: tag error.type / subtype / persistence FROM the source + # DCGM metric name BEFORE we rename it to hw.errors below. + - set(attributes["error.type"], "corrected") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_SBE_") + - set(attributes["error.type"], "uncorrected") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_DBE_") + - set(attributes["error.subtype"], "single_bit") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_SBE_") + - set(attributes["error.subtype"], "double_bit") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_DBE_") + - set(attributes["error.persistence"], "volatile") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_.._VOL_") + - set(attributes["error.persistence"], "aggregate") where IsMatch(metric.name, "^DCGM_FI_DEV_ECC_.._AGG_") + # Throttle reason from the source DCGM field name + - set(attributes["hw.gpu.throttle.reason"], "thermal") where metric.name == "DCGM_FI_DEV_THERMAL_VIOLATION" + - set(attributes["hw.gpu.throttle.reason"], "power") where metric.name == "DCGM_FI_DEV_POWER_VIOLATION" + - set(attributes["hw.gpu.throttle.reason"], "sync_boost") where metric.name == "DCGM_FI_DEV_SYNC_BOOST_VIOLATION" + # PCIe direction from the source DCGM field name + - set(attributes["network.io.direction"], "transmit") where metric.name == "DCGM_FI_PROF_PCIE_TX_BYTES" + - set(attributes["network.io.direction"], "receive") where metric.name == "DCGM_FI_PROF_PCIE_RX_BYTES" + - context: metric + statements: + # Now rewrite metric names. Attributes set above survive the + # rename because they live on datapoints, not on the metric + # descriptor. + - set(name, "hw.errors") where IsMatch(name, "^DCGM_FI_DEV_ECC_(SBE|DBE)_(VOL|AGG)_TOTAL$") + - set(name, "hw.temperature") where name == "DCGM_FI_DEV_GPU_TEMP" + - set(unit, "Cel") where name == "hw.temperature" + - set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_THERMAL_VIOLATION" + - set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_POWER_VIOLATION" + - set(name, "hw.gpu.throttle.duration") where name == "DCGM_FI_DEV_SYNC_BOOST_VIOLATION" + - set(unit, "s") where name == "hw.gpu.throttle.duration" + - set(name, "hw.gpu.io") where name == "DCGM_FI_PROF_PCIE_TX_BYTES" + - set(name, "hw.gpu.io") where name == "DCGM_FI_PROF_PCIE_RX_BYTES" + - set(unit, "By") where name == "hw.gpu.io" + + batch: + send_batch_size: 8192 + timeout: 10s + +exporters: + otlphttp: + endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT + compression: gzip + timeout: 10s + +service: + pipelines: + metrics/dcgm: + receivers: [prometheus/dcgm] + processors: [transform/dcgm_to_hw, batch] + exporters: [otlphttp]