Skip to content

docs(integrations): dcgm-exporter operator recipe#266

Closed
trilamsr wants to merge 1 commit into
mainfrom
docs/recipe-dcgm-exporter
Closed

docs(integrations): dcgm-exporter operator recipe#266
trilamsr wants to merge 1 commit into
mainfrom
docs/recipe-dcgm-exporter

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Adopt-over-build: we install upstream NVIDIA/dcgm-exporter (Apache-2.0) from its Helm chart; no fork, no vendor, no Pro features. The bundled values.yaml overrides only the keys the recipe depends on (nodeSelector, serviceMonitor: enabled=false since tracecore is the scraper not Prometheus Operator, service.port: 9400).

docs: NVIDIA dcgm-exporter operator recipe (`docs/integrations/dcgm-exporter.md`) wiring `prometheusreceiver` + OTTL rename of `DCGM_FI_*` to the `hw.gpu.*` namespace consumed by pattern detectors #1 / #3 / #4 / #5.

Test plan

  • ./_build/tracecore validate --config=docs/integrations/examples/dcgm-exporter.yaml exits 0 — OTTL statements type-check against the metric / datapoint / resource contexts and the scrape target URL is well-formed.
  • bash scripts/doc-check.sh clean — example file resolves from first ```yaml block, tested-against + last-verified markers parse, README.md indexes the new recipe.
  • bash scripts/validator-recipe.sh reports docs/integrations/dcgm-exporter.md -> tracecore validate docs/integrations/examples/dcgm-exporter.yaml succeeding (8/8 in-tree recipes on darwin; CI ubuntu runner exercises the journald / k8sobjects skips).
  • make verify clean (fmt, tidy-check, lint, vet, mod-verify, license-check, generate-fixtures-check, build-tags, nccl-fr-rce-gate, register-lint, actionlint, zizmor, doc-check, no-autoupdate-check).
  • Manual on-GPU verification (not done in this PR — see the recipe's "Verify" section for the kubectl port-forward + curl /metrics + otelcol_receiver_accepted_metric_points{receiver="prometheus/dcgm"} check operators should run after helm install).

Drift (#265, NOT a v0.3.0 blocker): the recipe documents that hw.gpu.throttle.duration, hw.gpu.nvlink.io, hw.gpu.throttle.reason, error.subtype, error.persistence, hw.gpu.index, hw.gpu.pci.bdf are tracecore extensions ahead of the OTel hardware-semconv proposal. Pattern docs reference them verbatim today; the upstream-rectification work is out of scope per the user task constraint and tracked separately.

Patterns #1, #3, #4, #5 all assert their input is "dcgm-exporter
scraped via prometheusreceiver" but the closest existing recipe
(prometheus-scrape.md) is the generic vendor-agnostic shape with
only a `gpu.vendor` tag, leaving each operator to piece together
the DCGM_FI_* -> hw.gpu.* rename layer the pattern detectors
consume.

This adds the NVIDIA-specific install + wire-up recipe:
- helm install upstream dcgm-exporter (Apache-2.0, no Pro features)
- prometheusreceiver scrape at the conventional :9400 endpoint
- OTTL transform renaming the DCGM metric families pattern
  detectors #1/#3/#4/#5 consume (ECC, temp, throttle, PCIe Tx/Rx)
  to OTel hw.* semconv where canonical, tracecore extensions
  where the spec hasn't ratified the shape yet
- Mapping table explicitly tagging each rename as (semconv) or
  (ext) so operators see the contract surface area
- DCGM_FI_DEV_NVLINK_BANDWIDTH_L* noted as opt-in (commented in
  default-counters.csv); pattern #1 needs operator action

Drift between pattern docs and upstream OTel hw.* semconv (the
(ext) names) is filed as #265 -- intentionally not papered over
since pattern detectors consume the tracecore-extended attributes
verbatim and operators need to know which names a future
spec-compliant collector would emit unchanged.

Validate-clean against the in-tree binary; doc-check + validator-
recipe + make verify all pass.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr

trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Closing in favor of PR #267 which ships the correct DCGM→hw.* OTTL transform.

Reviewer findings: the OTTL transform in this PR used the wrong NVLink metric family (aggregate DCGM_FI_DEV_NVLINK_BANDWIDTH_L{N} instead of per-direction DCGM_FI_PROF_NVLINK_L{N}_(TX|RX)_BYTES), missing network.io.direction for NVLink, missing hw.gpu.index + hw.gpu.pci.bdf resource attrs. PR #267 (PR-A) has these correct.

Net-new content worth preserving (helm install command, kubectl port-forward verify, failure-modes table, operator-facing metric mapping table) is tracked in #271 — to be merged INTO docs/integrations/prometheus-scrape.md as additional sections rather than a duplicate recipe file.

@trilamsr trilamsr closed this Jun 1, 2026
trilamsr added a commit that referenced this pull request Jun 1, 2026
#396)

## Summary

Closes #271. Extends `docs/integrations/prometheus-scrape.md` with the
operator-facing onboarding content that closed PR #266 carried before
the
OTTL-transform half landed independently via merged PR #267.

Three new sections + five new failure-mode rows. Stays out of the OTTL
transform section per issue scope; does NOT recreate the
`docs/integrations/dcgm-exporter.md` file the issue explicitly forbids.

## Sections added

- **§ Install dcgm-exporter** — `helm repo add gpu-helm-charts` +
  `helm install` against NVIDIA's canonical chart, plus a minimal
  `dcgm-exporter-values.yaml` overlay (disables ServiceMonitor since
  tracecore scrapes via `prometheusreceiver`, not Prometheus Operator;
  pins service type/port; constrains DaemonSet to GPU nodes).
- **§ Verify dcgm-exporter is scrapable** — `kubectl port-forward` +
  `curl /metrics` on the `app.kubernetes.io/name=dcgm-exporter`
  selector, plus a per-pattern prefix-presence table so operators can
  see ahead of time whether the pattern they care about has its raw
  DCGM family enabled.
- **Failure modes table extension** — 5 new rows mapping to root causes:
  - CrashLoop / NVML driver-library version mismatch → driver/image pin
  - CrashLoop / NVML driver-not-loaded → DaemonSet on non-GPU node
  - Pod `Running` but `/metrics` returns 500 / hangs → no
    `nvidia-container-toolkit` runtime
  - Pod `Pending` with `forbidden: ... configmaps` → chart RBAC disabled
  - `prometheusreceiver` `context deadline exceeded` → DCGM cold-start
    watch initialization vs `scrape_timeout: 10s`

## Verification

- `docker run alpine/helm:3.16.4` renders the documented install command
  against chart v4.8.2 — 355 lines, Service ClusterIP:9400, DaemonSet
  containerPort:9400, RBAC + ConfigMap, ServiceMonitor absent (values
  overlay disables it as documented).
- `bash scripts/doc-check.sh` clean.
- Pre-commit hooks pass (golangci-lint, go vet, go mod verify,
attribute-namespace-check, hit-line-format-stable, no-autoupdate-check).

## Cross-references

- Each failure-mode row ties to an OTTL stanza, a metric prefix, or a
  config value already documented in the recipe (no orphan claims).
- The Verify section's prefix table reuses the same DCGM family names
  the OTTL projection table already references — operators can grep for
  exactly the prefixes the recipe consumes.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant