docs(integrations): dcgm-exporter operator recipe by trilamsr · Pull Request #266 · TraceCoreAI/tracecore

trilamsr · 2026-06-01T03:44:04Z

Summary

Adds docs/integrations/dcgm-exporter.md — the NVIDIA-specific install + wire-up recipe pattern detectors ci(deps): bump the gh-actions group with 5 updates #1 / Add NORTHSTARS, RFC-0002, and Q1 MILESTONES #3 / Wave 1: governance bootstrap (CODEOWNERS, DCO, signing) #4 / Apply Wave-1 follow-ups: tests, automation, doc-truth #5 implicitly depend on (each pattern doc says "input: dcgm-exporter scraped via prometheusreceiver" but no recipe ships the DCGM_FI_* to hw.gpu.* rename layer those patterns consume).
Adds docs/integrations/examples/dcgm-exporter.yaml — a tracecore validate-clean example wiring prometheusreceiver to an OTTL transformprocessor that renames the DCGM metric families to OTel hardware-semconv names where canonical (hw.errors, hw.gpu.io, hw.temperature) and to tracecore extensions where the spec hasn't ratified the shape yet (hw.gpu.throttle.duration, hw.gpu.nvlink.io, ECC error.subtype / error.persistence).
Files the upstream-semconv drift as docs(patterns): hw.gpu.* extension names diverge from upstream OTel semconv #265 — pattern docs reference the tracecore extensions verbatim and operators need to know which hw.* names a future spec-compliant collector would emit unchanged vs which are tracecore-defined; the new recipe's mapping table tags each rename (semconv) or (ext) so the contract surface is explicit.
Indexes the new recipe under "Source (receiver-side) recipes" in docs/README.md so the doc-check parity gate stays green.

Adopt-over-build: we install upstream NVIDIA/dcgm-exporter (Apache-2.0) from its Helm chart; no fork, no vendor, no Pro features. The bundled values.yaml overrides only the keys the recipe depends on (nodeSelector, serviceMonitor: enabled=false since tracecore is the scraper not Prometheus Operator, service.port: 9400).

docs: NVIDIA dcgm-exporter operator recipe (`docs/integrations/dcgm-exporter.md`) wiring `prometheusreceiver` + OTTL rename of `DCGM_FI_*` to the `hw.gpu.*` namespace consumed by pattern detectors #1 / #3 / #4 / #5.

Test plan

./_build/tracecore validate --config=docs/integrations/examples/dcgm-exporter.yaml exits 0 — OTTL statements type-check against the metric / datapoint / resource contexts and the scrape target URL is well-formed.
bash scripts/doc-check.sh clean — example file resolves from first ```yaml block, tested-against + last-verified markers parse, README.md indexes the new recipe.
bash scripts/validator-recipe.sh reports docs/integrations/dcgm-exporter.md -> tracecore validate docs/integrations/examples/dcgm-exporter.yaml succeeding (8/8 in-tree recipes on darwin; CI ubuntu runner exercises the journald / k8sobjects skips).
make verify clean (fmt, tidy-check, lint, vet, mod-verify, license-check, generate-fixtures-check, build-tags, nccl-fr-rce-gate, register-lint, actionlint, zizmor, doc-check, no-autoupdate-check).
Manual on-GPU verification (not done in this PR — see the recipe's "Verify" section for the kubectl port-forward + curl /metrics + otelcol_receiver_accepted_metric_points{receiver="prometheus/dcgm"} check operators should run after helm install).

Drift (#265, NOT a v0.3.0 blocker): the recipe documents that hw.gpu.throttle.duration, hw.gpu.nvlink.io, hw.gpu.throttle.reason, error.subtype, error.persistence, hw.gpu.index, hw.gpu.pci.bdf are tracecore extensions ahead of the OTel hardware-semconv proposal. Pattern docs reference them verbatim today; the upstream-rectification work is out of scope per the user task constraint and tracked separately.

Patterns #1, #3, #4, #5 all assert their input is "dcgm-exporter scraped via prometheusreceiver" but the closest existing recipe (prometheus-scrape.md) is the generic vendor-agnostic shape with only a `gpu.vendor` tag, leaving each operator to piece together the DCGM_FI_* -> hw.gpu.* rename layer the pattern detectors consume. This adds the NVIDIA-specific install + wire-up recipe: - helm install upstream dcgm-exporter (Apache-2.0, no Pro features) - prometheusreceiver scrape at the conventional :9400 endpoint - OTTL transform renaming the DCGM metric families pattern detectors #1/#3/#4/#5 consume (ECC, temp, throttle, PCIe Tx/Rx) to OTel hw.* semconv where canonical, tracecore extensions where the spec hasn't ratified the shape yet - Mapping table explicitly tagging each rename as (semconv) or (ext) so operators see the contract surface area - DCGM_FI_DEV_NVLINK_BANDWIDTH_L* noted as opt-in (commented in default-counters.csv); pattern #1 needs operator action Drift between pattern docs and upstream OTel hw.* semconv (the (ext) names) is filed as #265 -- intentionally not papered over since pattern detectors consume the tracecore-extended attributes verbatim and operators need to know which names a future spec-compliant collector would emit unchanged. Validate-clean against the in-tree binary; doc-check + validator- recipe + make verify all pass. Signed-off-by: Tri Lam <tri@maydow.com>

trilamsr · 2026-06-01T03:50:19Z

Closing in favor of PR #267 which ships the correct DCGM→hw.* OTTL transform.

Reviewer findings: the OTTL transform in this PR used the wrong NVLink metric family (aggregate DCGM_FI_DEV_NVLINK_BANDWIDTH_L{N} instead of per-direction DCGM_FI_PROF_NVLINK_L{N}_(TX|RX)_BYTES), missing network.io.direction for NVLink, missing hw.gpu.index + hw.gpu.pci.bdf resource attrs. PR #267 (PR-A) has these correct.

Net-new content worth preserving (helm install command, kubectl port-forward verify, failure-modes table, operator-facing metric mapping table) is tracked in #271 — to be merged INTO docs/integrations/prometheus-scrape.md as additional sections rather than a duplicate recipe file.

#396) ## Summary Closes #271. Extends `docs/integrations/prometheus-scrape.md` with the operator-facing onboarding content that closed PR #266 carried before the OTTL-transform half landed independently via merged PR #267. Three new sections + five new failure-mode rows. Stays out of the OTTL transform section per issue scope; does NOT recreate the `docs/integrations/dcgm-exporter.md` file the issue explicitly forbids. ## Sections added - **§ Install dcgm-exporter** — `helm repo add gpu-helm-charts` + `helm install` against NVIDIA's canonical chart, plus a minimal `dcgm-exporter-values.yaml` overlay (disables ServiceMonitor since tracecore scrapes via `prometheusreceiver`, not Prometheus Operator; pins service type/port; constrains DaemonSet to GPU nodes). - **§ Verify dcgm-exporter is scrapable** — `kubectl port-forward` + `curl /metrics` on the `app.kubernetes.io/name=dcgm-exporter` selector, plus a per-pattern prefix-presence table so operators can see ahead of time whether the pattern they care about has its raw DCGM family enabled. - **Failure modes table extension** — 5 new rows mapping to root causes: - CrashLoop / NVML driver-library version mismatch → driver/image pin - CrashLoop / NVML driver-not-loaded → DaemonSet on non-GPU node - Pod `Running` but `/metrics` returns 500 / hangs → no `nvidia-container-toolkit` runtime - Pod `Pending` with `forbidden: ... configmaps` → chart RBAC disabled - `prometheusreceiver` `context deadline exceeded` → DCGM cold-start watch initialization vs `scrape_timeout: 10s` ## Verification - `docker run alpine/helm:3.16.4` renders the documented install command against chart v4.8.2 — 355 lines, Service ClusterIP:9400, DaemonSet containerPort:9400, RBAC + ConfigMap, ServiceMonitor absent (values overlay disables it as documented). - `bash scripts/doc-check.sh` clean. - Pre-commit hooks pass (golangci-lint, go vet, go mod verify, attribute-namespace-check, hit-line-format-stable, no-autoupdate-check). ## Cross-references - Each failure-mode row ties to an OTTL stanza, a metric prefix, or a config value already documented in the recipe (no orphan claims). - The Verify section's prefix table reuses the same DCGM family names the OTTL projection table already references — operators can grep for exactly the prefixes the recipe consumes. ```release-notes NONE ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>

trilamsr mentioned this pull request Jun 1, 2026

docs(prometheus-scrape): add dcgm-exporter helm install + verify + failure modes #271

Closed

trilamsr closed this Jun 1, 2026

trilamsr mentioned this pull request Jun 1, 2026

docs(prometheus-scrape): add dcgm-exporter install + verify + failures #396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(integrations): dcgm-exporter operator recipe#266

docs(integrations): dcgm-exporter operator recipe#266
trilamsr wants to merge 1 commit into
mainfrom
docs/recipe-dcgm-exporter

trilamsr commented Jun 1, 2026

Uh oh!

trilamsr commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 1, 2026

Summary

Test plan

Uh oh!

trilamsr commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant