docs(integrations): dcgm-exporter operator recipe#266
Closed
trilamsr wants to merge 1 commit into
Closed
Conversation
Patterns #1, #3, #4, #5 all assert their input is "dcgm-exporter scraped via prometheusreceiver" but the closest existing recipe (prometheus-scrape.md) is the generic vendor-agnostic shape with only a `gpu.vendor` tag, leaving each operator to piece together the DCGM_FI_* -> hw.gpu.* rename layer the pattern detectors consume. This adds the NVIDIA-specific install + wire-up recipe: - helm install upstream dcgm-exporter (Apache-2.0, no Pro features) - prometheusreceiver scrape at the conventional :9400 endpoint - OTTL transform renaming the DCGM metric families pattern detectors #1/#3/#4/#5 consume (ECC, temp, throttle, PCIe Tx/Rx) to OTel hw.* semconv where canonical, tracecore extensions where the spec hasn't ratified the shape yet - Mapping table explicitly tagging each rename as (semconv) or (ext) so operators see the contract surface area - DCGM_FI_DEV_NVLINK_BANDWIDTH_L* noted as opt-in (commented in default-counters.csv); pattern #1 needs operator action Drift between pattern docs and upstream OTel hw.* semconv (the (ext) names) is filed as #265 -- intentionally not papered over since pattern detectors consume the tracecore-extended attributes verbatim and operators need to know which names a future spec-compliant collector would emit unchanged. Validate-clean against the in-tree binary; doc-check + validator- recipe + make verify all pass. Signed-off-by: Tri Lam <tri@maydow.com>
Contributor
Author
|
Closing in favor of PR #267 which ships the correct DCGM→hw.* OTTL transform. Reviewer findings: the OTTL transform in this PR used the wrong NVLink metric family (aggregate Net-new content worth preserving (helm install command, kubectl port-forward verify, failure-modes table, operator-facing metric mapping table) is tracked in #271 — to be merged INTO |
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
#396) ## Summary Closes #271. Extends `docs/integrations/prometheus-scrape.md` with the operator-facing onboarding content that closed PR #266 carried before the OTTL-transform half landed independently via merged PR #267. Three new sections + five new failure-mode rows. Stays out of the OTTL transform section per issue scope; does NOT recreate the `docs/integrations/dcgm-exporter.md` file the issue explicitly forbids. ## Sections added - **§ Install dcgm-exporter** — `helm repo add gpu-helm-charts` + `helm install` against NVIDIA's canonical chart, plus a minimal `dcgm-exporter-values.yaml` overlay (disables ServiceMonitor since tracecore scrapes via `prometheusreceiver`, not Prometheus Operator; pins service type/port; constrains DaemonSet to GPU nodes). - **§ Verify dcgm-exporter is scrapable** — `kubectl port-forward` + `curl /metrics` on the `app.kubernetes.io/name=dcgm-exporter` selector, plus a per-pattern prefix-presence table so operators can see ahead of time whether the pattern they care about has its raw DCGM family enabled. - **Failure modes table extension** — 5 new rows mapping to root causes: - CrashLoop / NVML driver-library version mismatch → driver/image pin - CrashLoop / NVML driver-not-loaded → DaemonSet on non-GPU node - Pod `Running` but `/metrics` returns 500 / hangs → no `nvidia-container-toolkit` runtime - Pod `Pending` with `forbidden: ... configmaps` → chart RBAC disabled - `prometheusreceiver` `context deadline exceeded` → DCGM cold-start watch initialization vs `scrape_timeout: 10s` ## Verification - `docker run alpine/helm:3.16.4` renders the documented install command against chart v4.8.2 — 355 lines, Service ClusterIP:9400, DaemonSet containerPort:9400, RBAC + ConfigMap, ServiceMonitor absent (values overlay disables it as documented). - `bash scripts/doc-check.sh` clean. - Pre-commit hooks pass (golangci-lint, go vet, go mod verify, attribute-namespace-check, hit-line-format-stable, no-autoupdate-check). ## Cross-references - Each failure-mode row ties to an OTTL stanza, a metric prefix, or a config value already documented in the recipe (no orphan claims). - The Verify section's prefix table reuses the same DCGM family names the OTTL projection table already references — operators can grep for exactly the prefixes the recipe consumes. ```release-notes NONE ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
docs/integrations/dcgm-exporter.md— the NVIDIA-specific install + wire-up recipe pattern detectors ci(deps): bump the gh-actions group with 5 updates #1 / Add NORTHSTARS, RFC-0002, and Q1 MILESTONES #3 / Wave 1: governance bootstrap (CODEOWNERS, DCO, signing) #4 / Apply Wave-1 follow-ups: tests, automation, doc-truth #5 implicitly depend on (each pattern doc says "input: dcgm-exporter scraped via prometheusreceiver" but no recipe ships the DCGM_FI_* to hw.gpu.* rename layer those patterns consume).docs/integrations/examples/dcgm-exporter.yaml— atracecore validate-clean example wiringprometheusreceiverto an OTTLtransformprocessorthat renames the DCGM metric families to OTel hardware-semconv names where canonical (hw.errors,hw.gpu.io,hw.temperature) and to tracecore extensions where the spec hasn't ratified the shape yet (hw.gpu.throttle.duration,hw.gpu.nvlink.io, ECCerror.subtype/error.persistence).hw.*names a future spec-compliant collector would emit unchanged vs which are tracecore-defined; the new recipe's mapping table tags each rename(semconv)or(ext)so the contract surface is explicit.docs/README.mdso the doc-check parity gate stays green.Adopt-over-build: we install upstream
NVIDIA/dcgm-exporter(Apache-2.0) from its Helm chart; no fork, no vendor, no Pro features. The bundledvalues.yamloverrides only the keys the recipe depends on (nodeSelector,serviceMonitor: enabled=falsesince tracecore is the scraper not Prometheus Operator,service.port: 9400).Test plan
./_build/tracecore validate --config=docs/integrations/examples/dcgm-exporter.yamlexits 0 — OTTL statements type-check against the metric / datapoint / resource contexts and the scrape target URL is well-formed.bash scripts/doc-check.shclean — example file resolves from first ```yaml block, tested-against + last-verified markers parse, README.md indexes the new recipe.bash scripts/validator-recipe.shreportsdocs/integrations/dcgm-exporter.md -> tracecore validate docs/integrations/examples/dcgm-exporter.yamlsucceeding (8/8 in-tree recipes on darwin; CI ubuntu runner exercises the journald / k8sobjects skips).make verifyclean (fmt, tidy-check, lint, vet, mod-verify, license-check, generate-fixtures-check, build-tags, nccl-fr-rce-gate, register-lint, actionlint, zizmor, doc-check, no-autoupdate-check).kubectl port-forward+curl /metrics+otelcol_receiver_accepted_metric_points{receiver="prometheus/dcgm"}check operators should run afterhelm install).Drift (#265, NOT a v0.3.0 blocker): the recipe documents that
hw.gpu.throttle.duration,hw.gpu.nvlink.io,hw.gpu.throttle.reason,error.subtype,error.persistence,hw.gpu.index,hw.gpu.pci.bdfare tracecore extensions ahead of the OTel hardware-semconv proposal. Pattern docs reference them verbatim today; the upstream-rectification work is out of scope per the user task constraint and tracked separately.