Skip to content

docs(prometheus-scrape): add dcgm-exporter helm install + verify + failure modes #271

Description

@trilamsr

Problem

PR #266 (closed) added docs/integrations/dcgm-exporter.md + example with valuable operator-facing content:

  1. Helm install command for upstream dcgm-exporter (Apache-2.0)
  2. kubectl port-forward + curl /metrics verify step
  3. Failure modes table (missing GPU driver, NVML unreachable, RBAC denied)
  4. Operator-facing metric mapping table (DCGM_FI_* → tracecore hw.* → consumer pattern)

Closed instead of merged because the OTTL transform in the example YAML had blockers (wrong NVLink metric family, missing attrs) — that work is already shipped correctly via PR #267 (PR-A: DCGM→hw.* recipe + ADR) which lives at docs/integrations/prometheus-scrape.md.

The remaining operator-onboarding content from #266 (helm install + verify + failure modes) is still missing from prometheus-scrape.md. Operators following pattern docs end-to-end need it.

Proposed fix

Add new sections to docs/integrations/prometheus-scrape.md (not a new file):

  • § Install dcgm-exporter — helm install command + minimal values.yaml override (nodeSelector, serviceMonitor: false, service.port)
  • § Verify dcgm-exporter is scrapable — kubectl port-forward + curl + expected metric prefixes
  • § Failure modes — table: symptom → root cause → fix (no GPU driver, NVML unreachable, RBAC denied, dcgm-exporter pod CrashLoop, prometheusreceiver scrape timeout)

DO NOT touch the OTTL transform — PR #267 owns that.

Acceptance

  • prometheus-scrape.md is the single source of truth for the DCGM-exporter → tracecore wire path.
  • Operators can helm install dcgm-exporter + wire tracecore by following ONE doc.
  • No new file under docs/integrations/.

Out of scope

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions