Skip to content

feat(recipe): DCGM to hw.* OTTL transform + ADR for metrics-side detector wiring (PR-A for #260)#267

Merged
trilamsr merged 3 commits into
mainfrom
feat/dcgm-hw-semconv-recipe
Jun 1, 2026
Merged

feat(recipe): DCGM to hw.* OTTL transform + ADR for metrics-side detector wiring (PR-A for #260)#267
trilamsr merged 3 commits into
mainfrom
feat/dcgm-hw-semconv-recipe

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

PR-A for issue #260 — the recipe-side half. Two commits:

  1. docs(adr): metrics-to-logs converter for pattern input (Option A) — records the architectural decision and the upstream-blocker analysis under a new docs/adrs/ directory.
  2. docs(recipe): extend prometheus-scrape with DCGM to hw.* OTTL transform — adds transform/dcgm_to_hw_semconv (a second OTTL pass after transform/gpu_vendor) and documents every per-pattern projection in docs/integrations/prometheus-scrape.md.

Architectural decision (issue #260 §Proposed plan step 2)

Option A — extend patterndetectorprocessor with processor.WithMetrics alongside the existing WithLogs (deferred to PR-B; this PR ships only the wire-format contract PR-B will consume).

Per [[adopt-over-build]] Option B (OTTL metrics-to-logs converter) was the preferred starting point. It's blocked upstream: at OTel contrib v0.130 (the release pinned in builder-config.yaml) the transformprocessor signal contexts are sealed per-signal — metric_statements can only reference resource | scope | metric | datapoint paths, never log (README @ v0.130.0 § Config). Surveying every contrib connector at v0.130 also turns up zero metrics→logs primitives (countconnector, signaltometricsconnector, spanmetrics, exceptions, servicegraph all output metrics; routing/failover/roundrobin/otlpjson are signal-preserving; datadog and grafanacloud are vendor-specific). The full survey + reasoning is in ADR-0001.

Option C (routing connector + sibling) is strictly heavier than A with the same end state — routing preserves signal type, so it cannot bridge metrics→logs either.

Trade-off accepted: doubled processor surface inside patterndetectorprocessor (one ConsumeLogs, one ConsumeMetrics). Mitigation: the metrics path is a thin projection on top of the same module/pkg/patterns/ library; verdict emission stays log-based via a connector. Upstream-contribution slot opened — a metricthresholdconnector (inverse of signaltometricsconnector) would let us collapse Option A back to Option B without operator-visible churn. Tracked under RFC-0013 §5.

Per-pattern OTTL projection (this PR)

Pattern Raw DCGM input OTel output Trigger-statement shape (PR-B consumes this)
#1 NVLink DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES hw.gpu.nvlink.io (Counter, By) + hw.gpu.nvlink.link={N} + network.io.direction={transmit|receive} rate(hw.gpu.nvlink.io[5m]) < 0.5 × per-GPU median over the per-hw.id link set, for 10m
#3 HBM ECC DCGM_FI_DEV_ECC_{SBE,DBE}_{VOL,AGG}_TOTAL hw.errors (Counter, {error}) + error.{type,subtype,persistence} increase(hw.errors{error.type="uncorrected", error.persistence="volatile"}[5m]) > 0
#4 thermal DCGM_FI_DEV_{THERMAL,POWER,SYNC_BOOST,BOARD_LIMIT,LOW_UTIL}_VIOLATION hw.gpu.throttle.duration (Counter, s) + hw.gpu.throttle.reason increase(hw.gpu.throttle.duration{reason="thermal"}[5m]) > 30s, for 10m
#5 PCIe DCGM_FI_PROF_PCIE_{TX,RX}_BYTES hw.gpu.io (Counter, By) + network.io.direction + hw.gpu.pci.bdf resource attr rate(hw.gpu.io[5m]) < 0.3 × host median, for 15m

Resource attribution: UUID or gpu_uuidhw.id; gpu or GPUhw.gpu.index; hw.type="gpu" stamped on every DCGM series. Dual-label fallback handles both legacy and modern dcgm-exporter releases.

What's NOT in this PR

The third commit originally planned by issue #260feat(recipe): metrics→logs OTTL alert emission — is dropped because OTTL cannot emit log records from metric streams at v0.130 (root cause cited above, full analysis in ADR-0001). The fallback is Option A, which requires modifying patterndetectorprocessor source — out of scope for a recipe-only PR. The alert-emission half lands in PR-B alongside the NVLink detector code.

Test plan

  • ./_build/tracecore validate --config=docs/integrations/examples/prometheus-scrape.yaml exits 0.
  • bash scripts/doc-check.sh clean (test-name parity, markdown link integrity, banned-phrase lint, integration-recipe rubric all pass).
  • bash scripts/validator-recipe.sh validates 6 recipes (skips 2 requires-linux / requires-k8s-cluster — CI ubuntu runner exercises those paths).
  • make check (golangci-lint, go vet, mod-verify) clean.
  • Adversarial reviewer check that the OTTL set(metric.name, ...) identity-conflict caveat (recipe markdown § "Identity-conflict caveat") matches the operator's downstream backend's deduplication semantics for the four projected metric families.
  • CI runner exercises the requires-linux / requires-k8s recipes that this PR's host skipped.

Followup

  • PR-B (issue Recipe extension: emit hw.gpu.nvlink.* + wire metrics path for pattern-1 detector #260, blocked by this PR's merge) — extend patterndetectorprocessor with processor.WithMetrics; land module/pkg/patterns/nvlink_degradation.go; wire verdict emission through a sibling logs pipeline via a connector. Decision rationale baked into ADR-0001 so PR-B doesn't relitigate.
  • Upstream contribution (v0.3 cycle, RFC-0013 §5) — propose a metricthresholdconnector (or signaltologsconnector) to OTel-contrib. If/when it lands, the recipe can collapse back to Option B without operator-visible churn — the hw.gpu.* wire-format contract this PR ships is the load-bearing customer surface, not the upstream component selection.
  • dcgm-exporter throttle-reason mapping verification — the pattern Wave 1: governance bootstrap (CODEOWNERS, DCO, signing) #4 mapping table assumes modern dcgm-exporter DCGM_FI_DEV_*_VIOLATION series names. If your dcgm-exporter release exposes the throttle bitmask as DCGM_FI_DEV_CLOCKS_EVENT_REASONS instead (per the semconv proposal Open Question ci(deps): bump the gh-actions group with 5 updates #1), extend the OTTL block to expand the bitmask into discrete hw.gpu.throttle.duration datapoints — out of this recipe's scope; tracked as a follow-up under the same milestone.
docs(recipe): The `prometheus-scrape` recipe gains a second OTTL transform pass that projects raw `dcgm-exporter` `DCGM_FI_*` series into the customer-stable `hw.gpu.*` / `hw.errors` namespace (semconv proposal). Four pattern families covered — NVLink, HBM ECC, thermal throttle, PCIe — using upstream OTTL functions only (`set`, `IsMatch`, `ExtractPatterns`, `Int`). The wire-format contract this transform produces is the input PR-B's NVLink detector will consume; the in-tree component change for the detector itself lands in a follow-up PR per ADR-0001.

Tri Lam added 2 commits May 31, 2026 20:35
OTel transformprocessor v0.130 cannot emit log records from a metrics
pipeline (metric_statements path prefixes are sealed to
resource|scope|metric|datapoint). Surveying the contrib connector tree
at v0.130 turns up no metrics-to-logs primitive either. Issue #260's
preferred Option B (OTTL metrics-to-logs converter) is blocked
upstream; falling back to Option A (extend patterndetectorprocessor
with WithMetrics alongside WithLogs) is the only path that doesn't
introduce a net-new in-tree connector. Decision is recorded under a
new docs/adrs/ so the next PR (#260 PR-B) inherits the trade-off
without relitigating.

Signed-off-by: Tri Lam <tri@maydow.com>
Adds a second OTTL transform pass (transform/dcgm_to_hw_semconv) that
projects raw DCGM_FI_* series into the customer-stable hw.gpu.* /
hw.errors namespace declared in docs/proposals/semconv-hw-gpu-extensions.md.

Four patterns covered:
- #1 NVLink per-link Tx/Rx Counter (DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES
  -> hw.gpu.nvlink.io with hw.gpu.nvlink.link + network.io.direction).
- #3 HBM ECC counters (SBE/DBE x volatile/aggregate -> hw.errors with
  error.type/subtype/persistence).
- #4 thermal throttle reasons (DCGM_FI_DEV_*_VIOLATION -> hw.gpu.throttle.duration
  with hw.gpu.throttle.reason).
- #5 PCIe Tx/Rx counters (DCGM_FI_PROF_PCIE_{TX,RX}_BYTES -> hw.gpu.io
  with network.io.direction + hw.gpu.pci.bdf resource attr).

Resource attribution: maps UUID/gpu_uuid -> hw.id and gpu/GPU ->
hw.gpu.index with dual-label fallback so both legacy and modern
dcgm-exporter releases project correctly.

Uses upstream OTTL functions only (set, IsMatch, ExtractPatterns, Int);
no new OTTL functions required. tracecore validate clean against the
updated recipe; bumps last-verified to 2026-05-31.

Signed-off-by: Tri Lam <tri@maydow.com>
Per fresh-context review of #267:

- LOW_UTIL_VIOLATION row mapped sw_slowdown which is NOT in the
  hw.gpu.throttle.reason enum (proposal §2). Forward-incompat drift
  once the SIG resolves the vocabulary. Drop the row entirely; track
  upstream proposal extension at #272. Conservative per
  [[adopt-over-build]] — don't invent vocab ahead of proposal.
- NVLink DCGM_FI_PROF_NVLINK_* fields (1040-1075) are commented out
  in dcgm-exporter's default-counters.csv. Operators copy-pasting
  the recipe get zero hw.gpu.nvlink.io series; pattern #1 never
  fires. Add explicit opt-in note pointing at the custom-counters
  ConfigMap chart-flag path.

Lower-leverage findings declined per [[no-bloat]]:
- PR-266 coordination: PR-266 already closed (filed #271 for net-new
  helm-install content into prometheus-scrape.md).
- ADR upstream contribution link: no concrete OTel-contrib issue
  exists yet; would invent rather than reference.
- PR-B forward-or-patternverdict-connector pre-commit: pre-litigation.
- README guidance verbatim quote: cosmetic.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 03:57
@trilamsr trilamsr merged commit 0baa557 into main Jun 1, 2026
11 checks passed
@trilamsr trilamsr deleted the feat/dcgm-hw-semconv-recipe branch June 1, 2026 04:06
trilamsr added a commit that referenced this pull request Jun 1, 2026
#396)

## Summary

Closes #271. Extends `docs/integrations/prometheus-scrape.md` with the
operator-facing onboarding content that closed PR #266 carried before
the
OTTL-transform half landed independently via merged PR #267.

Three new sections + five new failure-mode rows. Stays out of the OTTL
transform section per issue scope; does NOT recreate the
`docs/integrations/dcgm-exporter.md` file the issue explicitly forbids.

## Sections added

- **§ Install dcgm-exporter** — `helm repo add gpu-helm-charts` +
  `helm install` against NVIDIA's canonical chart, plus a minimal
  `dcgm-exporter-values.yaml` overlay (disables ServiceMonitor since
  tracecore scrapes via `prometheusreceiver`, not Prometheus Operator;
  pins service type/port; constrains DaemonSet to GPU nodes).
- **§ Verify dcgm-exporter is scrapable** — `kubectl port-forward` +
  `curl /metrics` on the `app.kubernetes.io/name=dcgm-exporter`
  selector, plus a per-pattern prefix-presence table so operators can
  see ahead of time whether the pattern they care about has its raw
  DCGM family enabled.
- **Failure modes table extension** — 5 new rows mapping to root causes:
  - CrashLoop / NVML driver-library version mismatch → driver/image pin
  - CrashLoop / NVML driver-not-loaded → DaemonSet on non-GPU node
  - Pod `Running` but `/metrics` returns 500 / hangs → no
    `nvidia-container-toolkit` runtime
  - Pod `Pending` with `forbidden: ... configmaps` → chart RBAC disabled
  - `prometheusreceiver` `context deadline exceeded` → DCGM cold-start
    watch initialization vs `scrape_timeout: 10s`

## Verification

- `docker run alpine/helm:3.16.4` renders the documented install command
  against chart v4.8.2 — 355 lines, Service ClusterIP:9400, DaemonSet
  containerPort:9400, RBAC + ConfigMap, ServiceMonitor absent (values
  overlay disables it as documented).
- `bash scripts/doc-check.sh` clean.
- Pre-commit hooks pass (golangci-lint, go vet, go mod verify,
attribute-namespace-check, hit-line-format-stable, no-autoupdate-check).

## Cross-references

- Each failure-mode row ties to an OTTL stanza, a metric prefix, or a
  config value already documented in the recipe (no orphan claims).
- The Verify section's prefix table reuses the same DCGM family names
  the OTTL projection table already references — operators can grep for
  exactly the prefixes the recipe consumes.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant