feat(recipe): DCGM to hw.* OTTL transform + ADR for metrics-side detector wiring (PR-A for #260) by trilamsr · Pull Request #267 · TraceCoreAI/tracecore

trilamsr · 2026-06-01T03:45:11Z

Summary

PR-A for issue #260 — the recipe-side half. Two commits:

docs(adr): metrics-to-logs converter for pattern input (Option A) — records the architectural decision and the upstream-blocker analysis under a new docs/adrs/ directory.
docs(recipe): extend prometheus-scrape with DCGM to hw.* OTTL transform — adds transform/dcgm_to_hw_semconv (a second OTTL pass after transform/gpu_vendor) and documents every per-pattern projection in docs/integrations/prometheus-scrape.md.

Architectural decision (issue #260 §Proposed plan step 2)

Option A — extend patterndetectorprocessor with processor.WithMetrics alongside the existing WithLogs (deferred to PR-B; this PR ships only the wire-format contract PR-B will consume).

Per [[adopt-over-build]] Option B (OTTL metrics-to-logs converter) was the preferred starting point. It's blocked upstream: at OTel contrib v0.130 (the release pinned in builder-config.yaml) the transformprocessor signal contexts are sealed per-signal — metric_statements can only reference resource | scope | metric | datapoint paths, never log (README @ v0.130.0 § Config). Surveying every contrib connector at v0.130 also turns up zero metrics→logs primitives (countconnector, signaltometricsconnector, spanmetrics, exceptions, servicegraph all output metrics; routing/failover/roundrobin/otlpjson are signal-preserving; datadog and grafanacloud are vendor-specific). The full survey + reasoning is in ADR-0001.

Option C (routing connector + sibling) is strictly heavier than A with the same end state — routing preserves signal type, so it cannot bridge metrics→logs either.

Trade-off accepted: doubled processor surface inside patterndetectorprocessor (one ConsumeLogs, one ConsumeMetrics). Mitigation: the metrics path is a thin projection on top of the same module/pkg/patterns/ library; verdict emission stays log-based via a connector. Upstream-contribution slot opened — a metricthresholdconnector (inverse of signaltometricsconnector) would let us collapse Option A back to Option B without operator-visible churn. Tracked under RFC-0013 §5.

Per-pattern OTTL projection (this PR)

Pattern	Raw DCGM input	OTel output	Trigger-statement shape (PR-B consumes this)
#1 NVLink	`DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES`	`hw.gpu.nvlink.io` (Counter, `By`) + `hw.gpu.nvlink.link={N}` + `network.io.direction={transmit\|receive}`	`rate(hw.gpu.nvlink.io[5m]) < 0.5 × per-GPU median over the per-`hw.id `link set, for 10m`
#3 HBM ECC	`DCGM_FI_DEV_ECC_{SBE,DBE}_{VOL,AGG}_TOTAL`	`hw.errors` (Counter, `{error}`) + `error.{type,subtype,persistence}`	`increase(hw.errors{error.type="uncorrected", error.persistence="volatile"}[5m]) > 0`
#4 thermal	`DCGM_FI_DEV_{THERMAL,POWER,SYNC_BOOST,BOARD_LIMIT,LOW_UTIL}_VIOLATION`	`hw.gpu.throttle.duration` (Counter, `s`) + `hw.gpu.throttle.reason`	`increase(hw.gpu.throttle.duration{reason="thermal"}[5m]) > 30s, for 10m`
#5 PCIe	`DCGM_FI_PROF_PCIE_{TX,RX}_BYTES`	`hw.gpu.io` (Counter, `By`) + `network.io.direction` + `hw.gpu.pci.bdf` resource attr	`rate(hw.gpu.io[5m]) < 0.3 × host median, for 15m`

Resource attribution: UUID or gpu_uuid → hw.id; gpu or GPU → hw.gpu.index; hw.type="gpu" stamped on every DCGM series. Dual-label fallback handles both legacy and modern dcgm-exporter releases.

What's NOT in this PR

The third commit originally planned by issue #260 — feat(recipe): metrics→logs OTTL alert emission — is dropped because OTTL cannot emit log records from metric streams at v0.130 (root cause cited above, full analysis in ADR-0001). The fallback is Option A, which requires modifying patterndetectorprocessor source — out of scope for a recipe-only PR. The alert-emission half lands in PR-B alongside the NVLink detector code.

Test plan

./_build/tracecore validate --config=docs/integrations/examples/prometheus-scrape.yaml exits 0.
bash scripts/doc-check.sh clean (test-name parity, markdown link integrity, banned-phrase lint, integration-recipe rubric all pass).
bash scripts/validator-recipe.sh validates 6 recipes (skips 2 requires-linux / requires-k8s-cluster — CI ubuntu runner exercises those paths).
make check (golangci-lint, go vet, mod-verify) clean.
Adversarial reviewer check that the OTTL set(metric.name, ...) identity-conflict caveat (recipe markdown § "Identity-conflict caveat") matches the operator's downstream backend's deduplication semantics for the four projected metric families.
CI runner exercises the requires-linux / requires-k8s recipes that this PR's host skipped.

Followup

PR-B (issue Recipe extension: emit hw.gpu.nvlink.* + wire metrics path for pattern-1 detector #260, blocked by this PR's merge) — extend patterndetectorprocessor with processor.WithMetrics; land module/pkg/patterns/nvlink_degradation.go; wire verdict emission through a sibling logs pipeline via a connector. Decision rationale baked into ADR-0001 so PR-B doesn't relitigate.
Upstream contribution (v0.3 cycle, RFC-0013 §5) — propose a metricthresholdconnector (or signaltologsconnector) to OTel-contrib. If/when it lands, the recipe can collapse back to Option B without operator-visible churn — the hw.gpu.* wire-format contract this PR ships is the load-bearing customer surface, not the upstream component selection.
dcgm-exporter throttle-reason mapping verification — the pattern Wave 1: governance bootstrap (CODEOWNERS, DCO, signing) #4 mapping table assumes modern dcgm-exporter DCGM_FI_DEV_*_VIOLATION series names. If your dcgm-exporter release exposes the throttle bitmask as DCGM_FI_DEV_CLOCKS_EVENT_REASONS instead (per the semconv proposal Open Question ci(deps): bump the gh-actions group with 5 updates #1), extend the OTTL block to expand the bitmask into discrete hw.gpu.throttle.duration datapoints — out of this recipe's scope; tracked as a follow-up under the same milestone.

docs(recipe): The `prometheus-scrape` recipe gains a second OTTL transform pass that projects raw `dcgm-exporter` `DCGM_FI_*` series into the customer-stable `hw.gpu.*` / `hw.errors` namespace (semconv proposal). Four pattern families covered — NVLink, HBM ECC, thermal throttle, PCIe — using upstream OTTL functions only (`set`, `IsMatch`, `ExtractPatterns`, `Int`). The wire-format contract this transform produces is the input PR-B's NVLink detector will consume; the in-tree component change for the detector itself lands in a follow-up PR per ADR-0001.

OTel transformprocessor v0.130 cannot emit log records from a metrics pipeline (metric_statements path prefixes are sealed to resource|scope|metric|datapoint). Surveying the contrib connector tree at v0.130 turns up no metrics-to-logs primitive either. Issue #260's preferred Option B (OTTL metrics-to-logs converter) is blocked upstream; falling back to Option A (extend patterndetectorprocessor with WithMetrics alongside WithLogs) is the only path that doesn't introduce a net-new in-tree connector. Decision is recorded under a new docs/adrs/ so the next PR (#260 PR-B) inherits the trade-off without relitigating. Signed-off-by: Tri Lam <tri@maydow.com>

Adds a second OTTL transform pass (transform/dcgm_to_hw_semconv) that projects raw DCGM_FI_* series into the customer-stable hw.gpu.* / hw.errors namespace declared in docs/proposals/semconv-hw-gpu-extensions.md. Four patterns covered: - #1 NVLink per-link Tx/Rx Counter (DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES -> hw.gpu.nvlink.io with hw.gpu.nvlink.link + network.io.direction). - #3 HBM ECC counters (SBE/DBE x volatile/aggregate -> hw.errors with error.type/subtype/persistence). - #4 thermal throttle reasons (DCGM_FI_DEV_*_VIOLATION -> hw.gpu.throttle.duration with hw.gpu.throttle.reason). - #5 PCIe Tx/Rx counters (DCGM_FI_PROF_PCIE_{TX,RX}_BYTES -> hw.gpu.io with network.io.direction + hw.gpu.pci.bdf resource attr). Resource attribution: maps UUID/gpu_uuid -> hw.id and gpu/GPU -> hw.gpu.index with dual-label fallback so both legacy and modern dcgm-exporter releases project correctly. Uses upstream OTTL functions only (set, IsMatch, ExtractPatterns, Int); no new OTTL functions required. tracecore validate clean against the updated recipe; bumps last-verified to 2026-05-31. Signed-off-by: Tri Lam <tri@maydow.com>

Per fresh-context review of #267: - LOW_UTIL_VIOLATION row mapped sw_slowdown which is NOT in the hw.gpu.throttle.reason enum (proposal §2). Forward-incompat drift once the SIG resolves the vocabulary. Drop the row entirely; track upstream proposal extension at #272. Conservative per [[adopt-over-build]] — don't invent vocab ahead of proposal. - NVLink DCGM_FI_PROF_NVLINK_* fields (1040-1075) are commented out in dcgm-exporter's default-counters.csv. Operators copy-pasting the recipe get zero hw.gpu.nvlink.io series; pattern #1 never fires. Add explicit opt-in note pointing at the custom-counters ConfigMap chart-flag path. Lower-leverage findings declined per [[no-bloat]]: - PR-266 coordination: PR-266 already closed (filed #271 for net-new helm-install content into prometheus-scrape.md). - ADR upstream contribution link: no concrete OTel-contrib issue exists yet; would invent rather than reference. - PR-B forward-or-patternverdict-connector pre-commit: pre-litigation. - README guidance verbatim quote: cosmetic. Signed-off-by: Tri Lam <tri@maydow.com>

#396) ## Summary Closes #271. Extends `docs/integrations/prometheus-scrape.md` with the operator-facing onboarding content that closed PR #266 carried before the OTTL-transform half landed independently via merged PR #267. Three new sections + five new failure-mode rows. Stays out of the OTTL transform section per issue scope; does NOT recreate the `docs/integrations/dcgm-exporter.md` file the issue explicitly forbids. ## Sections added - **§ Install dcgm-exporter** — `helm repo add gpu-helm-charts` + `helm install` against NVIDIA's canonical chart, plus a minimal `dcgm-exporter-values.yaml` overlay (disables ServiceMonitor since tracecore scrapes via `prometheusreceiver`, not Prometheus Operator; pins service type/port; constrains DaemonSet to GPU nodes). - **§ Verify dcgm-exporter is scrapable** — `kubectl port-forward` + `curl /metrics` on the `app.kubernetes.io/name=dcgm-exporter` selector, plus a per-pattern prefix-presence table so operators can see ahead of time whether the pattern they care about has its raw DCGM family enabled. - **Failure modes table extension** — 5 new rows mapping to root causes: - CrashLoop / NVML driver-library version mismatch → driver/image pin - CrashLoop / NVML driver-not-loaded → DaemonSet on non-GPU node - Pod `Running` but `/metrics` returns 500 / hangs → no `nvidia-container-toolkit` runtime - Pod `Pending` with `forbidden: ... configmaps` → chart RBAC disabled - `prometheusreceiver` `context deadline exceeded` → DCGM cold-start watch initialization vs `scrape_timeout: 10s` ## Verification - `docker run alpine/helm:3.16.4` renders the documented install command against chart v4.8.2 — 355 lines, Service ClusterIP:9400, DaemonSet containerPort:9400, RBAC + ConfigMap, ServiceMonitor absent (values overlay disables it as documented). - `bash scripts/doc-check.sh` clean. - Pre-commit hooks pass (golangci-lint, go vet, go mod verify, attribute-namespace-check, hit-line-format-stable, no-autoupdate-check). ## Cross-references - Each failure-mode row ties to an OTTL stanza, a metric prefix, or a config value already documented in the recipe (no orphan claims). - The Verify section's prefix table reuses the same DCGM family names the OTTL projection table already references — operators can grep for exactly the prefixes the recipe consumes. ```release-notes NONE ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>

Tri Lam added 2 commits May 31, 2026 20:35

trilamsr enabled auto-merge (squash) June 1, 2026 03:57

trilamsr merged commit 0baa557 into main Jun 1, 2026
11 checks passed

trilamsr deleted the feat/dcgm-hw-semconv-recipe branch June 1, 2026 04:06

trilamsr mentioned this pull request Jun 1, 2026

docs(prometheus-scrape): add dcgm-exporter install + verify + failures #396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(recipe): DCGM to hw.* OTTL transform + ADR for metrics-side detector wiring (PR-A for #260)#267

feat(recipe): DCGM to hw.* OTTL transform + ADR for metrics-side detector wiring (PR-A for #260)#267
trilamsr merged 3 commits into
mainfrom
feat/dcgm-hw-semconv-recipe

trilamsr commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 1, 2026

Summary

Architectural decision (issue #260 §Proposed plan step 2)

Per-pattern OTTL projection (this PR)

What's NOT in this PR

Test plan

Followup

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant