feat(recipe): DCGM to hw.* OTTL transform + ADR for metrics-side detector wiring (PR-A for #260)#267
Merged
Merged
Conversation
added 2 commits
May 31, 2026 20:35
OTel transformprocessor v0.130 cannot emit log records from a metrics pipeline (metric_statements path prefixes are sealed to resource|scope|metric|datapoint). Surveying the contrib connector tree at v0.130 turns up no metrics-to-logs primitive either. Issue #260's preferred Option B (OTTL metrics-to-logs converter) is blocked upstream; falling back to Option A (extend patterndetectorprocessor with WithMetrics alongside WithLogs) is the only path that doesn't introduce a net-new in-tree connector. Decision is recorded under a new docs/adrs/ so the next PR (#260 PR-B) inherits the trade-off without relitigating. Signed-off-by: Tri Lam <tri@maydow.com>
Adds a second OTTL transform pass (transform/dcgm_to_hw_semconv) that projects raw DCGM_FI_* series into the customer-stable hw.gpu.* / hw.errors namespace declared in docs/proposals/semconv-hw-gpu-extensions.md. Four patterns covered: - #1 NVLink per-link Tx/Rx Counter (DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTES -> hw.gpu.nvlink.io with hw.gpu.nvlink.link + network.io.direction). - #3 HBM ECC counters (SBE/DBE x volatile/aggregate -> hw.errors with error.type/subtype/persistence). - #4 thermal throttle reasons (DCGM_FI_DEV_*_VIOLATION -> hw.gpu.throttle.duration with hw.gpu.throttle.reason). - #5 PCIe Tx/Rx counters (DCGM_FI_PROF_PCIE_{TX,RX}_BYTES -> hw.gpu.io with network.io.direction + hw.gpu.pci.bdf resource attr). Resource attribution: maps UUID/gpu_uuid -> hw.id and gpu/GPU -> hw.gpu.index with dual-label fallback so both legacy and modern dcgm-exporter releases project correctly. Uses upstream OTTL functions only (set, IsMatch, ExtractPatterns, Int); no new OTTL functions required. tracecore validate clean against the updated recipe; bumps last-verified to 2026-05-31. Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
Per fresh-context review of #267: - LOW_UTIL_VIOLATION row mapped sw_slowdown which is NOT in the hw.gpu.throttle.reason enum (proposal §2). Forward-incompat drift once the SIG resolves the vocabulary. Drop the row entirely; track upstream proposal extension at #272. Conservative per [[adopt-over-build]] — don't invent vocab ahead of proposal. - NVLink DCGM_FI_PROF_NVLINK_* fields (1040-1075) are commented out in dcgm-exporter's default-counters.csv. Operators copy-pasting the recipe get zero hw.gpu.nvlink.io series; pattern #1 never fires. Add explicit opt-in note pointing at the custom-counters ConfigMap chart-flag path. Lower-leverage findings declined per [[no-bloat]]: - PR-266 coordination: PR-266 already closed (filed #271 for net-new helm-install content into prometheus-scrape.md). - ADR upstream contribution link: no concrete OTel-contrib issue exists yet; would invent rather than reference. - PR-B forward-or-patternverdict-connector pre-commit: pre-litigation. - README guidance verbatim quote: cosmetic. Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
#396) ## Summary Closes #271. Extends `docs/integrations/prometheus-scrape.md` with the operator-facing onboarding content that closed PR #266 carried before the OTTL-transform half landed independently via merged PR #267. Three new sections + five new failure-mode rows. Stays out of the OTTL transform section per issue scope; does NOT recreate the `docs/integrations/dcgm-exporter.md` file the issue explicitly forbids. ## Sections added - **§ Install dcgm-exporter** — `helm repo add gpu-helm-charts` + `helm install` against NVIDIA's canonical chart, plus a minimal `dcgm-exporter-values.yaml` overlay (disables ServiceMonitor since tracecore scrapes via `prometheusreceiver`, not Prometheus Operator; pins service type/port; constrains DaemonSet to GPU nodes). - **§ Verify dcgm-exporter is scrapable** — `kubectl port-forward` + `curl /metrics` on the `app.kubernetes.io/name=dcgm-exporter` selector, plus a per-pattern prefix-presence table so operators can see ahead of time whether the pattern they care about has its raw DCGM family enabled. - **Failure modes table extension** — 5 new rows mapping to root causes: - CrashLoop / NVML driver-library version mismatch → driver/image pin - CrashLoop / NVML driver-not-loaded → DaemonSet on non-GPU node - Pod `Running` but `/metrics` returns 500 / hangs → no `nvidia-container-toolkit` runtime - Pod `Pending` with `forbidden: ... configmaps` → chart RBAC disabled - `prometheusreceiver` `context deadline exceeded` → DCGM cold-start watch initialization vs `scrape_timeout: 10s` ## Verification - `docker run alpine/helm:3.16.4` renders the documented install command against chart v4.8.2 — 355 lines, Service ClusterIP:9400, DaemonSet containerPort:9400, RBAC + ConfigMap, ServiceMonitor absent (values overlay disables it as documented). - `bash scripts/doc-check.sh` clean. - Pre-commit hooks pass (golangci-lint, go vet, go mod verify, attribute-namespace-check, hit-line-format-stable, no-autoupdate-check). ## Cross-references - Each failure-mode row ties to an OTTL stanza, a metric prefix, or a config value already documented in the recipe (no orphan claims). - The Verify section's prefix table reuses the same DCGM family names the OTTL projection table already references — operators can grep for exactly the prefixes the recipe consumes. ```release-notes NONE ``` Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR-A for issue #260 — the recipe-side half. Two commits:
docs(adr): metrics-to-logs converter for pattern input (Option A)— records the architectural decision and the upstream-blocker analysis under a newdocs/adrs/directory.docs(recipe): extend prometheus-scrape with DCGM to hw.* OTTL transform— addstransform/dcgm_to_hw_semconv(a second OTTL pass aftertransform/gpu_vendor) and documents every per-pattern projection indocs/integrations/prometheus-scrape.md.Architectural decision (issue #260 §Proposed plan step 2)
Option A — extend
patterndetectorprocessorwithprocessor.WithMetricsalongside the existingWithLogs(deferred to PR-B; this PR ships only the wire-format contract PR-B will consume).Per
[[adopt-over-build]]Option B (OTTL metrics-to-logs converter) was the preferred starting point. It's blocked upstream: at OTel contrib v0.130 (the release pinned inbuilder-config.yaml) thetransformprocessorsignal contexts are sealed per-signal —metric_statementscan only referenceresource | scope | metric | datapointpaths, neverlog(README @ v0.130.0 § Config). Surveying every contrib connector at v0.130 also turns up zero metrics→logs primitives (countconnector,signaltometricsconnector,spanmetrics,exceptions,servicegraphall output metrics;routing/failover/roundrobin/otlpjsonare signal-preserving;datadogandgrafanacloudare vendor-specific). The full survey + reasoning is in ADR-0001.Option C (routing connector + sibling) is strictly heavier than A with the same end state — routing preserves signal type, so it cannot bridge metrics→logs either.
Trade-off accepted: doubled processor surface inside
patterndetectorprocessor(oneConsumeLogs, oneConsumeMetrics). Mitigation: the metrics path is a thin projection on top of the samemodule/pkg/patterns/library; verdict emission stays log-based via a connector. Upstream-contribution slot opened — ametricthresholdconnector(inverse ofsignaltometricsconnector) would let us collapse Option A back to Option B without operator-visible churn. Tracked under RFC-0013 §5.Per-pattern OTTL projection (this PR)
DCGM_FI_PROF_NVLINK_L{N}_{TX,RX}_BYTEShw.gpu.nvlink.io(Counter,By) +hw.gpu.nvlink.link={N}+network.io.direction={transmit|receive}rate(hw.gpu.nvlink.io[5m]) < 0.5 × per-GPU median over the per-hw.idlink set, for 10mDCGM_FI_DEV_ECC_{SBE,DBE}_{VOL,AGG}_TOTALhw.errors(Counter,{error}) +error.{type,subtype,persistence}increase(hw.errors{error.type="uncorrected", error.persistence="volatile"}[5m]) > 0DCGM_FI_DEV_{THERMAL,POWER,SYNC_BOOST,BOARD_LIMIT,LOW_UTIL}_VIOLATIONhw.gpu.throttle.duration(Counter,s) +hw.gpu.throttle.reasonincrease(hw.gpu.throttle.duration{reason="thermal"}[5m]) > 30s, for 10mDCGM_FI_PROF_PCIE_{TX,RX}_BYTEShw.gpu.io(Counter,By) +network.io.direction+hw.gpu.pci.bdfresource attrrate(hw.gpu.io[5m]) < 0.3 × host median, for 15mResource attribution:
UUIDorgpu_uuid→hw.id;gpuorGPU→hw.gpu.index;hw.type="gpu"stamped on every DCGM series. Dual-label fallback handles both legacy and modern dcgm-exporter releases.What's NOT in this PR
The third commit originally planned by issue #260 —
feat(recipe): metrics→logs OTTL alert emission— is dropped because OTTL cannot emit log records from metric streams at v0.130 (root cause cited above, full analysis in ADR-0001). The fallback is Option A, which requires modifyingpatterndetectorprocessorsource — out of scope for a recipe-only PR. The alert-emission half lands in PR-B alongside the NVLink detector code.Test plan
./_build/tracecore validate --config=docs/integrations/examples/prometheus-scrape.yamlexits 0.bash scripts/doc-check.shclean (test-name parity, markdown link integrity, banned-phrase lint, integration-recipe rubric all pass).bash scripts/validator-recipe.shvalidates 6 recipes (skips 2 requires-linux / requires-k8s-cluster — CI ubuntu runner exercises those paths).make check(golangci-lint, go vet, mod-verify) clean.set(metric.name, ...)identity-conflict caveat (recipe markdown § "Identity-conflict caveat") matches the operator's downstream backend's deduplication semantics for the four projected metric families.Followup
patterndetectorprocessorwithprocessor.WithMetrics; landmodule/pkg/patterns/nvlink_degradation.go; wire verdict emission through a sibling logs pipeline via a connector. Decision rationale baked into ADR-0001 so PR-B doesn't relitigate.metricthresholdconnector(orsignaltologsconnector) to OTel-contrib. If/when it lands, the recipe can collapse back to Option B without operator-visible churn — thehw.gpu.*wire-format contract this PR ships is the load-bearing customer surface, not the upstream component selection.DCGM_FI_DEV_*_VIOLATIONseries names. If your dcgm-exporter release exposes the throttle bitmask asDCGM_FI_DEV_CLOCKS_EVENT_REASONSinstead (per the semconv proposal Open Question ci(deps): bump the gh-actions group with 5 updates #1), extend the OTTL block to expand the bitmask into discretehw.gpu.throttle.durationdatapoints — out of this recipe's scope; tracked as a follow-up under the same milestone.