You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Pattern #1 (NVLink silent degradation) is the next NORTHSTAR detector on the
v0.3.0 ladder, but the input plumbing it requires is not yet shipped. The
detector cannot be built until this lands.
OTTL transform — raw DCGM → customer-stable semconv. The recipe at docs/integrations/prometheus-scrape.md
scrapes dcgm-exporter but only stamps gpu.vendor. It does NOT rename DCGM_FI_PROF_NVLINK_L{0..17}_{TX,RX}_BYTES into the hw.gpu.nvlink.io Counter shape with per-link / per-direction attrs. The
semconv itself is PROPOSED in docs/proposals/semconv-hw-gpu-extensions.md
§3 and docs/rfcs/0005-dcgm-receiver-scope.md, but no recipe currently
emits it.
Metrics-side detector plumbing.patterndetectorprocessor is logs-only
today (processor.WithLogs(...) in factory.go — no WithMetrics). The
NVLink signal is a Counter, not a log record. Pattern ci(deps): bump the gh-actions group with 5 updates #1 needs either:
a metrics input path on patterndetectorprocessor, OR
a metrics→logs converter (rule-engine style — emit one nvlink_degraded log per offending link, analogous to the PromQL alert
rule in the pattern doc).
Why this matters now
Per feedback_adopt_over_build.md: don't fake attribute projections. The
upstream dcgm-exporter emits raw DCGM_FI_PROF_NVLINK_L*_TX_BYTES / *_RX_BYTES (18 links × 2 directions × N GPUs per node). Stamping those as hw.gpu.nvlink.io with the customer-stable attrs is a recipe-shaped task
— transformprocessor OTTL is the right primitive, and the work belongs
in docs/integrations/prometheus-scrape.md (or a sibling docs/integrations/dcgm-nvlink-recipe.md) rather than inside the
detector.
Proposed plan (separate PR, blocks detector PR)
PR A — recipe extension (this issue):
Extend the OTTL transform/gpu_vendor block (or add transform/dcgm_nvlink)
to:
For each DCGM_FI_PROF_NVLINK_L{N}_TX_BYTES series: emit hw.gpu.nvlink.io Counter with hw.gpu.nvlink.link=N, network.io.direction=transmit.
For each DCGM_FI_PROF_NVLINK_L{N}_RX_BYTES: same with network.io.direction=receive.
Map gpu_uuid / UUID label → resource attr hw.id.
Map gpu / GPU label → resource attr hw.gpu.index.
Decision needed: metrics input path on patterndetectorprocessor, OR a
thin metrics→logs converter (OTTL transform/metrics → log signal)?
Either way: pattern-detection consumes the customer-stable attribute
namespace, never the raw DCGM_* series.
The semconv proposal merge — docs/proposals/semconv-hw-gpu-extensions.md
remains PROPOSED until upstream OTel semconv adopts it. Tracecore's
customer-stable contract (RFC-0013 §3) is the local pin while upstream
catches up.
Blocker
Pattern #1 (NVLink silent degradation) is the next NORTHSTAR detector on the
v0.3.0 ladder, but the input plumbing it requires is not yet shipped. The
detector cannot be built until this lands.
Source-of-truth:
docs/patterns/pattern-1-nvlink-degradation.md.That doc specifies the input signal as the customer-stable
hw.gpu.nvlink.ioCounter with attributes:hw.gpu.nvlink.link0..N-1(18 on H100, 12 on A100)network.io.directiontransmit/receiveResource attributes:
hw.id,hw.gpu.index.What's missing
OTTL transform — raw DCGM → customer-stable semconv. The recipe at
docs/integrations/prometheus-scrape.mdscrapes
dcgm-exporterbut only stampsgpu.vendor. It does NOT renameDCGM_FI_PROF_NVLINK_L{0..17}_{TX,RX}_BYTESinto thehw.gpu.nvlink.ioCounter shape with per-link / per-direction attrs. Thesemconv itself is PROPOSED in
docs/proposals/semconv-hw-gpu-extensions.md§3 and
docs/rfcs/0005-dcgm-receiver-scope.md, but no recipe currentlyemits it.
Metrics-side detector plumbing.
patterndetectorprocessoris logs-onlytoday (
processor.WithLogs(...)infactory.go— noWithMetrics). TheNVLink signal is a Counter, not a log record. Pattern ci(deps): bump the gh-actions group with 5 updates #1 needs either:
patterndetectorprocessor, ORnvlink_degradedlog per offending link, analogous to the PromQL alertrule in the pattern doc).
Why this matters now
Per
feedback_adopt_over_build.md: don't fake attribute projections. Theupstream
dcgm-exporteremits rawDCGM_FI_PROF_NVLINK_L*_TX_BYTES/*_RX_BYTES(18 links × 2 directions × N GPUs per node). Stamping those ashw.gpu.nvlink.iowith the customer-stable attrs is a recipe-shaped task—
transformprocessorOTTL is the right primitive, and the work belongsin
docs/integrations/prometheus-scrape.md(or a siblingdocs/integrations/dcgm-nvlink-recipe.md) rather than inside thedetector.
Proposed plan (separate PR, blocks detector PR)
PR A — recipe extension (this issue):
transform/gpu_vendorblock (or addtransform/dcgm_nvlink)to:
DCGM_FI_PROF_NVLINK_L{N}_TX_BYTESseries: emithw.gpu.nvlink.ioCounter withhw.gpu.nvlink.link=N,network.io.direction=transmit.DCGM_FI_PROF_NVLINK_L{N}_RX_BYTES: same withnetwork.io.direction=receive.gpu_uuid/UUIDlabel → resource attrhw.id.gpu/GPUlabel → resource attrhw.gpu.index.patterndetectorprocessor, OR athin metrics→logs converter (OTTL
transform/metrics→ log signal)?Either way: pattern-detection consumes the customer-stable attribute
namespace, never the raw
DCGM_*series.docs/integrations/prometheus-scrape.mdalongside the existing
gpu.vendormapping table.PR B — NVLink detector (blocked on PR A):
module/pkg/patterns/nvlink_degradation.go+_test.gomirroringxid_correlation's two-layer shape.appendNVLinkDegradationVerdictinmodule/processor/patterndetectorprocessor/.PatternIDNVLink = "1",EvidenceKindNVLink = "nvlink".Out of scope
docs/proposals/semconv-hw-gpu-extensions.mdremains PROPOSED until upstream OTel semconv adopts it. Tracecore's
customer-stable contract (RFC-0013 §3) is the local pin while upstream
catches up.
Hopper/Ampere per
MILESTONES.mdM17.Pre-existing references
docs/patterns/pattern-1-nvlink-degradation.mddocs/proposals/semconv-hw-gpu-extensions.md§3docs/rfcs/0005-dcgm-receiver-scope.md§hw.gpu.nvlink.ioMILESTONES.md(search "M17")docs/rfcs/0013-distro-first-pivot.md§2