Skip to content

Recipe extension: emit hw.gpu.nvlink.* + wire metrics path for pattern-1 detector #260

Description

@trilamsr

Blocker

Pattern #1 (NVLink silent degradation) is the next NORTHSTAR detector on the
v0.3.0 ladder, but the input plumbing it requires is not yet shipped. The
detector cannot be built until this lands.

Source-of-truth: docs/patterns/pattern-1-nvlink-degradation.md.
That doc specifies the input signal as the customer-stable
hw.gpu.nvlink.io Counter with attributes:

Attribute Value
hw.gpu.nvlink.link 0 .. N-1 (18 on H100, 12 on A100)
network.io.direction transmit / receive

Resource attributes: hw.id, hw.gpu.index.

What's missing

  1. OTTL transform — raw DCGM → customer-stable semconv. The recipe at
    docs/integrations/prometheus-scrape.md
    scrapes dcgm-exporter but only stamps gpu.vendor. It does NOT rename
    DCGM_FI_PROF_NVLINK_L{0..17}_{TX,RX}_BYTES into the
    hw.gpu.nvlink.io Counter shape with per-link / per-direction attrs. The
    semconv itself is PROPOSED in
    docs/proposals/semconv-hw-gpu-extensions.md
    §3 and docs/rfcs/0005-dcgm-receiver-scope.md, but no recipe currently
    emits it.

  2. Metrics-side detector plumbing. patterndetectorprocessor is logs-only
    today (processor.WithLogs(...) in factory.go — no WithMetrics). The
    NVLink signal is a Counter, not a log record. Pattern ci(deps): bump the gh-actions group with 5 updates #1 needs either:

    • a metrics input path on patterndetectorprocessor, OR
    • a metrics→logs converter (rule-engine style — emit one
      nvlink_degraded log per offending link, analogous to the PromQL alert
      rule in the pattern doc).

Why this matters now

Per feedback_adopt_over_build.md: don't fake attribute projections. The
upstream dcgm-exporter emits raw DCGM_FI_PROF_NVLINK_L*_TX_BYTES /
*_RX_BYTES (18 links × 2 directions × N GPUs per node). Stamping those as
hw.gpu.nvlink.io with the customer-stable attrs is a recipe-shaped task
transformprocessor OTTL is the right primitive, and the work belongs
in docs/integrations/prometheus-scrape.md (or a sibling
docs/integrations/dcgm-nvlink-recipe.md) rather than inside the
detector.

Proposed plan (separate PR, blocks detector PR)

PR A — recipe extension (this issue):

  1. Extend the OTTL transform/gpu_vendor block (or add transform/dcgm_nvlink)
    to:
    • For each DCGM_FI_PROF_NVLINK_L{N}_TX_BYTES series: emit
      hw.gpu.nvlink.io Counter with hw.gpu.nvlink.link=N,
      network.io.direction=transmit.
    • For each DCGM_FI_PROF_NVLINK_L{N}_RX_BYTES: same with
      network.io.direction=receive.
    • Map gpu_uuid / UUID label → resource attr hw.id.
    • Map gpu / GPU label → resource attr hw.gpu.index.
  2. Decision needed: metrics input path on patterndetectorprocessor, OR a
    thin metrics→logs converter (OTTL transform/metrics → log signal)?
    Either way: pattern-detection consumes the customer-stable attribute
    namespace, never the raw DCGM_* series.
  3. Document in
    docs/integrations/prometheus-scrape.md
    alongside the existing gpu.vendor mapping table.

PR B — NVLink detector (blocked on PR A):

  • module/pkg/patterns/nvlink_degradation.go + _test.go mirroring
    xid_correlation's two-layer shape.
  • appendNVLinkDegradationVerdict in
    module/processor/patterndetectorprocessor/.
  • PatternIDNVLink = "1", EvidenceKindNVLink = "nvlink".

Out of scope

  • The semconv proposal merge — docs/proposals/semconv-hw-gpu-extensions.md
    remains PROPOSED until upstream OTel semconv adopts it. Tracecore's
    customer-stable contract (RFC-0013 §3) is the local pin while upstream
    catches up.
  • Blackwell-class NVSwitch SXid evidence path — pattern ci(deps): bump the gh-actions group with 5 updates #1 v0 scope is
    Hopper/Ampere per MILESTONES.md M17.

Pre-existing references

  • Pattern doc: docs/patterns/pattern-1-nvlink-degradation.md
  • Semconv proposal: docs/proposals/semconv-hw-gpu-extensions.md §3
  • DCGM receiver RFC (deleted, but signal table still load-bearing):
    docs/rfcs/0005-dcgm-receiver-scope.md §hw.gpu.nvlink.io
  • M17 milestone definition: MILESTONES.md (search "M17")
  • Pivot RFC adoption matrix: docs/rfcs/0013-distro-first-pivot.md §2

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions