Skip to content

docs(pattern-3): reconcile hw.id (GPU UUID) vs RFC-0013 §3 gpu.id (PCI BDF) on join key #276

Description

@trilamsr

The pattern-3 spec doc (docs/patterns/pattern-3-hbm-ecc.md line 41) names hw.id (GPU UUID) as the resource attribute on the receiver-emitted hw.errors Counter; the PromQL examples (lines 47, 74, 86, 88, 104) all sum by (hw_id). RFC-0013 §3 (the customer-stable attribute contract) instead names gpu.id (PCI BDF) as the canonical identifier for the GPU at fault.

The hbm_ecc detector shipped in #274 follows RFC-0013 §3 (uses gpu.id / PCI BDF as the join key) and the appended "Detector status" section reconciles the implementation choice. But the older symptom + query + alert sections still reference hw.id / hw_id — which is the upstream OTel hw.* semconv name. A future operator reading the doc top-to-bottom will see two different identifiers without context.

What's needed

  • Pin the doc on gpu.id (RFC-0013 §3) as the canonical join key. Either:
    • (a) update the symptom + PromQL sections to use gpu.id / gpu_id, OR
    • (b) document that the upstream dcgm-exporter metric resource attribute is hw.id (UUID) and the OTTL recipe rewrites/derives gpu.id (BDF) for the customer-stable contract, with an explicit mapping line.
  • Same reconciliation pass on pattern-1-nvlink-degradation.md, pattern-4-thermal-throttle.md, pattern-5-pcie-aer.md (all carry the same hw.id references).

Why

  • The detector + verdict schema both join on gpu.id. A doc that says hw.id is the identifier teaches operators to look in the wrong place.
  • RFC-0013 §3 is the binding contract for v0.3.x; pattern docs that contradict it create rework when the OTTL recipes land.

Not blocking #274

The detector implementation is correct per RFC-0013 §3 — this is a doc consistency follow-up across the v0.3.x pattern library.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions