The pattern-3 spec doc (docs/patterns/pattern-3-hbm-ecc.md line 41) names hw.id (GPU UUID) as the resource attribute on the receiver-emitted hw.errors Counter; the PromQL examples (lines 47, 74, 86, 88, 104) all sum by (hw_id). RFC-0013 §3 (the customer-stable attribute contract) instead names gpu.id (PCI BDF) as the canonical identifier for the GPU at fault.
The hbm_ecc detector shipped in #274 follows RFC-0013 §3 (uses gpu.id / PCI BDF as the join key) and the appended "Detector status" section reconciles the implementation choice. But the older symptom + query + alert sections still reference hw.id / hw_id — which is the upstream OTel hw.* semconv name. A future operator reading the doc top-to-bottom will see two different identifiers without context.
What's needed
- Pin the doc on
gpu.id (RFC-0013 §3) as the canonical join key. Either:
- (a) update the symptom + PromQL sections to use
gpu.id / gpu_id, OR
- (b) document that the upstream
dcgm-exporter metric resource attribute is hw.id (UUID) and the OTTL recipe rewrites/derives gpu.id (BDF) for the customer-stable contract, with an explicit mapping line.
- Same reconciliation pass on pattern-1-nvlink-degradation.md, pattern-4-thermal-throttle.md, pattern-5-pcie-aer.md (all carry the same
hw.id references).
Why
- The detector + verdict schema both join on
gpu.id. A doc that says hw.id is the identifier teaches operators to look in the wrong place.
- RFC-0013 §3 is the binding contract for v0.3.x; pattern docs that contradict it create rework when the OTTL recipes land.
Not blocking #274
The detector implementation is correct per RFC-0013 §3 — this is a doc consistency follow-up across the v0.3.x pattern library.
The pattern-3 spec doc (
docs/patterns/pattern-3-hbm-ecc.mdline 41) nameshw.id(GPU UUID) as the resource attribute on the receiver-emittedhw.errorsCounter; the PromQL examples (lines 47, 74, 86, 88, 104) allsum by (hw_id). RFC-0013 §3 (the customer-stable attribute contract) instead namesgpu.id(PCI BDF) as the canonical identifier for the GPU at fault.The
hbm_eccdetector shipped in #274 follows RFC-0013 §3 (usesgpu.id/ PCI BDF as the join key) and the appended "Detector status" section reconciles the implementation choice. But the older symptom + query + alert sections still referencehw.id/hw_id— which is the upstream OTelhw.*semconv name. A future operator reading the doc top-to-bottom will see two different identifiers without context.What's needed
gpu.id(RFC-0013 §3) as the canonical join key. Either:gpu.id/gpu_id, ORdcgm-exportermetric resource attribute ishw.id(UUID) and the OTTL recipe rewrites/derivesgpu.id(BDF) for the customer-stable contract, with an explicit mapping line.hw.idreferences).Why
gpu.id. A doc that sayshw.idis the identifier teaches operators to look in the wrong place.Not blocking #274
The detector implementation is correct per RFC-0013 §3 — this is a doc consistency follow-up across the v0.3.x pattern library.