You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The hbm_ecc detector (pattern #3, shipped in feat/hbm-ecc-detector) consumes HBMECCRecord projections gated on the log-shape attribute set hw.errors.delta + gpu.id + error.type + error.subtype + error.persistence + hw.gpu.index. The pattern doc (docs/patterns/pattern-3-hbm-ecc.md) defines the receiver-emitted signal as the hw.errors Counter scraped from dcgm-exporter via prometheusreceiver.
There is no OTTL recipe today that derives the hw.errors.delta log-shape from the upstream hw.errors Counter. Without it, the detector wires up but cannot fire in a real deployment.
What's needed
A metrics→logs OTTL transform recipe under docs/integrations/ that:
Reads hw.errors counter datapoints with hw.type=gpu AND error.type=uncorrected AND error.subtype=double_bit AND error.persistence=volatile.
Computes the per-scrape delta (increase(...[scrape_interval]) equivalent).
Emits a log record carrying hw.errors.delta (int), gpu.id (PCI BDF resource attr), hw.gpu.index, error.type, error.subtype, error.persistence, plus k8s.node.name resource attr.
Validates against ./_build/tracecore validate --config=docs/integrations/examples/....
Why this is a separate concern
The detector library is the v0.3 moat (per RFC-0013) and ships independent of any one upstream emitter.
The metrics→logs path requires either a metric-consumer in the processor (architectural shift) or an OTTL recipe that runs in a separate pipeline branch.
The hbm_ecc detector (pattern #3, shipped in feat/hbm-ecc-detector) consumes
HBMECCRecordprojections gated on the log-shape attribute sethw.errors.delta+gpu.id+error.type+error.subtype+error.persistence+hw.gpu.index. The pattern doc (docs/patterns/pattern-3-hbm-ecc.md) defines the receiver-emitted signal as thehw.errorsCounter scraped fromdcgm-exporterviaprometheusreceiver.There is no OTTL recipe today that derives the
hw.errors.deltalog-shape from the upstreamhw.errorsCounter. Without it, the detector wires up but cannot fire in a real deployment.What's needed
A metrics→logs OTTL transform recipe under
docs/integrations/that:hw.errorscounter datapoints withhw.type=gpuANDerror.type=uncorrectedANDerror.subtype=double_bitANDerror.persistence=volatile.increase(...[scrape_interval])equivalent).hw.errors.delta(int),gpu.id(PCI BDF resource attr),hw.gpu.index,error.type,error.subtype,error.persistence, plusk8s.node.nameresource attr../_build/tracecore validate --config=docs/integrations/examples/....Why this is a separate concern
Acceptance
docs/integrations/.tracecore validate.