Skip to content

OTTL recipe: derive hw.errors.delta log shape from hw.errors counter (pattern #3 wiring) #273

Description

@trilamsr

The hbm_ecc detector (pattern #3, shipped in feat/hbm-ecc-detector) consumes HBMECCRecord projections gated on the log-shape attribute set hw.errors.delta + gpu.id + error.type + error.subtype + error.persistence + hw.gpu.index. The pattern doc (docs/patterns/pattern-3-hbm-ecc.md) defines the receiver-emitted signal as the hw.errors Counter scraped from dcgm-exporter via prometheusreceiver.

There is no OTTL recipe today that derives the hw.errors.delta log-shape from the upstream hw.errors Counter. Without it, the detector wires up but cannot fire in a real deployment.

What's needed

A metrics→logs OTTL transform recipe under docs/integrations/ that:

  • Reads hw.errors counter datapoints with hw.type=gpu AND error.type=uncorrected AND error.subtype=double_bit AND error.persistence=volatile.
  • Computes the per-scrape delta (increase(...[scrape_interval]) equivalent).
  • Emits a log record carrying hw.errors.delta (int), gpu.id (PCI BDF resource attr), hw.gpu.index, error.type, error.subtype, error.persistence, plus k8s.node.name resource attr.
  • Validates against ./_build/tracecore validate --config=docs/integrations/examples/....

Why this is a separate concern

  • The detector library is the v0.3 moat (per RFC-0013) and ships independent of any one upstream emitter.
  • The metrics→logs path requires either a metric-consumer in the processor (architectural shift) or an OTTL recipe that runs in a separate pipeline branch.
  • Pattern ci(deps): bump the gh-actions group with 5 updates #1 (NVLink) faces the same gap — both are dcgm-exporter metric-derived signals.

Acceptance

  • Recipe lands under docs/integrations/.
  • Validates with tracecore validate.
  • Combined with the journald-kernel recipe (Xid + gpu.id) and the hbm_ecc detector, an end-to-end fixture produces an hbm_ecc verdict.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions