Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ Legend: 👤 operator · 🛠️ contributor · 🏛️ maintainer · 🌐 exter
| Path | Audience | Purpose |
|---|---|---|
| [rfcs/](rfcs/) | 🏛️ 🛠️ | Architecture decision records. See [rfcs/README.md](rfcs/README.md) for the status index. |
| [adrs/](adrs/) | 🛠️ 🏛️ | Narrowly-scoped architectural decisions that fit under an existing RFC. One file per decision; see [adrs/0001-metrics-to-logs-pattern-input.md](adrs/0001-metrics-to-logs-pattern-input.md) for the metrics-sourced pattern-input wiring. |
| [patterns/](patterns/) | 🛠️ 👤 | Root-cause-pattern walkthroughs (NVLink degradation, HBM ECC, thermal throttle, PCIe AER). |
| [proposals/](proposals/) | 🏛️ | Drafts pending upstream (semconv extensions, etc.). |
| [research/](research/) | 🛠️ | Synthesized findings from reading external sources (OTel collector internals, benchmark baselines). |
Expand Down
177 changes: 177 additions & 0 deletions docs/adrs/0001-metrics-to-logs-pattern-input.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# ADR 0001 — Metrics-sourced pattern inputs feed `patterndetectorprocessor` via a sibling `WithMetrics` registration (Option A), NOT via an OTTL metrics-to-logs converter (Option B)

- **Status:** accepted, blocks issue #260 PR-B
- **Date:** 2026-05-31
- **Authors:** Tri Lam (@trilam)
- **Affects:** `module/processor/patterndetectorprocessor/`, `docs/integrations/prometheus-scrape.md`, patterns #1 (NVLink), #3 (HBM ECC), #4 (thermal throttle), #5 (PCIe AER)

## Context

`patterndetectorprocessor` (RFC-0013 §6.3, PR-I.2b) ships logs-only:
`processor.WithLogs(...)` in `factory.go`. The detectors landed so far —
`pod_evicted`, `nccl_hang`, `xid_correlation` — all consume log records
projected from K8s objects, NCCL FlightRecorder, and journald.

The next four NORTHSTAR detectors (patterns #1, #3, #4, #5) consume
**metric** signals scraped from `dcgm-exporter` via `prometheusreceiver`
(per RFC-0013 §2 adoption matrix). They need a way to feed metric
datapoints — `hw.gpu.nvlink.io`, `hw.errors`, `hw.gpu.throttle.duration`,
`hw.gpu.io` — into a pattern engine that today only reads `plog.Logs`.

Issue [#260](https://github.com/tracecoreai/tracecore/issues/260)
frames this as a binary decision:

- **Option A** — extend `patterndetectorprocessor` to register
`processor.WithMetrics` alongside the existing `WithLogs`.
- **Option B** — keep `patterndetectorprocessor` logs-only and pipe a
**metrics→logs converter** in front of it. Each pattern declares its
trigger as an OTTL statement that, when a metric threshold/rate
crosses, emits a synthetic log record. The existing logs detector
consumes the synthetic record via the same attribute projection
contract it uses for kernel events.
- **Option C** — `routingconnector` splits the metrics signal off to a
sibling metrics-aware processor.

## Decision

**Option A.** We extend `patterndetectorprocessor` with a
`processor.WithMetrics` registration (separate factory entry, same
module, parallel `consumeMetrics` path). The metric-sourced detectors
read `pmetric.Metrics` directly and emit verdict log records onto a
sibling logs pipeline via a connector (`forward` or an in-tree
`patternverdict` connector — decided at PR-B time).

The decision sequence is recorded here so the next PR (issue #260
PR-B — the NVLink detector) doesn't relitigate the trade-off.

## Architectural alternatives evaluated

### Option B (preferred starting point per `[[adopt-over-build]]`) — **BLOCKED upstream**

Option B was the recommended starting point because it adopts the
upstream `transformprocessor` already bundled in `builder-config.yaml`,
keeps `patterndetectorprocessor` shape-monomorphic (one signal in,
one signal out), and centralizes pattern triggers in declarative OTTL.

**Why it's blocked.** OTel-contrib `transformprocessor` v0.130 (the
release pinned in `builder-config.yaml`) cannot emit log records from a
metrics pipeline. Verified against the upstream README:

> Within each `<signal_statements>` list, only certain OTTL Path prefixes can be used:
>
> | Signal | Path Prefix Values |
> |--------------------|------------------------------------------------|
> | trace_statements | `resource`, `scope`, `span`, and `spanevent` |
> | metric_statements | `resource`, `scope`, `metric`, and `datapoint` |
> | log_statements | `resource`, `scope`, and `log` |
> | profile_statements | `resource`, `scope`, and `profile` |
>
> This means, for example, that you cannot use the Path `span.attributes`
> within the `log_statements` configuration section.

Source: [`processor/transformprocessor/README.md` § Config (v0.130.0)](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config).

`metric_statements` cannot reference `log.*` paths. The processor's
pipeline binding (`processor.WithMetrics` vs `processor.WithLogs`) is
sealed per-signal — there is no "emit a log on the side" path in OTTL.

We also surveyed the contrib **connector** family at v0.130 for a
metrics→logs primitive (a connector is the only OTel-contrib component
type allowed to change signal type):

| Connector | Receiver pipeline types | Notes |
|---|---|---|
| `countconnector` | **metrics** only | Counts any-signal inputs into metrics; cannot emit logs. |
| `signaltometricsconnector` | metrics only | Same — output is metrics. |
| `routingconnector` | signal-preserving | logs→logs / metrics→metrics / traces→traces. |
| `failoverconnector`, `roundrobinconnector` | signal-preserving | Same. |
| `spanmetricsconnector`, `exceptionsconnector`, `servicegraphconnector` | traces → metrics | Wrong direction. |
| `datadogconnector`, `grafanacloudconnector` | vendor-specific | Out of scope. |
| `otlpjsonconnector` | signal-preserving | Same. |

Source: [`open-telemetry/opentelemetry-collector-contrib/tree/v0.130.0/connector`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/v0.130.0/connector) (directory listing, 2026-05-31).

**No upstream contrib component at v0.130 emits log records from a
metrics input.** A metric-rule-as-log-alert primitive would be a net-new
contrib component — an upstream contribution per RFC-0013 §5, not
recipe-shaped work tracecore can land in v0.3.0.

### Option C — `routingconnector` + sibling processor

Equivalent to Option A in operator surface (still need a metrics-aware
processor) but adds an extra component to the pipeline graph. Routing
preserves signal type — it cannot bridge metrics→logs either. Rejected:
strictly heavier than Option A with the same end state.

### Option A (this ADR)

Trade-offs accepted:

- **Doubled processor surface.** `patterndetectorprocessor` now ships
both `ConsumeLogs` and `ConsumeMetrics` paths. Mitigation: the
metrics path is a thin projection layer (`pmetric.Metrics` →
`patterns.<MetricRecord>`) on top of the same `patterns/` library;
the verdict-emission shape stays log-based (verdicts go out on a
sibling logs pipeline via a connector, not as metrics). The two
consume paths share the verdict-emit code.
- **Per-signal join semantics divergence.** Logs are record-batched,
metrics are datapoint-batched within ResourceMetrics. The metrics
path will not pretend to be a log record at runtime; the projection
is explicit. The pattern library already separates per-input record
types (`patterns.Record`, `patterns.NodeRecord`, `patterns.XidRecord`,
`patterns.NCCLFRRecord`) — a new `patterns.GPUMetricRecord` family
follows the same shape.
- **`[[adopt-over-build]]` cost.** Option A is custom in-tree code we'd
rather not write. We accept it only because Option B is blocked
upstream. When OTel contrib ships a `metricthresholdconnector` (or
equivalent — emit a log when a metric rate/threshold crosses), the
recipe is re-evaluated and the metrics-path detectors can be
refactored onto the upstream primitive without changing the
customer-facing verdict-record contract.

## Consequences

1. **PR-A (this PR — issue #260)** ships only the DCGM → `hw.*`
namespace OTTL transform on the metric stream. The recipe stops
short of trying to emit log records from metrics, because that path
is unsupported at v0.130. The transform is independently valuable:
it normalizes raw `DCGM_FI_PROF_NVLINK_L*_TX_BYTES` (and the other
three families) into the customer-stable `hw.gpu.*` namespace
declared in `docs/proposals/semconv-hw-gpu-extensions.md` so PR-B's
detector can consume the customer-stable shape directly, regardless
of which architectural path PR-B implements.

2. **PR-B (issue #260 follow-up, NOT in this PR)** extends
`patterndetectorprocessor` with `processor.WithMetrics`. Scope:
- `factory.go` registers `createMetrics` alongside `createLogs`.
- `patterndetector.go` adds `ConsumeMetrics(ctx, pmetric.Metrics)`
and a `collectMetricInputs` that mirrors `collectInputs` but
reads metric datapoints by `metric.name` + attribute gates.
- `pkg/patterns/nvlink_degradation.go` lands as the first
`WithMetrics`-driven detector.
- Verdicts continue to land as log records on a sibling logs
pipeline via a connector wiring documented in the chart's
`renderedConfig` template.

3. **Customer-facing recipe shape stays unchanged.** Operators do not
change their `prometheus-scrape.yaml` between PR-A and PR-B beyond
adding the metrics input to the `patterndetector` processor in
their pipeline config (one new line). The OTTL `hw.*` namespace
transform from PR-A is the load-bearing wire-format contract for
PR-B and any future GPU pattern.

4. **Upstream contribution slot opened.** A
`metricthresholdconnector` (or `signaltologsconnector`) that emits
a log record when a metric datapoint matches a condition — the
inverse of `signaltometricsconnector` — is the missing primitive
that would let us collapse Option A back to Option B. Tracked as a
v0.3 follow-up under RFC-0013 §5 (upstream contribution policy).

## References

- Issue [#260](https://github.com/tracecoreai/tracecore/issues/260) — recipe extension + metrics-side detector plumbing.
- [`docs/rfcs/0013-distro-first-pivot.md`](../rfcs/0013-distro-first-pivot.md) §3 (customer-stable contracts), §5 (upstream contribution policy), §6 (the four in-house moat scopes).
- [`docs/patterns/pattern-1-nvlink-degradation.md`](../patterns/pattern-1-nvlink-degradation.md) — the first metrics-sourced pattern unblocked by this decision.
- [`docs/proposals/semconv-hw-gpu-extensions.md`](../proposals/semconv-hw-gpu-extensions.md) §3 — the `hw.gpu.nvlink.io` shape PR-A's OTTL transform emits.
- [transformprocessor README @ v0.130.0](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config) — the upstream signal-context table cited above.
- [opentelemetry-collector-contrib connector tree @ v0.130.0](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/v0.130.0/connector) — surveyed for any metrics→logs primitive.
Loading