Skip to content

feat(processor): promote operator-facing scalar attrs onto verdict records (closes #270)#275

Merged
trilamsr merged 2 commits into
mainfrom
feat/verdict-scalar-attrs
Jun 1, 2026
Merged

feat(processor): promote operator-facing scalar attrs onto verdict records (closes #270)#275
trilamsr merged 2 commits into
mainfrom
feat/verdict-scalar-attrs

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #270. Promotes operator-facing scalar fields to top-level OTLP log-record attributes on every verdict the patterndetectorprocessor emits, so Grafana dashboards (PR #264) and LogQL queries can table-aggregate without server-side JSON parsing of pattern.verdict_json.

Root cause: verdict log records carried scalars only inside pattern.verdict_json. Fields like Regarding.Name, HangingRanks, EvictedPod were buried — 4/7 panels in PR #264 rendered empty. Fixed at source: extended verdict structs with the scalars, then promoted them via PutStr / PutInt in the emitters.

Two commits, one concern each:

  1. feat(patterns): expose operator-facing fields on verdict structsPodEvictedVerdict gains PodName / PodNamespace / NodeName / EventReason; XidCorrelationVerdict gains PodName / PodNamespace. NCCLHangVerdict already had PgID / CollectiveSeqID / HangingRanks. Schemas + canonical golden updated; pattern.verdict_json remains.
  2. feat(processor): promote verdict scalar attrs (closes #270) — extends VerdictAttr* constants and stamps them on the emitted log record alongside existing attrs. Empty optional strings are skipped (putStrIfSet) so dashboards filtering by attribute presence don't match malformed records.

Attribute taxonomy by pattern:

pattern new attrs mapped from
pod_evicted (14) k8s.pod.name, k8s.pod.namespace, k8s.node.name, k8s.event.reason Verdict.PodName / PodNamespace / NodeName / EventReason (new fields populated from ev.Regarding.{Name,Namespace}, ev.ReportingInstance, ev.Reason)
nccl_hang (15) nccl.fr.pg_id, nccl.fr.collective_seq_id, nccl.fr.hanging_ranks_count Verdict.PgID, Verdict.CollectiveSeqID, len(Verdict.HangingRanks) (no struct change)
xid_correlation (16) kernelevents.xid, k8s.node.name, k8s.pod.name, k8s.pod.namespace Verdict.XidCode, Verdict.Node, Verdict.PodName, Verdict.PodNamespace (last two new, split from EvictedPod)

Test plan

  • cd module && go test ./pkg/patterns/... ./processor/... -race -count=1 — green
  • make build — green (OCB compiles)
  • make check — green (golangci-lint, go vet, go mod verify)
  • New TDD red-first test: module/processor/patterndetectorprocessor/verdict_attrs_test.go asserts each new attribute on the emitted log record per pattern; verified RED before impl, then GREEN
  • Schema-conformance tests for all three verdicts pass against updated schemas (verdict.schema.json, xid_correlation_verdict.schema.json)
  • Schema-drift battery still rejects the documented mutation classes (additionalProperties:false / enum guards / minItems / type guards)
  • Canonical pod_evicted golden updated to include the new scalars; TestPatternDetector_GoldenAgainstCanonicalFixture round-trips JSONEq

Followup

Unblocks PR #264 dashboards — panels 2/3/4/5 should render populated tables once this lands.

patterndetectorprocessor verdict log records now carry operator-facing scalars as top-level OTLP attributes (k8s.pod.name, k8s.pod.namespace, k8s.node.name, k8s.event.reason on pod_evicted; nccl.fr.pg_id, nccl.fr.collective_seq_id, nccl.fr.hanging_ranks_count on nccl_hang; kernelevents.xid + k8s.node.name + k8s.pod.name + k8s.pod.namespace on xid_correlation). Dashboards and LogQL queries can table-aggregate without parsing pattern.verdict_json. The full pattern.verdict_json attribute remains unchanged.

Tri Lam added 2 commits May 31, 2026 20:55
PodEvictedVerdict gains PodName/PodNamespace/NodeName/EventReason and
XidCorrelationVerdict gains PodName/PodNamespace so the
patterndetectorprocessor can promote them to top-level OTLP log-record
attributes (issue #270). Schemas + canonical golden updated; existing
fields and JSON shapes unchanged.

Signed-off-by: Tri Lam <tri@maydow.com>
Extends VerdictAttr* taxonomy with k8s.pod.name, k8s.pod.namespace,
k8s.node.name, k8s.event.reason, nccl.fr.pg_id,
nccl.fr.collective_seq_id, nccl.fr.hanging_ranks_count, and
kernelevents.xid. appendVerdict / appendNCCLHangVerdict /
appendXidCorrelationVerdict stamp them on the emitted log record
alongside the existing pattern.verdict_json so dashboards
table-aggregate without server-side JSON parsing. Unblocks PR #264.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 04:02
@trilamsr trilamsr merged commit a1cca39 into main Jun 1, 2026
15 checks passed
@trilamsr trilamsr deleted the feat/verdict-scalar-attrs branch June 1, 2026 04:05
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Verdict panels queried OTLP attributes via `| json pattern_id="attributes_pattern_id"` (Promtail/Alloy shape), which returns empty against Loki 3.0+ native OTLP ingestion (PR #278). Loki normalizes OTLP log attributes into structured metadata with dots → underscores at the LogQL surface (`pattern.id` → `pattern_id`), no `attributes_` prefix, no JSON parse stage.

All 6 verdict panels now use direct structured-metadata filters (`| pattern_id="14" | k8s_node_name=~".+"`). top-N tables aggregate on promoted scalar attrs from PR #275 (`k8s_pod_name`, `k8s_pod_namespace`, `nccl_fr_pg_id`, `nccl_fr_collective_seq_id`, `kernelevents_xid`). Panel 6 reads `pattern_confidence` (was `attributes_pattern_confidence` via json).

README documents the Loki native-OTLP install path (cross-refs PR #278 recipe) and notes the Promtail/Alloy fallback shape for operators not on native OTLP. Same 7 panels, no new panels.

Verification: dashboard-linter --strict exit 0; `python3 -c 'import json; json.load(...)'` exit 0; `make check` + `make verify` clean.
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Verdict panels queried OTLP attributes via `| json pattern_id="attributes_pattern_id"` (Promtail/Alloy shape), which returns empty against Loki 3.0+ native OTLP ingestion (PR #278). Loki normalizes OTLP log attributes into structured metadata with dots → underscores at the LogQL surface (`pattern.id` → `pattern_id`), no `attributes_` prefix, no JSON parse stage.

All 6 verdict panels now use direct structured-metadata filters (`| pattern_id="14" | k8s_node_name=~".+"`). top-N tables aggregate on promoted scalar attrs from PR #275 (`k8s_pod_name`, `k8s_pod_namespace`, `nccl_fr_pg_id`, `nccl_fr_collective_seq_id`, `kernelevents_xid`). Panel 6 reads `pattern_confidence` (was `attributes_pattern_confidence` via json).

README documents the Loki native-OTLP install path (cross-refs PR #278 recipe) and notes the Promtail/Alloy fallback shape for operators not on native OTLP. Same 7 panels, no new panels.

Verification: dashboard-linter --strict exit 0; `python3 -c 'import json; json.load(...)'` exit 0; `make check` + `make verify` clean.
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Wires the patterns.ThermalThrottleDetector library into the logs-only
patterndetectorprocessor. Detector cannot fire end-to-end yet — the
metrics-side path is deferred to PR-B per ADR-0001 — but the
projection is staged so the wiring compiles against the future
metrics->logs converter's emitted shape.

- Config: ThermalThrottleWindow, ThermalThrottleDeltaThreshold,
  ThermalThrottleMinCascadeGPUs (floors + defaults mirror the
  detector library; MinCascadeGPUs floor 2 — a one-GPU cascade is
  not pattern-4 territory).
- collectInputs returns thermal_throttle records as the 5th typed
  shape; projectThermalThrottleRecord gates on hw.gpu.throttle.
  duration.delta + hw.gpu.throttle.reason=thermal + gpu.id.
- appendThermalThrottleVerdict mirrors peer emitters; promotes
  k8s.node.name and hw.gpu.throttle.cascade_size onto the verdict
  log record per PR #275 so dashboards filter without parsing JSON.
- Six wiring tests: positive cascade, single-GPU negative, cross-
  node negative, reason=power discriminator, Window + DeltaThreshold
  configurability, scalar-attr promotion check.
- pattern-4-thermal-throttle.md gains a "Detector status" section
  documenting the integration-pending state.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Adds `install/kubernetes/tracecore/dashboards/patterns.json` — an
operator-facing Grafana 10+ dashboard for the three shipped pattern
verdicts (pod_evicted #14, nccl_hang #15, xid_correlation #16), with
future-proof templating on `pattern.id` for M17/M18 detectors.

Closes #280 (LogQL drift vs Loki native OTLP). Issue #270 (scalar
attribute promotion) was closed by PR #275 upstream of this rebase.

### LogQL shape: Loki native OTLP structured metadata

Verdict log records carry the verdict scalars as OTLP log-record
attributes (PR #275): `pattern.id`, `pattern.confidence`,
`k8s.pod.name`, `k8s.pod.namespace`, `k8s.node.name`,
`k8s.event.reason`, `nccl.fr.pg_id`, `nccl.fr.collective_seq_id`,
`nccl.fr.hanging_ranks_count`, `kernelevents.xid`. The Loki backend
recipe (PR #278, `docs/integrations/loki.md`) sends these to Loki
3.0+'s native OTLP endpoint (`/otlp/v1/logs`) via the bundled
`otlphttp` exporter.

Loki's OTLP receiver lands log attributes as **structured metadata**,
queryable as direct LogQL label filters — no `| json` parser stage,
no `attributes_` prefix. Dots in attribute names normalize to
underscores at the LogQL surface (per Loki upstream docs
`docs/sources/shared/otel.md` "Format considerations"):

- `pattern.id` → `pattern_id`
- `pattern.confidence` → `pattern_confidence`
- `k8s.pod.name` → `k8s_pod_name`
- `nccl.fr.pg_id` → `nccl_fr_pg_id`
- `kernelevents.xid` → `kernelevents_xid`

All six verdict panels are written against this shape (e.g.
`{job=~"$job"} | pattern_id="14" | k8s_node_name=~".+" [$__auto]`).
README documents the Loki native-OTLP install path (cross-refs PR
#278) and notes the Promtail/Alloy fallback for operators not on
native OTLP (those need `| json pattern_id="attributes_pattern_id"`
extraction; not shipped in-tree, fork the JSON).

### Self-telemetry counter still on v0.3 roadmap

The spec proposed PromQL queries against
`otelcol_processor_patterndetector_verdicts_emitted_total`. That
metric does not exist yet — the patterndetectorprocessor README says
*"No self-telemetry yet. Self-telemetry is on the v0.3 roadmap"*.
Per the spec's constraint (*"DO NOT touch detector code or processor
code"*), root-causing the missing metric is out of scope for this
PR; the six verdict-derived panels query Loki via LogQL in the
meantime. Tracked under #261. When that lands, the six LogQL panels
swap to PromQL and the Loki dependency drops.

Panel 7 (throughput) queries Prometheus against the upstream
OTel-Collector standard `otelcol_processor_{incoming,outgoing}_items`
which the collector emits for every processor automatically.

### Panels shipped (7)

| # | Title | Datasource | Patterns covered |
|---|---|---|---|
| 1 | Verdict rate by pattern_id | Loki / LogQL | 14, 15, 16 (templated)
|
| 2 | Top 10 evicted pods | Loki / LogQL | 14 (pod_evicted) |
| 3 | Top 10 hung NCCL collectives | Loki / LogQL | 15 (nccl_hang) |
| 4 | Top 10 Xid+eviction correlations | Loki / LogQL | 16
(xid_correlation) |
| 5 | Verdict count by node | Loki / LogQL | all (templated) |
| 6 | Confidence distribution (full vs partial) | Loki / LogQL | all
(templated) |
| 7 | patterndetector processor throughput | Prom / PromQL | (pipeline
liveness) |

### Templating vars (6)

- `prometheus_datasource`, `loki_datasource` — datasource selectors (no
  hardcoded UIDs).
- `job`, `instance` — linter-mandated PromQL matchers, populated from
  `otelcol_process_uptime`.
- `cluster` — multi-cluster slice, populated from
`otelcol_process_uptime`.
- `pattern_id` — custom-options var seeded with the three shipped IDs
  (14/15/16). Extend in-place when new detectors land.

### Linter exclusion (justified)

`.lint` config waives `target-promql-rule` on the six Loki panels. The
dashboard-linter parses every target as PromQL irrespective of
`target.datasource.type` and fails on the first `|` in a valid LogQL
pipeline. `target-logql-rule` still validates each query as LogQL and
passes. Full justification in
`install/kubernetes/tracecore/dashboards/README.md` §"Linter exclusions"
and inline in the `.lint` `reason:` block. Removable once issue #261
swaps the panels to PromQL.

## Test plan

- [x] `dashboard-linter lint --strict --config .lint patterns.json`
      → exit 0 (built from source — `go install` rejects the linter's
      own go.mod replace directives; build steps in
      `install/kubernetes/tracecore/dashboards/README.md`).
- [x] `python3 -c "import json;
json.load(open('install/kubernetes/tracecore/dashboards/patterns.json'))"`
      → exit 0; 7 panels, 6 templating vars confirmed.
- [x] `make check` (golangci-lint + vet + tidy-check + mod verify) →
      exit 0.
- [x] `make verify` (license-check + doc-check + register-lint +
      actionlint + zizmor + no-autoupdate) → exit 0.
- [x] LogQL shape verified against Loki upstream docs
      (`docs/sources/shared/otel.md` "Format considerations" — dots
      and special characters normalize to underscores; no
      `attributes_` prefix on the native OTLP surface) and
      `docs/sources/send-data/otel/native_otlp_vs_loki_exporter.md`
      query examples (`{service_name="auth"} | severity_text="INFO"`).
- [x] Promoted scalar attrs verified against
      `module/processor/patterndetectorprocessor/patterndetector.go`
      `VerdictAttr*` constants (lines 25-88) and `appendVerdict` /
      `appendNCCLHangVerdict` / `appendXidCorrelationVerdict` emitters
      (lines 510-517+).
- [ ] Manual smoke test against a live cluster with Loki 3.0+ native
      OTLP receiver (deferred — adversarial reviewer to verify panels
      render against actual OTLP-native Loki output before merge).

```release-notes
Ship Grafana dashboard for tracecore's pattern verdicts: install/kubernetes/tracecore/dashboards/patterns.json. Seven panels cover the three shipped pattern detectors (pod_evicted #14, nccl_hang #15, xid_correlation #16) plus templated pattern.id for future M17/M18 patterns. LogQL queries target Loki 3.0+ native OTLP structured metadata (pairs with the Loki backend recipe). Includes README install guide (manual upload, grafana-cli, kube-prometheus-stack ConfigMap), Promtail/Alloy fallback notes, and pattern coverage matrix.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Ships NORTHSTAR pattern #5 per docs/patterns/pattern-5-pcie-aer.md.
Two layers required for a verdict; a single layer alone emits
nothing.

- Layer 1 (kernel AER): `PCIe Bus Error: severity=…, type=…` line
  on a specific `gpu.id` (PCI BDF) extracted by the journald-kernel
  OTTL recipe.
- Layer 2 (rate-collapse): per-GPU `hw.gpu.io` `BytesPerSecond` <=
  `(1 - RateDropThreshold) * BaselineBytesPerSecond` on the same
  BDF within the correlation window. Default threshold 0.5 (50%
  drop) and window 5min. Causality flows forward — the AER must
  precede the rate-collapse sample.

Why two layers required (no Confidence taxonomy, mirroring
xid_correlation + hbm_ecc):
- AER alone: corrected errors recovered — the link re-trained.
  Operators see raw AER telemetry on journald already.
- Rate-collapse alone: workload-natural Tx/Rx variance (a rank
  finished a comm phase, etc.). The AER is the hardware-fault
  confirmation that makes the join operator-actionable.

Join key is `gpu.id` (PCI BDF) per RFC-0013 §3, shared across the
dmesg AER preamble and the dcgm-exporter `hw.gpu.pci.bdf` resource
attribute on `hw.gpu.io`. Node is carried for the verdict prose
but not used in the join key — the BDF is the proximate hardware
identifier.

Defaults:
- CorrelationWindow: 5min — mirrors the spec's `rate(...[5m])`
  PromQL and typical post-AER link-renegotiation latency.
- RateDropThreshold: 0.5 — a one-generation PCIe downshift (Gen5
  → Gen3 or x16 → x8) lands well past this floor.

What's new:
- module/pkg/patterns/pcie_aer.go — PCIeAERDetector library type,
  PCIeAERRecord (AER kernel message), PCIeIORecord (hw.gpu.io
  rate sample with baseline), PCIeAERVerdict output.
  PatternIDPCIeAER = "5"; EvidenceKindAER = "pcie_aer";
  EvidenceKindPCIeIOCollapse = "hw_io_collapse".
- module/pkg/patterns/testdata/pcie_aer_verdict.schema.json — JSON
  schema with 10 drift falsifiers (extra field, confidence re-add,
  pattern.id numeric, pattern.id wrong value, severity outside
  enum, gpu_id empty, drop_ratio negative, drop_ratio over 1,
  evidence kind outside enum, evidence_trail under min).

Verdict struct carries promoted scalar fields (Severity, AERType,
DropRatio, GPUID, Node) so the processor wiring can stamp them as
top-level OTLP log attributes for dashboard table-aggregation per
PR #275's lesson. No dead fields — every struct field is read by
either the Evaluate path or the verdict-shape contract.

Integration end-to-end firing in a real deployment is blocked on
PR-B (issue #260): no OTTL recipe today derives the per-GPU
`tracecore.alert.pcie_rate_collapse`-shaped log record from
`hw.gpu.io`. The detector library is the v0.3 moat and ships
independently per ADR-0001.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Wires the PCIeAERDetector library type into patterndetectorprocessor:

- Config: PCIeAERWindow (yaml: pcie_aer_window, default 5min),
  PCIeAERRateDropThreshold (yaml: pcie_aer_rate_drop_threshold,
  default 0.5). Validate rejects sub-1s window and threshold
  outside [0, 1].
- collectInputs grows two new typed projections behind tighter
  discriminators than the existing five:
  - projectPCIeAERRecord — gate `kernelevents.pcie_aer.severity`
    AND `gpu.id`. Reads severity/type/gpu.id off log attrs, falls
    back to resource gpu.id when the journald-kernel OTTL stamps
    it on the resource.
  - projectPCIeIORecord — gate
    `tracecore.alert.pcie_rate_collapse.bytes_per_second` AND
    `gpu.id`. Reads BytesPerSecond + Baseline + Direction off log
    attrs; the bridge attribute name is namespaced
    `tracecore.alert.pcie_rate_collapse.*` so downstream knows it's
    a tracecore-derived alert (vs a raw hw.gpu.io scrape sample).
- appendPCIeAERVerdict promotes (GPUID, Severity, AERType,
  DropRatio, Node) onto the verdict log record as top-level OTLP
  attributes per PR #275's lesson, so dashboards can table-
  aggregate without parsing pattern.verdict_json.
  Names track OTel semconv (`gpu.id`, `k8s.node.name`) and
  recipe-canonical keys (`kernelevents.pcie_aer.severity/.type`).

Wiring tests cover: emits-verdict, AER-alone (no fire),
rate-collapse-alone (no fire), window-configurable, threshold-
configurable, promoted-scalar attribute presence, and the new
Validate guard.

Integration gap (filed separately, not blocking this PR):

1. The journald-kernel OTTL recipe extracts kernelevents.xid +
   gpu.id from `NVRM: Xid` lines but does NOT extract
   kernelevents.pcie_aer.* from `PCIe Bus Error: …` lines —
   needs a sibling OTTL stanza in
   docs/integrations/journald-kernel.md.
2. The metrics→logs PCIe rate-collapse alert OTTL recipe is the
   PR-B side of issue #260 (ADR-0001 the blocker for all
   metrics-sourced patterns).

Until both land, the detector is the v0.3 moat (pattern logic +
tests) and the wiring is configured-but-quiet (zero verdicts on
real input — projections find nothing).

Signed-off-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

patterndetectorprocessor: promote operator-facing scalar attrs onto verdict records

1 participant