feat(processor): promote operator-facing scalar attrs onto verdict records (closes #270)#275
Merged
Merged
Conversation
added 2 commits
May 31, 2026 20:55
PodEvictedVerdict gains PodName/PodNamespace/NodeName/EventReason and XidCorrelationVerdict gains PodName/PodNamespace so the patterndetectorprocessor can promote them to top-level OTLP log-record attributes (issue #270). Schemas + canonical golden updated; existing fields and JSON shapes unchanged. Signed-off-by: Tri Lam <tri@maydow.com>
Extends VerdictAttr* taxonomy with k8s.pod.name, k8s.pod.namespace, k8s.node.name, k8s.event.reason, nccl.fr.pg_id, nccl.fr.collective_seq_id, nccl.fr.hanging_ranks_count, and kernelevents.xid. appendVerdict / appendNCCLHangVerdict / appendXidCorrelationVerdict stamp them on the emitted log record alongside the existing pattern.verdict_json so dashboards table-aggregate without server-side JSON parsing. Unblocks PR #264. Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Verdict panels queried OTLP attributes via `| json pattern_id="attributes_pattern_id"` (Promtail/Alloy shape), which returns empty against Loki 3.0+ native OTLP ingestion (PR #278). Loki normalizes OTLP log attributes into structured metadata with dots → underscores at the LogQL surface (`pattern.id` → `pattern_id`), no `attributes_` prefix, no JSON parse stage. All 6 verdict panels now use direct structured-metadata filters (`| pattern_id="14" | k8s_node_name=~".+"`). top-N tables aggregate on promoted scalar attrs from PR #275 (`k8s_pod_name`, `k8s_pod_namespace`, `nccl_fr_pg_id`, `nccl_fr_collective_seq_id`, `kernelevents_xid`). Panel 6 reads `pattern_confidence` (was `attributes_pattern_confidence` via json). README documents the Loki native-OTLP install path (cross-refs PR #278 recipe) and notes the Promtail/Alloy fallback shape for operators not on native OTLP. Same 7 panels, no new panels. Verification: dashboard-linter --strict exit 0; `python3 -c 'import json; json.load(...)'` exit 0; `make check` + `make verify` clean.
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Verdict panels queried OTLP attributes via `| json pattern_id="attributes_pattern_id"` (Promtail/Alloy shape), which returns empty against Loki 3.0+ native OTLP ingestion (PR #278). Loki normalizes OTLP log attributes into structured metadata with dots → underscores at the LogQL surface (`pattern.id` → `pattern_id`), no `attributes_` prefix, no JSON parse stage. All 6 verdict panels now use direct structured-metadata filters (`| pattern_id="14" | k8s_node_name=~".+"`). top-N tables aggregate on promoted scalar attrs from PR #275 (`k8s_pod_name`, `k8s_pod_namespace`, `nccl_fr_pg_id`, `nccl_fr_collective_seq_id`, `kernelevents_xid`). Panel 6 reads `pattern_confidence` (was `attributes_pattern_confidence` via json). README documents the Loki native-OTLP install path (cross-refs PR #278 recipe) and notes the Promtail/Alloy fallback shape for operators not on native OTLP. Same 7 panels, no new panels. Verification: dashboard-linter --strict exit 0; `python3 -c 'import json; json.load(...)'` exit 0; `make check` + `make verify` clean.
7 tasks
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Wires the patterns.ThermalThrottleDetector library into the logs-only patterndetectorprocessor. Detector cannot fire end-to-end yet — the metrics-side path is deferred to PR-B per ADR-0001 — but the projection is staged so the wiring compiles against the future metrics->logs converter's emitted shape. - Config: ThermalThrottleWindow, ThermalThrottleDeltaThreshold, ThermalThrottleMinCascadeGPUs (floors + defaults mirror the detector library; MinCascadeGPUs floor 2 — a one-GPU cascade is not pattern-4 territory). - collectInputs returns thermal_throttle records as the 5th typed shape; projectThermalThrottleRecord gates on hw.gpu.throttle. duration.delta + hw.gpu.throttle.reason=thermal + gpu.id. - appendThermalThrottleVerdict mirrors peer emitters; promotes k8s.node.name and hw.gpu.throttle.cascade_size onto the verdict log record per PR #275 so dashboards filter without parsing JSON. - Six wiring tests: positive cascade, single-GPU negative, cross- node negative, reason=power discriminator, Window + DeltaThreshold configurability, scalar-attr promotion check. - pattern-4-thermal-throttle.md gains a "Detector status" section documenting the integration-pending state. Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Adds `install/kubernetes/tracecore/dashboards/patterns.json` — an operator-facing Grafana 10+ dashboard for the three shipped pattern verdicts (pod_evicted #14, nccl_hang #15, xid_correlation #16), with future-proof templating on `pattern.id` for M17/M18 detectors. Closes #280 (LogQL drift vs Loki native OTLP). Issue #270 (scalar attribute promotion) was closed by PR #275 upstream of this rebase. ### LogQL shape: Loki native OTLP structured metadata Verdict log records carry the verdict scalars as OTLP log-record attributes (PR #275): `pattern.id`, `pattern.confidence`, `k8s.pod.name`, `k8s.pod.namespace`, `k8s.node.name`, `k8s.event.reason`, `nccl.fr.pg_id`, `nccl.fr.collective_seq_id`, `nccl.fr.hanging_ranks_count`, `kernelevents.xid`. The Loki backend recipe (PR #278, `docs/integrations/loki.md`) sends these to Loki 3.0+'s native OTLP endpoint (`/otlp/v1/logs`) via the bundled `otlphttp` exporter. Loki's OTLP receiver lands log attributes as **structured metadata**, queryable as direct LogQL label filters — no `| json` parser stage, no `attributes_` prefix. Dots in attribute names normalize to underscores at the LogQL surface (per Loki upstream docs `docs/sources/shared/otel.md` "Format considerations"): - `pattern.id` → `pattern_id` - `pattern.confidence` → `pattern_confidence` - `k8s.pod.name` → `k8s_pod_name` - `nccl.fr.pg_id` → `nccl_fr_pg_id` - `kernelevents.xid` → `kernelevents_xid` All six verdict panels are written against this shape (e.g. `{job=~"$job"} | pattern_id="14" | k8s_node_name=~".+" [$__auto]`). README documents the Loki native-OTLP install path (cross-refs PR #278) and notes the Promtail/Alloy fallback for operators not on native OTLP (those need `| json pattern_id="attributes_pattern_id"` extraction; not shipped in-tree, fork the JSON). ### Self-telemetry counter still on v0.3 roadmap The spec proposed PromQL queries against `otelcol_processor_patterndetector_verdicts_emitted_total`. That metric does not exist yet — the patterndetectorprocessor README says *"No self-telemetry yet. Self-telemetry is on the v0.3 roadmap"*. Per the spec's constraint (*"DO NOT touch detector code or processor code"*), root-causing the missing metric is out of scope for this PR; the six verdict-derived panels query Loki via LogQL in the meantime. Tracked under #261. When that lands, the six LogQL panels swap to PromQL and the Loki dependency drops. Panel 7 (throughput) queries Prometheus against the upstream OTel-Collector standard `otelcol_processor_{incoming,outgoing}_items` which the collector emits for every processor automatically. ### Panels shipped (7) | # | Title | Datasource | Patterns covered | |---|---|---|---| | 1 | Verdict rate by pattern_id | Loki / LogQL | 14, 15, 16 (templated) | | 2 | Top 10 evicted pods | Loki / LogQL | 14 (pod_evicted) | | 3 | Top 10 hung NCCL collectives | Loki / LogQL | 15 (nccl_hang) | | 4 | Top 10 Xid+eviction correlations | Loki / LogQL | 16 (xid_correlation) | | 5 | Verdict count by node | Loki / LogQL | all (templated) | | 6 | Confidence distribution (full vs partial) | Loki / LogQL | all (templated) | | 7 | patterndetector processor throughput | Prom / PromQL | (pipeline liveness) | ### Templating vars (6) - `prometheus_datasource`, `loki_datasource` — datasource selectors (no hardcoded UIDs). - `job`, `instance` — linter-mandated PromQL matchers, populated from `otelcol_process_uptime`. - `cluster` — multi-cluster slice, populated from `otelcol_process_uptime`. - `pattern_id` — custom-options var seeded with the three shipped IDs (14/15/16). Extend in-place when new detectors land. ### Linter exclusion (justified) `.lint` config waives `target-promql-rule` on the six Loki panels. The dashboard-linter parses every target as PromQL irrespective of `target.datasource.type` and fails on the first `|` in a valid LogQL pipeline. `target-logql-rule` still validates each query as LogQL and passes. Full justification in `install/kubernetes/tracecore/dashboards/README.md` §"Linter exclusions" and inline in the `.lint` `reason:` block. Removable once issue #261 swaps the panels to PromQL. ## Test plan - [x] `dashboard-linter lint --strict --config .lint patterns.json` → exit 0 (built from source — `go install` rejects the linter's own go.mod replace directives; build steps in `install/kubernetes/tracecore/dashboards/README.md`). - [x] `python3 -c "import json; json.load(open('install/kubernetes/tracecore/dashboards/patterns.json'))"` → exit 0; 7 panels, 6 templating vars confirmed. - [x] `make check` (golangci-lint + vet + tidy-check + mod verify) → exit 0. - [x] `make verify` (license-check + doc-check + register-lint + actionlint + zizmor + no-autoupdate) → exit 0. - [x] LogQL shape verified against Loki upstream docs (`docs/sources/shared/otel.md` "Format considerations" — dots and special characters normalize to underscores; no `attributes_` prefix on the native OTLP surface) and `docs/sources/send-data/otel/native_otlp_vs_loki_exporter.md` query examples (`{service_name="auth"} | severity_text="INFO"`). - [x] Promoted scalar attrs verified against `module/processor/patterndetectorprocessor/patterndetector.go` `VerdictAttr*` constants (lines 25-88) and `appendVerdict` / `appendNCCLHangVerdict` / `appendXidCorrelationVerdict` emitters (lines 510-517+). - [ ] Manual smoke test against a live cluster with Loki 3.0+ native OTLP receiver (deferred — adversarial reviewer to verify panels render against actual OTLP-native Loki output before merge). ```release-notes Ship Grafana dashboard for tracecore's pattern verdicts: install/kubernetes/tracecore/dashboards/patterns.json. Seven panels cover the three shipped pattern detectors (pod_evicted #14, nccl_hang #15, xid_correlation #16) plus templated pattern.id for future M17/M18 patterns. LogQL queries target Loki 3.0+ native OTLP structured metadata (pairs with the Loki backend recipe). Includes README install guide (manual upload, grafana-cli, kube-prometheus-stack ConfigMap), Promtail/Alloy fallback notes, and pattern coverage matrix. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Ships NORTHSTAR pattern #5 per docs/patterns/pattern-5-pcie-aer.md. Two layers required for a verdict; a single layer alone emits nothing. - Layer 1 (kernel AER): `PCIe Bus Error: severity=…, type=…` line on a specific `gpu.id` (PCI BDF) extracted by the journald-kernel OTTL recipe. - Layer 2 (rate-collapse): per-GPU `hw.gpu.io` `BytesPerSecond` <= `(1 - RateDropThreshold) * BaselineBytesPerSecond` on the same BDF within the correlation window. Default threshold 0.5 (50% drop) and window 5min. Causality flows forward — the AER must precede the rate-collapse sample. Why two layers required (no Confidence taxonomy, mirroring xid_correlation + hbm_ecc): - AER alone: corrected errors recovered — the link re-trained. Operators see raw AER telemetry on journald already. - Rate-collapse alone: workload-natural Tx/Rx variance (a rank finished a comm phase, etc.). The AER is the hardware-fault confirmation that makes the join operator-actionable. Join key is `gpu.id` (PCI BDF) per RFC-0013 §3, shared across the dmesg AER preamble and the dcgm-exporter `hw.gpu.pci.bdf` resource attribute on `hw.gpu.io`. Node is carried for the verdict prose but not used in the join key — the BDF is the proximate hardware identifier. Defaults: - CorrelationWindow: 5min — mirrors the spec's `rate(...[5m])` PromQL and typical post-AER link-renegotiation latency. - RateDropThreshold: 0.5 — a one-generation PCIe downshift (Gen5 → Gen3 or x16 → x8) lands well past this floor. What's new: - module/pkg/patterns/pcie_aer.go — PCIeAERDetector library type, PCIeAERRecord (AER kernel message), PCIeIORecord (hw.gpu.io rate sample with baseline), PCIeAERVerdict output. PatternIDPCIeAER = "5"; EvidenceKindAER = "pcie_aer"; EvidenceKindPCIeIOCollapse = "hw_io_collapse". - module/pkg/patterns/testdata/pcie_aer_verdict.schema.json — JSON schema with 10 drift falsifiers (extra field, confidence re-add, pattern.id numeric, pattern.id wrong value, severity outside enum, gpu_id empty, drop_ratio negative, drop_ratio over 1, evidence kind outside enum, evidence_trail under min). Verdict struct carries promoted scalar fields (Severity, AERType, DropRatio, GPUID, Node) so the processor wiring can stamp them as top-level OTLP log attributes for dashboard table-aggregation per PR #275's lesson. No dead fields — every struct field is read by either the Evaluate path or the verdict-shape contract. Integration end-to-end firing in a real deployment is blocked on PR-B (issue #260): no OTTL recipe today derives the per-GPU `tracecore.alert.pcie_rate_collapse`-shaped log record from `hw.gpu.io`. The detector library is the v0.3 moat and ships independently per ADR-0001. Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Wires the PCIeAERDetector library type into patterndetectorprocessor:
- Config: PCIeAERWindow (yaml: pcie_aer_window, default 5min),
PCIeAERRateDropThreshold (yaml: pcie_aer_rate_drop_threshold,
default 0.5). Validate rejects sub-1s window and threshold
outside [0, 1].
- collectInputs grows two new typed projections behind tighter
discriminators than the existing five:
- projectPCIeAERRecord — gate `kernelevents.pcie_aer.severity`
AND `gpu.id`. Reads severity/type/gpu.id off log attrs, falls
back to resource gpu.id when the journald-kernel OTTL stamps
it on the resource.
- projectPCIeIORecord — gate
`tracecore.alert.pcie_rate_collapse.bytes_per_second` AND
`gpu.id`. Reads BytesPerSecond + Baseline + Direction off log
attrs; the bridge attribute name is namespaced
`tracecore.alert.pcie_rate_collapse.*` so downstream knows it's
a tracecore-derived alert (vs a raw hw.gpu.io scrape sample).
- appendPCIeAERVerdict promotes (GPUID, Severity, AERType,
DropRatio, Node) onto the verdict log record as top-level OTLP
attributes per PR #275's lesson, so dashboards can table-
aggregate without parsing pattern.verdict_json.
Names track OTel semconv (`gpu.id`, `k8s.node.name`) and
recipe-canonical keys (`kernelevents.pcie_aer.severity/.type`).
Wiring tests cover: emits-verdict, AER-alone (no fire),
rate-collapse-alone (no fire), window-configurable, threshold-
configurable, promoted-scalar attribute presence, and the new
Validate guard.
Integration gap (filed separately, not blocking this PR):
1. The journald-kernel OTTL recipe extracts kernelevents.xid +
gpu.id from `NVRM: Xid` lines but does NOT extract
kernelevents.pcie_aer.* from `PCIe Bus Error: …` lines —
needs a sibling OTTL stanza in
docs/integrations/journald-kernel.md.
2. The metrics→logs PCIe rate-collapse alert OTTL recipe is the
PR-B side of issue #260 (ADR-0001 the blocker for all
metrics-sourced patterns).
Until both land, the detector is the v0.3 moat (pattern logic +
tests) and the wiring is configured-but-quiet (zero verdicts on
real input — projections find nothing).
Signed-off-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #270. Promotes operator-facing scalar fields to top-level OTLP log-record attributes on every verdict the
patterndetectorprocessoremits, so Grafana dashboards (PR #264) and LogQL queries can table-aggregate without server-side JSON parsing ofpattern.verdict_json.Root cause: verdict log records carried scalars only inside
pattern.verdict_json. Fields likeRegarding.Name,HangingRanks,EvictedPodwere buried — 4/7 panels in PR #264 rendered empty. Fixed at source: extended verdict structs with the scalars, then promoted them viaPutStr/PutIntin the emitters.Two commits, one concern each:
feat(patterns): expose operator-facing fields on verdict structs—PodEvictedVerdictgainsPodName/PodNamespace/NodeName/EventReason;XidCorrelationVerdictgainsPodName/PodNamespace.NCCLHangVerdictalready hadPgID/CollectiveSeqID/HangingRanks. Schemas + canonical golden updated;pattern.verdict_jsonremains.feat(processor): promote verdict scalar attrs (closes #270)— extendsVerdictAttr*constants and stamps them on the emitted log record alongside existing attrs. Empty optional strings are skipped (putStrIfSet) so dashboards filtering by attribute presence don't match malformed records.Attribute taxonomy by pattern:
k8s.pod.name,k8s.pod.namespace,k8s.node.name,k8s.event.reasonVerdict.PodName/PodNamespace/NodeName/EventReason(new fields populated fromev.Regarding.{Name,Namespace},ev.ReportingInstance,ev.Reason)nccl.fr.pg_id,nccl.fr.collective_seq_id,nccl.fr.hanging_ranks_countVerdict.PgID,Verdict.CollectiveSeqID,len(Verdict.HangingRanks)(no struct change)kernelevents.xid,k8s.node.name,k8s.pod.name,k8s.pod.namespaceVerdict.XidCode,Verdict.Node,Verdict.PodName,Verdict.PodNamespace(last two new, split fromEvictedPod)Test plan
cd module && go test ./pkg/patterns/... ./processor/... -race -count=1— greenmake build— green (OCB compiles)make check— green (golangci-lint, go vet, go mod verify)module/processor/patterndetectorprocessor/verdict_attrs_test.goasserts each new attribute on the emitted log record per pattern; verified RED before impl, then GREENverdict.schema.json,xid_correlation_verdict.schema.json)TestPatternDetector_GoldenAgainstCanonicalFixtureround-trips JSONEqFollowup
Unblocks PR #264 dashboards — panels 2/3/4/5 should render populated tables once this lands.