Skip to content

feat(dashboards): grafana JSON for pattern verdicts#264

Merged
trilamsr merged 3 commits into
mainfrom
feat/grafana-dashboard-patterns
Jun 1, 2026
Merged

feat(dashboards): grafana JSON for pattern verdicts#264
trilamsr merged 3 commits into
mainfrom
feat/grafana-dashboard-patterns

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds install/kubernetes/tracecore/dashboards/patterns.json — an
operator-facing Grafana 10+ dashboard for the three shipped pattern
verdicts (pod_evicted #14, nccl_hang #15, xid_correlation #16), with
future-proof templating on pattern.id for M17/M18 detectors.

Closes #280 (LogQL drift vs Loki native OTLP). Issue #270 (scalar
attribute promotion) was closed by PR #275 upstream of this rebase.

LogQL shape: Loki native OTLP structured metadata

Verdict log records carry the verdict scalars as OTLP log-record
attributes (PR #275): pattern.id, pattern.confidence,
k8s.pod.name, k8s.pod.namespace, k8s.node.name,
k8s.event.reason, nccl.fr.pg_id, nccl.fr.collective_seq_id,
nccl.fr.hanging_ranks_count, kernelevents.xid. The Loki backend
recipe (PR #278, docs/integrations/loki.md) sends these to Loki
3.0+'s native OTLP endpoint (/otlp/v1/logs) via the bundled
otlphttp exporter.

Loki's OTLP receiver lands log attributes as structured metadata,
queryable as direct LogQL label filters — no | json parser stage,
no attributes_ prefix. Dots in attribute names normalize to
underscores at the LogQL surface (per Loki upstream docs
docs/sources/shared/otel.md "Format considerations"):

  • pattern.idpattern_id
  • pattern.confidencepattern_confidence
  • k8s.pod.namek8s_pod_name
  • nccl.fr.pg_idnccl_fr_pg_id
  • kernelevents.xidkernelevents_xid

All six verdict panels are written against this shape (e.g.
{job=~"$job"} | pattern_id="14" | k8s_node_name=~".+" [$__auto]).
README documents the Loki native-OTLP install path (cross-refs PR
#278) and notes the Promtail/Alloy fallback for operators not on
native OTLP (those need | json pattern_id="attributes_pattern_id"
extraction; not shipped in-tree, fork the JSON).

Self-telemetry counter still on v0.3 roadmap

The spec proposed PromQL queries against
otelcol_processor_patterndetector_verdicts_emitted_total. That
metric does not exist yet — the patterndetectorprocessor README says
"No self-telemetry yet. Self-telemetry is on the v0.3 roadmap".
Per the spec's constraint ("DO NOT touch detector code or processor
code"
), root-causing the missing metric is out of scope for this
PR; the six verdict-derived panels query Loki via LogQL in the
meantime. Tracked under #261. When that lands, the six LogQL panels
swap to PromQL and the Loki dependency drops.

Panel 7 (throughput) queries Prometheus against the upstream
OTel-Collector standard otelcol_processor_{incoming,outgoing}_items
which the collector emits for every processor automatically.

Panels shipped (7)

# Title Datasource Patterns covered
1 Verdict rate by pattern_id Loki / LogQL 14, 15, 16 (templated)
2 Top 10 evicted pods Loki / LogQL 14 (pod_evicted)
3 Top 10 hung NCCL collectives Loki / LogQL 15 (nccl_hang)
4 Top 10 Xid+eviction correlations Loki / LogQL 16 (xid_correlation)
5 Verdict count by node Loki / LogQL all (templated)
6 Confidence distribution (full vs partial) Loki / LogQL all (templated)
7 patterndetector processor throughput Prom / PromQL (pipeline liveness)

Templating vars (6)

  • prometheus_datasource, loki_datasource — datasource selectors (no
    hardcoded UIDs).
  • job, instance — linter-mandated PromQL matchers, populated from
    otelcol_process_uptime.
  • cluster — multi-cluster slice, populated from otelcol_process_uptime.
  • pattern_id — custom-options var seeded with the three shipped IDs
    (14/15/16). Extend in-place when new detectors land.

Linter exclusion (justified)

.lint config waives target-promql-rule on the six Loki panels. The
dashboard-linter parses every target as PromQL irrespective of
target.datasource.type and fails on the first | in a valid LogQL
pipeline. target-logql-rule still validates each query as LogQL and
passes. Full justification in
install/kubernetes/tracecore/dashboards/README.md §"Linter exclusions"
and inline in the .lint reason: block. Removable once issue #261
swaps the panels to PromQL.

Test plan

  • dashboard-linter lint --strict --config .lint patterns.json
    → exit 0 (built from source — go install rejects the linter's
    own go.mod replace directives; build steps in
    install/kubernetes/tracecore/dashboards/README.md).
  • python3 -c "import json; json.load(open('install/kubernetes/tracecore/dashboards/patterns.json'))"
    → exit 0; 7 panels, 6 templating vars confirmed.
  • make check (golangci-lint + vet + tidy-check + mod verify) →
    exit 0.
  • make verify (license-check + doc-check + register-lint +
    actionlint + zizmor + no-autoupdate) → exit 0.
  • LogQL shape verified against Loki upstream docs
    (docs/sources/shared/otel.md "Format considerations" — dots
    and special characters normalize to underscores; no
    attributes_ prefix on the native OTLP surface) and
    docs/sources/send-data/otel/native_otlp_vs_loki_exporter.md
    query examples ({service_name="auth"} | severity_text="INFO").
  • Promoted scalar attrs verified against
    module/processor/patterndetectorprocessor/patterndetector.go
    VerdictAttr* constants (lines 25-88) and appendVerdict /
    appendNCCLHangVerdict / appendXidCorrelationVerdict emitters
    (lines 510-517+).
  • Manual smoke test against a live cluster with Loki 3.0+ native
    OTLP receiver (deferred — adversarial reviewer to verify panels
    render against actual OTLP-native Loki output before merge).
Ship Grafana dashboard for tracecore's pattern verdicts: install/kubernetes/tracecore/dashboards/patterns.json. Seven panels cover the three shipped pattern detectors (pod_evicted #14, nccl_hang #15, xid_correlation #16) plus templated pattern.id for future M17/M18 patterns. LogQL queries target Loki 3.0+ native OTLP structured metadata (pairs with the Loki backend recipe). Includes README install guide (manual upload, grafana-cli, kube-prometheus-stack ConfigMap), Promtail/Alloy fallback notes, and pattern coverage matrix.

trilamsr added a commit that referenced this pull request Jun 1, 2026
…cords (closes #270) (#275)

## Summary

Closes #270. Promotes operator-facing scalar fields to top-level OTLP
log-record attributes on every verdict the `patterndetectorprocessor`
emits, so Grafana dashboards (PR #264) and LogQL queries can
table-aggregate without server-side JSON parsing of
`pattern.verdict_json`.

Root cause: verdict log records carried scalars only inside
`pattern.verdict_json`. Fields like `Regarding.Name`, `HangingRanks`,
`EvictedPod` were buried — 4/7 panels in PR #264 rendered empty. Fixed
at source: extended verdict structs with the scalars, then promoted them
via `PutStr` / `PutInt` in the emitters.

Two commits, one concern each:

1. `feat(patterns): expose operator-facing fields on verdict structs` —
`PodEvictedVerdict` gains `PodName` / `PodNamespace` / `NodeName` /
`EventReason`; `XidCorrelationVerdict` gains `PodName` / `PodNamespace`.
`NCCLHangVerdict` already had `PgID` / `CollectiveSeqID` /
`HangingRanks`. Schemas + canonical golden updated;
`pattern.verdict_json` remains.
2. `feat(processor): promote verdict scalar attrs (closes #270)` —
extends `VerdictAttr*` constants and stamps them on the emitted log
record alongside existing attrs. Empty optional strings are skipped
(`putStrIfSet`) so dashboards filtering by attribute presence don't
match malformed records.

Attribute taxonomy by pattern:

| pattern | new attrs | mapped from |
| --- | --- | --- |
| pod_evicted (14) | `k8s.pod.name`, `k8s.pod.namespace`,
`k8s.node.name`, `k8s.event.reason` | `Verdict.PodName` / `PodNamespace`
/ `NodeName` / `EventReason` (new fields populated from
`ev.Regarding.{Name,Namespace}`, `ev.ReportingInstance`, `ev.Reason`) |
| nccl_hang (15) | `nccl.fr.pg_id`, `nccl.fr.collective_seq_id`,
`nccl.fr.hanging_ranks_count` | `Verdict.PgID`,
`Verdict.CollectiveSeqID`, `len(Verdict.HangingRanks)` (no struct
change) |
| xid_correlation (16) | `kernelevents.xid`, `k8s.node.name`,
`k8s.pod.name`, `k8s.pod.namespace` | `Verdict.XidCode`, `Verdict.Node`,
`Verdict.PodName`, `Verdict.PodNamespace` (last two new, split from
`EvictedPod`) |

## Test plan

- [x] `cd module && go test ./pkg/patterns/... ./processor/... -race
-count=1` — green
- [x] `make build` — green (OCB compiles)
- [x] `make check` — green (golangci-lint, go vet, go mod verify)
- [x] New TDD red-first test:
`module/processor/patterndetectorprocessor/verdict_attrs_test.go`
asserts each new attribute on the emitted log record per pattern;
verified RED before impl, then GREEN
- [x] Schema-conformance tests for all three verdicts pass against
updated schemas (`verdict.schema.json`,
`xid_correlation_verdict.schema.json`)
- [x] Schema-drift battery still rejects the documented mutation classes
(additionalProperties:false / enum guards / minItems / type guards)
- [x] Canonical pod_evicted golden updated to include the new scalars;
`TestPatternDetector_GoldenAgainstCanonicalFixture` round-trips JSONEq

## Followup

Unblocks PR #264 dashboards — panels 2/3/4/5 should render populated
tables once this lands.

```release-notes
patterndetectorprocessor verdict log records now carry operator-facing scalars as top-level OTLP attributes (k8s.pod.name, k8s.pod.namespace, k8s.node.name, k8s.event.reason on pod_evicted; nccl.fr.pg_id, nccl.fr.collective_seq_id, nccl.fr.hanging_ranks_count on nccl_hang; kernelevents.xid + k8s.node.name + k8s.pod.name + k8s.pod.namespace on xid_correlation). Dashboards and LogQL queries can table-aggregate without parsing pattern.verdict_json. The full pattern.verdict_json attribute remains unchanged.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Adds the Grafana Loki backend recipe (roadmap A11) — pairs with the
ClickHouse / Datadog / Honeycomb / generic-OTel recipes for the
exporter-side install path.

- Loki 3.0+ accepts OTLP/HTTP natively at `/otlp/v1/logs`, so the
  recipe uses the bundled `otlphttp` exporter — no Loki-specific
  exporter is needed and the deprecated contrib `lokiexporter` stays
  out of the OCB distro (RFC-0013 §2 adoption matrix). Adopt-over-
  build holds.
- Documents the `X-Scope-OrgID` tenant header (required for
  `auth_enabled: true` clusters; dropped for single-tenant).
- Documents the labels-vs-structured-metadata mapping (Loki's
  canonical cardinality footgun): OTLP **resource** attributes flow
  to stream labels via the distributor's
  `default_resource_attributes_as_index_labels` allow-list; OTLP
  **scope** and **log** attributes flow to structured metadata.
  Tracecore's `pattern.*` verdict attributes ship as log attributes,
  so they land as structured metadata by default — exactly the shape
  PR #264's Grafana panels query as `attributes_pattern_id`.
- Multi-tenant gateway + Grafana Enterprise Logs features are noted
  as optional, not required.

### Why this unblocks PR #264

PR #264's six LogQL panels query `attributes_pattern_id`,
`attributes_k8s_node_name`, etc. against Loki. Those LogQL keys exist
because the Loki distributor maps OTLP log attributes to structured
metadata with the `attributes_` prefix (and dots → underscores). This
recipe is the install path that makes that mapping default-correct on
day one; without it, an operator can accidentally promote `pattern.*`
to labels and blow out the stream-cardinality budget.

## Test plan

- [x] `make doc-check` — exit 0 (504 markdown links resolve, 8
      integration recipes carry tested-against+last-verified,
      docs/README.md indexes every recipe).
- [x] `make validator-recipe` — exit 0; the new
      `docs/integrations/loki.md` is explicitly validated by the
in-tree `tracecore validate
--config=docs/integrations/examples/loki.yaml`
      (7 of 9 recipes validated, 2 skipped on darwin per existing
      requires-linux / requires-k8s-cluster markers).
- [x] `make check` — golangci-lint + vet + tidy-check + mod-verify,
      exit 0.
- [x] `make verify` — pre-push hook ran (license-check, doc-check,
      register-lint, actionlint, zizmor, no-autoupdate), exit 0.
- [x] Confirmed Loki's `X-Scope-OrgID` semantics and the
      `default_resource_attributes_as_index_labels` list against
      upstream docs (`grafana.com/docs/loki/latest/send-data/otel/`
      and Loki repo `docs/sources/get-started/labels/`).
- [x] Confirmed `pattern.*` attribute names against
      `module/processor/patterndetectorprocessor/verdict_attrs.go`
      (constants `VerdictAttrPatternID` ... `VerdictAttrVerdictJSON`).
- [x] PR #264's LogQL key shape (`attributes_pattern_id`) verified
      against Loki's OTLP attribute-mapping convention (log
      attributes → structured metadata, dots → underscores,
      `attributes_` prefix). Recipe cross-references PR #264 in the
      "See also" section.

```release-notes
Add Grafana Loki backend recipe: docs/integrations/loki.md + docs/integrations/examples/loki.yaml. Uses the bundled `otlphttp` exporter against Loki 3.0+'s native OTLP endpoint (`/otlp/v1/logs`) with `X-Scope-OrgID` tenant routing. Documents the labels-vs-structured-metadata mapping that makes PR #264's Grafana dashboard panels (`attributes_pattern_id`, `attributes_k8s_node_name`) default-correct.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Adds install/kubernetes/tracecore/dashboards/patterns.json — an
operator-facing Grafana 10+ dashboard for the three shipped pattern
verdicts (pod_evicted #14, nccl_hang #15, xid_correlation #16) plus
future-proof templating on pattern.id.

Seven panels:

1. Verdict rate by pattern_id (timeseries, LogQL)
2. Top 10 evicted pods — pod_evicted (table, LogQL)
3. Top 10 hung NCCL collectives — nccl_hang (table, LogQL)
4. Top 10 Xid+eviction correlations — xid_correlation (table, LogQL)
5. Verdict count by node (bargauge, LogQL)
6. Confidence distribution full vs partial (piechart, LogQL)
7. patterndetector processor throughput (timeseries, PromQL)

The verdict signal lives in OTLP log records (the processor's README
explicitly defers self-tel metrics to v0.3 — tracked in #261), so the
six verdict-derived panels query Loki via LogQL and the throughput
panel queries Prometheus for the upstream-standard otelcol_processor_*
items metrics. Two datasource template vars (prometheus_datasource,
loki_datasource) keep both backends pluggable on import.

Bundled .lint config waives target-promql-rule on the six LogQL
panels — the dashboard-linter parses every target as PromQL
irrespective of target.datasource.type and fails on the first | in a
LogQL pipeline. The target-logql-rule still validates each query.

dashboard-linter lint --strict --config .lint patterns.json → exit 0.
json.load → 7 panels, 6 templating vars.

README documents prerequisites (Prometheus + Loki datasources, the
patterndetector processor wired into service.pipelines.logs), three
install paths (manual upload, grafana-cli API, kube-prometheus-stack
ConfigMap), the pattern → panel coverage matrix, the self-tel gap, and
the lint-exclusion justification.

Refs #261

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr force-pushed the feat/grafana-dashboard-patterns branch from 6d8c0d5 to efa57d8 Compare June 1, 2026 04:42
Verdict panels queried OTLP attributes via `| json pattern_id="attributes_pattern_id"` (Promtail/Alloy shape), which returns empty against Loki 3.0+ native OTLP ingestion (PR #278). Loki normalizes OTLP log attributes into structured metadata with dots → underscores at the LogQL surface (`pattern.id` → `pattern_id`), no `attributes_` prefix, no JSON parse stage.

All 6 verdict panels now use direct structured-metadata filters (`| pattern_id="14" | k8s_node_name=~".+"`). top-N tables aggregate on promoted scalar attrs from PR #275 (`k8s_pod_name`, `k8s_pod_namespace`, `nccl_fr_pg_id`, `nccl_fr_collective_seq_id`, `kernelevents_xid`). Panel 6 reads `pattern_confidence` (was `attributes_pattern_confidence` via json).

README documents the Loki native-OTLP install path (cross-refs PR #278 recipe) and notes the Promtail/Alloy fallback shape for operators not on native OTLP. Same 7 panels, no new panels.

Verification: dashboard-linter --strict exit 0; `python3 -c 'import json; json.load(...)'` exit 0; `make check` + `make verify` clean.
@trilamsr trilamsr force-pushed the feat/grafana-dashboard-patterns branch from efa57d8 to 1ed52e1 Compare June 1, 2026 04:43
@trilamsr

trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Rebased onto current origin/main (now carries #275 verdict-attrs and #278 Loki recipe). Force-pushed; new HEAD 1ed52e1.

LogQL rewrite (closes #280). All six verdict panel targets now use direct Loki structured-metadata filters — | pattern_id="14" | k8s_node_name=~".+" [$__auto] form — instead of the prior | json pattern_id="attributes_pattern_id" extraction. Root cause: Loki's native OTLP receiver lands log attributes as structured metadata with dots normalized to underscores at the LogQL surface (per Loki upstream docs/sources/shared/otel.md "Format considerations" + docs/sources/send-data/otel/native_otlp_vs_loki_exporter.md examples like {service_name="auth"} | severity_text="INFO"). No | json parser stage, no attributes_ prefix. The old shape returned empty against the Loki recipe (#278) install path.

Panel-by-panel key mapping (all 6 verdict panels):

  • Panel 1 (rate): pattern_id=~"$pattern_id"
  • Panel 2 (evicted pods): pattern_id="14" | k8s_pod_name=~".+" | k8s_node_name=~".+", groups on k8s_pod_name, k8s_pod_namespace, k8s_node_name
  • Panel 3 (nccl hangs): pattern_id="15", groups on nccl_fr_pg_id, nccl_fr_collective_seq_id
  • Panel 4 (xid+evict): pattern_id="16", groups on k8s_node_name, kernelevents_xid, k8s_pod_name
  • Panel 5 (by node): groups on k8s_node_name, filters pattern_id=~"$pattern_id"
  • Panel 6 (confidence): groups on pattern_confidence, filters pattern_id=~"$pattern_id"

README updates. Prerequisites section now states "Loki 3.0+ with native OTLP ingestion" explicitly, links the Loki backend recipe (#278) as the canonical install path, and documents the Promtail/Alloy fallback shape for operators not on native OTLP (fork the JSON + add | json extraction; not shipped in-tree). The "Contributing a new panel" section now points authors at the VerdictAttr* constants in patterndetector.go instead of the old | json selector pattern.

Verification.

  • dashboard-linter lint --strict --config .lint patterns.json → exit 0
  • python3 -c "import json; json.load(open('patterns.json'))" → exit 0 (7 panels, 6 templating vars)
  • make check, make verify → exit 0

Out of scope (intentional). No new panels (same 7), no processor code touched (verdict_attrs shipped via #275), no Loki recipe doc edits (PR #278's docs/integrations/loki.md says attributes_pattern_id is the surface form, which contradicts upstream Loki native-OTLP docs — flagged for a follow-up correction; not blocking this PR).

Do not auto-merge. Adversarial reviewer should verify panels render against an actual OTLP-native Loki output (the manual smoke-test test-plan item remains unchecked).

PR #274 (hbm_ecc detector) landed during PR #264's rebase window.
Dashboard template options + README coverage matrix stale post-merge:

- pattern_id template: add '3 (hbm_ecc)' option
- README: promote pattern.id=3 from 'not shipped' to 'shipped' row;
  list Verdict rate (panel 1) + Verdict by node (panel 5) as the
  panels that cover it (no dedicated top-N table this round; can
  add in a follow-up when operator-facing HBM ECC fields stabilize)

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 04:54
@trilamsr trilamsr merged commit 65e7b9c into main Jun 1, 2026
14 checks passed
@trilamsr trilamsr deleted the feat/grafana-dashboard-patterns branch June 1, 2026 05:03
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
native OTLP path = bare-underscored (pattern_id, k8s_node_name).
attributes_* prefix is Promtail/Alloy JSON-extraction surface.
align loki.md with dashboard PR #264's native-OTLP shape.

closes #283

Signed-off-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PR #264 dashboards: rewrite LogQL from json-extract to OTLP-native structured-metadata filters (#278 follow-up)

1 participant