feat(dashboards): grafana JSON for pattern verdicts#264
Conversation
…cords (closes #270) (#275) ## Summary Closes #270. Promotes operator-facing scalar fields to top-level OTLP log-record attributes on every verdict the `patterndetectorprocessor` emits, so Grafana dashboards (PR #264) and LogQL queries can table-aggregate without server-side JSON parsing of `pattern.verdict_json`. Root cause: verdict log records carried scalars only inside `pattern.verdict_json`. Fields like `Regarding.Name`, `HangingRanks`, `EvictedPod` were buried — 4/7 panels in PR #264 rendered empty. Fixed at source: extended verdict structs with the scalars, then promoted them via `PutStr` / `PutInt` in the emitters. Two commits, one concern each: 1. `feat(patterns): expose operator-facing fields on verdict structs` — `PodEvictedVerdict` gains `PodName` / `PodNamespace` / `NodeName` / `EventReason`; `XidCorrelationVerdict` gains `PodName` / `PodNamespace`. `NCCLHangVerdict` already had `PgID` / `CollectiveSeqID` / `HangingRanks`. Schemas + canonical golden updated; `pattern.verdict_json` remains. 2. `feat(processor): promote verdict scalar attrs (closes #270)` — extends `VerdictAttr*` constants and stamps them on the emitted log record alongside existing attrs. Empty optional strings are skipped (`putStrIfSet`) so dashboards filtering by attribute presence don't match malformed records. Attribute taxonomy by pattern: | pattern | new attrs | mapped from | | --- | --- | --- | | pod_evicted (14) | `k8s.pod.name`, `k8s.pod.namespace`, `k8s.node.name`, `k8s.event.reason` | `Verdict.PodName` / `PodNamespace` / `NodeName` / `EventReason` (new fields populated from `ev.Regarding.{Name,Namespace}`, `ev.ReportingInstance`, `ev.Reason`) | | nccl_hang (15) | `nccl.fr.pg_id`, `nccl.fr.collective_seq_id`, `nccl.fr.hanging_ranks_count` | `Verdict.PgID`, `Verdict.CollectiveSeqID`, `len(Verdict.HangingRanks)` (no struct change) | | xid_correlation (16) | `kernelevents.xid`, `k8s.node.name`, `k8s.pod.name`, `k8s.pod.namespace` | `Verdict.XidCode`, `Verdict.Node`, `Verdict.PodName`, `Verdict.PodNamespace` (last two new, split from `EvictedPod`) | ## Test plan - [x] `cd module && go test ./pkg/patterns/... ./processor/... -race -count=1` — green - [x] `make build` — green (OCB compiles) - [x] `make check` — green (golangci-lint, go vet, go mod verify) - [x] New TDD red-first test: `module/processor/patterndetectorprocessor/verdict_attrs_test.go` asserts each new attribute on the emitted log record per pattern; verified RED before impl, then GREEN - [x] Schema-conformance tests for all three verdicts pass against updated schemas (`verdict.schema.json`, `xid_correlation_verdict.schema.json`) - [x] Schema-drift battery still rejects the documented mutation classes (additionalProperties:false / enum guards / minItems / type guards) - [x] Canonical pod_evicted golden updated to include the new scalars; `TestPatternDetector_GoldenAgainstCanonicalFixture` round-trips JSONEq ## Followup Unblocks PR #264 dashboards — panels 2/3/4/5 should render populated tables once this lands. ```release-notes patterndetectorprocessor verdict log records now carry operator-facing scalars as top-level OTLP attributes (k8s.pod.name, k8s.pod.namespace, k8s.node.name, k8s.event.reason on pod_evicted; nccl.fr.pg_id, nccl.fr.collective_seq_id, nccl.fr.hanging_ranks_count on nccl_hang; kernelevents.xid + k8s.node.name + k8s.pod.name + k8s.pod.namespace on xid_correlation). Dashboards and LogQL queries can table-aggregate without parsing pattern.verdict_json. The full pattern.verdict_json attribute remains unchanged. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
## Summary Adds the Grafana Loki backend recipe (roadmap A11) — pairs with the ClickHouse / Datadog / Honeycomb / generic-OTel recipes for the exporter-side install path. - Loki 3.0+ accepts OTLP/HTTP natively at `/otlp/v1/logs`, so the recipe uses the bundled `otlphttp` exporter — no Loki-specific exporter is needed and the deprecated contrib `lokiexporter` stays out of the OCB distro (RFC-0013 §2 adoption matrix). Adopt-over- build holds. - Documents the `X-Scope-OrgID` tenant header (required for `auth_enabled: true` clusters; dropped for single-tenant). - Documents the labels-vs-structured-metadata mapping (Loki's canonical cardinality footgun): OTLP **resource** attributes flow to stream labels via the distributor's `default_resource_attributes_as_index_labels` allow-list; OTLP **scope** and **log** attributes flow to structured metadata. Tracecore's `pattern.*` verdict attributes ship as log attributes, so they land as structured metadata by default — exactly the shape PR #264's Grafana panels query as `attributes_pattern_id`. - Multi-tenant gateway + Grafana Enterprise Logs features are noted as optional, not required. ### Why this unblocks PR #264 PR #264's six LogQL panels query `attributes_pattern_id`, `attributes_k8s_node_name`, etc. against Loki. Those LogQL keys exist because the Loki distributor maps OTLP log attributes to structured metadata with the `attributes_` prefix (and dots → underscores). This recipe is the install path that makes that mapping default-correct on day one; without it, an operator can accidentally promote `pattern.*` to labels and blow out the stream-cardinality budget. ## Test plan - [x] `make doc-check` — exit 0 (504 markdown links resolve, 8 integration recipes carry tested-against+last-verified, docs/README.md indexes every recipe). - [x] `make validator-recipe` — exit 0; the new `docs/integrations/loki.md` is explicitly validated by the in-tree `tracecore validate --config=docs/integrations/examples/loki.yaml` (7 of 9 recipes validated, 2 skipped on darwin per existing requires-linux / requires-k8s-cluster markers). - [x] `make check` — golangci-lint + vet + tidy-check + mod-verify, exit 0. - [x] `make verify` — pre-push hook ran (license-check, doc-check, register-lint, actionlint, zizmor, no-autoupdate), exit 0. - [x] Confirmed Loki's `X-Scope-OrgID` semantics and the `default_resource_attributes_as_index_labels` list against upstream docs (`grafana.com/docs/loki/latest/send-data/otel/` and Loki repo `docs/sources/get-started/labels/`). - [x] Confirmed `pattern.*` attribute names against `module/processor/patterndetectorprocessor/verdict_attrs.go` (constants `VerdictAttrPatternID` ... `VerdictAttrVerdictJSON`). - [x] PR #264's LogQL key shape (`attributes_pattern_id`) verified against Loki's OTLP attribute-mapping convention (log attributes → structured metadata, dots → underscores, `attributes_` prefix). Recipe cross-references PR #264 in the "See also" section. ```release-notes Add Grafana Loki backend recipe: docs/integrations/loki.md + docs/integrations/examples/loki.yaml. Uses the bundled `otlphttp` exporter against Loki 3.0+'s native OTLP endpoint (`/otlp/v1/logs`) with `X-Scope-OrgID` tenant routing. Documents the labels-vs-structured-metadata mapping that makes PR #264's Grafana dashboard panels (`attributes_pattern_id`, `attributes_k8s_node_name`) default-correct. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
Adds install/kubernetes/tracecore/dashboards/patterns.json — an operator-facing Grafana 10+ dashboard for the three shipped pattern verdicts (pod_evicted #14, nccl_hang #15, xid_correlation #16) plus future-proof templating on pattern.id. Seven panels: 1. Verdict rate by pattern_id (timeseries, LogQL) 2. Top 10 evicted pods — pod_evicted (table, LogQL) 3. Top 10 hung NCCL collectives — nccl_hang (table, LogQL) 4. Top 10 Xid+eviction correlations — xid_correlation (table, LogQL) 5. Verdict count by node (bargauge, LogQL) 6. Confidence distribution full vs partial (piechart, LogQL) 7. patterndetector processor throughput (timeseries, PromQL) The verdict signal lives in OTLP log records (the processor's README explicitly defers self-tel metrics to v0.3 — tracked in #261), so the six verdict-derived panels query Loki via LogQL and the throughput panel queries Prometheus for the upstream-standard otelcol_processor_* items metrics. Two datasource template vars (prometheus_datasource, loki_datasource) keep both backends pluggable on import. Bundled .lint config waives target-promql-rule on the six LogQL panels — the dashboard-linter parses every target as PromQL irrespective of target.datasource.type and fails on the first | in a LogQL pipeline. The target-logql-rule still validates each query. dashboard-linter lint --strict --config .lint patterns.json → exit 0. json.load → 7 panels, 6 templating vars. README documents prerequisites (Prometheus + Loki datasources, the patterndetector processor wired into service.pipelines.logs), three install paths (manual upload, grafana-cli API, kube-prometheus-stack ConfigMap), the pattern → panel coverage matrix, the self-tel gap, and the lint-exclusion justification. Refs #261 Signed-off-by: Tri Lam <tri@maydow.com>
6d8c0d5 to
efa57d8
Compare
Verdict panels queried OTLP attributes via `| json pattern_id="attributes_pattern_id"` (Promtail/Alloy shape), which returns empty against Loki 3.0+ native OTLP ingestion (PR #278). Loki normalizes OTLP log attributes into structured metadata with dots → underscores at the LogQL surface (`pattern.id` → `pattern_id`), no `attributes_` prefix, no JSON parse stage. All 6 verdict panels now use direct structured-metadata filters (`| pattern_id="14" | k8s_node_name=~".+"`). top-N tables aggregate on promoted scalar attrs from PR #275 (`k8s_pod_name`, `k8s_pod_namespace`, `nccl_fr_pg_id`, `nccl_fr_collective_seq_id`, `kernelevents_xid`). Panel 6 reads `pattern_confidence` (was `attributes_pattern_confidence` via json). README documents the Loki native-OTLP install path (cross-refs PR #278 recipe) and notes the Promtail/Alloy fallback shape for operators not on native OTLP. Same 7 panels, no new panels. Verification: dashboard-linter --strict exit 0; `python3 -c 'import json; json.load(...)'` exit 0; `make check` + `make verify` clean.
efa57d8 to
1ed52e1
Compare
|
Rebased onto current LogQL rewrite (closes #280). All six verdict panel targets now use direct Loki structured-metadata filters — Panel-by-panel key mapping (all 6 verdict panels):
README updates. Prerequisites section now states "Loki 3.0+ with native OTLP ingestion" explicitly, links the Loki backend recipe (#278) as the canonical install path, and documents the Promtail/Alloy fallback shape for operators not on native OTLP (fork the JSON + add Verification.
Out of scope (intentional). No new panels (same 7), no processor code touched (verdict_attrs shipped via #275), no Loki recipe doc edits (PR #278's Do not auto-merge. Adversarial reviewer should verify panels render against an actual OTLP-native Loki output (the manual smoke-test test-plan item remains unchecked). |
PR #274 (hbm_ecc detector) landed during PR #264's rebase window. Dashboard template options + README coverage matrix stale post-merge: - pattern_id template: add '3 (hbm_ecc)' option - README: promote pattern.id=3 from 'not shipped' to 'shipped' row; list Verdict rate (panel 1) + Verdict by node (panel 5) as the panels that cover it (no dedicated top-N table this round; can add in a follow-up when operator-facing HBM ECC fields stabilize) Signed-off-by: Tri Lam <tri@maydow.com>
Summary
Adds
install/kubernetes/tracecore/dashboards/patterns.json— anoperator-facing Grafana 10+ dashboard for the three shipped pattern
verdicts (pod_evicted #14, nccl_hang #15, xid_correlation #16), with
future-proof templating on
pattern.idfor M17/M18 detectors.Closes #280 (LogQL drift vs Loki native OTLP). Issue #270 (scalar
attribute promotion) was closed by PR #275 upstream of this rebase.
LogQL shape: Loki native OTLP structured metadata
Verdict log records carry the verdict scalars as OTLP log-record
attributes (PR #275):
pattern.id,pattern.confidence,k8s.pod.name,k8s.pod.namespace,k8s.node.name,k8s.event.reason,nccl.fr.pg_id,nccl.fr.collective_seq_id,nccl.fr.hanging_ranks_count,kernelevents.xid. The Loki backendrecipe (PR #278,
docs/integrations/loki.md) sends these to Loki3.0+'s native OTLP endpoint (
/otlp/v1/logs) via the bundledotlphttpexporter.Loki's OTLP receiver lands log attributes as structured metadata,
queryable as direct LogQL label filters — no
| jsonparser stage,no
attributes_prefix. Dots in attribute names normalize tounderscores at the LogQL surface (per Loki upstream docs
docs/sources/shared/otel.md"Format considerations"):pattern.id→pattern_idpattern.confidence→pattern_confidencek8s.pod.name→k8s_pod_namenccl.fr.pg_id→nccl_fr_pg_idkernelevents.xid→kernelevents_xidAll six verdict panels are written against this shape (e.g.
{job=~"$job"} | pattern_id="14" | k8s_node_name=~".+" [$__auto]).README documents the Loki native-OTLP install path (cross-refs PR
#278) and notes the Promtail/Alloy fallback for operators not on
native OTLP (those need
| json pattern_id="attributes_pattern_id"extraction; not shipped in-tree, fork the JSON).
Self-telemetry counter still on v0.3 roadmap
The spec proposed PromQL queries against
otelcol_processor_patterndetector_verdicts_emitted_total. Thatmetric does not exist yet — the patterndetectorprocessor README says
"No self-telemetry yet. Self-telemetry is on the v0.3 roadmap".
Per the spec's constraint ("DO NOT touch detector code or processor
code"), root-causing the missing metric is out of scope for this
PR; the six verdict-derived panels query Loki via LogQL in the
meantime. Tracked under #261. When that lands, the six LogQL panels
swap to PromQL and the Loki dependency drops.
Panel 7 (throughput) queries Prometheus against the upstream
OTel-Collector standard
otelcol_processor_{incoming,outgoing}_itemswhich the collector emits for every processor automatically.
Panels shipped (7)
Templating vars (6)
prometheus_datasource,loki_datasource— datasource selectors (nohardcoded UIDs).
job,instance— linter-mandated PromQL matchers, populated fromotelcol_process_uptime.cluster— multi-cluster slice, populated fromotelcol_process_uptime.pattern_id— custom-options var seeded with the three shipped IDs(14/15/16). Extend in-place when new detectors land.
Linter exclusion (justified)
.lintconfig waivestarget-promql-ruleon the six Loki panels. Thedashboard-linter parses every target as PromQL irrespective of
target.datasource.typeand fails on the first|in a valid LogQLpipeline.
target-logql-rulestill validates each query as LogQL andpasses. Full justification in
install/kubernetes/tracecore/dashboards/README.md§"Linter exclusions"and inline in the
.lintreason:block. Removable once issue #261swaps the panels to PromQL.
Test plan
dashboard-linter lint --strict --config .lint patterns.json→ exit 0 (built from source —
go installrejects the linter'sown go.mod replace directives; build steps in
install/kubernetes/tracecore/dashboards/README.md).python3 -c "import json; json.load(open('install/kubernetes/tracecore/dashboards/patterns.json'))"→ exit 0; 7 panels, 6 templating vars confirmed.
make check(golangci-lint + vet + tidy-check + mod verify) →exit 0.
make verify(license-check + doc-check + register-lint +actionlint + zizmor + no-autoupdate) → exit 0.
(
docs/sources/shared/otel.md"Format considerations" — dotsand special characters normalize to underscores; no
attributes_prefix on the native OTLP surface) anddocs/sources/send-data/otel/native_otlp_vs_loki_exporter.mdquery examples (
{service_name="auth"} | severity_text="INFO").module/processor/patterndetectorprocessor/patterndetector.goVerdictAttr*constants (lines 25-88) andappendVerdict/appendNCCLHangVerdict/appendXidCorrelationVerdictemitters(lines 510-517+).
OTLP receiver (deferred — adversarial reviewer to verify panels
render against actual OTLP-native Loki output before merge).