docs(integrations): loki backend recipe#278
Merged
Merged
Conversation
Native OTLP-HTTP path via the bundled otlphttp exporter — Loki 3.0+ accepts OTLP at /otlp/v1/logs, so no Loki-specific exporter is needed (the deprecated contrib lokiexporter is intentionally not bundled). Recipe documents the X-Scope-OrgID tenant header, the labels-vs- structured-metadata mapping (the canonical Loki cardinality footgun), and the otlp_config knobs Loki operators need to keep tracecore's pattern.* verdict attributes as structured metadata (where they are queryable as attributes_pattern_id in LogQL — the same shape PR #264's Grafana dashboard panels expect). Signed-off-by: Tri Lam <tri@maydow.com>
Per fresh-context review of #278: cited verdict_attrs.go (which doesn't exist); constants live in patterndetector.go. Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Verdict panels queried OTLP attributes via `| json pattern_id="attributes_pattern_id"` (Promtail/Alloy shape), which returns empty against Loki 3.0+ native OTLP ingestion (PR #278). Loki normalizes OTLP log attributes into structured metadata with dots → underscores at the LogQL surface (`pattern.id` → `pattern_id`), no `attributes_` prefix, no JSON parse stage. All 6 verdict panels now use direct structured-metadata filters (`| pattern_id="14" | k8s_node_name=~".+"`). top-N tables aggregate on promoted scalar attrs from PR #275 (`k8s_pod_name`, `k8s_pod_namespace`, `nccl_fr_pg_id`, `nccl_fr_collective_seq_id`, `kernelevents_xid`). Panel 6 reads `pattern_confidence` (was `attributes_pattern_confidence` via json). README documents the Loki native-OTLP install path (cross-refs PR #278 recipe) and notes the Promtail/Alloy fallback shape for operators not on native OTLP. Same 7 panels, no new panels. Verification: dashboard-linter --strict exit 0; `python3 -c 'import json; json.load(...)'` exit 0; `make check` + `make verify` clean.
trilamsr
pushed a commit
that referenced
this pull request
Jun 1, 2026
Verdict panels queried OTLP attributes via `| json pattern_id="attributes_pattern_id"` (Promtail/Alloy shape), which returns empty against Loki 3.0+ native OTLP ingestion (PR #278). Loki normalizes OTLP log attributes into structured metadata with dots → underscores at the LogQL surface (`pattern.id` → `pattern_id`), no `attributes_` prefix, no JSON parse stage. All 6 verdict panels now use direct structured-metadata filters (`| pattern_id="14" | k8s_node_name=~".+"`). top-N tables aggregate on promoted scalar attrs from PR #275 (`k8s_pod_name`, `k8s_pod_namespace`, `nccl_fr_pg_id`, `nccl_fr_collective_seq_id`, `kernelevents_xid`). Panel 6 reads `pattern_confidence` (was `attributes_pattern_confidence` via json). README documents the Loki native-OTLP install path (cross-refs PR #278 recipe) and notes the Promtail/Alloy fallback shape for operators not on native OTLP. Same 7 panels, no new panels. Verification: dashboard-linter --strict exit 0; `python3 -c 'import json; json.load(...)'` exit 0; `make check` + `make verify` clean.
trilamsr
added a commit
that referenced
this pull request
Jun 1, 2026
## Summary Adds `install/kubernetes/tracecore/dashboards/patterns.json` — an operator-facing Grafana 10+ dashboard for the three shipped pattern verdicts (pod_evicted #14, nccl_hang #15, xid_correlation #16), with future-proof templating on `pattern.id` for M17/M18 detectors. Closes #280 (LogQL drift vs Loki native OTLP). Issue #270 (scalar attribute promotion) was closed by PR #275 upstream of this rebase. ### LogQL shape: Loki native OTLP structured metadata Verdict log records carry the verdict scalars as OTLP log-record attributes (PR #275): `pattern.id`, `pattern.confidence`, `k8s.pod.name`, `k8s.pod.namespace`, `k8s.node.name`, `k8s.event.reason`, `nccl.fr.pg_id`, `nccl.fr.collective_seq_id`, `nccl.fr.hanging_ranks_count`, `kernelevents.xid`. The Loki backend recipe (PR #278, `docs/integrations/loki.md`) sends these to Loki 3.0+'s native OTLP endpoint (`/otlp/v1/logs`) via the bundled `otlphttp` exporter. Loki's OTLP receiver lands log attributes as **structured metadata**, queryable as direct LogQL label filters — no `| json` parser stage, no `attributes_` prefix. Dots in attribute names normalize to underscores at the LogQL surface (per Loki upstream docs `docs/sources/shared/otel.md` "Format considerations"): - `pattern.id` → `pattern_id` - `pattern.confidence` → `pattern_confidence` - `k8s.pod.name` → `k8s_pod_name` - `nccl.fr.pg_id` → `nccl_fr_pg_id` - `kernelevents.xid` → `kernelevents_xid` All six verdict panels are written against this shape (e.g. `{job=~"$job"} | pattern_id="14" | k8s_node_name=~".+" [$__auto]`). README documents the Loki native-OTLP install path (cross-refs PR #278) and notes the Promtail/Alloy fallback for operators not on native OTLP (those need `| json pattern_id="attributes_pattern_id"` extraction; not shipped in-tree, fork the JSON). ### Self-telemetry counter still on v0.3 roadmap The spec proposed PromQL queries against `otelcol_processor_patterndetector_verdicts_emitted_total`. That metric does not exist yet — the patterndetectorprocessor README says *"No self-telemetry yet. Self-telemetry is on the v0.3 roadmap"*. Per the spec's constraint (*"DO NOT touch detector code or processor code"*), root-causing the missing metric is out of scope for this PR; the six verdict-derived panels query Loki via LogQL in the meantime. Tracked under #261. When that lands, the six LogQL panels swap to PromQL and the Loki dependency drops. Panel 7 (throughput) queries Prometheus against the upstream OTel-Collector standard `otelcol_processor_{incoming,outgoing}_items` which the collector emits for every processor automatically. ### Panels shipped (7) | # | Title | Datasource | Patterns covered | |---|---|---|---| | 1 | Verdict rate by pattern_id | Loki / LogQL | 14, 15, 16 (templated) | | 2 | Top 10 evicted pods | Loki / LogQL | 14 (pod_evicted) | | 3 | Top 10 hung NCCL collectives | Loki / LogQL | 15 (nccl_hang) | | 4 | Top 10 Xid+eviction correlations | Loki / LogQL | 16 (xid_correlation) | | 5 | Verdict count by node | Loki / LogQL | all (templated) | | 6 | Confidence distribution (full vs partial) | Loki / LogQL | all (templated) | | 7 | patterndetector processor throughput | Prom / PromQL | (pipeline liveness) | ### Templating vars (6) - `prometheus_datasource`, `loki_datasource` — datasource selectors (no hardcoded UIDs). - `job`, `instance` — linter-mandated PromQL matchers, populated from `otelcol_process_uptime`. - `cluster` — multi-cluster slice, populated from `otelcol_process_uptime`. - `pattern_id` — custom-options var seeded with the three shipped IDs (14/15/16). Extend in-place when new detectors land. ### Linter exclusion (justified) `.lint` config waives `target-promql-rule` on the six Loki panels. The dashboard-linter parses every target as PromQL irrespective of `target.datasource.type` and fails on the first `|` in a valid LogQL pipeline. `target-logql-rule` still validates each query as LogQL and passes. Full justification in `install/kubernetes/tracecore/dashboards/README.md` §"Linter exclusions" and inline in the `.lint` `reason:` block. Removable once issue #261 swaps the panels to PromQL. ## Test plan - [x] `dashboard-linter lint --strict --config .lint patterns.json` → exit 0 (built from source — `go install` rejects the linter's own go.mod replace directives; build steps in `install/kubernetes/tracecore/dashboards/README.md`). - [x] `python3 -c "import json; json.load(open('install/kubernetes/tracecore/dashboards/patterns.json'))"` → exit 0; 7 panels, 6 templating vars confirmed. - [x] `make check` (golangci-lint + vet + tidy-check + mod verify) → exit 0. - [x] `make verify` (license-check + doc-check + register-lint + actionlint + zizmor + no-autoupdate) → exit 0. - [x] LogQL shape verified against Loki upstream docs (`docs/sources/shared/otel.md` "Format considerations" — dots and special characters normalize to underscores; no `attributes_` prefix on the native OTLP surface) and `docs/sources/send-data/otel/native_otlp_vs_loki_exporter.md` query examples (`{service_name="auth"} | severity_text="INFO"`). - [x] Promoted scalar attrs verified against `module/processor/patterndetectorprocessor/patterndetector.go` `VerdictAttr*` constants (lines 25-88) and `appendVerdict` / `appendNCCLHangVerdict` / `appendXidCorrelationVerdict` emitters (lines 510-517+). - [ ] Manual smoke test against a live cluster with Loki 3.0+ native OTLP receiver (deferred — adversarial reviewer to verify panels render against actual OTLP-native Loki output before merge). ```release-notes Ship Grafana dashboard for tracecore's pattern verdicts: install/kubernetes/tracecore/dashboards/patterns.json. Seven panels cover the three shipped pattern detectors (pod_evicted #14, nccl_hang #15, xid_correlation #16) plus templated pattern.id for future M17/M18 patterns. LogQL queries target Loki 3.0+ native OTLP structured metadata (pairs with the Loki backend recipe). Includes README install guide (manual upload, grafana-cli, kube-prometheus-stack ConfigMap), Promtail/Alloy fallback notes, and pattern coverage matrix. ``` --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the Grafana Loki backend recipe (roadmap A11) — pairs with the
ClickHouse / Datadog / Honeycomb / generic-OTel recipes for the
exporter-side install path.
/otlp/v1/logs, so therecipe uses the bundled
otlphttpexporter — no Loki-specificexporter is needed and the deprecated contrib
lokiexporterstaysout of the OCB distro (RFC-0013 §2 adoption matrix). Adopt-over-
build holds.
X-Scope-OrgIDtenant header (required forauth_enabled: trueclusters; dropped for single-tenant).canonical cardinality footgun): OTLP resource attributes flow
to stream labels via the distributor's
default_resource_attributes_as_index_labelsallow-list; OTLPscope and log attributes flow to structured metadata.
Tracecore's
pattern.*verdict attributes ship as log attributes,so they land as structured metadata by default — exactly the shape
PR feat(dashboards): grafana JSON for pattern verdicts #264's Grafana panels query as
attributes_pattern_id.as optional, not required.
Why this unblocks PR #264
PR #264's six LogQL panels query
attributes_pattern_id,attributes_k8s_node_name, etc. against Loki. Those LogQL keys existbecause the Loki distributor maps OTLP log attributes to structured
metadata with the
attributes_prefix (and dots → underscores). Thisrecipe is the install path that makes that mapping default-correct on
day one; without it, an operator can accidentally promote
pattern.*to labels and blow out the stream-cardinality budget.
Test plan
make doc-check— exit 0 (504 markdown links resolve, 8integration recipes carry tested-against+last-verified,
docs/README.md indexes every recipe).
make validator-recipe— exit 0; the newdocs/integrations/loki.mdis explicitly validated by thein-tree
tracecore validate --config=docs/integrations/examples/loki.yaml(7 of 9 recipes validated, 2 skipped on darwin per existing
requires-linux / requires-k8s-cluster markers).
make check— golangci-lint + vet + tidy-check + mod-verify,exit 0.
make verify— pre-push hook ran (license-check, doc-check,register-lint, actionlint, zizmor, no-autoupdate), exit 0.
X-Scope-OrgIDsemantics and thedefault_resource_attributes_as_index_labelslist againstupstream docs (
grafana.com/docs/loki/latest/send-data/otel/and Loki repo
docs/sources/get-started/labels/).pattern.*attribute names againstmodule/processor/patterndetectorprocessor/verdict_attrs.go(constants
VerdictAttrPatternID...VerdictAttrVerdictJSON).attributes_pattern_id) verifiedagainst Loki's OTLP attribute-mapping convention (log
attributes → structured metadata, dots → underscores,
attributes_prefix). Recipe cross-references PR feat(dashboards): grafana JSON for pattern verdicts #264 in the"See also" section.