Skip to content

docs(integrations): loki backend recipe#278

Merged
trilamsr merged 2 commits into
mainfrom
docs/recipe-loki
Jun 1, 2026
Merged

docs(integrations): loki backend recipe#278
trilamsr merged 2 commits into
mainfrom
docs/recipe-loki

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the Grafana Loki backend recipe (roadmap A11) — pairs with the
ClickHouse / Datadog / Honeycomb / generic-OTel recipes for the
exporter-side install path.

  • Loki 3.0+ accepts OTLP/HTTP natively at /otlp/v1/logs, so the
    recipe uses the bundled otlphttp exporter — no Loki-specific
    exporter is needed and the deprecated contrib lokiexporter stays
    out of the OCB distro (RFC-0013 §2 adoption matrix). Adopt-over-
    build holds.
  • Documents the X-Scope-OrgID tenant header (required for
    auth_enabled: true clusters; dropped for single-tenant).
  • Documents the labels-vs-structured-metadata mapping (Loki's
    canonical cardinality footgun): OTLP resource attributes flow
    to stream labels via the distributor's
    default_resource_attributes_as_index_labels allow-list; OTLP
    scope and log attributes flow to structured metadata.
    Tracecore's pattern.* verdict attributes ship as log attributes,
    so they land as structured metadata by default — exactly the shape
    PR feat(dashboards): grafana JSON for pattern verdicts #264's Grafana panels query as attributes_pattern_id.
  • Multi-tenant gateway + Grafana Enterprise Logs features are noted
    as optional, not required.

Why this unblocks PR #264

PR #264's six LogQL panels query attributes_pattern_id,
attributes_k8s_node_name, etc. against Loki. Those LogQL keys exist
because the Loki distributor maps OTLP log attributes to structured
metadata with the attributes_ prefix (and dots → underscores). This
recipe is the install path that makes that mapping default-correct on
day one; without it, an operator can accidentally promote pattern.*
to labels and blow out the stream-cardinality budget.

Test plan

  • make doc-check — exit 0 (504 markdown links resolve, 8
    integration recipes carry tested-against+last-verified,
    docs/README.md indexes every recipe).
  • make validator-recipe — exit 0; the new
    docs/integrations/loki.md is explicitly validated by the
    in-tree tracecore validate --config=docs/integrations/examples/loki.yaml
    (7 of 9 recipes validated, 2 skipped on darwin per existing
    requires-linux / requires-k8s-cluster markers).
  • make check — golangci-lint + vet + tidy-check + mod-verify,
    exit 0.
  • make verify — pre-push hook ran (license-check, doc-check,
    register-lint, actionlint, zizmor, no-autoupdate), exit 0.
  • Confirmed Loki's X-Scope-OrgID semantics and the
    default_resource_attributes_as_index_labels list against
    upstream docs (grafana.com/docs/loki/latest/send-data/otel/
    and Loki repo docs/sources/get-started/labels/).
  • Confirmed pattern.* attribute names against
    module/processor/patterndetectorprocessor/verdict_attrs.go
    (constants VerdictAttrPatternID ... VerdictAttrVerdictJSON).
  • PR feat(dashboards): grafana JSON for pattern verdicts #264's LogQL key shape (attributes_pattern_id) verified
    against Loki's OTLP attribute-mapping convention (log
    attributes → structured metadata, dots → underscores,
    attributes_ prefix). Recipe cross-references PR feat(dashboards): grafana JSON for pattern verdicts #264 in the
    "See also" section.
Add Grafana Loki backend recipe: docs/integrations/loki.md + docs/integrations/examples/loki.yaml. Uses the bundled `otlphttp` exporter against Loki 3.0+'s native OTLP endpoint (`/otlp/v1/logs`) with `X-Scope-OrgID` tenant routing. Documents the labels-vs-structured-metadata mapping that makes PR #264's Grafana dashboard panels (`attributes_pattern_id`, `attributes_k8s_node_name`) default-correct.

Native OTLP-HTTP path via the bundled otlphttp exporter — Loki 3.0+
accepts OTLP at /otlp/v1/logs, so no Loki-specific exporter is needed
(the deprecated contrib lokiexporter is intentionally not bundled).
Recipe documents the X-Scope-OrgID tenant header, the labels-vs-
structured-metadata mapping (the canonical Loki cardinality footgun),
and the otlp_config knobs Loki operators need to keep tracecore's
pattern.* verdict attributes as structured metadata (where they are
queryable as attributes_pattern_id in LogQL — the same shape PR #264's
Grafana dashboard panels expect).

Signed-off-by: Tri Lam <tri@maydow.com>
Per fresh-context review of #278: cited verdict_attrs.go (which
doesn't exist); constants live in patterndetector.go.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 04:23
@trilamsr trilamsr merged commit f3f8b68 into main Jun 1, 2026
11 checks passed
@trilamsr trilamsr deleted the docs/recipe-loki branch June 1, 2026 04:32
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Verdict panels queried OTLP attributes via `| json pattern_id="attributes_pattern_id"` (Promtail/Alloy shape), which returns empty against Loki 3.0+ native OTLP ingestion (PR #278). Loki normalizes OTLP log attributes into structured metadata with dots → underscores at the LogQL surface (`pattern.id` → `pattern_id`), no `attributes_` prefix, no JSON parse stage.

All 6 verdict panels now use direct structured-metadata filters (`| pattern_id="14" | k8s_node_name=~".+"`). top-N tables aggregate on promoted scalar attrs from PR #275 (`k8s_pod_name`, `k8s_pod_namespace`, `nccl_fr_pg_id`, `nccl_fr_collective_seq_id`, `kernelevents_xid`). Panel 6 reads `pattern_confidence` (was `attributes_pattern_confidence` via json).

README documents the Loki native-OTLP install path (cross-refs PR #278 recipe) and notes the Promtail/Alloy fallback shape for operators not on native OTLP. Same 7 panels, no new panels.

Verification: dashboard-linter --strict exit 0; `python3 -c 'import json; json.load(...)'` exit 0; `make check` + `make verify` clean.
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Verdict panels queried OTLP attributes via `| json pattern_id="attributes_pattern_id"` (Promtail/Alloy shape), which returns empty against Loki 3.0+ native OTLP ingestion (PR #278). Loki normalizes OTLP log attributes into structured metadata with dots → underscores at the LogQL surface (`pattern.id` → `pattern_id`), no `attributes_` prefix, no JSON parse stage.

All 6 verdict panels now use direct structured-metadata filters (`| pattern_id="14" | k8s_node_name=~".+"`). top-N tables aggregate on promoted scalar attrs from PR #275 (`k8s_pod_name`, `k8s_pod_namespace`, `nccl_fr_pg_id`, `nccl_fr_collective_seq_id`, `kernelevents_xid`). Panel 6 reads `pattern_confidence` (was `attributes_pattern_confidence` via json).

README documents the Loki native-OTLP install path (cross-refs PR #278 recipe) and notes the Promtail/Alloy fallback shape for operators not on native OTLP. Same 7 panels, no new panels.

Verification: dashboard-linter --strict exit 0; `python3 -c 'import json; json.load(...)'` exit 0; `make check` + `make verify` clean.
trilamsr added a commit that referenced this pull request Jun 1, 2026
## Summary

Adds `install/kubernetes/tracecore/dashboards/patterns.json` — an
operator-facing Grafana 10+ dashboard for the three shipped pattern
verdicts (pod_evicted #14, nccl_hang #15, xid_correlation #16), with
future-proof templating on `pattern.id` for M17/M18 detectors.

Closes #280 (LogQL drift vs Loki native OTLP). Issue #270 (scalar
attribute promotion) was closed by PR #275 upstream of this rebase.

### LogQL shape: Loki native OTLP structured metadata

Verdict log records carry the verdict scalars as OTLP log-record
attributes (PR #275): `pattern.id`, `pattern.confidence`,
`k8s.pod.name`, `k8s.pod.namespace`, `k8s.node.name`,
`k8s.event.reason`, `nccl.fr.pg_id`, `nccl.fr.collective_seq_id`,
`nccl.fr.hanging_ranks_count`, `kernelevents.xid`. The Loki backend
recipe (PR #278, `docs/integrations/loki.md`) sends these to Loki
3.0+'s native OTLP endpoint (`/otlp/v1/logs`) via the bundled
`otlphttp` exporter.

Loki's OTLP receiver lands log attributes as **structured metadata**,
queryable as direct LogQL label filters — no `| json` parser stage,
no `attributes_` prefix. Dots in attribute names normalize to
underscores at the LogQL surface (per Loki upstream docs
`docs/sources/shared/otel.md` "Format considerations"):

- `pattern.id` → `pattern_id`
- `pattern.confidence` → `pattern_confidence`
- `k8s.pod.name` → `k8s_pod_name`
- `nccl.fr.pg_id` → `nccl_fr_pg_id`
- `kernelevents.xid` → `kernelevents_xid`

All six verdict panels are written against this shape (e.g.
`{job=~"$job"} | pattern_id="14" | k8s_node_name=~".+" [$__auto]`).
README documents the Loki native-OTLP install path (cross-refs PR
#278) and notes the Promtail/Alloy fallback for operators not on
native OTLP (those need `| json pattern_id="attributes_pattern_id"`
extraction; not shipped in-tree, fork the JSON).

### Self-telemetry counter still on v0.3 roadmap

The spec proposed PromQL queries against
`otelcol_processor_patterndetector_verdicts_emitted_total`. That
metric does not exist yet — the patterndetectorprocessor README says
*"No self-telemetry yet. Self-telemetry is on the v0.3 roadmap"*.
Per the spec's constraint (*"DO NOT touch detector code or processor
code"*), root-causing the missing metric is out of scope for this
PR; the six verdict-derived panels query Loki via LogQL in the
meantime. Tracked under #261. When that lands, the six LogQL panels
swap to PromQL and the Loki dependency drops.

Panel 7 (throughput) queries Prometheus against the upstream
OTel-Collector standard `otelcol_processor_{incoming,outgoing}_items`
which the collector emits for every processor automatically.

### Panels shipped (7)

| # | Title | Datasource | Patterns covered |
|---|---|---|---|
| 1 | Verdict rate by pattern_id | Loki / LogQL | 14, 15, 16 (templated)
|
| 2 | Top 10 evicted pods | Loki / LogQL | 14 (pod_evicted) |
| 3 | Top 10 hung NCCL collectives | Loki / LogQL | 15 (nccl_hang) |
| 4 | Top 10 Xid+eviction correlations | Loki / LogQL | 16
(xid_correlation) |
| 5 | Verdict count by node | Loki / LogQL | all (templated) |
| 6 | Confidence distribution (full vs partial) | Loki / LogQL | all
(templated) |
| 7 | patterndetector processor throughput | Prom / PromQL | (pipeline
liveness) |

### Templating vars (6)

- `prometheus_datasource`, `loki_datasource` — datasource selectors (no
  hardcoded UIDs).
- `job`, `instance` — linter-mandated PromQL matchers, populated from
  `otelcol_process_uptime`.
- `cluster` — multi-cluster slice, populated from
`otelcol_process_uptime`.
- `pattern_id` — custom-options var seeded with the three shipped IDs
  (14/15/16). Extend in-place when new detectors land.

### Linter exclusion (justified)

`.lint` config waives `target-promql-rule` on the six Loki panels. The
dashboard-linter parses every target as PromQL irrespective of
`target.datasource.type` and fails on the first `|` in a valid LogQL
pipeline. `target-logql-rule` still validates each query as LogQL and
passes. Full justification in
`install/kubernetes/tracecore/dashboards/README.md` §"Linter exclusions"
and inline in the `.lint` `reason:` block. Removable once issue #261
swaps the panels to PromQL.

## Test plan

- [x] `dashboard-linter lint --strict --config .lint patterns.json`
      → exit 0 (built from source — `go install` rejects the linter's
      own go.mod replace directives; build steps in
      `install/kubernetes/tracecore/dashboards/README.md`).
- [x] `python3 -c "import json;
json.load(open('install/kubernetes/tracecore/dashboards/patterns.json'))"`
      → exit 0; 7 panels, 6 templating vars confirmed.
- [x] `make check` (golangci-lint + vet + tidy-check + mod verify) →
      exit 0.
- [x] `make verify` (license-check + doc-check + register-lint +
      actionlint + zizmor + no-autoupdate) → exit 0.
- [x] LogQL shape verified against Loki upstream docs
      (`docs/sources/shared/otel.md` "Format considerations" — dots
      and special characters normalize to underscores; no
      `attributes_` prefix on the native OTLP surface) and
      `docs/sources/send-data/otel/native_otlp_vs_loki_exporter.md`
      query examples (`{service_name="auth"} | severity_text="INFO"`).
- [x] Promoted scalar attrs verified against
      `module/processor/patterndetectorprocessor/patterndetector.go`
      `VerdictAttr*` constants (lines 25-88) and `appendVerdict` /
      `appendNCCLHangVerdict` / `appendXidCorrelationVerdict` emitters
      (lines 510-517+).
- [ ] Manual smoke test against a live cluster with Loki 3.0+ native
      OTLP receiver (deferred — adversarial reviewer to verify panels
      render against actual OTLP-native Loki output before merge).

```release-notes
Ship Grafana dashboard for tracecore's pattern verdicts: install/kubernetes/tracecore/dashboards/patterns.json. Seven panels cover the three shipped pattern detectors (pod_evicted #14, nccl_hang #15, xid_correlation #16) plus templated pattern.id for future M17/M18 patterns. LogQL queries target Loki 3.0+ native OTLP structured metadata (pairs with the Loki backend recipe). Includes README install guide (manual upload, grafana-cli, kube-prometheus-stack ConfigMap), Promtail/Alloy fallback notes, and pattern coverage matrix.
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant