Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ Backend (exporter-side) recipes:
| [integrations/honeycomb.md](integrations/honeycomb.md) | 👤 | Direct OTLP/HTTP to Honeycomb via the in-tree `otlphttp` exporter. |
| [integrations/datadog.md](integrations/datadog.md) | 👤 | Datadog via the bundled `datadogexporter`. |
| [integrations/clickhouse-direct.md](integrations/clickhouse-direct.md) | 👤 | Self-hosted ClickHouse via the bundled `clickhouseexporter`. |
| [integrations/loki.md](integrations/loki.md) | 👤 | Grafana Loki via OTLP/HTTP native ingestion (`otlphttp` exporter, `X-Scope-OrgID` tenant header); labels-vs-structured-metadata mapping for `pattern.*` verdict attributes. |

Source (receiver-side) recipes — RFC-0013 §migration PR-J replacements for the deleted in-tree receivers:

Expand Down
40 changes: 40 additions & 0 deletions docs/integrations/examples/loki.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Grafana Loki ingests OTLP/HTTP logs natively at /otlp/v1/logs since
# Loki 3.0 (2024). The upstream `otlphttp` exporter the OCB-assembled
# tracecore binary bundles (RFC-0013 §1) is sufficient — no Loki-
# specific exporter is needed, and the deprecated contrib `lokiexporter`
# is intentionally NOT bundled. Tracecore writes to Loki's distributor
# directly; the distributor maps OTLP resource attributes to stream
# labels and log attributes to structured metadata.
#
# Tenant header: Loki uses `X-Scope-OrgID` to identify the tenant when
# the distributor runs with `auth_enabled: true`. Single-tenant clusters
# (`auth_enabled: false`) accept requests without the header and route
# them under the synthetic tenant `fake`. The `X-Scope-OrgID: tracecore`
# below assumes a multi-tenant cluster; drop the header for single-
# tenant. Tracecore does not expand environment variables in YAML;
# render the literal value at deploy time via envsubst / Helm template
# / CSI secret driver if the tenant ID is sensitive. See
# docs/integrations/loki.md.
#
# Endpoint: the `otlphttp` exporter appends the OTLP-spec path
# `/v1/logs` automatically, so `endpoint: http://loki.../otlp` resolves
# to `http://loki.../otlp/v1/logs` at request time. Do not include
# `/v1/logs` in the endpoint string — `otlphttp` rejects the duplicate.
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318

exporters:
otlphttp/loki:
endpoint: http://loki-distributor.observability.svc.cluster.local:3100/otlp
compression: gzip
headers:
X-Scope-OrgID: tracecore

service:
pipelines:
logs/loki:
receivers: [otlp]
exporters: [otlphttp/loki]
192 changes: 192 additions & 0 deletions docs/integrations/loki.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
<!-- tested-against: tracecore -->
<!-- last-verified: 2026-05-31 -->

# Grafana Loki

Loki ingests OTLP/HTTP logs natively at `/otlp/v1/logs` since Loki 3.0
(2024). Tracecore reaches it directly through the upstream `otlphttp`
exporter bundled in the OCB-assembled tracecore distro; no Loki-specific
exporter is required, and the deprecated contrib `lokiexporter` is
intentionally not bundled (RFC-0013 §2 adoption matrix). The tenant ID
travels in the `X-Scope-OrgID` header.

Deployment shape:

```
tracecore (otlphttp exporter) ──▶ Loki distributor (/otlp/v1/logs)
```

## Config

```yaml
# docs/integrations/examples/loki.yaml
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318

exporters:
otlphttp/loki:
endpoint: http://loki-distributor.observability.svc.cluster.local:3100/otlp
compression: gzip
headers:
X-Scope-OrgID: tracecore

service:
pipelines:
logs/loki:
receivers: [otlp]
exporters: [otlphttp/loki]
```

Validate with the in-tree binary:

```sh
./tracecore validate --config=docs/integrations/examples/loki.yaml
```

## Endpoint and tenant

- The endpoint is the Loki distributor's HTTP listener at the path
`/otlp`; the `otlphttp` exporter appends the OTLP-spec `/v1/logs`
suffix automatically, so the request lands at `/otlp/v1/logs`. Do
not include `/v1/logs` in the YAML — the exporter rejects the
duplicated path.
- `X-Scope-OrgID` identifies the tenant when Loki's distributor runs
with `auth_enabled: true`. Single-tenant clusters
(`auth_enabled: false`) accept requests without the header and
route them under the synthetic tenant `fake`; you can drop the
`headers:` block in that case.
- Loki Operator and Grafana Enterprise Logs (GEL) layer additional
multi-tenant auth on top (e.g. mTLS gateways, per-tenant rate
limits); those are optional, not required for the basic OSS install.

## Labels vs. structured metadata (the cardinality footgun)

Loki indexes logs by stream **labels** and stores everything else as
**structured metadata** (queryable in LogQL, NOT indexed). Label
cardinality directly drives index size and query cost; the canonical
Loki guidance is to keep label values in the low hundreds per stream.

The distributor's OTLP receiver maps OTLP attributes in three buckets:

| Source | Default mapping | Cardinality risk |
|---|---|---|
| OTLP **resource** attributes | Index labels (only the ones in `default_resource_attributes_as_index_labels`) | Bounded; the default list is curated. |
| OTLP **scope** attributes | Structured metadata | Low — instrumentation-scope is rarely high-cardinality. |
| OTLP **log** attributes | Structured metadata | Safe by default; high-cardinality keys (e.g. `pattern.verdict_json`) stay out of the label index. |

The Loki-side defaults at the distributor pick up these resource
attributes as stream labels (from
`default_resource_attributes_as_index_labels`):

`service.name`, `service.namespace`, `deployment.environment`,
`deployment.environment.name`, `cloud.region`,
`cloud.availability_zone`, `k8s.cluster.name`, `k8s.namespace.name`,
`k8s.container.name`, `container.name`, `k8s.replicaset.name`,
`k8s.deployment.name`, `k8s.statefulset.name`, `k8s.daemonset.name`,
`k8s.cronjob.name`, `k8s.job.name`.

Operator-side tuning lives in Loki's config, not in tracecore:

```yaml
# loki.yaml (on the LOKI side, NOT in tracecore)
limits_config:
allow_structured_metadata: true # default in Loki 3.0+
otlp_config:
resource_attributes:
attributes_config:
- action: index_label
regex: k8s\.node\.name # opt-in: index by node
log_attributes:
- action: structured_metadata
attributes:
- pattern.id
- pattern.headline
- pattern.remediation
- pattern.confidence
- pattern.verdict_json
```

When OTLP attributes flow into Loki, dots in attribute names are
translated to underscores at the LogQL surface, with the bucket as
prefix: an attribute `pattern.id` on a log record becomes
`attributes_pattern_id` in a LogQL query; a resource attribute
`k8s.node.name` becomes `resources_k8s_node_name`. Verify against
your Loki version — the prefix convention is stable since Loki 3.0
but pre-3.0 callers should consult upstream release notes before
adopting LogQL queries that rely on it.

### Tracecore-specific attributes

The patterndetectorprocessor emits verdict records carrying these
attributes (defined in
`module/processor/patterndetectorprocessor/patterndetector.go`):

- `pattern.id`, `pattern.headline`, `pattern.remediation`,
`pattern.confidence`, `pattern.verdict_json`
- `k8s.pod.name`, `k8s.pod.namespace`, `k8s.node.name`
- `k8s.event.reason`
- `nccl.fr.pg_id`, `nccl.fr.collective_seq_id`,
`nccl.fr.hanging_ranks_count`

All ship as **log attributes**, so all land in Loki as **structured
metadata** by default. This is the right shape: `pattern.verdict_json`
in particular is per-incident JSON and would explode the label index
if promoted. The dashboards consume them as `attributes_pattern_id`,
`attributes_k8s_node_name`, etc. (see `## See also` below).

Only resource attributes on the verdict's containing log record are
candidates for the label index, and the default list above already
covers `k8s.namespace.name` / `k8s.cluster.name` / `service.name` /
the rest of the k8s workload axis.

## Retention

Retention is configured on the Loki side via `compactor.retention_*`
and per-stream `limits_config.retention_period`. Tracecore does not
control retention; the recipe assumes the operator has set a global
retention compatible with the verdict signal (~14-30d is typical for
incident review; longer for compliance). If the cluster has retention
disabled, verdicts accumulate indefinitely until disk fills — set at
least a default `retention_period` before pointing tracecore at the
cluster.

## Secret handling

Same shape as the other recipes: render the literal `X-Scope-OrgID`
value at deploy time through `envsubst`, Helm, or a CSI secret driver
if the tenant identifier is sensitive. The example file ships the
literal `tracecore` so `tracecore validate` succeeds offline. Single-
tenant Loki clusters can drop the `headers:` block entirely.

## Failure modes

| Symptom | First check |
|---|---|
| HTTP 401 / 403 from Loki | Auth gateway in front of the distributor is rejecting the request. Confirm the deployed `X-Scope-OrgID` value matches the gateway's tenant allow-list. |
| HTTP 400 `the request body is too large` | Tracecore is sending batches above `limits_config.distributor.ingestion_rate_mb`. Lower the batchprocessor flush size or raise the Loki limit. |
| HTTP 400 `structured metadata is not allowed` | Loki is below 3.0 OR `limits_config.allow_structured_metadata` is `false`. Upgrade Loki, or flip the limit. The OTLP receiver always emits structured metadata for non-label attributes. |
| HTTP 429 with `Retry-After` | Loki's per-tenant ingestion rate-limit is engaged. Either aggregate at tracecore (`batchprocessor`) before the exporter or raise `ingestion_rate_mb` / `ingestion_burst_size_mb` on the Loki side. |
| Verdicts arrive but `pattern.id` is missing from LogQL | The Loki distributor dropped log attributes per `otlp_config.log_attributes`. Confirm the operator-side config includes `action: structured_metadata` for `pattern.*` (see the labels-vs-metadata section above). |
| Repeated TLS handshake failures | The default trust store covers most managed Lokis. If a corporate proxy MITMs egress, install the proxy CA in the system trust store; do not enable `insecure_skip_verify` in production. |
| Stream cardinality alerts on the Loki cluster | Confirm no high-cardinality OTLP resource attribute (e.g. `service.instance.id`) was added to `default_resource_attributes_as_index_labels`; that list defaults sanely but is the most common operator footgun. |

## See also

- Upstream Loki OTLP-ingestion docs:
[`grafana.com/docs/loki/latest/send-data/otel/`](https://grafana.com/docs/loki/latest/send-data/otel/)
- Upstream Loki labels-vs-structured-metadata reference:
[`grafana.com/docs/loki/latest/get-started/labels/structured-metadata/`](https://grafana.com/docs/loki/latest/get-started/labels/structured-metadata/)
- Upstream exporter docs:
[`exporter/otlphttpexporter`](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlphttpexporter)
- Generic OTel backend recipe: [`otel-backend.md`](otel-backend.md)
- Honeycomb (OTLP/HTTP with vendor headers, same exporter shape):
[`honeycomb.md`](honeycomb.md)
- Grafana dashboard for pattern verdicts (install path that this
recipe unblocks): `install/kubernetes/tracecore/dashboards/patterns.json`
(PR #264) — six panels query Loki via LogQL against
`attributes_pattern_id` / `attributes_k8s_node_name`, matching the
default OTLP-log-attribute-to-structured-metadata mapping this
recipe documents.