diff --git a/docs/README.md b/docs/README.md index fefe079d..17a0db63 100644 --- a/docs/README.md +++ b/docs/README.md @@ -50,6 +50,7 @@ Backend (exporter-side) recipes: | [integrations/honeycomb.md](integrations/honeycomb.md) | πŸ‘€ | Direct OTLP/HTTP to Honeycomb via the in-tree `otlphttp` exporter. | | [integrations/datadog.md](integrations/datadog.md) | πŸ‘€ | Datadog via the bundled `datadogexporter`. | | [integrations/clickhouse-direct.md](integrations/clickhouse-direct.md) | πŸ‘€ | Self-hosted ClickHouse via the bundled `clickhouseexporter`. | +| [integrations/loki.md](integrations/loki.md) | πŸ‘€ | Grafana Loki via OTLP/HTTP native ingestion (`otlphttp` exporter, `X-Scope-OrgID` tenant header); labels-vs-structured-metadata mapping for `pattern.*` verdict attributes. | Source (receiver-side) recipes β€” RFC-0013 Β§migration PR-J replacements for the deleted in-tree receivers: diff --git a/docs/integrations/examples/loki.yaml b/docs/integrations/examples/loki.yaml new file mode 100644 index 00000000..28f310df --- /dev/null +++ b/docs/integrations/examples/loki.yaml @@ -0,0 +1,40 @@ +# Grafana Loki ingests OTLP/HTTP logs natively at /otlp/v1/logs since +# Loki 3.0 (2024). The upstream `otlphttp` exporter the OCB-assembled +# tracecore binary bundles (RFC-0013 Β§1) is sufficient β€” no Loki- +# specific exporter is needed, and the deprecated contrib `lokiexporter` +# is intentionally NOT bundled. Tracecore writes to Loki's distributor +# directly; the distributor maps OTLP resource attributes to stream +# labels and log attributes to structured metadata. +# +# Tenant header: Loki uses `X-Scope-OrgID` to identify the tenant when +# the distributor runs with `auth_enabled: true`. Single-tenant clusters +# (`auth_enabled: false`) accept requests without the header and route +# them under the synthetic tenant `fake`. The `X-Scope-OrgID: tracecore` +# below assumes a multi-tenant cluster; drop the header for single- +# tenant. Tracecore does not expand environment variables in YAML; +# render the literal value at deploy time via envsubst / Helm template +# / CSI secret driver if the tenant ID is sensitive. See +# docs/integrations/loki.md. +# +# Endpoint: the `otlphttp` exporter appends the OTLP-spec path +# `/v1/logs` automatically, so `endpoint: http://loki.../otlp` resolves +# to `http://loki.../otlp/v1/logs` at request time. Do not include +# `/v1/logs` in the endpoint string β€” `otlphttp` rejects the duplicate. +receivers: + otlp: + protocols: + http: + endpoint: 0.0.0.0:4318 + +exporters: + otlphttp/loki: + endpoint: http://loki-distributor.observability.svc.cluster.local:3100/otlp + compression: gzip + headers: + X-Scope-OrgID: tracecore + +service: + pipelines: + logs/loki: + receivers: [otlp] + exporters: [otlphttp/loki] diff --git a/docs/integrations/loki.md b/docs/integrations/loki.md new file mode 100644 index 00000000..367a2c42 --- /dev/null +++ b/docs/integrations/loki.md @@ -0,0 +1,192 @@ + + + +# Grafana Loki + +Loki ingests OTLP/HTTP logs natively at `/otlp/v1/logs` since Loki 3.0 +(2024). Tracecore reaches it directly through the upstream `otlphttp` +exporter bundled in the OCB-assembled tracecore distro; no Loki-specific +exporter is required, and the deprecated contrib `lokiexporter` is +intentionally not bundled (RFC-0013 Β§2 adoption matrix). The tenant ID +travels in the `X-Scope-OrgID` header. + +Deployment shape: + +``` +tracecore (otlphttp exporter) ──▢ Loki distributor (/otlp/v1/logs) +``` + +## Config + +```yaml +# docs/integrations/examples/loki.yaml +receivers: + otlp: + protocols: + http: + endpoint: 0.0.0.0:4318 + +exporters: + otlphttp/loki: + endpoint: http://loki-distributor.observability.svc.cluster.local:3100/otlp + compression: gzip + headers: + X-Scope-OrgID: tracecore + +service: + pipelines: + logs/loki: + receivers: [otlp] + exporters: [otlphttp/loki] +``` + +Validate with the in-tree binary: + +```sh +./tracecore validate --config=docs/integrations/examples/loki.yaml +``` + +## Endpoint and tenant + +- The endpoint is the Loki distributor's HTTP listener at the path + `/otlp`; the `otlphttp` exporter appends the OTLP-spec `/v1/logs` + suffix automatically, so the request lands at `/otlp/v1/logs`. Do + not include `/v1/logs` in the YAML β€” the exporter rejects the + duplicated path. +- `X-Scope-OrgID` identifies the tenant when Loki's distributor runs + with `auth_enabled: true`. Single-tenant clusters + (`auth_enabled: false`) accept requests without the header and + route them under the synthetic tenant `fake`; you can drop the + `headers:` block in that case. +- Loki Operator and Grafana Enterprise Logs (GEL) layer additional + multi-tenant auth on top (e.g. mTLS gateways, per-tenant rate + limits); those are optional, not required for the basic OSS install. + +## Labels vs. structured metadata (the cardinality footgun) + +Loki indexes logs by stream **labels** and stores everything else as +**structured metadata** (queryable in LogQL, NOT indexed). Label +cardinality directly drives index size and query cost; the canonical +Loki guidance is to keep label values in the low hundreds per stream. + +The distributor's OTLP receiver maps OTLP attributes in three buckets: + +| Source | Default mapping | Cardinality risk | +|---|---|---| +| OTLP **resource** attributes | Index labels (only the ones in `default_resource_attributes_as_index_labels`) | Bounded; the default list is curated. | +| OTLP **scope** attributes | Structured metadata | Low β€” instrumentation-scope is rarely high-cardinality. | +| OTLP **log** attributes | Structured metadata | Safe by default; high-cardinality keys (e.g. `pattern.verdict_json`) stay out of the label index. | + +The Loki-side defaults at the distributor pick up these resource +attributes as stream labels (from +`default_resource_attributes_as_index_labels`): + +`service.name`, `service.namespace`, `deployment.environment`, +`deployment.environment.name`, `cloud.region`, +`cloud.availability_zone`, `k8s.cluster.name`, `k8s.namespace.name`, +`k8s.container.name`, `container.name`, `k8s.replicaset.name`, +`k8s.deployment.name`, `k8s.statefulset.name`, `k8s.daemonset.name`, +`k8s.cronjob.name`, `k8s.job.name`. + +Operator-side tuning lives in Loki's config, not in tracecore: + +```yaml +# loki.yaml (on the LOKI side, NOT in tracecore) +limits_config: + allow_structured_metadata: true # default in Loki 3.0+ + otlp_config: + resource_attributes: + attributes_config: + - action: index_label + regex: k8s\.node\.name # opt-in: index by node + log_attributes: + - action: structured_metadata + attributes: + - pattern.id + - pattern.headline + - pattern.remediation + - pattern.confidence + - pattern.verdict_json +``` + +When OTLP attributes flow into Loki, dots in attribute names are +translated to underscores at the LogQL surface, with the bucket as +prefix: an attribute `pattern.id` on a log record becomes +`attributes_pattern_id` in a LogQL query; a resource attribute +`k8s.node.name` becomes `resources_k8s_node_name`. Verify against +your Loki version β€” the prefix convention is stable since Loki 3.0 +but pre-3.0 callers should consult upstream release notes before +adopting LogQL queries that rely on it. + +### Tracecore-specific attributes + +The patterndetectorprocessor emits verdict records carrying these +attributes (defined in +`module/processor/patterndetectorprocessor/patterndetector.go`): + +- `pattern.id`, `pattern.headline`, `pattern.remediation`, + `pattern.confidence`, `pattern.verdict_json` +- `k8s.pod.name`, `k8s.pod.namespace`, `k8s.node.name` +- `k8s.event.reason` +- `nccl.fr.pg_id`, `nccl.fr.collective_seq_id`, + `nccl.fr.hanging_ranks_count` + +All ship as **log attributes**, so all land in Loki as **structured +metadata** by default. This is the right shape: `pattern.verdict_json` +in particular is per-incident JSON and would explode the label index +if promoted. The dashboards consume them as `attributes_pattern_id`, +`attributes_k8s_node_name`, etc. (see `## See also` below). + +Only resource attributes on the verdict's containing log record are +candidates for the label index, and the default list above already +covers `k8s.namespace.name` / `k8s.cluster.name` / `service.name` / +the rest of the k8s workload axis. + +## Retention + +Retention is configured on the Loki side via `compactor.retention_*` +and per-stream `limits_config.retention_period`. Tracecore does not +control retention; the recipe assumes the operator has set a global +retention compatible with the verdict signal (~14-30d is typical for +incident review; longer for compliance). If the cluster has retention +disabled, verdicts accumulate indefinitely until disk fills β€” set at +least a default `retention_period` before pointing tracecore at the +cluster. + +## Secret handling + +Same shape as the other recipes: render the literal `X-Scope-OrgID` +value at deploy time through `envsubst`, Helm, or a CSI secret driver +if the tenant identifier is sensitive. The example file ships the +literal `tracecore` so `tracecore validate` succeeds offline. Single- +tenant Loki clusters can drop the `headers:` block entirely. + +## Failure modes + +| Symptom | First check | +|---|---| +| HTTP 401 / 403 from Loki | Auth gateway in front of the distributor is rejecting the request. Confirm the deployed `X-Scope-OrgID` value matches the gateway's tenant allow-list. | +| HTTP 400 `the request body is too large` | Tracecore is sending batches above `limits_config.distributor.ingestion_rate_mb`. Lower the batchprocessor flush size or raise the Loki limit. | +| HTTP 400 `structured metadata is not allowed` | Loki is below 3.0 OR `limits_config.allow_structured_metadata` is `false`. Upgrade Loki, or flip the limit. The OTLP receiver always emits structured metadata for non-label attributes. | +| HTTP 429 with `Retry-After` | Loki's per-tenant ingestion rate-limit is engaged. Either aggregate at tracecore (`batchprocessor`) before the exporter or raise `ingestion_rate_mb` / `ingestion_burst_size_mb` on the Loki side. | +| Verdicts arrive but `pattern.id` is missing from LogQL | The Loki distributor dropped log attributes per `otlp_config.log_attributes`. Confirm the operator-side config includes `action: structured_metadata` for `pattern.*` (see the labels-vs-metadata section above). | +| Repeated TLS handshake failures | The default trust store covers most managed Lokis. If a corporate proxy MITMs egress, install the proxy CA in the system trust store; do not enable `insecure_skip_verify` in production. | +| Stream cardinality alerts on the Loki cluster | Confirm no high-cardinality OTLP resource attribute (e.g. `service.instance.id`) was added to `default_resource_attributes_as_index_labels`; that list defaults sanely but is the most common operator footgun. | + +## See also + +- Upstream Loki OTLP-ingestion docs: + [`grafana.com/docs/loki/latest/send-data/otel/`](https://grafana.com/docs/loki/latest/send-data/otel/) +- Upstream Loki labels-vs-structured-metadata reference: + [`grafana.com/docs/loki/latest/get-started/labels/structured-metadata/`](https://grafana.com/docs/loki/latest/get-started/labels/structured-metadata/) +- Upstream exporter docs: + [`exporter/otlphttpexporter`](https://github.com/open-telemetry/opentelemetry-collector/tree/main/exporter/otlphttpexporter) +- Generic OTel backend recipe: [`otel-backend.md`](otel-backend.md) +- Honeycomb (OTLP/HTTP with vendor headers, same exporter shape): + [`honeycomb.md`](honeycomb.md) +- Grafana dashboard for pattern verdicts (install path that this + recipe unblocks): `install/kubernetes/tracecore/dashboards/patterns.json` + (PR #264) β€” six panels query Loki via LogQL against + `attributes_pattern_id` / `attributes_k8s_node_name`, matching the + default OTLP-log-attribute-to-structured-metadata mapping this + recipe documents.