Skip to content

docs(slo): declare v0.4 self-observability SLO targets + runbook#288

Merged
trilamsr merged 1 commit into
mainfrom
docs/slo-targets
Jun 1, 2026
Merged

docs(slo): declare v0.4 self-observability SLO targets + runbook#288
trilamsr merged 1 commit into
mainfrom
docs/slo-targets

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Roadmap item A14 (self-observability SLO targets). Adds three artifacts:

  • docs/SLOs.md — 5 SLIs (verdict-emit latency, collector availability, verdict correctness, pipeline drop rate, self-telemetry availability). v0.4 soft-commit; v1.0 binding per roadmap item B6.
  • docs/RUNBOOK-tracecore.md — symptom → root cause → first check → remediation for tracecore's own failure modes (OOM, receiver init, processor saturation, exporter backpressure, self-tel unreachable, verdict latency, false positives). Cross-references per-component RUNBOOKs without duplicating them.
  • install/kubernetes/tracecore/dashboards/slo-rules.yaml — 5 recording rules + 7 alerting rules. Every alert carries for: 5m and a runbook_url pointing at the matching section of RUNBOOK-tracecore.md.

Adopt-over-build: every SLI is vanilla PromQL over the upstream otelcol_* self-telemetry surface plus kube-state-metrics for availability. No tracecore-specific SLO tracker. RuleGroup shape is compatible with both rule_files: in prometheus.yml and PrometheusRule CRD.

What this PR does not do

  • Does not make SLOs binding. v0.4 = soft commit (warning-tier alerts only); v1.0 graduates per roadmap B6.
  • Does not touch per-component RUNBOOKs (module/processor/patterndetectorprocessor/RUNBOOK.md etc.). Cross-reference only.
  • Does not add new self-telemetry counters. The deferred otelcol_processor_patterndetector_verdict_emit_seconds_bucket histogram (issue patterndetectorprocessor: emit verdict counter metric for dashboard panels #261) is referenced in the rules so they activate the moment the metric lands; a LogQL stop-gap is documented for SLI 1 in the interim.

Test plan

  • bash scripts/doc-check.sh — 512 markdown links + 52 non-md links resolve; banned-phrase clean; test-name parity clean.
  • bash scripts/alert-check.sh — no RUNBOOK/alerts.yaml pair drift introduced (the new rule file lives under the chart, not next to a component RUNBOOK, so it's out of the check's scope by design).
  • make checkgolangci-lint, go vet, go mod verify all clean.
  • Static rule-file sanity: 7 unique alerts, all carry for: 5m + runbook_url; 5 recording rules use the tracecore:<sli>:<window> naming convention.
  • promtool check rules install/kubernetes/tracecore/dashboards/slo-rules.yamlnot run (promtool not installed in the dev sandbox; user did not request a brew install prometheus). Recommend running locally before merge.
docs(slo): declare v0.4 self-observability SLO targets + tracecore runbook

Adds three artifacts:

- docs/SLOs.md — 5 SLIs (verdict-emit latency, collector availability,
  verdict correctness, pipeline drop rate, self-telemetry availability)
  with v0.4 soft-commit targets and v1.0 binding posture per roadmap B6.
- docs/RUNBOOK-tracecore.md — symptom -> root-cause -> first-check ->
  remediation entries for tracecore's own failure modes (collector OOM,
  receiver init, processor saturation, exporter backpressure, self-tel
  unreachable). Cross-references per-component RUNBOOKs without
  duplicating their content.
- install/kubernetes/tracecore/dashboards/slo-rules.yaml — Prometheus
  recording + alerting rules for the 4 SLIs that have a PromQL
  expression. All alerts carry for: 5m and a runbook_url.

Adopt-over-build: SLO measurement is vanilla PromQL over the upstream
otelcol_* surface + kube-state-metrics. No tracecore-specific tracker.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant