docs(slo): declare v0.4 self-observability SLO targets + runbook by trilamsr · Pull Request #288 · TraceCoreAI/tracecore

trilamsr · 2026-06-01T05:18:39Z

Summary

Roadmap item A14 (self-observability SLO targets). Adds three artifacts:

docs/SLOs.md — 5 SLIs (verdict-emit latency, collector availability, verdict correctness, pipeline drop rate, self-telemetry availability). v0.4 soft-commit; v1.0 binding per roadmap item B6.
docs/RUNBOOK-tracecore.md — symptom → root cause → first check → remediation for tracecore's own failure modes (OOM, receiver init, processor saturation, exporter backpressure, self-tel unreachable, verdict latency, false positives). Cross-references per-component RUNBOOKs without duplicating them.
install/kubernetes/tracecore/dashboards/slo-rules.yaml — 5 recording rules + 7 alerting rules. Every alert carries for: 5m and a runbook_url pointing at the matching section of RUNBOOK-tracecore.md.

Adopt-over-build: every SLI is vanilla PromQL over the upstream otelcol_* self-telemetry surface plus kube-state-metrics for availability. No tracecore-specific SLO tracker. RuleGroup shape is compatible with both rule_files: in prometheus.yml and PrometheusRule CRD.

What this PR does not do

Does not make SLOs binding. v0.4 = soft commit (warning-tier alerts only); v1.0 graduates per roadmap B6.
Does not touch per-component RUNBOOKs (module/processor/patterndetectorprocessor/RUNBOOK.md etc.). Cross-reference only.
Does not add new self-telemetry counters. The deferred otelcol_processor_patterndetector_verdict_emit_seconds_bucket histogram (issue patterndetectorprocessor: emit verdict counter metric for dashboard panels #261) is referenced in the rules so they activate the moment the metric lands; a LogQL stop-gap is documented for SLI 1 in the interim.

Test plan

bash scripts/doc-check.sh — 512 markdown links + 52 non-md links resolve; banned-phrase clean; test-name parity clean.
bash scripts/alert-check.sh — no RUNBOOK/alerts.yaml pair drift introduced (the new rule file lives under the chart, not next to a component RUNBOOK, so it's out of the check's scope by design).
make check — golangci-lint, go vet, go mod verify all clean.
Static rule-file sanity: 7 unique alerts, all carry for: 5m + runbook_url; 5 recording rules use the tracecore:<sli>:<window> naming convention.
promtool check rules install/kubernetes/tracecore/dashboards/slo-rules.yaml — not run (promtool not installed in the dev sandbox; user did not request a brew install prometheus). Recommend running locally before merge.

docs(slo): declare v0.4 self-observability SLO targets + tracecore runbook

Adds three artifacts: - docs/SLOs.md — 5 SLIs (verdict-emit latency, collector availability, verdict correctness, pipeline drop rate, self-telemetry availability) with v0.4 soft-commit targets and v1.0 binding posture per roadmap B6. - docs/RUNBOOK-tracecore.md — symptom -> root-cause -> first-check -> remediation entries for tracecore's own failure modes (collector OOM, receiver init, processor saturation, exporter backpressure, self-tel unreachable). Cross-references per-component RUNBOOKs without duplicating their content. - install/kubernetes/tracecore/dashboards/slo-rules.yaml — Prometheus recording + alerting rules for the 4 SLIs that have a PromQL expression. All alerts carry for: 5m and a runbook_url. Adopt-over-build: SLO measurement is vanilla PromQL over the upstream otelcol_* surface + kube-state-metrics. No tracecore-specific tracker. Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr enabled auto-merge (squash) June 1, 2026 05:31

trilamsr merged commit 334a8ae into main Jun 1, 2026
14 checks passed

trilamsr deleted the docs/slo-targets branch June 1, 2026 05:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(slo): declare v0.4 self-observability SLO targets + runbook#288

docs(slo): declare v0.4 self-observability SLO targets + runbook#288
trilamsr merged 1 commit into
mainfrom
docs/slo-targets

trilamsr commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 1, 2026

Summary

What this PR does not do

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant