docs(slo): declare v0.4 self-observability SLO targets + runbook#288
Merged
Conversation
Adds three artifacts: - docs/SLOs.md — 5 SLIs (verdict-emit latency, collector availability, verdict correctness, pipeline drop rate, self-telemetry availability) with v0.4 soft-commit targets and v1.0 binding posture per roadmap B6. - docs/RUNBOOK-tracecore.md — symptom -> root-cause -> first-check -> remediation entries for tracecore's own failure modes (collector OOM, receiver init, processor saturation, exporter backpressure, self-tel unreachable). Cross-references per-component RUNBOOKs without duplicating their content. - install/kubernetes/tracecore/dashboards/slo-rules.yaml — Prometheus recording + alerting rules for the 4 SLIs that have a PromQL expression. All alerts carry for: 5m and a runbook_url. Adopt-over-build: SLO measurement is vanilla PromQL over the upstream otelcol_* surface + kube-state-metrics. No tracecore-specific tracker. Signed-off-by: Tri Lam <tree@lumalabs.ai>
This was referenced Jun 1, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Roadmap item A14 (self-observability SLO targets). Adds three artifacts:
docs/SLOs.md— 5 SLIs (verdict-emit latency, collector availability, verdict correctness, pipeline drop rate, self-telemetry availability). v0.4 soft-commit; v1.0 binding per roadmap item B6.docs/RUNBOOK-tracecore.md— symptom → root cause → first check → remediation for tracecore's own failure modes (OOM, receiver init, processor saturation, exporter backpressure, self-tel unreachable, verdict latency, false positives). Cross-references per-component RUNBOOKs without duplicating them.install/kubernetes/tracecore/dashboards/slo-rules.yaml— 5 recording rules + 7 alerting rules. Every alert carriesfor: 5mand arunbook_urlpointing at the matching section ofRUNBOOK-tracecore.md.Adopt-over-build: every SLI is vanilla PromQL over the upstream
otelcol_*self-telemetry surface pluskube-state-metricsfor availability. No tracecore-specific SLO tracker. RuleGroup shape is compatible with bothrule_files:inprometheus.ymlandPrometheusRuleCRD.What this PR does not do
module/processor/patterndetectorprocessor/RUNBOOK.mdetc.). Cross-reference only.otelcol_processor_patterndetector_verdict_emit_seconds_buckethistogram (issue patterndetectorprocessor: emit verdict counter metric for dashboard panels #261) is referenced in the rules so they activate the moment the metric lands; a LogQL stop-gap is documented for SLI 1 in the interim.Test plan
bash scripts/doc-check.sh— 512 markdown links + 52 non-md links resolve; banned-phrase clean; test-name parity clean.bash scripts/alert-check.sh— no RUNBOOK/alerts.yaml pair drift introduced (the new rule file lives under the chart, not next to a component RUNBOOK, so it's out of the check's scope by design).make check—golangci-lint,go vet,go mod verifyall clean.for: 5m+runbook_url; 5 recording rules use thetracecore:<sli>:<window>naming convention.promtool check rules install/kubernetes/tracecore/dashboards/slo-rules.yaml— not run (promtool not installed in the dev sandbox; user did not request abrew install prometheus). Recommend running locally before merge.