Skip to content

slo-rules.yaml: 'pipeline' label doesn't exist on otelcol_* metrics; 'unless on (instance)' join silently no-ops #298

Description

@trilamsr

Summary

Two label-alignment defects in
install/kubernetes/tracecore/dashboards/slo-rules.yaml:

1. pipeline label does not exist on the source metrics

The drop-rate recording rule (tracecore:pipeline_drop_rate:5m)
groups by pipeline:

sum by (pipeline) (
  rate(otelcol_processor_refused_log_records_total[5m])
  + rate(otelcol_exporter_send_failed_log_records_total[5m])
)
/
clamp_min(sum by (pipeline) (
  rate(otelcol_receiver_accepted_log_records_total[5m])
), 0.001)

Upstream processorhelper/obsreport.go only stamps a processor
attribute. Upstream receiverhelper/obsreport.go stamps receiver

  • transport. There is no pipeline label. sum by (pipeline)
    collapses to a single empty pipeline="" group. The
    {{ $labels.pipeline }} field in the alert annotations renders empty.

Math: the ratio is still per-pod-correct (since all metrics carry
the same instance), but the runbook's per-pipeline localization
guidance doesn't apply to the alert payload.

2. unless on (instance) join in TracecoreSelftelemetryDown doesn't match

up{job="tracecore"} == 0 unless on (instance) kube_pod_status_phase{phase="Failed"} == 1

up{} labels: instance, job (from scrape config).
kube_pod_status_phase{} labels: namespace, pod, phase, uid.

The two metrics have no shared instance label unless an external
relabel step injects one. The unless on (instance) ... is effectively
a no-op — the alert becomes functionally identical to
TracecorePodDown. The intent (separate "listener wedge" from
"pod crashed") is not actually implemented.

Fix

For (1): either replace sum by (pipeline) with sum by (component_id, instance)
(or sum by (processor) / sum by (exporter) per-rule), or have
operators wire pipeline naming via target_label: pipeline in
metric_relabel_configs. Document the choice.

For (2): join via namespace + pod (after relabeling up{} to
carry pod from the kube-prometheus-stack scrape config) — or
drop the unless clause and document why "self-tel down" implies
"investigate pod first" in the runbook.

References

  • Upstream label sources:
    • processor@v0.110.0/processorhelper/obsreport.go line 58 (ProcessorKey only)
    • receiver@v0.110.0/receiverhelper/obsreport.go line 64-65 (ReceiverKey, TransportKey)
  • slo-rules.yaml lines 117-126 (drop-rate rule), 181 (selftel unless-clause)
  • Found during adversarial review of PR docs(slo): declare v0.4 self-observability SLO targets + runbook #288.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions