You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two label-alignment defects in install/kubernetes/tracecore/dashboards/slo-rules.yaml:
1. pipeline label does not exist on the source metrics
The drop-rate recording rule (tracecore:pipeline_drop_rate:5m)
groups by pipeline:
sum by (pipeline) (
rate(otelcol_processor_refused_log_records_total[5m])
+ rate(otelcol_exporter_send_failed_log_records_total[5m])
)
/
clamp_min(sum by (pipeline) (
rate(otelcol_receiver_accepted_log_records_total[5m])
), 0.001)
Upstream processorhelper/obsreport.go only stamps a processor
attribute. Upstream receiverhelper/obsreport.go stamps receiver
transport. There is no pipeline label. sum by (pipeline)
collapses to a single empty pipeline="" group. The {{ $labels.pipeline }} field in the alert annotations renders empty.
Math: the ratio is still per-pod-correct (since all metrics carry
the same instance), but the runbook's per-pipeline localization
guidance doesn't apply to the alert payload.
2. unless on (instance) join in TracecoreSelftelemetryDown doesn't match
up{job="tracecore"} == 0 unless on (instance) kube_pod_status_phase{phase="Failed"} == 1
The two metrics have no shared instance label unless an external
relabel step injects one. The unless on (instance) ... is effectively
a no-op — the alert becomes functionally identical to TracecorePodDown. The intent (separate "listener wedge" from
"pod crashed") is not actually implemented.
Fix
For (1): either replace sum by (pipeline) with sum by (component_id, instance)
(or sum by (processor) / sum by (exporter) per-rule), or have
operators wire pipeline naming via target_label: pipeline in metric_relabel_configs. Document the choice.
For (2): join via namespace + pod (after relabeling up{} to
carry pod from the kube-prometheus-stack scrape config) — or
drop the unless clause and document why "self-tel down" implies
"investigate pod first" in the runbook.
References
Upstream label sources:
processor@v0.110.0/processorhelper/obsreport.go line 58 (ProcessorKey only)
receiver@v0.110.0/receiverhelper/obsreport.go line 64-65 (ReceiverKey, TransportKey)
Summary
Two label-alignment defects in
install/kubernetes/tracecore/dashboards/slo-rules.yaml:1.
pipelinelabel does not exist on the source metricsThe drop-rate recording rule (
tracecore:pipeline_drop_rate:5m)groups by
pipeline:Upstream
processorhelper/obsreport.goonly stamps aprocessorattribute. Upstream
receiverhelper/obsreport.gostampsreceivertransport. There is nopipelinelabel.sum by (pipeline)collapses to a single empty
pipeline=""group. The{{ $labels.pipeline }}field in the alert annotations renders empty.Math: the ratio is still per-pod-correct (since all metrics carry
the same
instance), but the runbook's per-pipeline localizationguidance doesn't apply to the alert payload.
2.
unless on (instance)join inTracecoreSelftelemetryDowndoesn't matchup{}labels:instance,job(from scrape config).kube_pod_status_phase{}labels:namespace,pod,phase,uid.The two metrics have no shared
instancelabel unless an externalrelabel step injects one. The
unless on (instance) ...is effectivelya no-op — the alert becomes functionally identical to
TracecorePodDown. The intent (separate "listener wedge" from"pod crashed") is not actually implemented.
Fix
For (1): either replace
sum by (pipeline)withsum by (component_id, instance)(or
sum by (processor)/sum by (exporter)per-rule), or haveoperators wire pipeline naming via
target_label: pipelineinmetric_relabel_configs. Document the choice.For (2): join via
namespace+pod(after relabelingup{}tocarry
podfrom the kube-prometheus-stack scrape config) — ordrop the
unlessclause and document why "self-tel down" implies"investigate pod first" in the runbook.
References
processor@v0.110.0/processorhelper/obsreport.goline 58 (ProcessorKey only)receiver@v0.110.0/receiverhelper/obsreport.goline 64-65 (ReceiverKey, TransportKey)slo-rules.yamllines 117-126 (drop-rate rule), 181 (selftel unless-clause)