Skip to content

[reliability] Daily Reliability Review - 2026-06-12 #38970

@github-actions

Description

@github-actions

Executive Summary

Telemetry source: Sentry org github, project gh-aw (spans dataset), last 24h. Authenticated as Mara Kiefer.

Overall health is stable, with no acute outages. Across 27,271 spans the agent fleet ran cleanly: 0 timeouts, 0 cancellations, and 0 OTLP export failures. Of the runs that emitted a job-conclusion, 24 distinct traces carried a gh-aw.run.status:failure job (≈4.7% of the ~506 conclusion-bearing traces; 482 success). Failures are thinly spread across 14 workflows (mostly 1 failed run each), so this reads as normal background noise rather than a regression spike — but the failed runs do not capture why they failed in telemetry.

The more actionable signal is an instrumentation/queryability gap: core reliability fields needed to triage failures are not queryable in Sentry even though agent spans are flowing — most importantly gen_ai.response.finish_reasons (length-truncation detection) and a per-failure error reason.

⚠️ This is an evidence-first report. Several "failure" attributes are null/absent, so runtime outcomes are reported as confirmed-where-evidenced and inconclusive-with-instrumentation-gap otherwise.

Top Reliability Findings

Priority Workflow Problem Evidence Next Action
P1 Contribution Check Recurring failed runs (top failer) 4 distinct failing traces; rep trace 9bc13cae..., engine claude-sonnet-4.6, continuity intact Inspect the 4 runs; no error reason in telemetry (see P4)
P1 PR Sous Chef / Smoke Copilot Repeated failures 3 distinct failing traces each Triage; Smoke Copilot failing = engine smoke regression risk
P2 AI Moderator / Issue Monster / [aw] Failure Investigator (6h) Multiple failures 2 distinct failing traces each Confirm whether shared cause vs independent
P3 (fleet-wide) Truncation undetectable gen_ai.response.finish_reasons returns 0 results despite 4,300 gen_ai.operation.name:chat spans Fix emit/indexing of finish_reasons array attr
P3 DataFlow PR & Discussion Dataset Builder Latency outlier gen_ai span 21,609,288 ms (~6.0h), trace 1eddb4eb..., run.status success Confirm real vs whole-agent span mislabel
P4 (fleet-wide) Failed runs lack a captured cause gh-aw.error_count:>0 = 0; no gh-aw.error.messages on the 24 failing traces Emit failure reason on conclusion spans
P5 (fleet-wide) span.status / release null both null across all 27,271 spans Use gh-aw.run.status + gh-aw.cli.version instead

Failure distribution (distinct failing traces): Contribution Check 4 · PR Sous Chef 3 · Smoke Copilot 3 · AI Moderator 2 · Issue Monster 2 · [aw] Failure Investigator (6h) 2 · Smoke Codex 1 · Smoke Claude 1 · Auto-Triage Issues 1 · Daily Ambient Context Optimizer 1 · Daily CLI Tools Exploratory Tester 1 · Smoke Copilot - AOAI (Entra) 1 · Constraint Solving — Problem of the Day 1 · Daily Reliability Review 1.

Representative Traces

View representative traces

Operational failure — Contribution Check (9bc13cae49bf63826bffce421af11f2c)

  • View trace
  • Continuity intact: one trace = one run; the failure status appears on the agent, safe_outputs, detection, and conclusion job-conclusion spans of this run (this is expected multi-job-per-run behavior, not a duplicate-failure).
  • Engine: claude-sonnet-4.6; longest agent gen_ai span ≈ 434,154 ms (≈7.2 min). No gh-aw.error.messages / finish_reasons attached → cause not recoverable from telemetry.

Latency outlier — DataFlow PR & Discussion Dataset Builder (1eddb4eb20b7e512ea172cc70238ec69)

  • View trace
  • Two gen_ai-op spans at 21,609,288 ms and 21,602,759 ms (~6.0h), run.status:success, no gen_ai.request.model → these are whole-agent-execution spans mislabeled under span.op:gen_ai, not single LLM calls.
  • Context: gen_ai span avg 28.6s, p95 94.8s (n=16,146). Other long agent spans: "Daily Agent of the Day Blog Writer" ~47 min (a1ac970b...), "Daily Security Observability Report" ~46 min (c6945a28...) — all success.

Recommendations

  1. Make truncation queryable (smallest, highest-value fix). gen_ai.response.finish_reasons is emitted as an array attr at actions/setup/js/send_otlp_span.cjs:2139 (buildArrayAttr), but Sentry EAP returns 0 rows for both has:gen_ai.response.finish_reasons and finish_reasons:length despite 4,300 chat spans. Emit a scalar mirror (e.g. gen_ai.response.finish_reason as a string) so :length truncation/runaway detection actually works.
  2. Attach a failure reason to conclusion spans. The 24 failing runs carry gh-aw.run.status:failure but gh-aw.error_count:>0=0 and no gh-aw.error.messages. When the run fails via GitHub job conclusion (not output errors), also stamp gh-aw.run.status_message / gh-aw.failure.categories on the conclusion span so failures are self-explaining.
  3. Standardize triage fields in the dashboard. span.status, release, and service.version are null fleet-wide; use gh-aw.run.status for outcome and gh-aw.cli.version (present on 4,528 spans) for release correlation until the resource→span mapping is fixed.
  4. Spot-check the engine smoke failures (Smoke Copilot ×3, Smoke Codex/Claude ×1) — smoke tests failing is a cheaper early-warning signal than user workflows.

Notes

View notes — telemetry gaps & method

Datasets

  • spans: healthy, 27,271 spans/24h (gen_ai 16,132 · http.server 7,605 · default 3,508).
  • errors: empty (no results, 24h) — explicit finding.
  • logs: empty (no results, 24h) — explicit finding. All reliability signal comes from spans.

Confirmed instrumentation/queryability gaps (evidence)

  • span.status = null on all 27,271 spans → OTLP status.code not mapped to Sentry span.status. Outcomes derived from gh-aw.run.status instead.
  • gen_ai.response.finish_reasons & has: = 0 rows despite 4,300 agent chat spans → array attr not indexed/queryable in EAP.
  • has:gh-aw.turns = 0 → turns not queryable.
  • release and service.version = null on all spans (resource attrs not surfaced as span fields); gh-aw.cli.version present (4,528).
  • gh-aw.otlp.export_errors:>0 = 0 → exporter/auth healthy.

Method / limitations

  • search_events and get_trace_details are not present in this Sentry MCP build; used list_events (Sentry query syntax) with client-side aggregation per the skill's fallback. Trace continuity validated by listing all spans for trace:<id>.
  • Failure counts use count_unique(trace) (not raw span count) because gh-aw.run.status is stamped on every job-conclusion span of a run; raw span counts over-state failures (e.g. Contribution Check = 20 spans but 4 traces).
  • Success (482) and failure (24) trace buckets can overlap when a run has both a failed and a succeeded job; the 24 is the count of traces containing ≥1 failed job conclusion.
  • No historical baseline was queried, so regression-vs-normal is inferred from distribution shape (thin spread) rather than a trend comparison.

References:

Generated by 🚨 Daily Reliability Review · 216.7 AIC · ⌖ 12.2 AIC · ⊞ 5.6K ·

  • expires on Jun 14, 2026, 3:25 PM UTC-08:00

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions