Executive Summary
Telemetry source: Sentry org github, project gh-aw (spans dataset), last 24h. Authenticated as Mara Kiefer.
Overall health is stable, with no acute outages. Across 27,271 spans the agent fleet ran cleanly: 0 timeouts, 0 cancellations, and 0 OTLP export failures. Of the runs that emitted a job-conclusion, 24 distinct traces carried a gh-aw.run.status:failure job (≈4.7% of the ~506 conclusion-bearing traces; 482 success). Failures are thinly spread across 14 workflows (mostly 1 failed run each), so this reads as normal background noise rather than a regression spike — but the failed runs do not capture why they failed in telemetry.
The more actionable signal is an instrumentation/queryability gap: core reliability fields needed to triage failures are not queryable in Sentry even though agent spans are flowing — most importantly gen_ai.response.finish_reasons (length-truncation detection) and a per-failure error reason.
⚠️ This is an evidence-first report. Several "failure" attributes are null/absent, so runtime outcomes are reported as confirmed-where-evidenced and inconclusive-with-instrumentation-gap otherwise.
Top Reliability Findings
| Priority |
Workflow |
Problem |
Evidence |
Next Action |
| P1 |
Contribution Check |
Recurring failed runs (top failer) |
4 distinct failing traces; rep trace 9bc13cae..., engine claude-sonnet-4.6, continuity intact |
Inspect the 4 runs; no error reason in telemetry (see P4) |
| P1 |
PR Sous Chef / Smoke Copilot |
Repeated failures |
3 distinct failing traces each |
Triage; Smoke Copilot failing = engine smoke regression risk |
| P2 |
AI Moderator / Issue Monster / [aw] Failure Investigator (6h) |
Multiple failures |
2 distinct failing traces each |
Confirm whether shared cause vs independent |
| P3 |
(fleet-wide) |
Truncation undetectable |
gen_ai.response.finish_reasons returns 0 results despite 4,300 gen_ai.operation.name:chat spans |
Fix emit/indexing of finish_reasons array attr |
| P3 |
DataFlow PR & Discussion Dataset Builder |
Latency outlier |
gen_ai span 21,609,288 ms (~6.0h), trace 1eddb4eb..., run.status success |
Confirm real vs whole-agent span mislabel |
| P4 |
(fleet-wide) |
Failed runs lack a captured cause |
gh-aw.error_count:>0 = 0; no gh-aw.error.messages on the 24 failing traces |
Emit failure reason on conclusion spans |
| P5 |
(fleet-wide) |
span.status / release null |
both null across all 27,271 spans |
Use gh-aw.run.status + gh-aw.cli.version instead |
Failure distribution (distinct failing traces): Contribution Check 4 · PR Sous Chef 3 · Smoke Copilot 3 · AI Moderator 2 · Issue Monster 2 · [aw] Failure Investigator (6h) 2 · Smoke Codex 1 · Smoke Claude 1 · Auto-Triage Issues 1 · Daily Ambient Context Optimizer 1 · Daily CLI Tools Exploratory Tester 1 · Smoke Copilot - AOAI (Entra) 1 · Constraint Solving — Problem of the Day 1 · Daily Reliability Review 1.
Representative Traces
View representative traces
Operational failure — Contribution Check (9bc13cae49bf63826bffce421af11f2c)
- View trace
- Continuity intact: one trace = one run; the failure status appears on the
agent, safe_outputs, detection, and conclusion job-conclusion spans of this run (this is expected multi-job-per-run behavior, not a duplicate-failure).
- Engine:
claude-sonnet-4.6; longest agent gen_ai span ≈ 434,154 ms (≈7.2 min). No gh-aw.error.messages / finish_reasons attached → cause not recoverable from telemetry.
Latency outlier — DataFlow PR & Discussion Dataset Builder (1eddb4eb20b7e512ea172cc70238ec69)
- View trace
- Two
gen_ai-op spans at 21,609,288 ms and 21,602,759 ms (~6.0h), run.status:success, no gen_ai.request.model → these are whole-agent-execution spans mislabeled under span.op:gen_ai, not single LLM calls.
- Context: gen_ai span avg 28.6s, p95 94.8s (n=16,146). Other long agent spans: "Daily Agent of the Day Blog Writer" ~47 min (
a1ac970b...), "Daily Security Observability Report" ~46 min (c6945a28...) — all success.
Recommendations
- Make truncation queryable (smallest, highest-value fix).
gen_ai.response.finish_reasons is emitted as an array attr at actions/setup/js/send_otlp_span.cjs:2139 (buildArrayAttr), but Sentry EAP returns 0 rows for both has:gen_ai.response.finish_reasons and finish_reasons:length despite 4,300 chat spans. Emit a scalar mirror (e.g. gen_ai.response.finish_reason as a string) so :length truncation/runaway detection actually works.
- Attach a failure reason to conclusion spans. The 24 failing runs carry
gh-aw.run.status:failure but gh-aw.error_count:>0=0 and no gh-aw.error.messages. When the run fails via GitHub job conclusion (not output errors), also stamp gh-aw.run.status_message / gh-aw.failure.categories on the conclusion span so failures are self-explaining.
- Standardize triage fields in the dashboard.
span.status, release, and service.version are null fleet-wide; use gh-aw.run.status for outcome and gh-aw.cli.version (present on 4,528 spans) for release correlation until the resource→span mapping is fixed.
- Spot-check the engine smoke failures (Smoke Copilot ×3, Smoke Codex/Claude ×1) — smoke tests failing is a cheaper early-warning signal than user workflows.
Notes
View notes — telemetry gaps & method
Datasets
spans: healthy, 27,271 spans/24h (gen_ai 16,132 · http.server 7,605 · default 3,508).
errors: empty (no results, 24h) — explicit finding.
logs: empty (no results, 24h) — explicit finding. All reliability signal comes from spans.
Confirmed instrumentation/queryability gaps (evidence)
span.status = null on all 27,271 spans → OTLP status.code not mapped to Sentry span.status. Outcomes derived from gh-aw.run.status instead.
gen_ai.response.finish_reasons & has: = 0 rows despite 4,300 agent chat spans → array attr not indexed/queryable in EAP.
has:gh-aw.turns = 0 → turns not queryable.
release and service.version = null on all spans (resource attrs not surfaced as span fields); gh-aw.cli.version present (4,528).
gh-aw.otlp.export_errors:>0 = 0 → exporter/auth healthy.
Method / limitations
search_events and get_trace_details are not present in this Sentry MCP build; used list_events (Sentry query syntax) with client-side aggregation per the skill's fallback. Trace continuity validated by listing all spans for trace:<id>.
- Failure counts use
count_unique(trace) (not raw span count) because gh-aw.run.status is stamped on every job-conclusion span of a run; raw span counts over-state failures (e.g. Contribution Check = 20 spans but 4 traces).
- Success (482) and failure (24) trace buckets can overlap when a run has both a failed and a succeeded job; the 24 is the count of traces containing ≥1 failed job conclusion.
- No historical baseline was queried, so regression-vs-normal is inferred from distribution shape (thin spread) rather than a trend comparison.
References:
Generated by 🚨 Daily Reliability Review · 216.7 AIC · ⌖ 12.2 AIC · ⊞ 5.6K · ◷
Executive Summary
Telemetry source: Sentry org
github, projectgh-aw(spans dataset), last 24h. Authenticated as Mara Kiefer.Overall health is stable, with no acute outages. Across 27,271 spans the agent fleet ran cleanly: 0 timeouts, 0 cancellations, and 0 OTLP export failures. Of the runs that emitted a job-conclusion, 24 distinct traces carried a
gh-aw.run.status:failurejob (≈4.7% of the ~506 conclusion-bearing traces; 482 success). Failures are thinly spread across 14 workflows (mostly 1 failed run each), so this reads as normal background noise rather than a regression spike — but the failed runs do not capture why they failed in telemetry.The more actionable signal is an instrumentation/queryability gap: core reliability fields needed to triage failures are not queryable in Sentry even though agent spans are flowing — most importantly
gen_ai.response.finish_reasons(length-truncation detection) and a per-failure error reason.Top Reliability Findings
9bc13cae..., engineclaude-sonnet-4.6, continuity intactgen_ai.response.finish_reasonsreturns 0 results despite 4,300gen_ai.operation.name:chatspans1eddb4eb..., run.status successgh-aw.error_count:>0= 0; nogh-aw.error.messageson the 24 failing tracesspan.status/releasenullgh-aw.run.status+gh-aw.cli.versioninsteadFailure distribution (distinct failing traces): Contribution Check 4 · PR Sous Chef 3 · Smoke Copilot 3 · AI Moderator 2 · Issue Monster 2 · [aw] Failure Investigator (6h) 2 · Smoke Codex 1 · Smoke Claude 1 · Auto-Triage Issues 1 · Daily Ambient Context Optimizer 1 · Daily CLI Tools Exploratory Tester 1 · Smoke Copilot - AOAI (Entra) 1 · Constraint Solving — Problem of the Day 1 · Daily Reliability Review 1.
Representative Traces
View representative traces
Operational failure — Contribution Check (
9bc13cae49bf63826bffce421af11f2c)agent,safe_outputs,detection, andconclusionjob-conclusion spans of this run (this is expected multi-job-per-run behavior, not a duplicate-failure).claude-sonnet-4.6; longest agentgen_aispan ≈ 434,154 ms (≈7.2 min). Nogh-aw.error.messages/finish_reasonsattached → cause not recoverable from telemetry.Latency outlier — DataFlow PR & Discussion Dataset Builder (
1eddb4eb20b7e512ea172cc70238ec69)gen_ai-op spans at 21,609,288 ms and 21,602,759 ms (~6.0h),run.status:success, nogen_ai.request.model→ these are whole-agent-execution spans mislabeled underspan.op:gen_ai, not single LLM calls.a1ac970b...), "Daily Security Observability Report" ~46 min (c6945a28...) — allsuccess.Recommendations
gen_ai.response.finish_reasonsis emitted as an array attr atactions/setup/js/send_otlp_span.cjs:2139(buildArrayAttr), but Sentry EAP returns 0 rows for bothhas:gen_ai.response.finish_reasonsandfinish_reasons:lengthdespite 4,300chatspans. Emit a scalar mirror (e.g.gen_ai.response.finish_reasonas a string) so:lengthtruncation/runaway detection actually works.gh-aw.run.status:failurebutgh-aw.error_count:>0=0 and nogh-aw.error.messages. When the run fails via GitHub job conclusion (not output errors), also stampgh-aw.run.status_message/gh-aw.failure.categorieson the conclusion span so failures are self-explaining.span.status,release, andservice.versionare null fleet-wide; usegh-aw.run.statusfor outcome andgh-aw.cli.version(present on 4,528 spans) for release correlation until the resource→span mapping is fixed.Notes
View notes — telemetry gaps & method
Datasets
spans: healthy, 27,271 spans/24h (gen_ai 16,132 · http.server 7,605 · default 3,508).errors: empty (no results, 24h) — explicit finding.logs: empty (no results, 24h) — explicit finding. All reliability signal comes from spans.Confirmed instrumentation/queryability gaps (evidence)
span.status= null on all 27,271 spans → OTLPstatus.codenot mapped to Sentryspan.status. Outcomes derived fromgh-aw.run.statusinstead.gen_ai.response.finish_reasons&has:= 0 rows despite 4,300 agentchatspans → array attr not indexed/queryable in EAP.has:gh-aw.turns= 0 → turns not queryable.releaseandservice.version= null on all spans (resource attrs not surfaced as span fields);gh-aw.cli.versionpresent (4,528).gh-aw.otlp.export_errors:>0= 0 → exporter/auth healthy.Method / limitations
search_eventsandget_trace_detailsare not present in this Sentry MCP build; usedlist_events(Sentry query syntax) with client-side aggregation per the skill's fallback. Trace continuity validated by listing all spans fortrace:<id>.count_unique(trace)(not raw span count) becausegh-aw.run.statusis stamped on every job-conclusion span of a run; raw span counts over-state failures (e.g. Contribution Check = 20 spans but 4 traces).References: