[reliability] Daily Reliability Review - 2026-06-12

### Executive Summary

Telemetry source: **Sentry** org `github`, project `gh-aw` (spans dataset), last 24h. Authenticated as Mara Kiefer.

Overall health is **stable, with no acute outages**. Across **27,271 spans** the agent fleet ran cleanly: **0 timeouts, 0 cancellations, and 0 OTLP export failures**. Of the runs that emitted a job-conclusion, **24 distinct traces carried a `gh-aw.run.status:failure` job** (≈4.7% of the ~506 conclusion-bearing traces; 482 success). Failures are **thinly spread across 14 workflows** (mostly 1 failed run each), so this reads as normal background noise rather than a regression spike — but the failed runs **do not capture *why* they failed** in telemetry.

The more actionable signal is an **instrumentation/queryability gap**: core reliability fields needed to triage failures are not queryable in Sentry even though agent spans are flowing — most importantly `gen_ai.response.finish_reasons` (length-truncation detection) and a per-failure error reason.

> ⚠️ This is an **evidence-first** report. Several "failure" attributes are **null/absent**, so runtime outcomes are reported as *confirmed-where-evidenced* and *inconclusive-with-instrumentation-gap* otherwise.

### Top Reliability Findings

| Priority | Workflow | Problem | Evidence | Next Action |
| --- | --- | --- | --- | --- |
| P1 | Contribution Check | Recurring failed runs (top failer) | 4 distinct failing traces; rep trace `9bc13cae...`, engine `claude-sonnet-4.6`, continuity intact | Inspect the 4 runs; no error reason in telemetry (see P4) |
| P1 | PR Sous Chef / Smoke Copilot | Repeated failures | 3 distinct failing traces each | Triage; Smoke Copilot failing = engine smoke regression risk |
| P2 | AI Moderator / Issue Monster / [aw] Failure Investigator (6h) | Multiple failures | 2 distinct failing traces each | Confirm whether shared cause vs independent |
| P3 | (fleet-wide) | Truncation **undetectable** | `gen_ai.response.finish_reasons` returns **0** results despite **4,300** `gen_ai.operation.name:chat` spans | Fix emit/indexing of finish_reasons array attr |
| P3 | DataFlow PR & Discussion Dataset Builder | Latency outlier | gen_ai span **21,609,288 ms (~6.0h)**, trace `1eddb4eb...`, run.status **success** | Confirm real vs whole-agent span mislabel |
| P4 | (fleet-wide) | Failed runs lack a captured cause | `gh-aw.error_count:>0` = **0**; no `gh-aw.error.messages` on the 24 failing traces | Emit failure reason on conclusion spans |
| P5 | (fleet-wide) | `span.status` / `release` null | both null across all 27,271 spans | Use `gh-aw.run.status` + `gh-aw.cli.version` instead |

**Failure distribution (distinct failing traces):** Contribution Check 4 · PR Sous Chef 3 · Smoke Copilot 3 · AI Moderator 2 · Issue Monster 2 · [aw] Failure Investigator (6h) 2 · Smoke Codex 1 · Smoke Claude 1 · Auto-Triage Issues 1 · Daily Ambient Context Optimizer 1 · Daily CLI Tools Exploratory Tester 1 · Smoke Copilot - AOAI (Entra) 1 · Constraint Solving — Problem of the Day 1 · Daily Reliability Review 1.

### Representative Traces
<details>
<summary>View representative traces</summary>

**Operational failure — Contribution Check** (`9bc13cae49bf63826bffce421af11f2c`)
- [View trace](https://github.sentry.io/explore/traces/trace/9bc13cae49bf63826bffce421af11f2c)
- Continuity **intact**: one trace = one run; the failure status appears on the `agent`, `safe_outputs`, `detection`, and `conclusion` job-conclusion spans of this run (this is expected multi-job-per-run behavior, not a duplicate-failure).
- Engine: `claude-sonnet-4.6`; longest agent `gen_ai` span ≈ 434,154 ms (≈7.2 min). No `gh-aw.error.messages` / `finish_reasons` attached → cause not recoverable from telemetry.

**Latency outlier — DataFlow PR & Discussion Dataset Builder** (`1eddb4eb20b7e512ea172cc70238ec69`)
- [View trace](https://github.sentry.io/explore/traces/trace/1eddb4eb20b7e512ea172cc70238ec69)
- Two `gen_ai`-op spans at **21,609,288 ms** and **21,602,759 ms** (~6.0h), `run.status:success`, **no `gen_ai.request.model`** → these are whole-agent-execution spans mislabeled under `span.op:gen_ai`, not single LLM calls.
- Context: gen_ai span **avg 28.6s, p95 94.8s** (n=16,146). Other long agent spans: "Daily Agent of the Day Blog Writer" ~47 min (`a1ac970b...`), "Daily Security Observability Report" ~46 min (`c6945a28...`) — all `success`.

</details>

### Recommendations

1. **Make truncation queryable (smallest, highest-value fix).** `gen_ai.response.finish_reasons` is emitted as an array attr at `actions/setup/js/send_otlp_span.cjs:2139` (`buildArrayAttr`), but Sentry EAP returns **0** rows for both `has:gen_ai.response.finish_reasons` and `finish_reasons:length` despite 4,300 `chat` spans. Emit a scalar mirror (e.g. `gen_ai.response.finish_reason` as a string) so `:length` truncation/runaway detection actually works.
2. **Attach a failure reason to conclusion spans.** The 24 failing runs carry `gh-aw.run.status:failure` but `gh-aw.error_count:>0`=0 and no `gh-aw.error.messages`. When the run fails via GitHub job conclusion (not output errors), also stamp `gh-aw.run.status_message` / `gh-aw.failure.categories` on the conclusion span so failures are self-explaining.
3. **Standardize triage fields in the dashboard.** `span.status`, `release`, and `service.version` are null fleet-wide; use `gh-aw.run.status` for outcome and `gh-aw.cli.version` (present on 4,528 spans) for release correlation until the resource→span mapping is fixed.
4. **Spot-check the engine smoke failures** (Smoke Copilot ×3, Smoke Codex/Claude ×1) — smoke tests failing is a cheaper early-warning signal than user workflows.

### Notes
<details>
<summary>View notes — telemetry gaps & method</summary>

**Datasets**
- `spans`: healthy, 27,271 spans/24h (gen_ai 16,132 · http.server 7,605 · default 3,508).
- `errors`: **empty** (no results, 24h) — explicit finding.
- `logs`: **empty** (no results, 24h) — explicit finding. All reliability signal comes from spans.

**Confirmed instrumentation/queryability gaps (evidence)**
- `span.status` = **null** on all 27,271 spans → OTLP `status.code` not mapped to Sentry `span.status`. Outcomes derived from `gh-aw.run.status` instead.
- `gen_ai.response.finish_reasons` & `has:` = **0** rows despite 4,300 agent `chat` spans → array attr not indexed/queryable in EAP.
- `has:gh-aw.turns` = **0** → turns not queryable.
- `release` and `service.version` = **null** on all spans (resource attrs not surfaced as span fields); `gh-aw.cli.version` present (4,528).
- `gh-aw.otlp.export_errors:>0` = **0** → exporter/auth healthy.

**Method / limitations**
- `search_events` and `get_trace_details` are **not present** in this Sentry MCP build; used `list_events` (Sentry query syntax) with client-side aggregation per the skill's fallback. Trace continuity validated by listing all spans for `trace:<id>`.
- Failure counts use `count_unique(trace)` (not raw span count) because `gh-aw.run.status` is stamped on every job-conclusion span of a run; raw span counts over-state failures (e.g. Contribution Check = 20 spans but 4 traces).
- Success (482) and failure (24) trace buckets can overlap when a run has both a failed and a succeeded job; the 24 is the count of traces containing ≥1 failed job conclusion.
- No historical baseline was queried, so regression-vs-normal is inferred from distribution shape (thin spread) rather than a trend comparison.

**References:**
- [§27448503711](https://github.com/github/gh-aw/actions/runs/27448503711) (this review run)
- Sentry spans explorer: https://github.sentry.io/explore/traces/?project=4511347087179777&statsPeriod=24h

</details>







> Generated by [🚨 Daily Reliability Review](https://github.com/github/gh-aw/actions/runs/27448503711) · 216.7 AIC · ⌖ 12.2 AIC · ⊞ 5.6K · [◷](https://github.com/search?q=repo%3Agithub%2Fgh-aw+is%3Aissue+%22gh-aw-workflow-call-id%3A+github%2Fgh-aw%2Fdaily-reliability-review%22&type=issues)
> - [x] expires  on Jun 14, 2026, 3:25 PM UTC-08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reliability] Daily Reliability Review - 2026-06-12 #38970

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Priority	Workflow	Problem	Evidence	Next Action
P1	Contribution Check	Recurring failed runs (top failer)	4 distinct failing traces; rep trace `9bc13cae...`, engine `claude-sonnet-4.6`, continuity intact	Inspect the 4 runs; no error reason in telemetry (see P4)
P1	PR Sous Chef / Smoke Copilot	Repeated failures	3 distinct failing traces each	Triage; Smoke Copilot failing = engine smoke regression risk
P2	AI Moderator / Issue Monster / [aw] Failure Investigator (6h)	Multiple failures	2 distinct failing traces each	Confirm whether shared cause vs independent
P3	(fleet-wide)	Truncation undetectable	`gen_ai.response.finish_reasons` returns 0 results despite 4,300 `gen_ai.operation.name:chat` spans	Fix emit/indexing of finish_reasons array attr
P3	DataFlow PR & Discussion Dataset Builder	Latency outlier	gen_ai span 21,609,288 ms (~6.0h), trace `1eddb4eb...`, run.status success	Confirm real vs whole-agent span mislabel
P4	(fleet-wide)	Failed runs lack a captured cause	`gh-aw.error_count:>0` = 0; no `gh-aw.error.messages` on the 24 failing traces	Emit failure reason on conclusion spans
P5	(fleet-wide)	`span.status` / `release` null	both null across all 27,271 spans	Use `gh-aw.run.status` + `gh-aw.cli.version` instead

[reliability] Daily Reliability Review - 2026-06-12 #38970

Description

Executive Summary

Top Reliability Findings

Representative Traces

Recommendations

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions