Skip to content

[aw-failures] [aw] Failure Investigation Report — 6h window (2026-06-17 19:34 UTC) #39883

@github-actions

Description

@github-actions

[aw] Failure Investigation — 6h window ending 2026-06-17 19:34 UTC

Executive summary

  1. Scope: 29 failed/cancelled runs in github/gh-aw over the last 6h. 13 were cancelled (concurrency/guard, not defects); 16 were genuine failure outcomes across 6 signature clusters.
  2. P0 (provider/config): Codex engine cannot reach its model — every Codex run 404s on gpt-5-codex-alpha-2025-11-07. Tracked per-workflow but root cause not escalated.
  3. P1 (product bug, new/untracked): four asset-producing workflows succeed at the agent step but the upload_assets job fails because the declared PNG asset files are never staged. Filed as sub-issue below.
  4. P1 (provider): Copilot-CLI BYOK proxy rejects claude-sonnet-4.6 with 400 model not supported; partially covered by the existing cascade rollup.
  5. No open agentic-workflows issue qualified for closure — all 20 are <6h old and reflect still-active failure modes; none have fresh evidence of being fixed.

Failure cluster table

Cluster Class Runs Representative Comparator Existing coverage
C1 codex-model-404 model_resolution_404 (P0) 2 §27713303874 §27703903782 #39878, #39844 (per-workflow)
C2 phantom-asset missing-staged-file (P1) 4 §27713375907 §27705239494 none → sub-issue
C3 copilot-byok model-not-supported 400 / denials (P1) 8 §27703869930 §27702793798 #39850 #39851 #39848 #39853 #39852
C4 claude-cli engine failure (P2) 1 §27703868296 (in cascade #39852)
C5 checkout-infra checkout step (P2) 1 §27707338407 #39856
C6 super-linter linter step (P2) 1 §27698271426 none (low impact)
— cancelled not a defect 13 Smoke CI ×8, AI Moderator ×4, Auto-Triage ×1 concurrency/guard cancellations

Evidence

C1 — Codex model 404 (P0)
  • Dominant error: 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07 at `(172.30.0.30/redacted)
  • The Codex engine resolves alias gpt-5-codex to backend id gpt-5-codex-alpha-2025-11-07, which no longer exists on the proxy. The Unknown model gpt-5-codex ... fallback model metadata path swaps metadata only — the request still targets the missing backend, so all 5 reconnect retries 404 and the turn fails with 0 tool calls.
  • Audit posture: read-only, turns 11→0 vs baseline (agent never started real work).
  • Impact: every Codex-engine workflow (Daily Cache Strategy Analyzer, Smoke Codex). Auto-filed at [aw] Daily Cache Strategy Analyzer failed #39878 / [aw] Smoke Codex produced no safe outputs #39844 as symptoms, but the shared root cause (removed alpha model id) is not escalated.
C2 — Phantom asset (P1, new)
  • upload_assets job fails: ERR_SYSTEM: Asset file not found: /tmp/gh-aw/safeoutputs/assets/quality_score_breakdown.png while the agent job succeeded.
  • The agent emitted upload_asset safe-output items referencing PNGs with sha + byte size + markdown image links (e.g. quality_score_breakdown.png size=178654, historical_trends.png size=422386) that were never written to the staging dir.
  • Path mismatch: agent referenced .../.gh-aw-assets/(file) while the upload pipeline reads /tmp/gh-aw/safeoutputs/assets/; the safe-outputs-assets artifact was never produced (Artifact not found for name: safe-outputs-assets).
  • Affected: §27713375907 Daily Code Metrics, §27705239494 Daily Security Observability Report, §27704205632 Daily Repository Chronicle, §27699677975 Daily Agent of the Day Blog Writer.
C3 — Copilot BYOK (P1, heterogeneous)
  • Representative §27703869930: 400 The requested model is not supported (model=claude-sonnet-4.6, isModelNotSupportedError=true), not retried; not auth, not denial-limit (permissionDeniedCount=0).
  • Comparator §27702793798: hasNumerousPermissionDenied (permissionDeniedCount=11).
  • Third mode §27701060020: harness exit code 127 (runtime).
  • Cluster mixes three distinct Copilot-CLI failure modes; mostly covered by cascade rollup [aw] Failure cascade detected #39852 and per-workflow issues.

Existing issue correlation

  1. Clusters → tracking: C1 ↔ [aw] Daily Cache Strategy Analyzer failed #39878/[aw] Smoke Codex produced no safe outputs #39844; C3 ↔ [aw] Smoke Copilot - AOAI (apikey) failed #39850/[aw] Smoke Copilot - AOAI (Entra) failed #39851/[aw] Daily Formal Spec Verifier exceeded tool denial limit #39848/[aw] Daily SPDD Spec Planner exceeded tool denial limit #39853/[aw] Failure cascade detected #39852; C5 ↔ [aw] Documentation Unbloat failed #39856.
  2. Cascade rollup [aw] Failure cascade detected #39852 already groups 10 of today's [aw] * failed issues — consistent with C1/C3 sharing provider-side root causes.
  3. Gaps: C2 (phantom asset) has no tracking coverage → addressed by the sub-issue below. C1 root cause is filed only as per-workflow symptoms; recommend treating [aw] Daily Cache Strategy Analyzer failed #39878 as the canonical P0 and updating the Codex model alias.
  4. Potential duplicates: [aw] Smoke Copilot - AOAI (apikey) failed #39850[aw] Smoke Copilot - AOAI (apikey) produced no safe outputs #39861 and [aw] Smoke Copilot - AOAI (Entra) failed #39851[aw] Smoke Copilot - AOAI (Entra) produced no safe outputs #39862 (apikey/Entra "failed" vs "produced no safe outputs" describe the same two Smoke Copilot runs); candidates for consolidation by the owning workflow, not closed here without confirmation.
  5. Closures: none performed — every open agentic-workflows issue is <6h old and reflects an active failure mode; no fresh evidence of resolution.

Fix roadmap

  1. P0 — Restore the Codex model route. The configured Codex model resolves to gpt-5-codex-alpha-2025-11-07, which 404s on the proxy. Point the alias at a live model (or restore the backend) and make the "unknown model" fallback re-target the request, not just metadata. Canonical tracker: [aw] Daily Cache Strategy Analyzer failed #39878.
  2. P1 — Fix asset staging (sub-issue below). Ensure declared upload_asset files are staged to /tmp/gh-aw/safeoutputs/assets/ before the upload_assets job, and validate file existence at safe-output emission time so the agent cannot declare phantom assets.
  3. P1 — Copilot BYOK model support. Resolve claude-sonnet-4.6 rejection (400 not supported) on the BYOK proxy; covered by cascade [aw] Failure cascade detected #39852.
  4. P2 — Monitor C4 (Claude CLI), C5 (checkout, [aw] Documentation Unbloat failed #39856), C6 (super-linter); single-occurrence, no action this cycle.

Sub-issues created

  • Phantom asset staging failure (C2) — see linked sub-issue.

References: §27713303874 · §27713375907 · §27703869930

Generated by 🔍 [aw] Failure Investigator (6h) ·

  • expires on Jun 24, 2026, 11:47 AM UTC-08:00


6h-window follow-up — 2026-06-18 08:26 UTC

Scope: 18 failed/cancelled runs. 5 cancelled Smoke CI (concurrency/guard — not defects); 13 genuine failure across 5 signature clusters. No P0 (no provider-down). No issue closures — all open trackers reflect still-active modes.

Cluster Class Runs Branch Coverage
phantom-asset upload_assets/Push assets, agent success (P1) 2 main #39885 updated (recurrence)
copilot-exit-1 Execute Copilot CLI exit 1, classifiers false (P1) 3 main×2, PR×1 #39946 updated (recurrence + 429 evidence)
safe-outputs-skip Process Safe Outputs fail on unsatisfiable item (P1) 1 main new sub-issue (LintMonster)
smoke-pr-noise Process Safe Outputs on PR/dev branches (P2) 5 PR branches already noted in #39946
post-step-singleton agent success, post-agent infra step fail (P2) 2 main monitored (below)

Closures: none. #39885 and #39946 both recur this window; parent (this issue) and #39790 (token audit) are unrelated/fresh.

New sub-issue: LintMonster — scheduled run fails Process Safe Outputs because the agent emitted update_issue target:triggering (unsatisfiable outside issue context) and the skip is counted as a hard failure. Linked to this report.

Monitored singletons (no issue filed — single occurrence, post-agent infra step):

  1. Avenger §27740190790 (main): agent job marked failure at Parse agent logs for step summary, but the agent itself completed and emitted a noop ("No PR created"). Post-processing log-parser step failure; watch for recurrence.
  2. Daily Safe Outputs Git Simulator §27739899787 (main): agent success (emitted noop, state persisted to repo memory), but push_repo_memory/Push repo-memory changes (default) failed — likely a repo-memory branch push race/conflict. Single occurrence.

References: §27735258410 · §27737401463 · §27738455642

Generated by 🔍 [aw] Failure Investigator (6h) ·

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions