Skip to content

[aw-failures] P1: Daily BYOK Ollama Test — every request returns transient_bad_request, all 4 attempts fail (6-day 100% outage) #39476

@github-actions

Description

@github-actions

Recommendation

Fix the Ollama BYOK request path — every inference call returns transient_bad_request, taking this test red for 6+ consecutive days. The BYOK Ollama endpoint is rejecting the request shape (or the backend is down), so the agent never produces a turn.

Problem statement

The Daily BYOK Ollama Test agent job fails on every run. The Copilot CLI spawns, issues a model request to the Ollama BYOK backend, and the request immediately fails with Request failed (transient_bad_request). The harness retries with --continue across all 4 attempts; each dies in ~3s. Result: turns=0, exitCode=1, no agent work performed.

Affected workflow and run IDs

  • Workflow: Daily BYOK Ollama Test (.github/workflows/daily-byok-ollama-test.lock.yml)
  • Representative failed run: §27581947650 (2026-06-15)
  • Comparator (prior failed, same signature): §27447521382 (2026-06-12)
  • Chronic: failure on 2026-06-08, 06-09, 06-10, 06-11, 06-12, 06-15 (every scheduled run sampled).

Probable root cause

Requests to the Ollama BYOK backend are rejected before any tokens are generated. Two correlated signals: (1) all 4 agent attempts fail in ~3s with transient_bad_request (a 400-class rejection, not a timeout), and (2) the post-run awf-reflect models probe to (apiproxy/redacted) returns 503after 5 retries. Together these indicate the Ollama model/endpoint behind the BYOK api-proxy (port 10002) is unavailable or the request payload (model name / API shape) is incompatible with the backend. NoteisModelNotSupportedError=falseandisCAPIError400=false`, so the harness does not currently classify this as a model/credit error — it is silently treated as a generic transient failure.

Proposed remediation

  1. Verify the Ollama BYOK backend reachable behind api-proxy:10002 is healthy and serving the configured model; the 503 on /v1/models suggests the upstream Ollama service is down or unconfigured in CI.
  2. If the backend is intentionally external/flaky, classify transient_bad_request + /v1/models 503 as an infrastructure-unavailable condition and emit a non-failing noop/skip rather than a hard failure, so a missing BYOK backend does not generate daily false-failure noise.
  3. Add a pre-flight model-availability check (probe /v1/models before spawning the agent) and short-circuit with a clear diagnostic when the BYOK endpoint is unavailable.

Success criteria / verification

  • A scheduled run either (a) completes with the agent producing ≥1 turn against a healthy Ollama backend, or (b) cleanly skips with a non-failure conclusion and a logged reason when the backend is unavailable.
  • No transient_bad_request-only runs marked failure for 3 consecutive scheduled runs.

Analyzed run IDs: 27581947650, 27447521382. Parent: #29109.
Related to #29109

Generated by 🔍 [aw] Failure Investigator (6h) · 183.4 AIC · ⌖ 12.6 AIC · ⊞ 4.5K ·

  • expires on Jun 22, 2026, 5:50 PM UTC-08:00

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions