Skip to content

feat(auth): force token refresh on rejected token (401/403), never the browser#1015

Merged
wpfleger96 merged 6 commits into
mainfrom
duncan/oauth-refresh-on-401
Jun 12, 2026
Merged

feat(auth): force token refresh on rejected token (401/403), never the browser#1015
wpfleger96 merged 6 commits into
mainfrom
duncan/oauth-refresh-on-401

Conversation

@wpfleger96

@wpfleger96 wpfleger96 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Agents were silently dying mid-turn against Databricks. The root cause was a broken token refresh: a cached token the provider rejected was never actually replaced, so the agent re-sent the dead token, got rejected again, and died with no recovery. This PR closes that, plus a related multi-process refresh race. It carries two halves — flagging the first explicitly because it has never been independently reviewed.

Half 1 — multi-process refresh race in bearer() (commit 408d1da5, not previously reviewed)

Multiple buzz-agent processes share one OAuth cache file. When the access token expires they all race to refresh, and Databricks rotates refresh tokens — so the first process to refresh invalidates the old one, and the losers' refresh grants previously fell through to browser_pkce_flow, causing repeated browser auth prompts (a headless hang).

bearer() now re-reads the disk cache before attempting a refresh (picks up a token a sibling already wrote) and again after a refresh failure (catches a sibling that won the race between our check and our failed attempt). The endpoints() discovery call is hoisted so it runs at most once per bearer() invocation. This commit was developed separately and ships here for the first time; it inherits this PR's review.

Half 2 — force refresh on a rejected token, keyed by token identity

A cached token can be rejected by the provider before its local expiry clock says it is stale — clock skew, server-side revocation, or a load-balancer node that never saw it. The original refresh_now() gated the refresh grant on is_expired(), so on a locally-fresh-but-rejected token it returned the same dead token unchanged. The agent re-sent it, was rejected again, and died. This was the actual cause of the silent deaths.

  • TokenSource::refresh_now(&self, rejected: &str) — a new trait method that takes the rejected access token and coalesces on token identity, not the expiry clock. It holds the cache lock for the whole re-check -> refresh -> save sequence. Under the lock: if the cached token no longer equals rejected, a sibling already refreshed -> return it (true coalesce); otherwise it is still the rejected token, so run the refresh-token grant unconditionally. is_expired() is gone from the method entirely. It runs the refresh-token grant only — never browser_pkce_flow — and on a dead refresh token returns a terminal LlmAuth rather than hanging a headless harness. The refresh token is preserved, never nulled.
  • post() status mapping — both 401 and 403 map to LlmAuth and are treated as refreshable. A 403 can mean a revoked/expired token (Databricks and OAuth gateways return 403 for this), indistinguishable from a pure-authorization 403 at the HTTP layer. The identity-coalesce design is what makes retrying a 403 safe: the retry sends a genuinely fresh token instead of re-sending a dead one.
  • post_openai one-shot retry — on the first LlmAuth it calls refresh_now(&rejected) and retries post() once. A second LlmAuth propagates terminally. The guard is a local bool scoped to the single call, so an earlier turn's rejection can never suppress a later turn's legitimate retry, and a persistent rejection is bounded to one grant + one retry. This seam covers all three OpenAI-family paths (/responses, /chat/completions, and the auto-upgrade).
  • StaticTokenSource uses the trait default — refresh_now() returns the static key unchanged, since a static key can't refresh and one harmless retry is fine.

Tests

  • refresh_now_runs_grant_on_unexpired_rejected_token — the proof test for the core bug: an unexpired cached token (expires_at = now+3600) that equals the rejected token forces the grant to run exactly once, the returned bearer differs from the rejected one, no browser, refresh token preserved (rotated). This test fails against the old expiry-clock gate and passes now.
  • refresh_now_coalesces_when_another_caller_already_refreshed — an unexpired cached token that differs from the rejected one short-circuits with zero grants, proving coalesce is identity-keyed, not clock-keyed.
  • refresh_now_without_refresh_token_is_terminal — no refresh token -> terminal LlmAuth, no browser, no grant.
  • post_openai_refreshes_once_per_call_on_401 — one 401 -> exactly one refresh + successful retry; a later call gets its own refresh (guard is per-call).
  • post_openai_persistent_401_propagates_after_one_retry — a token the refresh can't satisfy propagates LlmAuth after exactly one refresh (no loop).
  • post_openai_refreshes_once_on_403 — a 403 on the stale token, 200 on the fresh one: proves a 403 enters the refresh path and recovers after exactly one refresh.
  • post_openai_persistent_403_propagates_after_one_retry — always-403 -> terminal LlmAuth after exactly one retry, bounding a pure-authorization 403; the regression guard against re-narrowing the mapping to 401-only.
  • static_token_source_refresh_now_returns_static_token — default impl returns the key.
  • Plus the two existing bearer() race-fix tests from Half 1.

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 6 commits June 12, 2026 15:39
…resh race

Multiple buzz-agent processes share one OAuth cache file. When the access
token expires, they all race to refresh. Databricks uses rotating refresh
tokens — the first process to refresh invalidates the old token. Remaining
processes fail to refresh and previously fell through to browser_pkce_flow,
causing repeated browser auth prompts.

Add two disk re-reads in bearer():
- Before attempting refresh: picks up a token another process already wrote,
  avoiding a wasted network round-trip that will fail.
- After refresh failure: catches the case where another process won the race
  between our pre-refresh check and the failed refresh attempt.

Also hoists the endpoints() discovery call so it happens at most once per
bearer() invocation, reused by both the refresh and browser flow paths.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
A cached OAuth token can be rejected by the provider (401) before its
local expiry clock says it is stale — clock skew, server-side
revocation, or a load-balancer node that never saw it. Today post()
surfaces that as a terminal LlmAuth with no attempt to recover, so the
agent silently dies mid-turn instead of refreshing.

Add an atomic refresh_now() to the TokenSource trait. The PKCE impl
holds the cache lock for the whole re-check -> refresh -> save sequence
so concurrent 401s from a fanned-out turn coalesce onto a single grant:
the first refreshes, the rest see the fresh token on the re-check and
return it. It runs the refresh-token grant only — never browser_pkce_flow
— and on a dead refresh token returns a terminal LlmAuth rather than
hanging a headless harness on an interactive prompt. The refresh token
is preserved (never nulled), so a future 401 can refresh again.

post_openai retries exactly once on the first LlmAuth, guarded by a
local bool scoped to the call so an earlier turn's 401 can never suppress
a later turn's legitimate retry. The StaticTokenSource default returns
its key unchanged — a static key can't refresh and one harmless retry is
fine.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
refresh_now() gated the refresh behind is_expired(), but a 401 means the
provider rejected a token that still looks locally fresh (skew, revocation,
a node that never saw it). On that dominant case is_expired() returned
false, so refresh_now() returned the rejected token unchanged, post_openai
retried with the byte-identical bearer, got another 401, and the agent died
silently — reintroducing the original browser-hang CRITICAL through the
coalesce gate.

Key coalescing off token identity instead. post_openai passes the rejected
bearer into refresh_now(&rejected); under the lock, if the cached token no
longer equals the rejected one a sibling already refreshed (return it), else
run the refresh-token grant unconditionally. The expiry clock is never
consulted here.

Also scope the one-shot retry to 401: post() now maps 403 to a terminal Llm
error rather than LlmAuth, so a genuine authorization failure (scope, quota,
entitlement) no longer burns a pointless refresh and retry.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
A 403 is indistinguishable from an expired-or-revoked-token 403 at the
HTTP-status layer — Databricks and OAuth gateways return 403 for stale
tokens, not only 401. The prior commit scoped the one-shot retry to 401
and routed 403 to a terminal Llm error, which would silently drop the
stale-token-403 case the agent needs to recover from.

Now safe because refresh_now() forces a grant by token identity: the
retry sends a genuinely fresh token, so a stale-token 403 recovers, while
the per-call one-shot guard bounds a pure-authz 403 to one wasted refresh
before it propagates terminally. No loop, no hang.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The 403-is-refreshable behavior had no direct coverage — the auth stub
only emitted 401, so a regression narrowing the retry back to 401-only
would pass every test. Parameterize the stub's reject status and add a
persistent-403 test asserting exactly one refresh then terminal LlmAuth,
mirroring the 401 propagation test.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The persistent-403 test covers only the failure tail; it never proves a
403 actually recovers when the refreshed token is accepted. Add the
success-path test (stale token 403s, fresh token 200s) so the
stale-token-403 recovery — the reason 403 is refreshable at all — is
pinned, not just the propagation case.

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
@wpfleger96 wpfleger96 changed the title feat(auth): force token refresh on 401, never the browser feat(auth): force token refresh on rejected token (401/403), never the browser Jun 12, 2026
@wpfleger96 wpfleger96 merged commit 5a8cc79 into main Jun 12, 2026
27 checks passed
@wpfleger96 wpfleger96 deleted the duncan/oauth-refresh-on-401 branch June 12, 2026 21:19
tlongwell-block pushed a commit that referenced this pull request Jun 13, 2026
* origin/main: (33 commits)
  fix(desktop): make Windows release compile cleanly (#1029)
  Add production Docker Compose bundle (#985)
  feat(profile): show active turn badges on agent profile panel and popover (#1026)
  chore(release): release version 0.3.20 (#1027)
  fix(release): resolve Windows sidecar path and Linux AppImage updater format (#1024)
  chore(release): release version 0.3.19 (#1014)
  fix(release): ignore prerelease tags in changelog generation (#1021)
  fix: repair main build after cross-PR merge skew (#1020)
  feat(agents): show per-turn duration and prune dead turns within ~25s of host crash (#1017)
  fix(release): replace hermit with native tool setup on Windows job (#1018)
  feat(acp): surface error-class outcomes to the activity feed only, never the channel (#1010)
  fix(desktop): migrate Sprout workspace storage (#1016)
  feat(auth): force token refresh on rejected token (401/403), never the browser (#1015)
  fix(release): mark prerelease versions so they do not become latest (#1013)
  feat(acp): implement systemPrompt with protocol version gating (#981)
  fix(release): update repository name check from block/sprout to block/buzz (#1012)
  feat(release): all-OS desktop builds + universal auto-update manifest (#1011)
  Add relay disconnect UX: friendly errors, reconnect, cached identity (#1004)
  feat(agents): add active turn indicators to Agents Menu (#1005)
  ci: add fork guards to docker, release, and auto-tag workflows (#1007)
  ...

Co-authored-by: npub1t2tgm7d8f995uqvmnm8h88sg3wnpp9a5xysjf6dg3tjmgt3ltulqdp8ehr <5a968df9a7494b4e019b9ecf739e088ba61097b4312124e9a88ae5b42e3f5f3e@sprout-oss.stage.blox.sqprod.co>
Signed-off-by: npub1t2tgm7d8f995uqvmnm8h88sg3wnpp9a5xysjf6dg3tjmgt3ltulqdp8ehr <5a968df9a7494b4e019b9ecf739e088ba61097b4312124e9a88ae5b42e3f5f3e@sprout-oss.stage.blox.sqprod.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant