feat(auth): force token refresh on rejected token (401/403), never the browser by wpfleger96 · Pull Request #1015 · block/buzz

wpfleger96 · 2026-06-12T20:01:57Z

Agents were silently dying mid-turn against Databricks. The root cause was a broken token refresh: a cached token the provider rejected was never actually replaced, so the agent re-sent the dead token, got rejected again, and died with no recovery. This PR closes that, plus a related multi-process refresh race. It carries two halves — flagging the first explicitly because it has never been independently reviewed.

Half 1 — multi-process refresh race in `bearer()` (commit `408d1da5`, not previously reviewed)

Multiple buzz-agent processes share one OAuth cache file. When the access token expires they all race to refresh, and Databricks rotates refresh tokens — so the first process to refresh invalidates the old one, and the losers' refresh grants previously fell through to browser_pkce_flow, causing repeated browser auth prompts (a headless hang).

bearer() now re-reads the disk cache before attempting a refresh (picks up a token a sibling already wrote) and again after a refresh failure (catches a sibling that won the race between our check and our failed attempt). The endpoints() discovery call is hoisted so it runs at most once per bearer() invocation. This commit was developed separately and ships here for the first time; it inherits this PR's review.

Half 2 — force refresh on a rejected token, keyed by token identity

A cached token can be rejected by the provider before its local expiry clock says it is stale — clock skew, server-side revocation, or a load-balancer node that never saw it. The original refresh_now() gated the refresh grant on is_expired(), so on a locally-fresh-but-rejected token it returned the same dead token unchanged. The agent re-sent it, was rejected again, and died. This was the actual cause of the silent deaths.

TokenSource::refresh_now(&self, rejected: &str) — a new trait method that takes the rejected access token and coalesces on token identity, not the expiry clock. It holds the cache lock for the whole re-check -> refresh -> save sequence. Under the lock: if the cached token no longer equals rejected, a sibling already refreshed -> return it (true coalesce); otherwise it is still the rejected token, so run the refresh-token grant unconditionally. is_expired() is gone from the method entirely. It runs the refresh-token grant only — never browser_pkce_flow — and on a dead refresh token returns a terminal LlmAuth rather than hanging a headless harness. The refresh token is preserved, never nulled.
post() status mapping — both 401 and 403 map to LlmAuth and are treated as refreshable. A 403 can mean a revoked/expired token (Databricks and OAuth gateways return 403 for this), indistinguishable from a pure-authorization 403 at the HTTP layer. The identity-coalesce design is what makes retrying a 403 safe: the retry sends a genuinely fresh token instead of re-sending a dead one.
post_openai one-shot retry — on the first LlmAuth it calls refresh_now(&rejected) and retries post() once. A second LlmAuth propagates terminally. The guard is a local bool scoped to the single call, so an earlier turn's rejection can never suppress a later turn's legitimate retry, and a persistent rejection is bounded to one grant + one retry. This seam covers all three OpenAI-family paths (/responses, /chat/completions, and the auto-upgrade).
StaticTokenSource uses the trait default — refresh_now() returns the static key unchanged, since a static key can't refresh and one harmless retry is fine.

Tests

refresh_now_runs_grant_on_unexpired_rejected_token — the proof test for the core bug: an unexpired cached token (expires_at = now+3600) that equals the rejected token forces the grant to run exactly once, the returned bearer differs from the rejected one, no browser, refresh token preserved (rotated). This test fails against the old expiry-clock gate and passes now.
refresh_now_coalesces_when_another_caller_already_refreshed — an unexpired cached token that differs from the rejected one short-circuits with zero grants, proving coalesce is identity-keyed, not clock-keyed.
refresh_now_without_refresh_token_is_terminal — no refresh token -> terminal LlmAuth, no browser, no grant.
post_openai_refreshes_once_per_call_on_401 — one 401 -> exactly one refresh + successful retry; a later call gets its own refresh (guard is per-call).
post_openai_persistent_401_propagates_after_one_retry — a token the refresh can't satisfy propagates LlmAuth after exactly one refresh (no loop).
post_openai_refreshes_once_on_403 — a 403 on the stale token, 200 on the fresh one: proves a 403 enters the refresh path and recovers after exactly one refresh.
post_openai_persistent_403_propagates_after_one_retry — always-403 -> terminal LlmAuth after exactly one retry, bounding a pure-authorization 403; the regression guard against re-narrowing the mapping to 401-only.
static_token_source_refresh_now_returns_static_token — default impl returns the key.
Plus the two existing bearer() race-fix tests from Half 1.

…resh race Multiple buzz-agent processes share one OAuth cache file. When the access token expires, they all race to refresh. Databricks uses rotating refresh tokens — the first process to refresh invalidates the old token. Remaining processes fail to refresh and previously fell through to browser_pkce_flow, causing repeated browser auth prompts. Add two disk re-reads in bearer(): - Before attempting refresh: picks up a token another process already wrote, avoiding a wasted network round-trip that will fail. - After refresh failure: catches the case where another process won the race between our pre-refresh check and the failed refresh attempt. Also hoists the endpoints() discovery call so it happens at most once per bearer() invocation, reused by both the refresh and browser flow paths. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

A cached OAuth token can be rejected by the provider (401) before its local expiry clock says it is stale — clock skew, server-side revocation, or a load-balancer node that never saw it. Today post() surfaces that as a terminal LlmAuth with no attempt to recover, so the agent silently dies mid-turn instead of refreshing. Add an atomic refresh_now() to the TokenSource trait. The PKCE impl holds the cache lock for the whole re-check -> refresh -> save sequence so concurrent 401s from a fanned-out turn coalesce onto a single grant: the first refreshes, the rest see the fresh token on the re-check and return it. It runs the refresh-token grant only — never browser_pkce_flow — and on a dead refresh token returns a terminal LlmAuth rather than hanging a headless harness on an interactive prompt. The refresh token is preserved (never nulled), so a future 401 can refresh again. post_openai retries exactly once on the first LlmAuth, guarded by a local bool scoped to the call so an earlier turn's 401 can never suppress a later turn's legitimate retry. The StaticTokenSource default returns its key unchanged — a static key can't refresh and one harmless retry is fine. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

refresh_now() gated the refresh behind is_expired(), but a 401 means the provider rejected a token that still looks locally fresh (skew, revocation, a node that never saw it). On that dominant case is_expired() returned false, so refresh_now() returned the rejected token unchanged, post_openai retried with the byte-identical bearer, got another 401, and the agent died silently — reintroducing the original browser-hang CRITICAL through the coalesce gate. Key coalescing off token identity instead. post_openai passes the rejected bearer into refresh_now(&rejected); under the lock, if the cached token no longer equals the rejected one a sibling already refreshed (return it), else run the refresh-token grant unconditionally. The expiry clock is never consulted here. Also scope the one-shot retry to 401: post() now maps 403 to a terminal Llm error rather than LlmAuth, so a genuine authorization failure (scope, quota, entitlement) no longer burns a pointless refresh and retry. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

A 403 is indistinguishable from an expired-or-revoked-token 403 at the HTTP-status layer — Databricks and OAuth gateways return 403 for stale tokens, not only 401. The prior commit scoped the one-shot retry to 401 and routed 403 to a terminal Llm error, which would silently drop the stale-token-403 case the agent needs to recover from. Now safe because refresh_now() forces a grant by token identity: the retry sends a genuinely fresh token, so a stale-token 403 recovers, while the per-call one-shot guard bounds a pure-authz 403 to one wasted refresh before it propagates terminally. No loop, no hang. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

The 403-is-refreshable behavior had no direct coverage — the auth stub only emitted 401, so a regression narrowing the retry back to 401-only would pass every test. Parameterize the stub's reject status and add a persistent-403 test asserting exactly one refresh then terminal LlmAuth, mirroring the 401 propagation test. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

The persistent-403 test covers only the failure tail; it never proves a 403 actually recovers when the refreshed token is accepted. Add the success-path test (stale token 403s, fresh token 200s) so the stale-token-403 recovery — the reason 403 is refreshable at all — is pinned, not just the propagation case. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

* origin/main: (33 commits) fix(desktop): make Windows release compile cleanly (#1029) Add production Docker Compose bundle (#985) feat(profile): show active turn badges on agent profile panel and popover (#1026) chore(release): release version 0.3.20 (#1027) fix(release): resolve Windows sidecar path and Linux AppImage updater format (#1024) chore(release): release version 0.3.19 (#1014) fix(release): ignore prerelease tags in changelog generation (#1021) fix: repair main build after cross-PR merge skew (#1020) feat(agents): show per-turn duration and prune dead turns within ~25s of host crash (#1017) fix(release): replace hermit with native tool setup on Windows job (#1018) feat(acp): surface error-class outcomes to the activity feed only, never the channel (#1010) fix(desktop): migrate Sprout workspace storage (#1016) feat(auth): force token refresh on rejected token (401/403), never the browser (#1015) fix(release): mark prerelease versions so they do not become latest (#1013) feat(acp): implement systemPrompt with protocol version gating (#981) fix(release): update repository name check from block/sprout to block/buzz (#1012) feat(release): all-OS desktop builds + universal auto-update manifest (#1011) Add relay disconnect UX: friendly errors, reconnect, cached identity (#1004) feat(agents): add active turn indicators to Agents Menu (#1005) ci: add fork guards to docker, release, and auto-tag workflows (#1007) ... Co-authored-by: npub1t2tgm7d8f995uqvmnm8h88sg3wnpp9a5xysjf6dg3tjmgt3ltulqdp8ehr <5a968df9a7494b4e019b9ecf739e088ba61097b4312124e9a88ae5b42e3f5f3e@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub1t2tgm7d8f995uqvmnm8h88sg3wnpp9a5xysjf6dg3tjmgt3ltulqdp8ehr <5a968df9a7494b4e019b9ecf739e088ba61097b4312124e9a88ae5b42e3f5f3e@sprout-oss.stage.blox.sqprod.co>

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 6 commits June 12, 2026 15:39

wpfleger96 changed the title ~~feat(auth): force token refresh on 401, never the browser~~ feat(auth): force token refresh on rejected token (401/403), never the browser Jun 12, 2026

wpfleger96 merged commit 5a8cc79 into main Jun 12, 2026
27 checks passed

wpfleger96 deleted the duncan/oauth-refresh-on-401 branch June 12, 2026 21:19

wpfleger96 mentioned this pull request Jun 12, 2026

chore(release): release version 0.3.19 #1014

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(auth): force token refresh on rejected token (401/403), never the browser#1015

feat(auth): force token refresh on rejected token (401/403), never the browser#1015
wpfleger96 merged 6 commits into
mainfrom
duncan/oauth-refresh-on-401

wpfleger96 commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wpfleger96 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Half 1 — multi-process refresh race in bearer() (commit 408d1da5, not previously reviewed)

Half 2 — force refresh on a rejected token, keyed by token identity

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wpfleger96 commented Jun 12, 2026 •

edited

Loading

Half 1 — multi-process refresh race in `bearer()` (commit `408d1da5`, not previously reviewed)