feat(auth): force token refresh on rejected token (401/403), never the browser#1015
Merged
Conversation
…resh race Multiple buzz-agent processes share one OAuth cache file. When the access token expires, they all race to refresh. Databricks uses rotating refresh tokens — the first process to refresh invalidates the old token. Remaining processes fail to refresh and previously fell through to browser_pkce_flow, causing repeated browser auth prompts. Add two disk re-reads in bearer(): - Before attempting refresh: picks up a token another process already wrote, avoiding a wasted network round-trip that will fail. - After refresh failure: catches the case where another process won the race between our pre-refresh check and the failed refresh attempt. Also hoists the endpoints() discovery call so it happens at most once per bearer() invocation, reused by both the refresh and browser flow paths. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
A cached OAuth token can be rejected by the provider (401) before its local expiry clock says it is stale — clock skew, server-side revocation, or a load-balancer node that never saw it. Today post() surfaces that as a terminal LlmAuth with no attempt to recover, so the agent silently dies mid-turn instead of refreshing. Add an atomic refresh_now() to the TokenSource trait. The PKCE impl holds the cache lock for the whole re-check -> refresh -> save sequence so concurrent 401s from a fanned-out turn coalesce onto a single grant: the first refreshes, the rest see the fresh token on the re-check and return it. It runs the refresh-token grant only — never browser_pkce_flow — and on a dead refresh token returns a terminal LlmAuth rather than hanging a headless harness on an interactive prompt. The refresh token is preserved (never nulled), so a future 401 can refresh again. post_openai retries exactly once on the first LlmAuth, guarded by a local bool scoped to the call so an earlier turn's 401 can never suppress a later turn's legitimate retry. The StaticTokenSource default returns its key unchanged — a static key can't refresh and one harmless retry is fine. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
refresh_now() gated the refresh behind is_expired(), but a 401 means the provider rejected a token that still looks locally fresh (skew, revocation, a node that never saw it). On that dominant case is_expired() returned false, so refresh_now() returned the rejected token unchanged, post_openai retried with the byte-identical bearer, got another 401, and the agent died silently — reintroducing the original browser-hang CRITICAL through the coalesce gate. Key coalescing off token identity instead. post_openai passes the rejected bearer into refresh_now(&rejected); under the lock, if the cached token no longer equals the rejected one a sibling already refreshed (return it), else run the refresh-token grant unconditionally. The expiry clock is never consulted here. Also scope the one-shot retry to 401: post() now maps 403 to a terminal Llm error rather than LlmAuth, so a genuine authorization failure (scope, quota, entitlement) no longer burns a pointless refresh and retry. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
A 403 is indistinguishable from an expired-or-revoked-token 403 at the HTTP-status layer — Databricks and OAuth gateways return 403 for stale tokens, not only 401. The prior commit scoped the one-shot retry to 401 and routed 403 to a terminal Llm error, which would silently drop the stale-token-403 case the agent needs to recover from. Now safe because refresh_now() forces a grant by token identity: the retry sends a genuinely fresh token, so a stale-token 403 recovers, while the per-call one-shot guard bounds a pure-authz 403 to one wasted refresh before it propagates terminally. No loop, no hang. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The 403-is-refreshable behavior had no direct coverage — the auth stub only emitted 401, so a regression narrowing the retry back to 401-only would pass every test. Parameterize the stub's reject status and add a persistent-403 test asserting exactly one refresh then terminal LlmAuth, mirroring the 401 propagation test. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
The persistent-403 test covers only the failure tail; it never proves a 403 actually recovers when the refreshed token is accepted. Add the success-path test (stale token 403s, fresh token 200s) so the stale-token-403 recovery — the reason 403 is refreshable at all — is pinned, not just the propagation case. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
tlongwell-block
pushed a commit
that referenced
this pull request
Jun 13, 2026
* origin/main: (33 commits) fix(desktop): make Windows release compile cleanly (#1029) Add production Docker Compose bundle (#985) feat(profile): show active turn badges on agent profile panel and popover (#1026) chore(release): release version 0.3.20 (#1027) fix(release): resolve Windows sidecar path and Linux AppImage updater format (#1024) chore(release): release version 0.3.19 (#1014) fix(release): ignore prerelease tags in changelog generation (#1021) fix: repair main build after cross-PR merge skew (#1020) feat(agents): show per-turn duration and prune dead turns within ~25s of host crash (#1017) fix(release): replace hermit with native tool setup on Windows job (#1018) feat(acp): surface error-class outcomes to the activity feed only, never the channel (#1010) fix(desktop): migrate Sprout workspace storage (#1016) feat(auth): force token refresh on rejected token (401/403), never the browser (#1015) fix(release): mark prerelease versions so they do not become latest (#1013) feat(acp): implement systemPrompt with protocol version gating (#981) fix(release): update repository name check from block/sprout to block/buzz (#1012) feat(release): all-OS desktop builds + universal auto-update manifest (#1011) Add relay disconnect UX: friendly errors, reconnect, cached identity (#1004) feat(agents): add active turn indicators to Agents Menu (#1005) ci: add fork guards to docker, release, and auto-tag workflows (#1007) ... Co-authored-by: npub1t2tgm7d8f995uqvmnm8h88sg3wnpp9a5xysjf6dg3tjmgt3ltulqdp8ehr <5a968df9a7494b4e019b9ecf739e088ba61097b4312124e9a88ae5b42e3f5f3e@sprout-oss.stage.blox.sqprod.co> Signed-off-by: npub1t2tgm7d8f995uqvmnm8h88sg3wnpp9a5xysjf6dg3tjmgt3ltulqdp8ehr <5a968df9a7494b4e019b9ecf739e088ba61097b4312124e9a88ae5b42e3f5f3e@sprout-oss.stage.blox.sqprod.co>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Agents were silently dying mid-turn against Databricks. The root cause was a broken token refresh: a cached token the provider rejected was never actually replaced, so the agent re-sent the dead token, got rejected again, and died with no recovery. This PR closes that, plus a related multi-process refresh race. It carries two halves — flagging the first explicitly because it has never been independently reviewed.
Half 1 — multi-process refresh race in
bearer()(commit408d1da5, not previously reviewed)Multiple
buzz-agentprocesses share one OAuth cache file. When the access token expires they all race to refresh, and Databricks rotates refresh tokens — so the first process to refresh invalidates the old one, and the losers' refresh grants previously fell through tobrowser_pkce_flow, causing repeated browser auth prompts (a headless hang).bearer()now re-reads the disk cache before attempting a refresh (picks up a token a sibling already wrote) and again after a refresh failure (catches a sibling that won the race between our check and our failed attempt). Theendpoints()discovery call is hoisted so it runs at most once perbearer()invocation. This commit was developed separately and ships here for the first time; it inherits this PR's review.Half 2 — force refresh on a rejected token, keyed by token identity
A cached token can be rejected by the provider before its local expiry clock says it is stale — clock skew, server-side revocation, or a load-balancer node that never saw it. The original
refresh_now()gated the refresh grant onis_expired(), so on a locally-fresh-but-rejected token it returned the same dead token unchanged. The agent re-sent it, was rejected again, and died. This was the actual cause of the silent deaths.TokenSource::refresh_now(&self, rejected: &str)— a new trait method that takes the rejected access token and coalesces on token identity, not the expiry clock. It holds the cache lock for the whole re-check -> refresh -> save sequence. Under the lock: if the cached token no longer equalsrejected, a sibling already refreshed -> return it (true coalesce); otherwise it is still the rejected token, so run the refresh-token grant unconditionally.is_expired()is gone from the method entirely. It runs the refresh-token grant only — neverbrowser_pkce_flow— and on a dead refresh token returns a terminalLlmAuthrather than hanging a headless harness. The refresh token is preserved, never nulled.post()status mapping — both401and403map toLlmAuthand are treated as refreshable. A 403 can mean a revoked/expired token (Databricks and OAuth gateways return 403 for this), indistinguishable from a pure-authorization 403 at the HTTP layer. The identity-coalesce design is what makes retrying a 403 safe: the retry sends a genuinely fresh token instead of re-sending a dead one.post_openaione-shot retry — on the firstLlmAuthit callsrefresh_now(&rejected)and retriespost()once. A secondLlmAuthpropagates terminally. The guard is a localboolscoped to the single call, so an earlier turn's rejection can never suppress a later turn's legitimate retry, and a persistent rejection is bounded to one grant + one retry. This seam covers all three OpenAI-family paths (/responses,/chat/completions, and the auto-upgrade).StaticTokenSourceuses the trait default —refresh_now()returns the static key unchanged, since a static key can't refresh and one harmless retry is fine.Tests
refresh_now_runs_grant_on_unexpired_rejected_token— the proof test for the core bug: an unexpired cached token (expires_at= now+3600) that equals the rejected token forces the grant to run exactly once, the returned bearer differs from the rejected one, no browser, refresh token preserved (rotated). This test fails against the old expiry-clock gate and passes now.refresh_now_coalesces_when_another_caller_already_refreshed— an unexpired cached token that differs from the rejected one short-circuits with zero grants, proving coalesce is identity-keyed, not clock-keyed.refresh_now_without_refresh_token_is_terminal— no refresh token -> terminalLlmAuth, no browser, no grant.post_openai_refreshes_once_per_call_on_401— one 401 -> exactly one refresh + successful retry; a later call gets its own refresh (guard is per-call).post_openai_persistent_401_propagates_after_one_retry— a token the refresh can't satisfy propagatesLlmAuthafter exactly one refresh (no loop).post_openai_refreshes_once_on_403— a 403 on the stale token, 200 on the fresh one: proves a 403 enters the refresh path and recovers after exactly one refresh.post_openai_persistent_403_propagates_after_one_retry— always-403 -> terminalLlmAuthafter exactly one retry, bounding a pure-authorization 403; the regression guard against re-narrowing the mapping to 401-only.static_token_source_refresh_now_returns_static_token— default impl returns the key.bearer()race-fix tests from Half 1.