Skip to content

fix: add retry status timeout to prevent infinite hang#2263

Open
cloudwaddie-agent wants to merge 2 commits intocode-yeongyu:devfrom
cloudwaddie-agent:fix/retry-status-timeout
Open

fix: add retry status timeout to prevent infinite hang#2263
cloudwaddie-agent wants to merge 2 commits intocode-yeongyu:devfrom
cloudwaddie-agent:fix/retry-status-timeout

Conversation

@cloudwaddie-agent
Copy link
Contributor

@cloudwaddie-agent cloudwaddie-agent commented Mar 3, 2026

Summary

When runtime fallback triggers on errors like 503, the session status becomes "retry". The poll-for-completion loop treats "retry" the same as "busy", causing it to wait indefinitely without any timeout.

This fix adds a 3-minute timeout (based on max_fallback_attempts * cooldown_seconds = 3 * 60s) for the retry status. If the session stays in retry status longer than the timeout, it exits with an error instead of hanging forever (exit code 143).

Changes

  • Added DEFAULT_RETRY_STATUS_TIMEOUT_MS constant (180000ms = 3 minutes)
  • Added retryStatusTimeoutMs option to PollOptions interface
  • Track when session first enters "retry" status
  • Check if retry timeout exceeded and exit with error message if so

Root Cause

The issue was reported in CloudWaddie/actions-agent#82 where the agent hangs with exit code 143 when 503 errors trigger runtime fallback retry loops. The session gets stuck in "retry" status and the poll loop keeps waiting forever.

Testing

The fix should be tested by:

  1. Triggering a runtime fallback scenario (e.g., 503 error)
  2. Verifying the session exits after 3 minutes instead of hanging indefinitely
  3. Verifying the error message is helpful

Fixes: CloudWaddie/actions-agent#82


Summary by cubic

Prevent infinite hangs when runtime fallback keeps retrying by adding a 3-minute timeout for sessions in "retry". The poll loop now exits with a clear, actionable error instead of waiting forever.

  • Bug Fixes
    • Add DEFAULT_RETRY_STATUS_TIMEOUT_MS (180000 ms) and retryStatusTimeoutMs option.
    • Track first entry into "retry"; reset timer when status becomes "busy" or "idle".
    • On timeout, exit non-zero and print guidance to check model availability and network connectivity.

Written for commit e1ce286. Summary will update on new commits.

When runtime fallback triggers on errors like 503, the session status
becomes 'retry'. The poll-for-completion loop treats 'retry' the same
as 'busy', causing it to wait indefinitely without any timeout.

This fix adds a 3-minute timeout (based on max_fallback_attempts *
cooldown_seconds = 3 * 60s) for the retry status. If the session
stays in retry status longer than the timeout, it exits with an error
instead of hanging forever (exit code 143).

Fixes: Session hangs with exit code 143 when 503 errors trigger
runtime fallback retry loops
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Confidence score: 2/5

  • There is a high-confidence, high-severity logic issue in src/cli/run/poll-for-completion.ts: retryStatusStartTimestamp is not reset on busy, which can incorrectly accumulate elapsed time across unrelated retries.
  • This can falsely trigger the 3-minute timeout and prematurely fail completion polling, so there is clear user-impacting regression risk if merged as-is.
  • Given the concrete behavior impact and strong confidence (8/10 severity, 10/10 confidence), this is better treated as high merge risk until fixed.
  • Pay close attention to src/cli/run/poll-for-completion.ts - timeout tracking needs to be scoped to the current retry window, not carried across transient status changes.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/cli/run/poll-for-completion.ts">

<violation number="1" location="src/cli/run/poll-for-completion.ts:115">
P1: Custom agent: **Opencode Compatibility**

The `retryStatusStartTimestamp` is not reset when the status changes to `busy`, causing accumulated time between unrelated transient errors to falsely trigger the 3-minute timeout and prematurely terminate long-running sessions.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Co-authored-by: cubic-dev-ai[bot] <191113872+cubic-dev-ai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

freezing leads to 143

1 participant