Skip to content

fix(reliability): Add timeouts and error handling for CI operations#943

Merged
canihavesomecoffee merged 9 commits intomasterfrom
fix/reliability-quick-wins
Dec 22, 2025
Merged

fix(reliability): Add timeouts and error handling for CI operations#943
canihavesomecoffee merged 9 commits intomasterfrom
fix/reliability-quick-wins

Conversation

@cfsmp3
Copy link
Contributor

@cfsmp3 cfsmp3 commented Dec 22, 2025

Summary

Phase 1 quick wins for Sample Platform reliability (fixes #942):

  • Add timeout constants for external API calls (GitHub, GCP, artifact downloads)
  • Fix wait_for_operation() infinite loop - now times out after 30 minutes with exponential backoff
  • Add mark_test_failed() helper to properly update test status on errors
  • Fix artifact download with timeout and proper error handling (no more silent failures)
  • Add basic locking to prevent duplicate test starts
  • Fix file handle leak in artifact download
  • Add error handling for GitHub API calls in gcp_instance()

Test plan

  • Verify syntax is valid (python3 -m py_compile mod_ci/controllers.py)
  • Deploy to staging and trigger a test run
  • Verify timeout behavior works (can be tested by setting a short timeout temporarily)
  • Verify duplicate test prevention works
  • Monitor logs for proper error messages on failures

🤖 Generated with Claude Code

Phase 1 quick wins for Sample Platform reliability:

1. Add timeout constants for external API calls
2. Fix wait_for_operation() infinite loop - now times out after 30 minutes
3. Add mark_test_failed() helper to properly update test status on errors
4. Fix artifact download with timeout and proper error handling
5. Add basic locking to prevent duplicate test starts
6. Fix file handle leak in artifact download
7. Add error handling for GitHub API calls

Fixes #942 (Phase 1)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
cfsmp3 and others added 4 commits December 22, 2025 06:44
The exponential backoff calculation (poll_interval * 1.5) produces a float,
so poll_interval must be initialized as a float to satisfy mypy type checking.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
pycodestyle requires all imports at top of file before any other code.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The new duplicate test prevention code checks for existing GcpInstance and
TestProgress entries. The test needs to mock these queries to return None
so the locking checks pass and the test can proceed.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

❌ Patch coverage is 90.66667% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.88%. Comparing base (a53988b) to head (724de92).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
mod_ci/controllers.py 90.66% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #943      +/-   ##
==========================================
+ Coverage   86.70%   86.88%   +0.17%     
==========================================
  Files          35       35              
  Lines        3709     3759      +50     
  Branches      761      767       +6     
==========================================
+ Hits         3216     3266      +50     
  Misses        355      355              
  Partials      138      138              
Flag Coverage Δ
unittests 86.88% <90.66%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

cfsmp3 and others added 4 commits December 22, 2025 07:11
Add tests to improve code coverage for the reliability fixes:

- test_mark_test_failed_success: Tests successful test failure marking
- test_mark_test_failed_exception: Tests exception handling in mark_test_failed
- test_gcp_instance_github_exception: Tests GithubException handling
- test_gcp_instance_unexpected_exception: Tests generic exception handling
- test_start_test_duplicate_instance_check: Tests GCP instance deduplication
- test_start_test_duplicate_progress_check: Tests progress entry deduplication
- test_start_test_artifact_timeout: Tests artifact download timeout handling
- test_start_test_artifact_http_error: Tests HTTP error handling for artifacts
- test_wait_for_operation_timeout: Tests operation timeout logic
- test_wait_for_operation_api_error: Tests API error handling
- test_wait_for_operation_success: Tests successful operation completion

These tests cover the new error handling and reliability code paths
introduced in the reliability fixes PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove unused 'import zipfile' from artifact tests
- Fix import ordering to satisfy isort

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Align continuation lines with opening parenthesis for pycodestyle.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The gcp_instance function doesn't return anything, so assigning
its result and asserting it's None is meaningless. This fixes the
SonarCloud "C Reliability Rating" issue.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@sonarqubecloud
Copy link

@canihavesomecoffee canihavesomecoffee merged commit 53b3d7f into master Dec 22, 2025
8 checks passed
cfsmp3 added a commit that referenced this pull request Dec 22, 2025
The tests from PR #943 passed mock_g.db to functions that internally
call db.query().filter(), but the mock chain wasn't properly set up.
In Python 3.10+, this causes 'coroutine' object has no attribute 'filter'
errors.

Since these tests are testing exception handling (GithubException,
timeout, HTTP errors), not database interactions, use the real g.db
instead of the mock.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
canihavesomecoffee pushed a commit that referenced this pull request Dec 23, 2025
The tests from PR #943 passed mock_g.db to functions that internally
call db.query().filter(), but the mock chain wasn't properly set up.
In Python 3.10+, this causes 'coroutine' object has no attribute 'filter'
errors.

Since these tests are testing exception handling (GithubException,
timeout, HTTP errors), not database interactions, use the real g.db
instead of the mock.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
cfsmp3 added a commit that referenced this pull request Dec 25, 2025
This fixes a regression introduced in Phase 2 reliability improvements
(PRs #943/#946) where tests would be permanently marked as failed when
the GitHub Actions build hadn't completed yet.

Previous behavior (before Dec 2025):
- If artifact not found, just return silently
- Test stays in "pending" state
- Next cron cycle retries the test
- Eventually build completes and test runs

Regression behavior (after Phase 2):
- If artifact not found, mark test as failed immediately
- Test is permanently marked as "canceled" with ERROR status
- No retry - test never runs even after build completes

Fixed behavior (this PR):
- Distinguish between retryable and permanent failures
- Retryable: build in progress, workflow queued, diagnostic error
- Permanent: build failed, artifact expired
- Only mark as failed for permanent failures
- Return silently for retryable failures (next cron cycle retries)

Changes:
- Modified _diagnose_missing_artifact() to return tuple (message, is_retryable)
- Modified start_test() to check is_retryable before marking as failed
- Added 8 comprehensive tests for retry behavior

Fixes issue where Windows tests fail with "Build still in progress"
when the cron job runs before the 40-minute Windows build completes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sample Platform Initial Assessment - Reliability Issues and Improvement Plan

2 participants