fix(reliability): Add timeouts and error handling for CI operations by cfsmp3 · Pull Request #943 · CCExtractor/sample-platform

cfsmp3 · 2025-12-22T05:28:10Z

Summary

Phase 1 quick wins for Sample Platform reliability (fixes #942):

Add timeout constants for external API calls (GitHub, GCP, artifact downloads)
Fix wait_for_operation() infinite loop - now times out after 30 minutes with exponential backoff
Add mark_test_failed() helper to properly update test status on errors
Fix artifact download with timeout and proper error handling (no more silent failures)
Add basic locking to prevent duplicate test starts
Fix file handle leak in artifact download
Add error handling for GitHub API calls in gcp_instance()

Test plan

Verify syntax is valid (python3 -m py_compile mod_ci/controllers.py)
Deploy to staging and trigger a test run
Verify timeout behavior works (can be tested by setting a short timeout temporarily)
Verify duplicate test prevention works
Monitor logs for proper error messages on failures

🤖 Generated with Claude Code

Phase 1 quick wins for Sample Platform reliability: 1. Add timeout constants for external API calls 2. Fix wait_for_operation() infinite loop - now times out after 30 minutes 3. Add mark_test_failed() helper to properly update test status on errors 4. Fix artifact download with timeout and proper error handling 5. Add basic locking to prevent duplicate test starts 6. Fix file handle leak in artifact download 7. Add error handling for GitHub API calls Fixes #942 (Phase 1) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The exponential backoff calculation (poll_interval * 1.5) produces a float, so poll_interval must be initialized as a float to satisfy mypy type checking. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

pycodestyle requires all imports at top of file before any other code. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The new duplicate test prevention code checks for existing GcpInstance and TestProgress entries. The test needs to mock these queries to return None so the locking checks pass and the test can proceed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov · 2025-12-22T05:58:01Z

Codecov Report

❌ Patch coverage is 90.66667% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.88%. Comparing base (a53988b) to head (724de92).
⚠️ Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
mod_ci/controllers.py	90.66%	5 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #943      +/-   ##
==========================================
+ Coverage   86.70%   86.88%   +0.17%     
==========================================
  Files          35       35              
  Lines        3709     3759      +50     
  Branches      761      767       +6     
==========================================
+ Hits         3216     3266      +50     
  Misses        355      355              
  Partials      138      138

Flag	Coverage Δ
unittests	`86.88% <90.66%> (+0.17%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Add tests to improve code coverage for the reliability fixes: - test_mark_test_failed_success: Tests successful test failure marking - test_mark_test_failed_exception: Tests exception handling in mark_test_failed - test_gcp_instance_github_exception: Tests GithubException handling - test_gcp_instance_unexpected_exception: Tests generic exception handling - test_start_test_duplicate_instance_check: Tests GCP instance deduplication - test_start_test_duplicate_progress_check: Tests progress entry deduplication - test_start_test_artifact_timeout: Tests artifact download timeout handling - test_start_test_artifact_http_error: Tests HTTP error handling for artifacts - test_wait_for_operation_timeout: Tests operation timeout logic - test_wait_for_operation_api_error: Tests API error handling - test_wait_for_operation_success: Tests successful operation completion These tests cover the new error handling and reliability code paths introduced in the reliability fixes PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove unused 'import zipfile' from artifact tests - Fix import ordering to satisfy isort 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Align continuation lines with opening parenthesis for pycodestyle. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The gcp_instance function doesn't return anything, so assigning its result and asserting it's None is meaningless. This fixes the SonarCloud "C Reliability Rating" issue. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

sonarqubecloud · 2025-12-22T06:23:00Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

The tests from PR #943 passed mock_g.db to functions that internally call db.query().filter(), but the mock chain wasn't properly set up. In Python 3.10+, this causes 'coroutine' object has no attribute 'filter' errors. Since these tests are testing exception handling (GithubException, timeout, HTTP errors), not database interactions, use the real g.db instead of the mock. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This fixes a regression introduced in Phase 2 reliability improvements (PRs #943/#946) where tests would be permanently marked as failed when the GitHub Actions build hadn't completed yet. Previous behavior (before Dec 2025): - If artifact not found, just return silently - Test stays in "pending" state - Next cron cycle retries the test - Eventually build completes and test runs Regression behavior (after Phase 2): - If artifact not found, mark test as failed immediately - Test is permanently marked as "canceled" with ERROR status - No retry - test never runs even after build completes Fixed behavior (this PR): - Distinguish between retryable and permanent failures - Retryable: build in progress, workflow queued, diagnostic error - Permanent: build failed, artifact expired - Only mark as failed for permanent failures - Return silently for retryable failures (next cron cycle retries) Changes: - Modified _diagnose_missing_artifact() to return tuple (message, is_retryable) - Modified start_test() to check is_retryable before marking as failed - Added 8 comprehensive tests for retry behavior Fixes issue where Windows tests fail with "Build still in progress" when the cron job runs before the 40-minute Windows build completes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cfsmp3 requested review from canihavesomecoffee and thealphadollar as code owners December 22, 2025 05:28

cfsmp3 and others added 4 commits December 22, 2025 06:44

fix(style): Move timeout constants after imports to fix E402

db01439

pycodestyle requires all imports at top of file before any other code. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix(style): Break long lines to satisfy E501 (max 120 chars)

5add1df

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cfsmp3 and others added 4 commits December 22, 2025 07:11

fix(test): Fix import sorting and remove unused imports

66ac9b7

- Remove unused 'import zipfile' from artifact tests - Fix import ordering to satisfy isort 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix(test): Fix continuation line indentation (E127)

b4e0410

Align continuation lines with opening parenthesis for pycodestyle. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

canihavesomecoffee approved these changes Dec 22, 2025

View reviewed changes

canihavesomecoffee merged commit 53b3d7f into master Dec 22, 2025
8 checks passed

cfsmp3 mentioned this pull request Dec 25, 2025

fix: Retry tests when build is still in progress instead of failing #973

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(reliability): Add timeouts and error handling for CI operations#943

fix(reliability): Add timeouts and error handling for CI operations#943
canihavesomecoffee merged 9 commits intomasterfrom
fix/reliability-quick-wins

cfsmp3 commented Dec 22, 2025

Uh oh!

codecov bot commented Dec 22, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Dec 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cfsmp3 commented Dec 22, 2025

Summary

Test plan

Uh oh!

codecov bot commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud bot commented Dec 22, 2025

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Dec 22, 2025 •

edited

Loading