E2E: Add health-aware navigation to reduce false-positive failures#64366
E2E: Add health-aware navigation to reduce false-positive failures#64366vatsrahul1001 merged 7 commits intoapache:mainfrom
Conversation
decb6f4 to
2837758
Compare
|
@Dev-iL This is great work for sure. |
|
Hopefully all test goes to green |
2837758 to
a5ce75c
Compare
…lures Tests running with 2 parallel workers produce cascading failures when one heavy test (HITL, backfill) overwhelms the shared Airflow server, causing all concurrent navigations to timeout. Retries also fail because the server hasn't recovered. Add a health-check utility that polls /api/v2/monitor/health before every navigation, using backoff intervals [1s, 2s, 4s, 8s] with a 60s cap. When the server is responsive (the common case), the check returns immediately with negligible overhead. When overloaded, it waits for recovery instead of blindly navigating into timeouts. The health check is centralized in a new `BasePage.goto()` method that all page objects inherit. 12 direct `page.goto()` calls across DagsPage, RequiredActionsPage, DagCalendarTab, and EventsPage now use `this.goto()` instead. No spec files modified, no timeouts increased.
AssetDetailPage defines its own public goto() method that calls this.navigateTo(). When BasePage.navigateTo() dispatched to this.goto(), polymorphism resolved to AssetDetailPage.goto() instead of BasePage.goto(), creating infinite recursion. Rename to safeGoto() which doesn't collide with any subclass method names.
Remove the 2000ms response time threshold from health endpoint checks. The threshold was too strict for Firefox on CI, causing tests to timeout while waiting for server readiness even though the server was healthy (returning 200). The health check should verify the server responds, not enforce a specific response time. Increases REQUEST_TIMEOUT_MS to 10000ms for individual request attempts to give slower environments time to respond, while keeping the overall MAX_WAIT_MS at 60s. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The health-aware safeGoto() protects against an unresponsive server, but two Firefox-only flaky failures occur after navigation succeeds: click-based navigation (RequiredActionsPage link clicks) and post-navigation React rendering (TaskInstancesPage table) can exceed 10s on Firefox under CI load. Increase these element visibility timeouts from 10s to 30s, matching the threshold already used by handleWaitForMultipleOptionsTask.
- ConnectionsPage: combobox click timeout 3s→10s (matching actionTimeout) - DagCalendarTab: wrap hover+tooltip in retry loop — webkit hover events are unreliable and may not trigger tooltips on first attempt - DagCodePage: inner Monaco editor visibility check 5s→10s per retry attempt, giving the editor more time to initialize on webkit - RequiredActionsPage: wait_for_default_option toPass timeout 30s→60s, too tight on webkit under 3-browser CI load - XComsPage: table visibility wait 10s→30s, same pattern as TaskInstancesPage fix
The 60s health check exceeded the 30s default test timeout, causing tests to die while the health check was still polling (plugins.spec.ts). It also consumed the entire toPass budget (triggerDag has toPass 60s), leaving no room for retries after the health check exhausted the timeout. - Reduce health check MAX_WAIT_MS from 60s to 30s - Increase default test timeout from 30s to 60s This ensures the health check fits within any test's time budget (30s < 60s), leaving 30s for navigation and assertions. Heavy tests with custom timeouts (120s+) are unaffected.
The beforeAll makes many direct API calls (POST dagRuns, GET taskInstances, PATCH state) that use the global 10s actionTimeout. Under CI load these time out. Also, the setup had no health check and inherited the 60s describe-level timeout for a multi-step setup. - Add waitForServerReady before API calls - Increase beforeAll timeout to 120s - Add explicit 30s timeout to all API calls (3x the actionTimeout)
a5ce75c to
018d96b
Compare
|
It seems like the flaky issue might not be fully resolved yet 🥲 |
…pache#64366) * E2E: Add health-aware navigation to eliminate false-positive test failures Tests running with 2 parallel workers produce cascading failures when one heavy test (HITL, backfill) overwhelms the shared Airflow server, causing all concurrent navigations to timeout. Retries also fail because the server hasn't recovered. Add a health-check utility that polls /api/v2/monitor/health before every navigation, using backoff intervals [1s, 2s, 4s, 8s] with a 60s cap. When the server is responsive (the common case), the check returns immediately with negligible overhead. When overloaded, it waits for recovery instead of blindly navigating into timeouts. The health check is centralized in a new `BasePage.goto()` method that all page objects inherit. 12 direct `page.goto()` calls across DagsPage, RequiredActionsPage, DagCalendarTab, and EventsPage now use `this.goto()` instead. No spec files modified, no timeouts increased. * E2E: Rename BasePage.goto() to safeGoto() to fix stack overflow AssetDetailPage defines its own public goto() method that calls this.navigateTo(). When BasePage.navigateTo() dispatched to this.goto(), polymorphism resolved to AssetDetailPage.goto() instead of BasePage.goto(), creating infinite recursion. Rename to safeGoto() which doesn't collide with any subclass method names. * E2E: Relax health check to fix Firefox test timeouts Remove the 2000ms response time threshold from health endpoint checks. The threshold was too strict for Firefox on CI, causing tests to timeout while waiting for server readiness even though the server was healthy (returning 200). The health check should verify the server responds, not enforce a specific response time. Increases REQUEST_TIMEOUT_MS to 10000ms for individual request attempts to give slower environments time to respond, while keeping the overall MAX_WAIT_MS at 60s. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * E2E: Increase element visibility timeouts for Firefox CI flakiness The health-aware safeGoto() protects against an unresponsive server, but two Firefox-only flaky failures occur after navigation succeeds: click-based navigation (RequiredActionsPage link clicks) and post-navigation React rendering (TaskInstancesPage table) can exceed 10s on Firefox under CI load. Increase these element visibility timeouts from 10s to 30s, matching the threshold already used by handleWaitForMultipleOptionsTask. * E2E: Fix webkit-specific flakiness across 5 page objects - ConnectionsPage: combobox click timeout 3s→10s (matching actionTimeout) - DagCalendarTab: wrap hover+tooltip in retry loop — webkit hover events are unreliable and may not trigger tooltips on first attempt - DagCodePage: inner Monaco editor visibility check 5s→10s per retry attempt, giving the editor more time to initialize on webkit - RequiredActionsPage: wait_for_default_option toPass timeout 30s→60s, too tight on webkit under 3-browser CI load - XComsPage: table visibility wait 10s→30s, same pattern as TaskInstancesPage fix * E2E: Rebalance health check and test timeouts The 60s health check exceeded the 30s default test timeout, causing tests to die while the health check was still polling (plugins.spec.ts). It also consumed the entire toPass budget (triggerDag has toPass 60s), leaving no room for retries after the health check exhausted the timeout. - Reduce health check MAX_WAIT_MS from 60s to 30s - Increase default test timeout from 30s to 60s This ensures the health check fits within any test's time budget (30s < 60s), leaving 30s for navigation and assertions. Heavy tests with custom timeouts (120s+) are unaffected. * E2E: Robustify task-instances spec setup against server overload The beforeAll makes many direct API calls (POST dagRuns, GET taskInstances, PATCH state) that use the global 10s actionTimeout. Under CI load these time out. Also, the setup had no health check and inherited the 60s describe-level timeout for a multi-step setup. - Add waitForServerReady before API calls - Increase beforeAll timeout to 120s - Add explicit 30s timeout to all API calls (3x the actionTimeout) --------- Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>
I know... But the question is - is it rarer? Also, one of the AI's suggestions was to use 1 concurrent worker instead of 2. This will make the tests run longer, but should be safer. Thoughts? |
|
The main cause of the current test failures is related to data isolation and handling race conditions. |
Context
E2E tests running with 2 parallel workers produce false-positive failures in CI. The test suite runs 88 tests across chromium, firefox, and webkit, and consistently shows 2-7 failures + 2-6 flaky tests per run -- all caused by server unresponsiveness or browser-specific interaction issues rather than actual bugs.
Observed Failure Pattern
From CI logs (
2026-03-28runs):Chromium:
home-dashboard.spec.ts,plugins.spec.tsx2,requiredAction.spec.ts,task-instances.spec.ts,xcoms.spec.tsx2plugins.spec.tstoPassbudgets consumed by health check; specbeforeAllAPI calls using 10sactionTimeoutFirefox:
requiredAction.spec.ts(server not ready after 60s),xcoms.spec.tsx2xcoms.spec.tstriggerDagretry budget consumed by health checkWebkit:
Failure Analysis
All failures trace back to infrastructure-level root causes, not test logic bugs:
1.
plugins.spec.ts-- Health check exceeds test timeout (chromium)The default test timeout (30s) is shorter than the health check timeout (60s). Playwright kills the test while the health check is still polling for recovery. The health check never gets a chance to complete.
2.
requiredAction.spec.ts-- Health check consumes toPass budgettriggerDagusestoPass({timeout: 60_000}). A single health check attempt (60s) exhausts the entire retry budget, leaving zero room fortoPassto retry after recovery.3.
xcoms.spec.ts-- Same toPass/health check conflictThe
triggerDagcall inbeforeAllfails because the health check consumes the fulltoPasstimeout. Multiple xcoms tests then fail because setup never completed.4.
task-instances.spec.ts-- Unprotected API calls in beforeAll (chromium)The
beforeAllmakes direct API calls (POST dagRuns, GET taskInstances, PATCH state) using the global 10sactionTimeout. No health check, no extended timeouts, no retry logic. When the server is overloaded by concurrent heavy tests, these API calls time out.5.
dag-calendar-tab.spec.ts-- Tooltip interaction failureCalendar cell hover fails because the tooltip never appears. On webkit, hover events are unreliable and may not trigger on the first attempt.
6.
connections.spec.ts-- Combobox click timeout (webkit)The combobox is found, visible, enabled, and stable, but the click action can't complete within 3s on webkit. The explicit 3s timeout is far below the global
actionTimeoutof 10s.7.
dag-code-tab.spec.ts-- Monaco editor slow to initialize (webkit)The navigation retry loop (2s intervals, 60s total) repeatedly navigates and checks for the editor. With only 5s per inner check, Monaco doesn't have enough time to initialize on webkit before the next retry fires.
Root Cause Analysis
Eight interconnected root causes identified during investigation:
page.goto()with no retry or health checknavigateToconsumerspage.goto()calls -- 12 directpage.goto()calls across 4 page objects bypass any protection innavigateTonavigateTois hardened, these bypass calls remain vulnerabletoPassbudgets (60s), consuming entire time budgets with no room for retries or actual worktoPassloops can't retry after health check timeoutApproach
The fix is a six-layer approach across shared infrastructure, page objects, config, and one spec file:
Layer 1: Health Check Utility (
tests/e2e/utils/health.ts)New shared utility that polls
/api/v2/monitor/health(unauthenticated endpoint already used by Breeze for startup checks). Checks HTTP 200 only -- no response time threshold, since Firefox on CI regularly exceeds 2s for health responses (making time-based checks unsatisfiable). Uses exponential backoff intervals[1s, 2s, 4s, 8s]with 30s max wait and 10s per-request timeout. When the server is healthy (the common case), returns immediately with negligible overhead.Layer 2: Health-Aware BasePage Navigation (
BasePage.safeGoto())Added a
protected safeGoto()method toBasePagethat wrapswaitForServerReady()+page.goto(). All page object subclasses usethis.safeGoto()instead ofthis.page.goto()directly. The existingnavigateTo()delegates tothis.safeGoto(). NamedsafeGoto(notgoto) to avoid colliding withAssetDetailPage.goto(), which would cause infinite recursion via polymorphic dispatch throughnavigateTo().Layer 3: Consistent Element Visibility Timeouts
After click-based navigation (e.g., clicking a "required action" link), the health check does not apply -- the server IS responsive, but React components must still fetch API data and render. On Firefox/webkit under CI load, this regularly exceeds 10s. Increased element visibility timeouts from 10s to 30s across
RequiredActionsPage(4handle*methods),TaskInstancesPage.navigate(), andXComsPage.navigate(). Also increasedConnectionsPagecombobox click timeout from 3s to 10s,DagCodePageinner Monaco editor check from 5s to 10s per retry attempt, andRequiredActionsPagewait_for_default_optiontask completion timeout from 30s to 60s.Layer 4: Webkit Interaction Reliability
Webkit hover events don't reliably trigger tooltips on the first attempt. Wrapped
DagCalendarTab.getManualRunStates()hover+tooltip check in atoPassretry loop (500ms intervals, 20s total) withforce: trueon hover. This retries the hover if the tooltip doesn't appear, without changing the overall timeout budget.Layer 5: Health Check / Test Timeout Rebalancing
The 60s health check exceeded the 30s default test timeout -- tests were killed while the health check was still polling. It also consumed the entire
toPassbudget intriggerDag(60s), leaving no room for retries.MAX_WAIT_MSfrom 60s to 30splaywright.config.tsThis ensures the health check always fits within the test's time budget (30s < 60s), leaving 30s for navigation and assertions.
toPassloops with 60s budgets can now retry twice after health check failures.Layer 6: Spec-Level API Resilience (
task-instances.spec.ts)The
beforeAllsetup makes direct API calls (POST dagRuns, GET taskInstances, PATCH state) that bypass the health check. These used the global 10sactionTimeoutand had no health gating. Under server load, they time out.waitForServerReady(page)before API callsbeforeAlltimeout to 120s (was inheriting 60s)actionTimeout)Design Decisions
toPassbudgets (60s), leaving room for actual navigation and retries.BasePage.safeGoto()wrappersafeGoto()notgoto()AssetDetailPagehas a publicgoto()that callsnavigateTo(). IfnavigateTo()dispatched tothis.goto(), polymorphism would resolve to the subclass method, creating infinite recursion.handleWaitForMultipleOptionsTaskthreshold. 10s is insufficient for Firefox/webkit under CI load.toPassloop withforce: trueChanges
New File
tests/e2e/utils/health.ts--waitForServerReady(page)function with backoff polling (HTTP 200 check, 10s per-request timeout, 30s overall)Modified Files (page objects)
tests/e2e/pages/BasePage.ts-- Addedprotected safeGoto()method;navigateTo()delegates to ittests/e2e/pages/DagsPage.ts-- 3 directpage.goto()calls replaced withthis.safeGoto()tests/e2e/pages/RequiredActionsPage.ts-- 7 directpage.goto()calls replaced withthis.safeGoto(); first element visibility timeouts after link clicks 10s→30s in 4handle*methods;wait_for_default_optiontask timeout 30s→60stests/e2e/pages/DagCalendarTab.ts-- 1 directpage.goto()call replaced withthis.safeGoto(); hover+tooltip wrapped in retry loop for webkit reliabilitytests/e2e/pages/EventsPage.ts-- 1 directpage.goto()call replaced withthis.safeGoto()tests/e2e/pages/TaskInstancesPage.ts-- Table visibility timeout 10s→30stests/e2e/pages/ConnectionsPage.ts-- Combobox click timeout 3s→10stests/e2e/pages/DagCodePage.ts-- Inner Monaco editor visibility check 5s→10s per retry attempttests/e2e/pages/XComsPage.ts-- Table visibility timeout 10s→30sModified Files (config and specs)
playwright.config.ts-- Default test timeout 30s→60stests/e2e/specs/task-instances.spec.ts--beforeAll: addedwaitForServerReady, 120s timeout, 30s per API callAll paths relative to
airflow-core/src/airflow/ui/Not Modified
navigateTo()exclusively (onlypage.request.*for API calls)navigateTo()Known Remaining Issues
Three categories of flakiness remain. All pass on retry (0 hard failures); none are addressable with further timeout tuning.
requiredAction.spec.ts(firefox/webkit) -- The HITL workflow (verifyFinalTaskStates→waitForTaskState) sometimes exceeds the test timeout under 2-worker parallel load. The Airflow scheduler is overwhelmed when heavy DAG operations overlap across workers. Passes on retry once the server recovers. The only fix is reducing toworkers: 1(doubling test time) or increasing server capacity.click()hangs at the 10sactionTimeout. AffectsexpandAllButtonandaddFilterButton. Passes on retry, suggesting a transient webkit click interception issue (possibly an overlay or loading state). Increasing the globalactionTimeoutwould affect all tests across all browsers.dag-calendar-tab.spec.ts(webkit) -- "failed filter shows only failed runs" expects both success and failed runs in the calendar, but only sees success. This is a data timing issue: the failed DAG run created inbeforeAllhasn't been reflected in the calendar data when the test reads tooltip states. Not a timeout issue — the test reads stale data, not missing UI elements.How It Works
When the server is healthy (the normal case),
waitForServerReadycompletes on the first attempt with negligible overhead. When the server is overloaded (the flaky CI case), the health check waits with backoff until the server recovers, then proceeds with navigation. The 30s health check fits within the 60s default test timeout, leaving room for the actual navigation and assertions. FortoPassloops, the 30s health check leaves budget for at least one retry.Was generative AI tooling used to co-author this PR?
Generated-by: Claude Opus 4.6 following the guidelines
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.