Skip to content

E2E: Add health-aware navigation to reduce false-positive failures#64366

Merged
vatsrahul1001 merged 7 commits intoapache:mainfrom
Dev-iL:2603/e2e_deflake
Mar 29, 2026
Merged

E2E: Add health-aware navigation to reduce false-positive failures#64366
vatsrahul1001 merged 7 commits intoapache:mainfrom
Dev-iL:2603/e2e_deflake

Conversation

@Dev-iL
Copy link
Copy Markdown
Collaborator

@Dev-iL Dev-iL commented Mar 28, 2026

Context

E2E tests running with 2 parallel workers produce false-positive failures in CI. The test suite runs 88 tests across chromium, firefox, and webkit, and consistently shows 2-7 failures + 2-6 flaky tests per run -- all caused by server unresponsiveness or browser-specific interaction issues rather than actual bugs.

Observed Failure Pattern

From CI logs (2026-03-28 runs):

Chromium:

  • 7 failed: home-dashboard.spec.ts, plugins.spec.ts x2, requiredAction.spec.ts, task-instances.spec.ts, xcoms.spec.ts x2
  • 1 flaky: plugins.spec.ts
  • Root causes: health check (60s) exceeded test timeout (30s); toPass budgets consumed by health check; spec beforeAll API calls using 10s actionTimeout

Firefox:

  • 3 failed: requiredAction.spec.ts (server not ready after 60s), xcoms.spec.ts x2
  • 1 flaky: xcoms.spec.ts
  • Root causes: server genuinely down >60s during concurrent heavy tests; triggerDag retry budget consumed by health check

Webkit:

  • 0 failed, all passed
  • Previously: 1 failed + 6 flaky (fixed by webkit-specific changes)

Failure Analysis

All failures trace back to infrastructure-level root causes, not test logic bugs:

1. plugins.spec.ts -- Health check exceeds test timeout (chromium)

page.waitForTimeout: Test timeout of 30000ms exceeded
> 60 |     await page.waitForTimeout(Math.min(interval, remaining));

The default test timeout (30s) is shorter than the health check timeout (60s). Playwright kills the test while the health check is still polling for recovery. The health check never gets a chance to complete.

2. requiredAction.spec.ts -- Health check consumes toPass budget

Server not ready after 60000ms — health endpoint did not return 200
Timeout 60000ms exceeded while waiting on the predicate

triggerDag uses toPass({timeout: 60_000}). A single health check attempt (60s) exhausts the entire retry budget, leaving zero room for toPass to retry after recovery.

3. xcoms.spec.ts -- Same toPass/health check conflict

Timeout 60000ms exceeded while waiting on the predicate

The triggerDag call in beforeAll fails because the health check consumes the full toPass timeout. Multiple xcoms tests then fail because setup never completed.

4. task-instances.spec.ts -- Unprotected API calls in beforeAll (chromium)

apiRequestContext.patch: Timeout 10000ms exceeded
apiRequestContext.get: Timeout 10000ms exceeded

The beforeAll makes direct API calls (POST dagRuns, GET taskInstances, PATCH state) using the global 10s actionTimeout. No health check, no extended timeouts, no retry logic. When the server is overloaded by concurrent heavy tests, these API calls time out.

5. dag-calendar-tab.spec.ts -- Tooltip interaction failure

DagCalendarTab.ts:73 - expect(locator).toBeVisible() failed - getByTestId('calendar-tooltip')

Calendar cell hover fails because the tooltip never appears. On webkit, hover events are unreliable and may not trigger on the first attempt.

6. connections.spec.ts -- Combobox click timeout (webkit)

ConnectionsPage.ts:218 - locator.click: Timeout 3000ms exceeded

The combobox is found, visible, enabled, and stable, but the click action can't complete within 3s on webkit. The explicit 3s timeout is far below the global actionTimeout of 10s.

7. dag-code-tab.spec.ts -- Monaco editor slow to initialize (webkit)

DagCodePage.ts:42 - expect(locator).toBeVisible() failed - locator('[role="code"]')

The navigation retry loop (2s intervals, 60s total) repeatedly navigates and checks for the editor. With only 5s per inner check, Monaco doesn't have enough time to initialize on webkit before the next retry fires.

Root Cause Analysis

Eight interconnected root causes identified during investigation:

# Root Cause Impact Affected Tests
1 No server health gating -- Tests navigate blindly regardless of server state When server is overwhelmed by heavy tests (HITL, backfill), all concurrent tests fail ALL failing tests
2 BasePage.navigateTo lacks resilience -- Simple page.goto() with no retry or health check Navigation times out within test timeout if server is slow XComs, TaskInstances, all navigateTo consumers
3 No inter-test recovery -- No mechanism to wait for server to recover between tests Heavy test on worker 1 overwhelms server, test on worker 2 fails, retries fail too Cross-worker failures
4 Page objects bypass BasePage with fragile page.goto() calls -- 12 direct page.goto() calls across 4 page objects bypass any protection in navigateTo Even if navigateTo is hardened, these bypass calls remain vulnerable RequiredActions (7), DagsPage (3), Calendar (1), Events (1)
5 Health check response time threshold too strict for Firefox -- Original health check required HTTP 200 AND response < 2s, but Firefox on CI regularly exceeds 2s Health check condition is never satisfied, causing 60s timeout even though the server is healthy ALL Firefox tests
6 Element visibility timeouts too tight -- After navigation (click-based or programmatic), page elements must fetch API data and render. Hardcoded 3-10s timeouts are insufficient on Firefox/webkit under CI load Server IS responsive, but post-navigation rendering exceeds tight timeouts RequiredActions, TaskInstances, Connections, XComs, DagCode
7 Webkit-specific interaction unreliability -- Webkit hover events don't reliably trigger tooltips on first attempt; click dispatch is slower than chromium/firefox Single-attempt interactions fail intermittently on webkit DagCalendarTab (hover/tooltip), Connections (combobox click)
8 Health check / test timeout imbalance -- Health check MAX_WAIT (60s) ≥ default test timeout (30s) and toPass budgets (60s), consuming entire time budgets with no room for retries or actual work Tests die mid-health-check; toPass loops can't retry after health check timeout plugins, requiredAction, xcoms, home-dashboard (all browsers)

Approach

The fix is a six-layer approach across shared infrastructure, page objects, config, and one spec file:

Layer 1: Health Check Utility (tests/e2e/utils/health.ts)

New shared utility that polls /api/v2/monitor/health (unauthenticated endpoint already used by Breeze for startup checks). Checks HTTP 200 only -- no response time threshold, since Firefox on CI regularly exceeds 2s for health responses (making time-based checks unsatisfiable). Uses exponential backoff intervals [1s, 2s, 4s, 8s] with 30s max wait and 10s per-request timeout. When the server is healthy (the common case), returns immediately with negligible overhead.

Layer 2: Health-Aware BasePage Navigation (BasePage.safeGoto())

Added a protected safeGoto() method to BasePage that wraps waitForServerReady() + page.goto(). All page object subclasses use this.safeGoto() instead of this.page.goto() directly. The existing navigateTo() delegates to this.safeGoto(). Named safeGoto (not goto) to avoid colliding with AssetDetailPage.goto(), which would cause infinite recursion via polymorphic dispatch through navigateTo().

Layer 3: Consistent Element Visibility Timeouts

After click-based navigation (e.g., clicking a "required action" link), the health check does not apply -- the server IS responsive, but React components must still fetch API data and render. On Firefox/webkit under CI load, this regularly exceeds 10s. Increased element visibility timeouts from 10s to 30s across RequiredActionsPage (4 handle* methods), TaskInstancesPage.navigate(), and XComsPage.navigate(). Also increased ConnectionsPage combobox click timeout from 3s to 10s, DagCodePage inner Monaco editor check from 5s to 10s per retry attempt, and RequiredActionsPage wait_for_default_option task completion timeout from 30s to 60s.

Layer 4: Webkit Interaction Reliability

Webkit hover events don't reliably trigger tooltips on the first attempt. Wrapped DagCalendarTab.getManualRunStates() hover+tooltip check in a toPass retry loop (500ms intervals, 20s total) with force: true on hover. This retries the hover if the tooltip doesn't appear, without changing the overall timeout budget.

Layer 5: Health Check / Test Timeout Rebalancing

The 60s health check exceeded the 30s default test timeout -- tests were killed while the health check was still polling. It also consumed the entire toPass budget in triggerDag (60s), leaving no room for retries.

  • Reduced health check MAX_WAIT_MS from 60s to 30s
  • Increased default test timeout from 30s to 60s in playwright.config.ts

This ensures the health check always fits within the test's time budget (30s < 60s), leaving 30s for navigation and assertions. toPass loops with 60s budgets can now retry twice after health check failures.

Layer 6: Spec-Level API Resilience (task-instances.spec.ts)

The beforeAll setup makes direct API calls (POST dagRuns, GET taskInstances, PATCH state) that bypass the health check. These used the global 10s actionTimeout and had no health gating. Under server load, they time out.

  • Added waitForServerReady(page) before API calls
  • Increased beforeAll timeout to 120s (was inheriting 60s)
  • Added explicit 30s timeout to all 8 API calls (3x the actionTimeout)

Design Decisions

Decision Choice Rationale
Health check timeout 30s (reduced from 60s) Must fit within default test timeout (60s) and toPass budgets (60s), leaving room for actual navigation and retries.
Default test timeout 60s (increased from 30s) Tests average ~14s; 60s is generous without masking real failures. Prevents test death during health check polling.
Health check criteria HTTP 200 only, no response time threshold Firefox on CI regularly exceeds 2s for health responses. A time threshold makes the check unsatisfiable.
Pattern consolidation Single BasePage.safeGoto() wrapper Eliminates 12 scattered manual calls. Future page objects inherit protection automatically.
Method name safeGoto() not goto() AssetDetailPage has a public goto() that calls navigateTo(). If navigateTo() dispatched to this.goto(), polymorphism would resolve to the subclass method, creating infinite recursion.
Post-click timeouts 30s for first element after link navigation Matches existing handleWaitForMultipleOptionsTask threshold. 10s is insufficient for Firefox/webkit under CI load.
Hover retry pattern toPass loop with force: true Webkit hover events are unreliable. Retrying the hover handles cases where the first hover didn't register.
Spec API timeouts 30s per request (3x actionTimeout) Enough for a slow but responsive server. The health check gates the start, timeouts handle in-flight slowness.

Changes

New File

  • tests/e2e/utils/health.ts -- waitForServerReady(page) function with backoff polling (HTTP 200 check, 10s per-request timeout, 30s overall)

Modified Files (page objects)

  • tests/e2e/pages/BasePage.ts -- Added protected safeGoto() method; navigateTo() delegates to it
  • tests/e2e/pages/DagsPage.ts -- 3 direct page.goto() calls replaced with this.safeGoto()
  • tests/e2e/pages/RequiredActionsPage.ts -- 7 direct page.goto() calls replaced with this.safeGoto(); first element visibility timeouts after link clicks 10s→30s in 4 handle* methods; wait_for_default_option task timeout 30s→60s
  • tests/e2e/pages/DagCalendarTab.ts -- 1 direct page.goto() call replaced with this.safeGoto(); hover+tooltip wrapped in retry loop for webkit reliability
  • tests/e2e/pages/EventsPage.ts -- 1 direct page.goto() call replaced with this.safeGoto()
  • tests/e2e/pages/TaskInstancesPage.ts -- Table visibility timeout 10s→30s
  • tests/e2e/pages/ConnectionsPage.ts -- Combobox click timeout 3s→10s
  • tests/e2e/pages/DagCodePage.ts -- Inner Monaco editor visibility check 5s→10s per retry attempt
  • tests/e2e/pages/XComsPage.ts -- Table visibility timeout 10s→30s

Modified Files (config and specs)

  • playwright.config.ts -- Default test timeout 30s→60s
  • tests/e2e/specs/task-instances.spec.ts -- beforeAll: added waitForServerReady, 120s timeout, 30s per API call

All paths relative to airflow-core/src/airflow/ui/

Not Modified

  • BackfillPage.ts -- Already uses navigateTo() exclusively (only page.request.* for API calls)
  • LoginPage.ts -- Runs before health infra; already uses navigateTo()

Known Remaining Issues

Three categories of flakiness remain. All pass on retry (0 hard failures); none are addressable with further timeout tuning.

  1. requiredAction.spec.ts (firefox/webkit) -- The HITL workflow (verifyFinalTaskStateswaitForTaskState) sometimes exceeds the test timeout under 2-worker parallel load. The Airflow scheduler is overwhelmed when heavy DAG operations overlap across workers. Passes on retry once the server recovers. The only fix is reducing to workers: 1 (doubling test time) or increasing server capacity.
  2. Webkit click dispatch (xcoms.spec.ts) -- Buttons are found, visible, enabled, and stable, but click() hangs at the 10s actionTimeout. Affects expandAllButton and addFilterButton. Passes on retry, suggesting a transient webkit click interception issue (possibly an overlay or loading state). Increasing the global actionTimeout would affect all tests across all browsers.
  3. dag-calendar-tab.spec.ts (webkit) -- "failed filter shows only failed runs" expects both success and failed runs in the calendar, but only sees success. This is a data timing issue: the failed DAG run created in beforeAll hasn't been reflected in the calendar data when the test reads tooltip states. Not a timeout issue — the test reads stale data, not missing UI elements.

How It Works

Test calls this.navigateTo("/dags")
  -> BasePage.navigateTo() calls this.safeGoto("/dags", { waitUntil: "domcontentloaded" })
    -> BasePage.safeGoto() calls waitForServerReady(this.page)
      -> GET /api/v2/monitor/health (10s timeout per request)
        -> 200? Return immediately (fast path, ~0ms overhead)
        -> Non-200 or timeout? Backoff [1s, 2s, 4s, 8s], retry up to 30s
        -> Still failing after 30s? Throw descriptive error
    -> this.page.goto(path, options)

Test calls this.safeGoto("/dags/my_dag/runs/abc123")  (subclass direct call)
  -> Same flow as above -- health check + goto

When the server is healthy (the normal case), waitForServerReady completes on the first attempt with negligible overhead. When the server is overloaded (the flaky CI case), the health check waits with backoff until the server recovers, then proceeds with navigation. The 30s health check fits within the 60s default test timeout, leaving room for the actual navigation and assertions. For toPass loops, the 30s health check leaves budget for at least one retry.


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: Claude Opus 4.6 following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@boring-cyborg boring-cyborg bot added the area:UI Related to UI/UX. For Frontend Developers. label Mar 28, 2026
@Dev-iL Dev-iL force-pushed the 2603/e2e_deflake branch from decb6f4 to 2837758 Compare March 28, 2026 18:08
@Dev-iL Dev-iL marked this pull request as ready for review March 28, 2026 18:08
@vatsrahul1001
Copy link
Copy Markdown
Contributor

@Dev-iL This is great work for sure.

@vatsrahul1001
Copy link
Copy Markdown
Contributor

Hopefully all test goes to green

@Dev-iL Dev-iL force-pushed the 2603/e2e_deflake branch from 2837758 to a5ce75c Compare March 28, 2026 19:32
Dev-iL and others added 7 commits March 29, 2026 08:57
…lures

Tests running with 2 parallel workers produce cascading failures when one
heavy test (HITL, backfill) overwhelms the shared Airflow server, causing
all concurrent navigations to timeout. Retries also fail because the
server hasn't recovered.

Add a health-check utility that polls /api/v2/monitor/health before every
navigation, using backoff intervals [1s, 2s, 4s, 8s] with a 60s cap. When
the server is responsive (the common case), the check returns immediately
with negligible overhead. When overloaded, it waits for recovery instead
of blindly navigating into timeouts.

The health check is centralized in a new `BasePage.goto()` method that all
page objects inherit. 12 direct `page.goto()` calls across DagsPage,
RequiredActionsPage, DagCalendarTab, and EventsPage now use `this.goto()`
instead. No spec files modified, no timeouts increased.
AssetDetailPage defines its own public goto() method that calls
this.navigateTo(). When BasePage.navigateTo() dispatched to this.goto(),
polymorphism resolved to AssetDetailPage.goto() instead of BasePage.goto(),
creating infinite recursion.

Rename to safeGoto() which doesn't collide with any subclass method names.
Remove the 2000ms response time threshold from health endpoint checks.
The threshold was too strict for Firefox on CI, causing tests to timeout
while waiting for server readiness even though the server was healthy
(returning 200). The health check should verify the server responds, not
enforce a specific response time.

Increases REQUEST_TIMEOUT_MS to 10000ms for individual request attempts
to give slower environments time to respond, while keeping the overall
MAX_WAIT_MS at 60s.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The health-aware safeGoto() protects against an unresponsive server,
but two Firefox-only flaky failures occur after navigation succeeds:
click-based navigation (RequiredActionsPage link clicks) and
post-navigation React rendering (TaskInstancesPage table) can exceed
10s on Firefox under CI load. Increase these element visibility
timeouts from 10s to 30s, matching the threshold already used by
handleWaitForMultipleOptionsTask.
- ConnectionsPage: combobox click timeout 3s→10s (matching actionTimeout)
- DagCalendarTab: wrap hover+tooltip in retry loop — webkit hover events
  are unreliable and may not trigger tooltips on first attempt
- DagCodePage: inner Monaco editor visibility check 5s→10s per retry
  attempt, giving the editor more time to initialize on webkit
- RequiredActionsPage: wait_for_default_option toPass timeout 30s→60s,
  too tight on webkit under 3-browser CI load
- XComsPage: table visibility wait 10s→30s, same pattern as
  TaskInstancesPage fix
The 60s health check exceeded the 30s default test timeout, causing
tests to die while the health check was still polling (plugins.spec.ts).
It also consumed the entire toPass budget (triggerDag has toPass 60s),
leaving no room for retries after the health check exhausted the
timeout.

- Reduce health check MAX_WAIT_MS from 60s to 30s
- Increase default test timeout from 30s to 60s

This ensures the health check fits within any test's time budget
(30s < 60s), leaving 30s for navigation and assertions. Heavy tests
with custom timeouts (120s+) are unaffected.
The beforeAll makes many direct API calls (POST dagRuns, GET
taskInstances, PATCH state) that use the global 10s actionTimeout.
Under CI load these time out. Also, the setup had no health check
and inherited the 60s describe-level timeout for a multi-step setup.

- Add waitForServerReady before API calls
- Increase beforeAll timeout to 120s
- Add explicit 30s timeout to all API calls (3x the actionTimeout)
@Dev-iL Dev-iL force-pushed the 2603/e2e_deflake branch from a5ce75c to 018d96b Compare March 29, 2026 06:07
@Dev-iL Dev-iL added the area:CI Airflow's tests and continious integration label Mar 29, 2026
@vatsrahul1001 vatsrahul1001 merged commit 28f7cf8 into apache:main Mar 29, 2026
84 checks passed
@Dev-iL Dev-iL deleted the 2603/e2e_deflake branch March 29, 2026 14:35
@choo121600
Copy link
Copy Markdown
Member

It seems like the flaky issue might not be fully resolved yet 🥲

nailo2c pushed a commit to nailo2c/airflow that referenced this pull request Mar 30, 2026
…pache#64366)

* E2E: Add health-aware navigation to eliminate false-positive test failures

Tests running with 2 parallel workers produce cascading failures when one
heavy test (HITL, backfill) overwhelms the shared Airflow server, causing
all concurrent navigations to timeout. Retries also fail because the
server hasn't recovered.

Add a health-check utility that polls /api/v2/monitor/health before every
navigation, using backoff intervals [1s, 2s, 4s, 8s] with a 60s cap. When
the server is responsive (the common case), the check returns immediately
with negligible overhead. When overloaded, it waits for recovery instead
of blindly navigating into timeouts.

The health check is centralized in a new `BasePage.goto()` method that all
page objects inherit. 12 direct `page.goto()` calls across DagsPage,
RequiredActionsPage, DagCalendarTab, and EventsPage now use `this.goto()`
instead. No spec files modified, no timeouts increased.

* E2E: Rename BasePage.goto() to safeGoto() to fix stack overflow

AssetDetailPage defines its own public goto() method that calls
this.navigateTo(). When BasePage.navigateTo() dispatched to this.goto(),
polymorphism resolved to AssetDetailPage.goto() instead of BasePage.goto(),
creating infinite recursion.

Rename to safeGoto() which doesn't collide with any subclass method names.

* E2E: Relax health check to fix Firefox test timeouts

Remove the 2000ms response time threshold from health endpoint checks.
The threshold was too strict for Firefox on CI, causing tests to timeout
while waiting for server readiness even though the server was healthy
(returning 200). The health check should verify the server responds, not
enforce a specific response time.

Increases REQUEST_TIMEOUT_MS to 10000ms for individual request attempts
to give slower environments time to respond, while keeping the overall
MAX_WAIT_MS at 60s.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

* E2E: Increase element visibility timeouts for Firefox CI flakiness

The health-aware safeGoto() protects against an unresponsive server,
but two Firefox-only flaky failures occur after navigation succeeds:
click-based navigation (RequiredActionsPage link clicks) and
post-navigation React rendering (TaskInstancesPage table) can exceed
10s on Firefox under CI load. Increase these element visibility
timeouts from 10s to 30s, matching the threshold already used by
handleWaitForMultipleOptionsTask.

* E2E: Fix webkit-specific flakiness across 5 page objects

- ConnectionsPage: combobox click timeout 3s→10s (matching actionTimeout)
- DagCalendarTab: wrap hover+tooltip in retry loop — webkit hover events
  are unreliable and may not trigger tooltips on first attempt
- DagCodePage: inner Monaco editor visibility check 5s→10s per retry
  attempt, giving the editor more time to initialize on webkit
- RequiredActionsPage: wait_for_default_option toPass timeout 30s→60s,
  too tight on webkit under 3-browser CI load
- XComsPage: table visibility wait 10s→30s, same pattern as
  TaskInstancesPage fix

* E2E: Rebalance health check and test timeouts

The 60s health check exceeded the 30s default test timeout, causing
tests to die while the health check was still polling (plugins.spec.ts).
It also consumed the entire toPass budget (triggerDag has toPass 60s),
leaving no room for retries after the health check exhausted the
timeout.

- Reduce health check MAX_WAIT_MS from 60s to 30s
- Increase default test timeout from 30s to 60s

This ensures the health check fits within any test's time budget
(30s < 60s), leaving 30s for navigation and assertions. Heavy tests
with custom timeouts (120s+) are unaffected.

* E2E: Robustify task-instances spec setup against server overload

The beforeAll makes many direct API calls (POST dagRuns, GET
taskInstances, PATCH state) that use the global 10s actionTimeout.
Under CI load these time out. Also, the setup had no health check
and inherited the 60s describe-level timeout for a multi-step setup.

- Add waitForServerReady before API calls
- Increase beforeAll timeout to 120s
- Add explicit 30s timeout to all API calls (3x the actionTimeout)

---------

Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>
@Dev-iL
Copy link
Copy Markdown
Collaborator Author

Dev-iL commented Mar 30, 2026

It seems like the flaky issue might not be fully resolved yet 🥲

I know... But the question is - is it rarer? Also, one of the AI's suggestions was to use 1 concurrent worker instead of 2. This will make the tests run longer, but should be safer. Thoughts?

@choo121600
Copy link
Copy Markdown
Member

The main cause of the current test failures is related to data isolation and handling race conditions.
Additionally, issues arise during retries when timeouts occur due to resource constraints in the CI environment.
Previously, using networkidle masked these underlying issues, which is why they didn’t surface before.
Therefore, I believe it’s better to address the root cause rather than reducing the number of workers :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:CI Airflow's tests and continious integration area:UI Related to UI/UX. For Frontend Developers.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants