Skip to content

Azure DevOps history-driven slow-test threshold enricher #9140

Description

@Evangelink

Summary

Add an extension that fetches per-test historical duration data from Azure DevOps Test Analytics and uses it to:

  1. Lower the per-test still running after N seconds threshold for tests with known short historical runtime, surfacing potential hangs much earlier than the static default.
  2. Decorate the emitted line with the historical p95/p99 so investigators have immediate context — no need to leave the CI log to confirm "is this slow normal for this test?".

Example output:

[slow] still running after 5s: Foo (historical p95 = 2s, p99 = 3s, samples = 120)

Hooks into the IProgressEnricher surface introduced by #9139 (silence-driven heartbeat renderer).

Motivation

Today's static threshold (e.g. 60s in #9139) is a lowest-common-denominator. A test that historically completes in 2 seconds is interesting at 5s, not 60s. AzDO already collects per-test duration history; we can leverage that for a dramatically better signal in the most common .NET CI host.

This realises the history-aware threshold engine extension point sketched in #3495 and the design discussion on #9125, without putting that intelligence into core (per Marco's earlier guidance on #3495).

Proposed design

Where it lives

Two options to decide before implementation:

  • Option A — extend the existing Microsoft.Testing.Extensions.AzureDevOpsReport package with a new opt-in feature.
  • Option B — new standalone Microsoft.Testing.Extensions.AzureDevOpsTestHistory package (keeps Report focused on reporting; this is data-fetching).

Default recommendation: Option B, because the new package needs different permissions (read access to Analytics) and depends on extra HTTP plumbing we don't want polluting the Report package.

Data source

  • Azure DevOps Test Analytics OData, TestResults entity (per-run granularity).
  • Spike needed: confirm whether TestResultsDaily (cheap, aggregate-only — likely just gives mean) is sufficient, or whether we need TestResults (per-run, more rows, true distribution required for p95/p99). Web research suggests percentile metrics are NOT exposed directly — we must fetch durations and compute client-side.
  • Endpoint shape: GET https://analytics.dev.azure.com/{org}/{project}/_odata/v3.0-preview/TestResults?\=...&\=AutomatedTestName,Duration,Date.
  • Auth: System.AccessToken env var (requires "Allow scripts to access OAuth token" pipeline toggle).

Scope of the history fetch

  • Same pipeline (build definition ID from BUILD_DEFINITIONID).
  • Default branch only (configurable). Cross-branch comparisons can mislead — different test environments.
  • Window: N days back (configurable; default 30).
  • Minimum sample size before emitting p95/p99 (default 10).

Bootstrap & failure modes

  • First run / unknown test: fall back to static threshold from Silence-driven progress heartbeat renderer for SimpleAnsi/NoAnsi terminal modes #9139.
  • API unavailable / token missing / non-AzDO host / fork PR with no token: extension silently no-ops (verbose log only). Never fail the run.
  • Async fetch at session start; never block test execution. If fetch takes longer than the first test, the first few slow-test emissions use static thresholds.
  • Cache per-process: don't refetch within a run.

Hook usage

  • IProgressEnricher.OnSlowTestThreshold(test) → return min(staticDefault, p99 * multiplier) if we have history, else null (uses static).
  • IProgressEnricher.OnSlowTestEmit(test, currentDuration) → returns the (historical p95 = 2s, p99 = 3s, samples = 120) suffix.

Test identity matching

Open question / spike: AzDO AutomatedTestName may not match MTP TestNode.Uid exactly for parameterised tests. Need an investigation spike to define the lookup key. Possible approaches:

  • Match on FQN namespace.class.method and rely on parameter-set aggregation.
  • Add an extension-supplied mapper hook.
  • Treat parameterised variants as a single bucket (less precise but simpler).

Knobs

  • --report-azdo-test-history (on/off) — or auto-on when --report-azdo-progress is set and access token is available.
  • --report-azdo-test-history-window-days N (default 30).
  • --report-azdo-test-history-min-sample N (default 10).
  • --report-azdo-test-history-multiplier X (default 3.0 — emit warning at p99 * 3, so a normally-2s test triggers at 9s, not 60s).
  • --report-azdo-test-history-branch <name> (default: main / repo default branch).

Privacy / safety

  • Don't log p95/p99 when sample size < minimum.
  • Don't log historical failure rates (out of scope; could leak flakiness data; separate feature).
  • Fork PRs typically have no token → extension no-ops, no leakage.

Touchpoints

  • New package: src/Platform/Microsoft.Testing.Extensions.AzureDevOpsTestHistory/ mirroring the AzureDevOpsReport layout.
  • AzDoTestHistoryFetcher.cs — OData query + percentile compute.
  • AzDoTestHistorySlowTestEnricher.cs — implements IProgressEnricher.
  • AzDoTestHistoryCommandLineProvider.cs — the four options above.
  • Help/info acceptance test expectations (alphabetically sorted):
    • test/IntegrationTests/Microsoft.Testing.Platform.Acceptance.IntegrationTests/HelpInfoAllExtensionsTests.cs
    • test/IntegrationTests/Microsoft.Testing.Platform.Acceptance.IntegrationTests/MSBuild.KnownExtensionRegistration.cs
  • Resource strings + dotnet msbuild ... /t:UpdateXlf.
  • Unit tests for percentile math + caching + fallback behaviour.
  • Acceptance test for graceful no-op outside AzDO.

Open questions

  1. TestResults vs TestResultsDaily — what's the minimum we can query? (Spike.)
  2. Test identity matching for parameterised tests. (Spike.)
  3. Cross-branch fallback policy (when running on a branch with no history, fall back to default branch's history? or just static threshold?).
  4. Should the extension also report duration regressions ("this test is 3x slower than last week") as warnings? (Probably out of scope; flag as follow-up.)

Out of scope

  • Other CI hosts (GitHub Actions doesn't have a comparable first-class test analytics API; would need a different data source, possibly cached artifacts).
  • Flakiness prediction / quarantine recommendations — separate feature.
  • Writing data back to AzDO — rely on AzDO's existing test result ingestion.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Fields

    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions