Azure DevOps history-driven slow-test threshold enricher

## Summary

Add an extension that fetches per-test historical duration data from Azure DevOps Test Analytics and uses it to:

1. **Lower the per-test `still running after N seconds` threshold** for tests with known short historical runtime, surfacing potential hangs much earlier than the static default.
2. **Decorate the emitted line** with the historical p95/p99 so investigators have immediate context — no need to leave the CI log to confirm "is this slow normal for this test?".

Example output:

`[slow] still running after 5s: Foo  (historical p95 = 2s, p99 = 3s, samples = 120)`

Hooks into the `IProgressEnricher` surface introduced by #9139 (silence-driven heartbeat renderer).

## Motivation

Today's static threshold (e.g. 60s in #9139) is a lowest-common-denominator. A test that historically completes in 2 seconds is interesting at 5s, not 60s. AzDO already collects per-test duration history; we can leverage that for a dramatically better signal in the most common .NET CI host.

This realises the **history-aware threshold engine** extension point sketched in #3495 and the design discussion on #9125, without putting that intelligence into core (per Marco's earlier guidance on #3495).

## Proposed design

### Where it lives

Two options to decide before implementation:

- **Option A** — extend the existing `Microsoft.Testing.Extensions.AzureDevOpsReport` package with a new opt-in feature.
- **Option B** — new standalone `Microsoft.Testing.Extensions.AzureDevOpsTestHistory` package (keeps Report focused on reporting; this is data-fetching).

Default recommendation: **Option B**, because the new package needs different permissions (read access to Analytics) and depends on extra HTTP plumbing we don't want polluting the Report package.

### Data source

- **Azure DevOps Test Analytics OData**, `TestResults` entity (per-run granularity).
- ⚠ **Spike needed**: confirm whether `TestResultsDaily` (cheap, aggregate-only — likely just gives mean) is sufficient, or whether we need `TestResults` (per-run, more rows, true distribution required for p95/p99). Web research suggests percentile metrics are NOT exposed directly — we must fetch durations and compute client-side.
- Endpoint shape: `GET https://analytics.dev.azure.com/{org}/{project}/_odata/v3.0-preview/TestResults?\=...&\=AutomatedTestName,Duration,Date`.
- Auth: `System.AccessToken` env var (requires "Allow scripts to access OAuth token" pipeline toggle).

### Scope of the history fetch

- Same pipeline (build definition ID from `BUILD_DEFINITIONID`).
- Default branch only (configurable). Cross-branch comparisons can mislead — different test environments.
- Window: N days back (configurable; default 30).
- Minimum sample size before emitting p95/p99 (default 10).

### Bootstrap & failure modes

- **First run / unknown test**: fall back to static threshold from #9139.
- **API unavailable / token missing / non-AzDO host / fork PR with no token**: extension silently no-ops (verbose log only). Never fail the run.
- **Async fetch at session start**; never block test execution. If fetch takes longer than the first test, the first few slow-test emissions use static thresholds.
- **Cache per-process**: don't refetch within a run.

### Hook usage

- `IProgressEnricher.OnSlowTestThreshold(test)` → return `min(staticDefault, p99 * multiplier)` if we have history, else null (uses static).
- `IProgressEnricher.OnSlowTestEmit(test, currentDuration)` → returns the `(historical p95 = 2s, p99 = 3s, samples = 120)` suffix.

### Test identity matching

⚠ **Open question / spike**: AzDO `AutomatedTestName` may not match MTP `TestNode.Uid` exactly for parameterised tests. Need an investigation spike to define the lookup key. Possible approaches:

- Match on FQN `namespace.class.method` and rely on parameter-set aggregation.
- Add an extension-supplied mapper hook.
- Treat parameterised variants as a single bucket (less precise but simpler).

### Knobs

- `--report-azdo-test-history` (on/off) — or auto-on when `--report-azdo-progress` is set and access token is available.
- `--report-azdo-test-history-window-days N` (default 30).
- `--report-azdo-test-history-min-sample N` (default 10).
- `--report-azdo-test-history-multiplier X` (default 3.0 — emit warning at `p99 * 3`, so a normally-2s test triggers at 9s, not 60s).
- `--report-azdo-test-history-branch <name>` (default: `main` / repo default branch).

### Privacy / safety

- Don't log p95/p99 when sample size < minimum.
- Don't log historical failure rates (out of scope; could leak flakiness data; separate feature).
- Fork PRs typically have no token → extension no-ops, no leakage.

### Touchpoints

- **New package**: `src/Platform/Microsoft.Testing.Extensions.AzureDevOpsTestHistory/` mirroring the `AzureDevOpsReport` layout.
- `AzDoTestHistoryFetcher.cs` — OData query + percentile compute.
- `AzDoTestHistorySlowTestEnricher.cs` — implements `IProgressEnricher`.
- `AzDoTestHistoryCommandLineProvider.cs` — the four options above.
- Help/info acceptance test expectations (alphabetically sorted):
  - `test/IntegrationTests/Microsoft.Testing.Platform.Acceptance.IntegrationTests/HelpInfoAllExtensionsTests.cs`
  - `test/IntegrationTests/Microsoft.Testing.Platform.Acceptance.IntegrationTests/MSBuild.KnownExtensionRegistration.cs`
- Resource strings + `dotnet msbuild ... /t:UpdateXlf`.
- Unit tests for percentile math + caching + fallback behaviour.
- Acceptance test for graceful no-op outside AzDO.

## Open questions

1. `TestResults` vs `TestResultsDaily` — what's the minimum we can query? (Spike.)
2. Test identity matching for parameterised tests. (Spike.)
3. Cross-branch fallback policy (when running on a branch with no history, fall back to default branch's history? or just static threshold?).
4. Should the extension also report duration regressions ("this test is 3x slower than last week") as warnings? (Probably out of scope; flag as follow-up.)

## Out of scope

- Other CI hosts (GitHub Actions doesn't have a comparable first-class test analytics API; would need a different data source, possibly cached artifacts).
- Flakiness prediction / quarantine recommendations — separate feature.
- Writing data back to AzDO — rely on AzDO's existing test result ingestion.

## Related

- #9125 — parent design discussion.
- #9139 — silence-driven heartbeat renderer + `IProgressEnricher` hook surface (prerequisite).
- #3495 — original `slowest tests` feature request; this realises the extension layer of that design.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Azure DevOps history-driven slow-test threshold enricher #9140

Summary

Motivation

Proposed design

Where it lives

Data source

Scope of the history fetch

Bootstrap & failure modes

Hook usage

Test identity matching

Knobs

Privacy / safety

Touchpoints

Open questions

Out of scope

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Azure DevOps history-driven slow-test threshold enricher #9140

Description

Summary

Motivation

Proposed design

Where it lives

Data source

Scope of the history fetch

Bootstrap & failure modes

Hook usage

Test identity matching

Knobs

Privacy / safety

Touchpoints

Open questions

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions