Make scheduled outerloop builds succeed when only Helix tests fail#129049
Conversation
The libraries outerloop pipeline runs on a daily schedule with always:false, meaning AzDO only re-queues a commit if there were changes since the last successful scheduled run. Because flaky outerloop tests cause the 'Send to Helix' task to fail on essentially every scheduled run, the build never succeeds, so AzDO re-queues the same commit every day and submits ever more Helix work for an unchanged sha. Set shouldContinueOnError on the Send to Helix step for scheduled builds only (Build.Reason == 'Schedule'), so Helix work item failures no longer fail the build. Compile/build breaks still fail the build, and PR/CI/manual runs are unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @dotnet/area-infrastructure-libraries |
There was a problem hiding this comment.
Pull request overview
This PR updates the libraries outerloop Azure DevOps pipeline to avoid failing scheduled runs due to Helix work item/test failures, with the intent of preventing always: false schedules from repeatedly re-queuing the same commit and submitting duplicate Helix work.
Changes:
- Pass
shouldContinueOnError: ${{ eq(variables['Build.Reason'], 'Schedule') }}into the threeplatform-matrix.ymlinvocations inouterloop.yml. - Add inline YAML comments explaining the rationale (avoid same-SHA daily re-queues and wasted Helix capacity).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Bleh, it's right. partiallySucceeded won't cause AzDO to avoid scheduling. |
continueOnError only marks the build partiallySucceeded, which AzDO's always:false scheduler still treats as not-successful, so the same commit keeps getting re-queued daily. Instead, for scheduled builds, tell the Helix SDK not to fail the build on work item / test failures by passing FailOnWorkItemFailure=false and FailOnTestFailure=false. The Send to Helix step then fully succeeds, so a perpetually-flaky scheduled run no longer causes AzDO to re-queue the same sha. - helix.yml: add failOnTestFailures parameter (default true = current behavior) wired to the FailOnWorkItemFailure/FailOnTestFailure Helix SDK properties. - outerloop.yml: pass failOnTestFailures=false only for scheduled builds (Build.Reason == 'Schedule'); replaces the earlier shouldContinueOnError approach. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…will revert) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
If this looks reasonble we should backport to 9.0 and 10.0 for outerloop. |
|
/azp list |
|
@lewing any concerns here? See https://dev.azure.com/dnceng-public/public/_build/results?buildId=1451767&view=results for a test run (conditional changed to "manual" to verify the functionality) |
lewing
left a comment
There was a problem hiding this comment.
I'm fine with with it @steveisok @jeffschwMSFT for visibility
|
Will watch to see whether outerloop runs on 8.0 stop happening when there are no changes. If so, will backport to 9.0 and 10.0 |
Note
This pull request was authored with the assistance of GitHub Copilot.
Problem
Several scheduled outerloop pipelines (the
outerloop.ymlfamily:runtime-libraries-coreclr outerloopand its-windows/-linux/-osxvariants) use analways: falsescheduled trigger. Withalways: false, AzDO only starts a new scheduled run if the source changed since the last successful scheduled run.Because the repo has many flaky outerloop tests, the Helix test work items virtually always have at least one failure, which fails the "Send to Helix" step and therefore the whole build. The build never reaches a
succeededstate, so AzDO re-queues the same, unchanged commit day after day, submitting more and more Helix work for no benefit. (Empirically confirmed: a single commit was re-run and failed for 19 consecutive days; once a sibling definition produced a genuinely successful run, the same-SHA re-queue stopped.)Why
continueOnErroris not enoughcontinueOnError: trueonly downgrades the build topartiallySucceeded, which AzDO'salways: falsescheduler still does not treat as successful — so the same commit keeps getting re-queued. The Helix step must end fully successful (exit 0).Fix
Make the "Send to Helix" step actually succeed on scheduled runs by disabling the two Arcade
Microsoft.DotNet.Helix.Sdkproperties that fail the build (both default totrue):FailOnWorkItemFailure—CheckHelixJobStatuserrors when a work item exits non-zero.FailOnTestFailure—CheckAzurePipelinesTestResultserrors when any published test failed.Setting both to
falselets the msbuild step exit 0, producing a fullysucceededbuild. Failed tests are still published and visible in the test results tab; AzDO does not auto-degrade a build topartiallySucceededjust because a published test run contains failures — only a failing task would.Changes
eng/pipelines/libraries/helix.yml: Added afailOnTestFailuresparameter (defaulttrue, preserving today's behavior) wired to/p:FailOnWorkItemFailureand/p:FailOnTestFailureon the Send to Helix msbuild invocation.eng/pipelines/libraries/outerloop.yml: PassesfailOnTestFailures: falseonly on scheduled runs (Build.Reason == 'Schedule') for all three matrix legs (Release, Debug, NET48).Behavior preservation
The new parameter defaults to
true, so all otherhelix.ymlcallers are unaffected (none setWaitForWorkItemCompletionor these properties on this path, so they already resolve totrue). Only scheduled outerloop runs change behavior. PR / rolling / manual outerloop runs continue to fail on Helix failures exactly as before. Build/compile breaks still fail scheduled runs (this only affects the Helix step).Tradeoff
On scheduled runs,
FailOnWorkItemFailure=falsealso masks work-item crashes/timeouts/infra failures, not just test-assertion failures. This is an accepted tradeoff for the goal of stopping the wasteful daily re-queue of unchanged commits; results remain visible in the Helix/test reporting.