Skip to content

Enrich EndBuild hang diagnostics with logging service and submission state#13385

Merged
YuliiaKovalova merged 7 commits into
mainfrom
dev/enrich-endbuild-hang-telemetry
Mar 16, 2026
Merged

Enrich EndBuild hang diagnostics with logging service and submission state#13385
YuliiaKovalova merged 7 commits into
mainfrom
dev/enrich-endbuild-hang-telemetry

Conversation

@YuliiaKovalova
Copy link
Copy Markdown
Member

@YuliiaKovalova YuliiaKovalova commented Mar 13, 2026

Summary

When EndBuild hangs waiting for submissions to complete, the existing EndBuildHang crash telemetry captures basic counts (pending submissions, unmatched project started events) but lacks the information needed to determine why the hang is occurring. This PR adds additional diagnostic properties to narrow down the root cause.

New EndBuild Hang Properties

Property Type Description
LoggingServiceState string Whether the logging pipeline is alive, shutting down, or already shut down (Initialized, ShuttingDown, Shutdown)
LoggingEventQueueDepth int Number of events queued in the async logging pipeline. A large value indicates the pipeline is backed up.
IsShuttingDown bool Whether BuildManager shutdown has been initiated
IsCancellationRequested bool Whether the execution cancellation token was triggered
WorkQueueDepth int Pending items in the BuildManager work queue. OnProjectFinished posts to this queue, so a blocked queue prevents logging completion.
SubmissionDetails string Per-submission diagnostic state: id:Started:HasResult:HasException:LoggingCompleted separated by semicolons
RegisteredLoggerTypeNames string Semicolon-separated list of registered logger type names, to identify which loggers could be blocking the pipeline

New Crash Telemetry Properties (all crash types)

Property Type Description
InnerExceptionStackTrace string Sanitized stack trace of the inner exception. For wrapper exceptions like InternalLoggerException, the outer stack only shows MSBuild infrastructure — the inner stack reveals the actual faulting component.
InnerExceptionMessage string Truncated and path-sanitized inner exception message
LoggerEventType string The build event type name being processed when a logger faulted (extracted via reflection from InternalLoggerException.BuildEventArgs)

StackHash Improvement

ComputeStackHash now includes the inner exception's stack trace so that wrapper exceptions (e.g., all InternalLoggerException instances thrown from EventSourceSink.Consume) get different telemetry buckets based on which logger actually faulted.

Interface Change

Added EventQueueCount property to ILoggingService (internal interface) to expose the async event queue depth for hang diagnostics.

ScheduleTimeRecord.AccumulatedTime throws InternalErrorException with
'Can't get the accumulated time while the timer is still running' during
Scheduler.WriteDetailedSummary(). This exception kills the BuildManager
work queue, preventing any further build results from being processed.
EndBuild() hangs indefinitely, causing VS to freeze for hours.

The fix returns the best-effort elapsed time (accumulated + current
elapsed) when the timer is still running, instead of throwing.
This is diagnostic summary data — throwing has no correctness benefit
but causes a catastrophic hang.

11 hits in 30 days confirmed via telemetry (StackHash: 2C721D65...).
All occurrences during solution close with running timers.
- Remove placeholder issues/XXXXX URL from XML doc comment
- Add ScheduleTimeRecord_AccumulatedTime_DoesNotThrowWhenTimerIsRunning test
- Add ScheduleTimeRecord_AccumulatedTime_IncludesPreviousAccumulation test
…state

Add new telemetry properties to the EndBuildHang crash event to help
diagnose why EndBuild gets stuck waiting for submissions to complete:

- LoggingServiceState: whether the logging pipeline is alive or shutting down
- LoggingEventQueueDepth: number of events backed up in the async queue
- IsShuttingDown: whether BuildManager shutdown has been initiated
- IsCancellationRequested: whether the cancellation token was triggered
- WorkQueueDepth: pending items in the BuildManager work queue
- SubmissionDetails: per-submission state (started, has result, has exception, logging completed)
- RegisteredLoggerTypeNames: which loggers are registered on the node

Also add inner exception diagnostics for all crash telemetry:
- InnerExceptionStackTrace: sanitized stack trace of the inner exception
- InnerExceptionMessage: truncated and path-sanitized inner exception message
- LoggerEventType: the build event type being processed when a logger faulted
- Include inner exception stack in StackHash computation for better bucketing

All string fields are sanitized to remove file paths and truncated to prevent PII leakage.
Copilot AI review requested due to automatic review settings March 13, 2026 17:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves MSBuild crash/hang telemetry to better diagnose EndBuild hangs (especially those related to logging/submission completion) by enriching CrashTelemetry and emitting additional EndBuild state.

Changes:

  • Add inner-exception diagnostics (message/stack) and logger event type to crash telemetry, and improve stack hashing by incorporating inner stack traces.
  • Expand EndBuild hang telemetry with logging service state/queue depth, work queue depth, cancellation/shutdown state, submission details, and registered logger types.
  • Refactor EndBuild hang emission to pass a pre-populated CrashTelemetry object; add EventQueueCount to ILoggingService and implementations.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/Framework/Telemetry/CrashTelemetryRecorder.cs Refactors EndBuild hang diagnostic emission to accept a pre-populated CrashTelemetry.
src/Framework/Telemetry/CrashTelemetry.cs Adds new crash + EndBuild-hang properties and updates stack hashing and exception population logic.
src/Framework.UnitTests/CrashTelemetry_Tests.cs Extends unit coverage for new telemetry fields and stack-hash behavior.
src/Build/BackEnd/Components/Scheduler/ScheduleTimeRecord.cs Changes AccumulatedTime to return best-effort elapsed time while running instead of throwing.
src/Build/BackEnd/Components/Logging/LoggingService.cs Exposes async logging queue depth via EventQueueCount.
src/Build/BackEnd/Components/Logging/ILoggingService.cs Adds EventQueueCount to the logging service interface.
src/Build/BackEnd/BuildManager/BuildManager.cs Populates and emits enriched EndBuild hang telemetry, including logging and submission state details.
src/Build.UnitTests/BackEnd/Scheduler_Tests.cs Adds tests validating the new ScheduleTimeRecord.AccumulatedTime behavior.
src/Build.UnitTests/BackEnd/MockLoggingService.cs Updates mock to implement the new EventQueueCount interface member.

Comment thread src/Build/BackEnd/BuildManager/BuildManager.cs Outdated
…nableNodeReuse, ActiveNodeDetails

For WaitingForNodes hangs where nodes refuse to shut down, the existing
telemetry only reports the count of active nodes. Add:

- ActiveNodeIds: comma-separated list of stuck node IDs
- EnableNodeReuse: whether nodes were told to go idle vs exit
- ActiveNodeDetails: per-node state showing what each node was last
  executing (nodeId:configId:projectFileName), idle, or error
@YuliiaKovalova YuliiaKovalova merged commit 1f15113 into main Mar 16, 2026
10 checks passed
@YuliiaKovalova YuliiaKovalova deleted the dev/enrich-endbuild-hang-telemetry branch March 16, 2026 12:20
AR-May pushed a commit to AR-May/msbuild that referenced this pull request Mar 19, 2026
…state (dotnet#13385)

## Summary

When `EndBuild` hangs waiting for submissions to complete, the existing
`EndBuildHang` crash telemetry captures basic counts (pending
submissions, unmatched project started events) but lacks the information
needed to determine *why* the hang is occurring. This PR adds additional
diagnostic properties to narrow down the root cause.

## New EndBuild Hang Properties

| Property | Type | Description |
|---|---|---|
| `LoggingServiceState` | string | Whether the logging pipeline is
alive, shutting down, or already shut down (`Initialized`,
`ShuttingDown`, `Shutdown`) |
| `LoggingEventQueueDepth` | int | Number of events queued in the async
logging pipeline. A large value indicates the pipeline is backed up. |
| `IsShuttingDown` | bool | Whether `BuildManager` shutdown has been
initiated |
| `IsCancellationRequested` | bool | Whether the execution cancellation
token was triggered |
| `WorkQueueDepth` | int | Pending items in the `BuildManager` work
queue. `OnProjectFinished` posts to this queue, so a blocked queue
prevents logging completion. |
| `SubmissionDetails` | string | Per-submission diagnostic state:
`id:Started:HasResult:HasException:LoggingCompleted` separated by
semicolons |
| `RegisteredLoggerTypeNames` | string | Semicolon-separated list of
registered logger type names, to identify which loggers could be
blocking the pipeline |

## New Crash Telemetry Properties (all crash types)

| Property | Type | Description |
|---|---|---|
| `InnerExceptionStackTrace` | string | Sanitized stack trace of the
inner exception. For wrapper exceptions like `InternalLoggerException`,
the outer stack only shows MSBuild infrastructure — the inner stack
reveals the actual faulting component. |
| `InnerExceptionMessage` | string | Truncated and path-sanitized inner
exception message |
| `LoggerEventType` | string | The build event type name being processed
when a logger faulted (extracted via reflection from
`InternalLoggerException.BuildEventArgs`) |

## StackHash Improvement

`ComputeStackHash` now includes the inner exception's stack trace so
that wrapper exceptions (e.g., all `InternalLoggerException` instances
thrown from `EventSourceSink.Consume`) get different telemetry buckets
based on which logger actually faulted.

## Interface Change

Added `EventQueueCount` property to `ILoggingService` (internal
interface) to expose the async event queue depth for hang diagnostics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants