feat(aspire): dump resource error logs on test failure + warn on missing telemetry wiring#6323
Conversation
…ing telemetry wiring Two DX features so a failing Aspire test surfaces SUT logs without manual setup: 1. AspireFixture<T> implements ITestEndEventReceiver. On a failed test it appends each waited-on resource's recent stderr lines to that test's output (resource-scoped, capped). Gated by DumpResourceLogsOnFailure (default on), MaxFailureLogLinesPerResource, FailureLogCollectionTimeout. Complements live OTLP per-request log correlation. 2. OtlpReceiver records the service.name of every incoming OTLP log record. At session end the fixture hints (WarnOnMissingTelemetry, default on) for any project resource that produced console output but never sent correlated OTLP logs - the usual symptom of an SUT missing OpenTelemetry log export (e.g. ServiceDefaults not wired). Emitted after the receiver drain and skipped on run-abort. Adds Receiver_LogWithServiceName_RecordsSeenService.
There was a problem hiding this comment.
Review: feat(aspire): dump resource error logs on test failure + warn on missing telemetry wiring
Good DX addition overall — the two-feature design is well-reasoned, the PR description is thorough, and the test is solid. A few things worth fixing before merge:
1. Potential null reference: _app passed to non-nullable parameter
In StopAndDisposeCoreAsync, _app (typed DistributedApplication?) is passed to EmitTelemetryWiringHintsAsync which takes DistributedApplication app (non-nullable):
await EmitTelemetryWiringHintsAsync(_app, _otlpReceiver);If _app is null at that point (e.g. initialization failed before the app was assigned), this is either a CS8604 nullable warning or a runtime null dereference. The call site should guard:
if (WarnOnMissingTelemetry && _app is not null && !RunCancellationToken.IsCancellationRequested)
{
await EmitTelemetryWiringHintsAsync(_app, _otlpReceiver);
}Or make the parameter nullable and guard inside the method (similar to how OnTestEnd does var app = _app; if (... || app is null) early return).
2. Memory inefficiency: CollectResourceErrorLinesAsync collects unbounded lines before truncating
var errors = new List<string>();
// ... fills with ALL error lines from the backlog ...
if (max >= 0 && errors.Count > max)
{
errors.RemoveRange(0, errors.Count - max);
}On a long test run where a resource produced thousands of error lines, this allocates the full list before discarding all but the last max. For shared [ClassDataSource] or [SessionDataSource] fixtures this could be a significant allocation on every test failure.
The better approach is a bounded ring-buffer that never exceeds max entries:
var errors = new Queue<string>(max + 1);
// inside loop:
if (line.IsErrorMessage)
{
errors.Enqueue(line.Content);
if (errors.Count > max) errors.Dequeue();
}
// return errors.ToList()This keeps memory proportional to MaxFailureLogLinesPerResource, not to the total number of error lines the resource has ever produced.
3. Hardcoded 2-second timeout in ResourceProducedConsoleOutputAsync
FailureLogCollectionTimeout is a configurable virtual property, but ResourceProducedConsoleOutputAsync bakes in TimeSpan.FromSeconds(2) with no hook for customization. These are separate concerns (failure dumps vs. the telemetry hint), but the inconsistency is surprising.
More importantly: if a project resource produced no output at all, the probe will wait the full 2 seconds before returning false. With N project resources in parallel, this is always 2s of teardown overhead on every run — even when all resources are perfectly wired with OTLP. Consider a short-circuit: if the receiver has already seen a service name for a resource, skip the probe entirely:
if (!receiver.HasSeenLogsFrom(p.Name) && await ResourceProducedConsoleOutputAsync(...))
{
// emit hint
}Reversing the check order avoids the I/O probe for resources that clearly sent OTLP logs.
4. Minor: unnecessary local copy in ProcessLogs
private void ProcessLogs(byte[] body)
{
var diag = _diagnostics;_diagnostics is a readonly field, so var diag = _diagnostics is a needless alias. Just use _diagnostics directly as before. The motivation for the change (making the method instance rather than static) is correct — the alias is just noise.
What's solid
- The
ITestEndEventReceiverintegration is the right hook — cleaner than a global cleanup callback. ObjectDisposedExceptionguard inOnTestEndis the right call: a race against run-abort teardown is expected and shouldn't poison the test result.ConcurrentDictionary<string, byte>for the seen-services set is an appropriate allocation-free pattern (compared toConcurrentHashSet).- Case-insensitive service name matching via
StringComparer.OrdinalIgnoreCaseon the dictionary is correctly placed at construction time. - The new test
Receiver_LogWithServiceName_RecordsSeenServicetests both the happy path and case-insensitivity, and uses an unregistered trace ID to verify the seen-service recording is independent of trace correlation — good. - The PR description is unusually complete and the "landmines for reviewers" section is helpful context.
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 38 |
NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.
… probes Cleanups from /simplify review: - Merge CollectResourceErrorLinesAsync into CollectResourceLogLinesAsync via optional (timeout, errorsOnly, max) params; defaults reproduce the existing startup-diagnostics caller's behavior (5s, all lines, E> prefix, no cap). - EmitTelemetryWiringHintsAsync: filter project resources by !HasSeenLogsFrom before probing console output, so a correctly-wired SUT opens zero probes. - OnTestEnd: build the waited-on resource name list with LINQ instead of a manual foreach.
There was a problem hiding this comment.
Follow-up review (second commit: 016b04a3)
What the new commit addressed
The refactor(aspire) commit picked up two concerns from the prior review:
- Pre-filter telemetry-hint probes —
EmitTelemetryWiringHintsAsyncnow filters by!HasSeenLogsFrombefore opening any console probe, so a correctly-wired SUT (all resources seen) takes zero I/O overhead at teardown. Good fix. - Method consolidation — merging
CollectResourceErrorLinesAsyncintoCollectResourceLogLinesAsyncvia optional params reduces duplication. Clean.
Correction to my prior review — issue #1 was a misread
My prior comment about a potential null dereference on _app was wrong. The call to EmitTelemetryWiringHintsAsync(_app, _otlpReceiver) is nested inside:
if (_otlpReceiver is not null && _app is not null)so _app is guaranteed non-null at that point. Apologies for the noise.
Still open from prior review
Issue #2: Memory — unbounded list before tail-truncation (still present)
CollectResourceLogLinesAsync still allocates a List<string> for every line in the backlog, then removes the head:
// line 1062
lines.RemoveRange(0, lines.Count - max);On a long test run with a shared (session/class) fixture, a noisy resource can produce thousands of lines, all allocated before all-but-50 are discarded. A fixed-size ring-buffer keeps memory proportional to max throughout:
// Replace List<string> with a sliding window
var window = new Queue<string>(max + 1);
// inside loop, when errorsOnly:
if (line.IsErrorMessage)
{
window.Enqueue(line.Content);
if (window.Count > max) window.Dequeue();
}
// return window.ToList()This is especially worth fixing here because OnTestEnd fires for every failing test that consumes the fixture — on a shared fixture, that means N calls per run, each potentially pulling a large backlog.
Issue #4: Unnecessary alias in ProcessLogs (still present)
// OtlpReceiver.cs line 554
private void ProcessLogs(byte[] body)
{
var diag = _diagnostics; // _diagnostics is a readonly field — this alias adds nothingJust use _diagnostics directly. The alias was a remnant of when the method was static and took diag as a parameter; now that it's an instance method, the variable is dead weight.
Hardcoded timeout in ResourceProducedConsoleOutputAsync (FYI, not blocking)
// line 743
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(2));FailureLogCollectionTimeout defaults to 2s too, so the values coincidentally match, but the probe timeout isn't wired to any configurable property. With the pre-filter now in place the probe is only opened for resources that haven't sent OTLP logs, so in practice this affects mis-wired SUTs only. Still, if someone extends the failure dump timeout they'd reasonably expect the hint probes to follow. A minor but clean fix would be passing FailureLogCollectionTimeout through to the probe (or a separate virtual TelemetryHintProbeTimeout). Not blocking.
Summary
Two real fixes landed in the second commit; issue #1 from my prior review was a misread (already safe). Memory inefficiency (#2) and the var diag alias (#4) still need attention. The timeout inconsistency is low priority but worth a note.
…lias - CollectResourceLogLinesAsync: use a fixed-size sliding window (Queue) when `max` is set so memory stays proportional to MaxFailureLogLinesPerResource instead of buffering the whole backlog before tail-truncating (review #2). - OtlpReceiver.ProcessLogs: drop the leftover `var diag = _diagnostics` alias from when the method was static; use the field directly (review #4). - Wire the telemetry-hint console probe to FailureLogCollectionTimeout instead of a hardcoded 2s, so extending the dump timeout extends the probe too.
|
Addressed the open items from the follow-up review in
|
Code ReviewThis PR adds two opt-out Aspire DX features to Findings
Not flagged (verified and refuted)
Nice test coverage on the new |
Addresses the deep-review findings on the OTLP log-correlation DX PR. - #1 false-positive hint: OtlpLogParser dropped every record without a trace id *before* OtlpReceiver could record its service.name, so a correctly-wired SUT emitting only untraced logs (startup/background, no active Activity) was never marked "seen" and triggered the missing-telemetry hint it was meant to avoid. Parse now invokes an onResourceSeen callback for each record ahead of the trace-id filter; the receiver records seen services via a cached delegate. The record-emission contract (untraced records excluded from results) is unchanged. - #2 retry leak: OnTestEnd fires once per retry attempt (end-event receivers run inside the retried body), and TestContext.Output is shared across attempts, so an early failed attempt's log dump leaked into the output of a test that passed on retry. Gate the dump on the final attempt (CurrentRetryAttempt >= RetryLimit). - #4 consistency: the telemetry-hint candidate list now applies ShouldWaitForResource, matching the on-failure dump's filtering, so a user-excluded resource isn't probed. - #6 docs: FailureLogCollectionTimeout XML doc now discloses its reuse as the session-end console-output probe timeout. New test Parse_UntracedLogRecord_StillReportsSeenService covers #1.
|
Thanks — the deep pass caught two real bugs. Addressed in #1 (false-positive hint) — fixed. Confirmed: #2 (retry leak) — fixed. Verified against #4 (filter consistency) — fixed. #6 (doc) — fixed. #3 (tracking on shared #5 (dedup the two
|
Follow-up review (verifying fixes in
|
Quality cleanups from a /simplify pass (no behaviour change): - OtlpLogParser: fire the onResourceSeen callback once per resource scope instead of once per log record. ParseScopeLogs now returns whether it saw any record; ParseResourceLogs signals after the loop, dropping the per-record ConcurrentDictionary.TryAdd churn for a noisy resource and removing the callback param from ParseScopeLogs's signature. - AspireFixture: invert the log-line prefix ternary to test the special case directly (E> only when an error line and not errors-only). - AspireFixture: merge the two telemetry-hint candidate .Where() filters into a single predicate.
There was a problem hiding this comment.
Follow-up review (latest commit 1054d3fa: "tidy log-parser seen-service signal and hint filter")
All prior rounds' findings (the deep-pass review and the two follow-ups) are already confirmed fixed and verified. This commit is a small post-"ready to merge" cleanup on top of that state — reviewed it specifically since it landed after the last sign-off.
Changes in this commit
-
AspireFixture.cs:714-717— merged the two chained.Where(ShouldWaitForResource).Where(p => !receiver.HasSeenLogsFrom(p.Name))calls into a single.Where(p => ShouldWaitForResource(p) && !receiver.HasSeenLogsFrom(p.Name)). Pure style change — same predicates, same short-circuit order, no behavior difference. -
AspireFixture.cs:1070-1072— flipped theE>prefix ternary fromerrorsOnly || !line.IsErrorMessage ? Content : "E> " + Contentto!errorsOnly && line.IsErrorMessage ? "E> " + Content : Content. Worked through the truth table for all four(errorsOnly, IsErrorMessage)combinations — identical output in every case. The new form reads more directly (positive condition for the prefix-added branch instead of a double-negative guard), so this is a legitimate readability win with no regression. -
OtlpLogParser.cs— reworksonResourceSeenfrom firing once per LogRecord (insideParseScopeLogs, before the trace-id filter) to firing once per resource (ParseResourceLogs, via asawLogRecordflag OR'd across all of a resource'sScopeLogsblocks). This avoids redundantonResourceSeeninvocations for a noisy resource —AspireFixture's consumer is aConcurrentDictionary.TryAdd, so the old per-record version wasn't wrong, just wasteful at scale. The new code explicitly documents the wire-format assumption it leans on ("Resource (field 1) precedes ScopeLogs (field 2) ... true for all known OTel SDK implementations"), which is the right way to disclose it given the parser has no way to enforce field ordering.
Verification
- Confirmed the ternary rewrite is logically equivalent by hand (see truth table above).
- Built
TUnit.OpenTelemetryandTUnit.Aspire.Coreagainst the PR head (1054d3fa) — both succeed with 0 warnings/errors. - Ran the directly-affected suites against the PR head:
OtlpLogParserTests(incl.Parse_UntracedLogRecord_StillReportsSeenService, which exercises exactly this code path): 26/26 passed.OtlpReceiverIngestionTests: 8/8 passed.
No new issues. This commit doesn't touch any of the previously-flagged areas and doesn't introduce new ones — the seen-service contract (presence-before-trace-filter, case-insensitive lookup) is preserved, just invoked more efficiently. Ready to merge from my side.
What
Two opt-out DX features so a failing Aspire test surfaces SUT logs without any manual wiring.
1. Dump resource error logs on test failure (default on)
AspireFixture<T>now implementsITestEndEventReceiver. When a test that consumes the fixture fails, each waited-on resource's recent stderr lines are appended to that test's output.ResourceLoggerService.WatchAsyncpattern; collected in parallel with a short per-resource timeout.DumpResourceLogsOnFailure(defaulttrue),MaxFailureLogLinesPerResource(50),FailureLogCollectionTimeout(2s).2. Warn on missing telemetry wiring (default on)
OtlpReceivernow records theservice.nameof every incoming OTLP log record (before the trace-id filter — presence means OTLP logging reached us at all). At session end the fixture emits a one-time hint for any project resource that produced console output but never sent correlated OTLP logs — the usual symptom of an SUT missing OpenTelemetry log export (e.g. Aspire ServiceDefaults not wired). Turns a silent gap into an actionable nudge.StopAsync(app still up to probe console output); skipped on a run-abort (incomplete data).OTEL_SERVICE_NAME=<resourceName>; an SUT that overrides its own service name may produce a false hint, hence the soft wording. Knob:WarnOnMissingTelemetry(defaulttrue).Why
For a SUT wired with standard ServiceDefaults, correlated per-request logs already reach the owning test live. These two changes cover the two remaining gaps with zero user code: failure visibility for uncorrelated console logs, and a hint when correlation silently can't work because OTLP log export isn't on.
Tests
Receiver_LogWithServiceName_RecordsSeenService(hand-built OTLP logs protobuf, unregistered trace id, asserts case-insensitiveHasSeenLogsFrom+SeenLogServiceNames).OtlpReceiverIngestionTests8/8; pure AspireTraceRegistryTests/ResourcesToRemoveTestspass; clean build on net10.0.Notes / landmines for reviewers
TestContext.Resultis internal; the public read path iscontext.Execution.Result?.State.OtlpReceiver.ProcessLogschanged fromstaticto an instance method (needs the seen-services field).[ClassDataSource]-injected fixtures becauseBuildEligibleEventObjectsscans test class arguments / injected properties.