Server-side App Insights: close mobile↔API correlation loop#166
Merged
Conversation
…PI (close mobile↔API correlation loop) Adds server-side Application Insights via Azure.Monitor.OpenTelemetry.Exporter into SentenceStudio.ServiceDefaults so API, WebApp, Workers, and Marketing all tag themselves with distinct cloud_RoleName values and export to the same sstudio-mobile-ai resource the mobile client already uses (PR #165). With HttpClient instrumentation on the client and AspNetCore instrumentation on the server, W3C traceparent propagates automatically and requests join to dependencies on operation_Id. Critical MAUI-safety pivot: Azure.Monitor.OpenTelemetry.AspNetCore transitively requires Microsoft.AspNetCore.App, which has no runtime pack for maccatalyst-*/ ios-*/android-* RIDs. ServiceDefaults is consumed by every MAUI head via AppLib, so the AspNetCore variant breaks MAUI builds (NETSDK1082). Using the lower-level Exporter package (same one mobile uses) with the three AddAzureMonitor{Log, Metric,Trace}Exporter calls in shared defaults keeps MAUI buildable; AspNetCore instrumentation is added only to the API csproj and wired from Program.cs. OTel stack bumped to 1.15.x across ServiceDefaults and AppLib to stay aligned with MauiServiceDefaults and clear NU1605 downgrade errors in web projects. Azure Monitor wiring gated #if !DEBUG so local aspire run keeps streaming to the Aspire dashboard via OTLP without dual-exporting. AzureMonitor:ConnectionString committed to Api/appsettings.Production.json (write-only key, same approach mobile slice used; ingestion capped at 0.5 GB/day). Companion skill at .squad/skills/aspnetcore-azure-monitor/SKILL.md documents the MAUI-safe server pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ASP.NET Core OTel instrumentation tags exceptions on the request span but does NOT produce rows in App Insights' exceptions table — that's populated only from ILogger records carrying an Exception. Wire UseExceptionHandler as the first middleware, log via a named 'UnhandledException' ILogger, and write a ProblemDetails 500. The OTel log exporter ships the record so KQL like exceptions | where cloud_RoleName == 'SentenceStudio.Api' surfaces unhandled controller / minimal-API throws. Placement: before UseAuthentication so exceptions in auth handlers and custom middleware are also caught. Smoke-validated locally by temporarily adding /__debug/boom (removed before commit): HTTP 500 + application/problem+json body, fail: UnhandledException[0] log line with full stack, no process crash. Also: - aspnetcore-azure-monitor SKILL: replace 'no middleware needed' claim with the correct pattern; add az monitor app-insights component billing update recipe for daily-cap management. - wash history: learnings for the cap raise CLI and the span-vs- exceptions-row distinction. Companion: 0.5 GB/day → 2 GB/day cap raise on sstudio-mobile-ai (done via az CLI, read-back archived in PR #166 body). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
davidortinau
added a commit
that referenced
this pull request
Apr 22, 2026
Big one: always read AppHost.cs before recommending a deploy-tool flip. Plus: stale safety scripts are worse than no safety scripts, ACA prepends env name to cloud_RoleName (prod-only KQL gotcha), and azd deploy on Flexible-Server architecture is clean.
This was referenced Apr 22, 2026
davidortinau
added a commit
that referenced
this pull request
Apr 22, 2026
…tion PR #165 (mobile) + PR #166 (server) shipped the App Insights pipeline end to end, but the mobile↔API correlation join in App Insights was returning zero rows. Every server request had `operation_Id == operation_ParentId` — i.e., no `traceparent` header was arriving from the device. Diagnosis (see #171): - `OpenTelemetry.Instrumentation.Http`'s `AddHttpClientInstrumentation()` was already on the MAUI TracerProvider since commit 216a2da and a trim-disabled Release build on DX24 produced the same zero-span result, so neither registration nor trimming was the problem. - Mobile logs had empty `operation_Id` across the board, confirming no ambient `Activity` ever existed on the device. - Root cause (tracked in #171): MAUI's `MauiApp` doesn't run `IHostedService` instances, so the `TelemetryHostedService` that would normally materialize the TracerProvider and attach its listeners never runs. Logs work because they hook `ILoggerFactory` synchronously; the tracer path needs the hosted-service startup. This PR: - Adds `ApiActivityHandler`, a `DelegatingHandler` that starts a `Client` Activity per outbound API call using a dedicated `ActivitySource` (`SentenceStudio.Mobile.HttpClient`). With an Activity current, HttpClient's built-in `DiagnosticsHandler` auto-injects the W3C `traceparent` header. - Registers the new ActivitySource on the mobile TracerProvider via `.AddSource(...)` in `MauiServiceDefaults.Extensions` so the spans actually export. - Wires the handler onto every API-bound HttpClient: CoreSync's `HttpClientToServer`, the auth client, the four typed API clients, and `VersionCheckService`. The handler is placed FIRST in the chain so the span wraps the full operation including auth token attachment. - Hardens `OpenTelemetryInitializer` to call `GetRequiredService<T>()` instead of the nullable `GetService<T>()` for all three providers, so a misregistration fails loudly at startup instead of silently breaking telemetry at runtime. Out of scope (explicitly): - Root-cause fix for the IHostedService gap — tracked in #171. - The raw `new HttpClient()` in `SentenceStudio.Shared/Services/AiService.cs:93` — bypasses `HttpClientFactory` entirely. Separate refactor. - The KQL in `docs/deploy-runbook.md` is still wrong (joins requests to requests; should be dependencies to requests). Separate doc PR. Verification: Mac Catalyst Debug + Release both build clean. Post-merge verification will be an iOS publish to DX24 + KQL query for non-empty `operation_ParentId` on server requests.
davidortinau
added a commit
that referenced
this pull request
Apr 22, 2026
…ile↔API correlation (#172) * fix(mobile): wrap API HttpClients with ApiActivityHandler for correlation PR #165 (mobile) + PR #166 (server) shipped the App Insights pipeline end to end, but the mobile↔API correlation join in App Insights was returning zero rows. Every server request had `operation_Id == operation_ParentId` — i.e., no `traceparent` header was arriving from the device. Diagnosis (see #171): - `OpenTelemetry.Instrumentation.Http`'s `AddHttpClientInstrumentation()` was already on the MAUI TracerProvider since commit 216a2da and a trim-disabled Release build on DX24 produced the same zero-span result, so neither registration nor trimming was the problem. - Mobile logs had empty `operation_Id` across the board, confirming no ambient `Activity` ever existed on the device. - Root cause (tracked in #171): MAUI's `MauiApp` doesn't run `IHostedService` instances, so the `TelemetryHostedService` that would normally materialize the TracerProvider and attach its listeners never runs. Logs work because they hook `ILoggerFactory` synchronously; the tracer path needs the hosted-service startup. This PR: - Adds `ApiActivityHandler`, a `DelegatingHandler` that starts a `Client` Activity per outbound API call using a dedicated `ActivitySource` (`SentenceStudio.Mobile.HttpClient`). With an Activity current, HttpClient's built-in `DiagnosticsHandler` auto-injects the W3C `traceparent` header. - Registers the new ActivitySource on the mobile TracerProvider via `.AddSource(...)` in `MauiServiceDefaults.Extensions` so the spans actually export. - Wires the handler onto every API-bound HttpClient: CoreSync's `HttpClientToServer`, the auth client, the four typed API clients, and `VersionCheckService`. The handler is placed FIRST in the chain so the span wraps the full operation including auth token attachment. - Hardens `OpenTelemetryInitializer` to call `GetRequiredService<T>()` instead of the nullable `GetService<T>()` for all three providers, so a misregistration fails loudly at startup instead of silently breaking telemetry at runtime. Out of scope (explicitly): - Root-cause fix for the IHostedService gap — tracked in #171. - The raw `new HttpClient()` in `SentenceStudio.Shared/Services/AiService.cs:93` — bypasses `HttpClientFactory` entirely. Separate refactor. - The KQL in `docs/deploy-runbook.md` is still wrong (joins requests to requests; should be dependencies to requests). Separate doc PR. Verification: Mac Catalyst Debug + Release both build clean. Post-merge verification will be an iOS publish to DX24 + KQL query for non-empty `operation_ParentId` on server requests. * fix(mobile): use Activity.AddException for OTel-conformant exception recording Code review feedback on #172: exceptions should be recorded as Activity events (via AddException/RecordException), not raw tags. Emits the standard OTel 'exception' event with type/message/stacktrace, which surfaces in App Insights' exception timeline rather than being tag-only. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
davidortinau
added a commit
that referenced
this pull request
Apr 22, 2026
…tion (#173) PR #172 got mobile HttpClient dependency spans emitting with operation_Id, but the correlation join against API requests still returned zero rows: the API saw every incoming request without a traceparent header and started a fresh operation_Id. Root cause: HttpClient's built-in DiagnosticsHandler only injects traceparent automatically when an OTel-style ActivityListener is attached to "System.Net.Http". On MAUI the listener never attaches because OpenTelemetry's TelemetryHostedService — which wires listeners to the TracerProvider — relies on IHostedService, and MauiApp doesn't run hosted services (issue #171). Fix: have ApiActivityHandler explicitly call DistributedContextPropagator.Current.Inject(...) on the outbound request headers after starting its Activity. Guards against double-injection if a caller or a resilience retry already set traceparent. This is the user-space workaround to #171. Framework fix is still desirable but now lower priority. Verification plan: re-run the App Insights correlation join; expect requests | join dependencies on operation_Id to return > 0 rows for the mobile role name. Refs: #165 #166 #172 #171
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Server-side App Insights: close mobile↔API correlation loop
Companion to PR #165 (the mobile-side slice that shipped
Azure.Monitor.OpenTelemetry.ExporterintoSentenceStudio.MauiServiceDefaults). This PR wires the same App Insights resource (sstudio-mobile-ai) into the server tier viaSentenceStudio.ServiceDefaults, giving us joinedrequests/dependenciesrows keyed byoperation_Id.Agent: Wash (Backend Dev). Draft until the deploy + correlation proof below is filled in.
Cap raise
Daily ingestion cap on
sstudio-mobile-airaised from 0.5 GB → 2 GB before deploy so combined mobile + 4 server emitters (API, WebApp, Workers, Marketing) don't get throttled in the first day.az monitor app-insights component billing update \ --app sstudio-mobile-ai --resource-group rg-sstudio-prod \ --cap 2 --stop falseResult (from
az monitor app-insights component billing show …):Exception handler
Added a global
app.UseExceptionHandler(…)as the first middleware insrc/SentenceStudio.Api/Program.cs(just afterbuilder.Build(), lines 276–302) that logs unhandled exceptions via a namedUnhandledExceptionILoggerand returnsapplication/problem+json500. This is required on top ofAddAspNetCoreInstrumentation— that instrumentation tags the request span with exception events but does NOT produce rows in App Insights'exceptionstable (those come only fromILoggerrecords carrying anException, shipped through the OTel log exporter).Smoke-validated locally via a temporary
/__debug/boomendpoint (removed before commit): HTTP 500 + problem+json body,fail: UnhandledException[0]log line with full stack, process kept running for further requests.What's in this PR
src/SentenceStudio.ServiceDefaults/SentenceStudio.ServiceDefaults.csprojAzure.Monitor.OpenTelemetry.Exporter 1.7.0.src/SentenceStudio.ServiceDefaults/Extensions.csAddServiceDefaults(..., cloudRoleName);ConfigureResource(AddService(roleName));#if !DEBUGthree-exporter Azure Monitor block gated onAzureMonitor:ConnectionString.src/SentenceStudio.AppLib/SentenceStudio.AppLib.csprojsrc/SentenceStudio.Api/SentenceStudio.Api.csprojOpenTelemetry.Instrumentation.AspNetCore 1.15.0locally (kept out of shared defaults — see MAUI-safety note below).src/SentenceStudio.Api/Program.csAddServiceDefaults("SentenceStudio.Api"); local.WithMetrics/.WithTracingwithAddAspNetCoreInstrumentation(); globalUseExceptionHandler→ILogger.LogError(new; pre-deploy review fix).src/SentenceStudio.Api/appsettings.Production.jsonAzureMonitor:ConnectionString(write-only ingestion key, same as mobile's — intentional reuse).src/SentenceStudio.WebApp/Program.cs,Workers/Program.cs,Marketing/Program.cscloud_RoleNameliteral. No connection string shipped in theirappsettings.Production.jsonyet → they stay OTLP-only until Captain opts them in..squad/skills/aspnetcore-azure-monitor/SKILL.mdmaui-azure-monitor/SKILL.md. Captures the MAUI-safe server pattern + the exception-handler recipe + the cap-raise az CLI recipe..squad/agents/wash/history.mdLocked-decisions adherence
sstudio-mobile-ai, workspace-backed bylaw-3ovvqiybthkb6): ✅ reused verbatim, same connection string as mobile.cloud_RoleName, no runtime detection: ✅"SentenceStudio.Api","SentenceStudio.WebApp","SentenceStudio.Workers","SentenceStudio.Marketing"— all passed from their respectiveProgram.cs.#if !DEBUG. Container builds are Release so it activates in prod;aspire runis Debug so it stays OTLP-only. No double-export.Aspire.Hosting.AzureMonitorintegration in AppHost: ✅ verified absent. The manual wiring in ServiceDefaults is the only path.MAUI-safety pivot (deviation from task brief)
The task brief suggested
Azure.Monitor.OpenTelemetry.AspNetCore 1.4.0+UseAzureMonitor(). That package transitively pullsOpenTelemetry.Instrumentation.AspNetCore, which declares<FrameworkReference Include="Microsoft.AspNetCore.App" />. There is noMicrosoft.AspNetCore.Appruntime pack formaccatalyst-arm64/ios-arm64/android-*RIDs, so putting it inServiceDefaultsbroke every MAUI head withNETSDK1082.ServiceDefaultsis consumed by web hosts and (transitively, viaAppLib) by every MAUI head. So the.AspNetCorevariant of Azure Monitor simply can't live in the shared project.Resolution: swapped to the lower-level
Azure.Monitor.OpenTelemetry.Exporter 1.7.0— exactly whatMauiServiceDefaultsalready uses client-side — with the threeAddAzureMonitor{Log,Metric,Trace}Exportercalls. AddedOpenTelemetry.Instrumentation.AspNetCoreonly to the API's csproj, and wired.AddAspNetCoreInstrumentation()fromProgram.cs. Net observability fidelity matchesUseAzureMonitor(); MAUI stays buildable. The MAUI-safety note was already documented in themaui-azure-monitorskill; the siblingaspnetcore-azure-monitorskill in this PR captures the server flip-side.Build proof
All zero-error:
dotnet build src/SentenceStudio.Api -f net10.0 -c Debug✅dotnet build src/SentenceStudio.Api -f net10.0 -c Release✅dotnet build src/SentenceStudio.WebApp -c Release✅dotnet build src/SentenceStudio.Workers -c Release✅dotnet build src/SentenceStudio.Marketing -c Release✅dotnet build src/SentenceStudio.MacCatalyst -f net10.0-maccatalyst -c Debug✅ (the MAUI-safety proof)Deploy + validation — TO BE FILLED IN BEFORE MARKING READY
Captain to confirm VPN off, then:
Then generate mobile→API traffic (any authenticated call from Mac Catalyst on DX24 or sim in Release), wait 2–5 min for ingestion, and paste results of these three KQL queries into a follow-up comment:
1. Server requests are flowing with the right role name
2. Mobile → API correlation (the money shot)
Expect ≥1 row with
client_role = SentenceStudio.Mobile.MacCatalyst(or iOS).3. Server-side exceptions land with the right role name
Trigger a malformed payload against any API endpoint, then:
Out of scope (follow-ups noted in
.squad/decisions/inbox/wash-server-appinsights-shipped.md)Global— landed in this PR as pre-deploy review fix.UseExceptionHandler+AddProblemDetailsmiddlewareBackgroundServicestartup-failure wrapping in Workers (silent before OTel sees them)./healthendpoint for ACA liveness probes — currently gated toIsDevelopment().SentenceStudio.WebServiceDefaultsis dead code (nobody references it) — delete or migrate web projects to it in a separate PR.appsettings.Production.json— they'll export to App Insights the moment we do.Known unrelated breakage
ci.ymlhas been red onmainsince ~Apr 17 (wasm-toolsworkload missing fornet10.0-ioson Ubuntu). Pre-existing; not in scope here.Dress rehearsal (Release-build, local — 2026-04-22)
Before risking
azd deploy, the API was built Release and run standalone against a local Docker Postgres (sstudio-pg-rehearsal,postgres:16, port 5433) withASPNETCORE_ENVIRONMENT=Productionand the realAzureMonitor:ConnectionStringfromappsettings.Production.json. This activates the#if !DEBUGbranch ofSentenceStudio.ServiceDefaults.AddOpenTelemetryExporters→ live telemetry tosstudio-mobile-ai. Zero blast radius to the deployed container.Smoke tests (API on
https://localhost:7801):POST /api/auth/loginwith bad creds → 401 ✅GET /__debug/boom(temp endpoint, removed before commit) → 500 +application/problem+jsonbody ✅POST /api/auth/loginwith an injected W3Ctraceparent: 00-5c4324bba96c15b5da00f712ac863982-d96513170a11dd97-01→ 401 ✅After ~4 min ingestion wait, these KQL queries against
sstudio-mobile-ai(App ID74e94530-…) all returned non-empty:Query A —
exceptionstable (the Fix 🟡 proof)4 rows returned (2 distinct operation_Ids × 2 log entries each from the ExceptionHandler + UnhandledException loggers):
Proves the
UseExceptionHandler→ILogger.LogError→ OTel log exporter → App Insightsexceptionstable chain works end-to-end. This is the check that was green-field in the review fix (PR commit4ff69c7).Query B —
requeststable (instrumentation + role name)4 rows:
Proves
AddAspNetCoreInstrumentation+AddAzureMonitorTraceExporterpipeline shipsrequestsrows with the correctcloud_RoleName = "SentenceStudio.Api".Query C — W3C traceparent propagation (the correlation proof)
2 rows — the server adopted the injected trace id AND the injected span id as parent:
operation_ParentId = d96513170a11dd97exactly matches the span id we sent in thetraceparentheader. This is the proof that when Mac Catalyst (or any mobile head runningMauiServiceDefaultswith HttpClient instrumentation) calls the deployed API, the server span will inherit the mobile-originated trace id automatically — Query 2 of the pre-deploy proof block (mobile→API correlation) will light up the moment real mobile traffic hits the deployed container. And the Postgres dependency row in the same operation shows server-internal spans also hang off the correct trace, so mobile→API→DB will all chain under one operation_Id.What's still unproven
azd deployproves it).api.livelyforest-b32e7d63.centralus.azurecontainerapps.io).These are the residuals
azd deployis expected to cover. The dress rehearsal de-risks the code itself; leaving this PR draft for Captain to flip ready-for-review and runazd deploy.Production deploy (2026-04-22)
Deploy tool:
azd deploy(aspire deploy migration deferred — AppHost does not yet registerAddAzureContainerAppEnvironment; tracked as separate follow-up)Deploy duration: 2m 18s
Pre-deploy check: refreshed for Flexible-Server architecture — commit
56a98cfscripts/pre-deploy-check.sh— ALL CHECKS PASSEDFlexible Server
db-3ovvqiybthkb6state=Ready · both RG locks (do-not-delete-db,do-not-delete-db-storage) present · ACA envcae-3ovvqiybthkb6Succeeded · latest backup 18h old.scripts/post-deploy-validate.sh— 17 PASS / 0 FAIL / 2 SKIP / 1 WARNPhase 1 infra: all services Running on latest revision, no crash indicators, DB reachable, migrations ran (12 migration log entries). Phase 2 skipped (no
DEPLOY_TEST_PASSWORDconfigured). Phase 4 regression: bootstrap/login/register/marketing all green. Warning wasworkersrevision scaled to zero — expected for an on-demand job.Query A — production API telemetry proof
Ran against
sstudio-mobile-ai(AppId74e94530-...b70f) ~2 minutes after deploy:Gotcha surfaced in prod, not caught by rehearsal:
cloud_RoleNamein Container Apps is[cae-3ovvqiybthkb6]/SentenceStudio.Api(the ACA env name gets prepended), NOT the plainSentenceStudio.Apithe local dress rehearsal used. Existing KQL queries should useendswith "SentenceStudio.Api"or strip the bracketed prefix. Runbook note worth filing.Postgres dependency spans — correlation substrate confirmed
3 rows, all pointing at
db-3ovvqiybthkb6.postgres.database.azure.com | sentencestudio, each with a distinctoperation_Id. This is the critical plumbing: when a mobile client'straceparentheader lands on the API, the sameoperation_Idwill chain through the API request span and the Postgres child dependency span — giving Captain full mobile → API → DB tracing under one trace id.Mobile ↔ API correlation — pending next iOS publish
DX24 was unreachable via
xcrun devicectlat deploy time and no mobile-tagged telemetry (cloud_RoleName containing Mobile/iOS/Maui) has arrived in the last 2 hours. The installed build on DX24 may predate PR #165'se002d3e. Not a blocker for merging this PR — the server side of the correlation is proven ready via the dependency span data above. The join query will light up on the next iOS publish to DX24 that carries PR #165'sMauiServiceDefaultswiring.Exceptions
0 rows — expected (no errors occurred post-deploy). The
UseExceptionHandler-before-Exporter fix from the rehearsal is still the authoritative proof that theexceptionstable populates when errors do occur.Follow-up work filed
See linked issues: #167 (aspire-deploy migration), #168 (Managed Identity for App Insights auth), #169 (JS bridge for Blazor WebView exceptions), #170 (OTel linker preserve configs for Android/iOS Release).
DX24 correlation lit — partial (2026-04-22)
Wash published iOS Release build to DX24 (iPhone 15 Pro, prod bundle) against this PR's API. Mobile telemetry is flowing to
sstudio-mobile-ai— 701 traces fromcloud_RoleName = SentenceStudio.Mobile.iOSin 15 minutes, sync-agent loop hammeringapi.livelyforest-b32e7d63.centralus.azurecontainerapps.ioas expected.API side (this PR): 72 requests, all named
/api/sync-agent/*, all 200.Correlation join: ❌ empty. Root cause: on every API request,
operation_Id == operation_ParentId, meaning ASP.NET Core generated its own root trace because notraceparentheader arrived from the mobile client. PR #165's mobile OTel setup wires up the logs exporter but notAddHttpClientInstrumentation()on the TracerProvider — so client-side HTTP calls emit log traces ("Sending HTTP request GET …") but no dependency spans, and no W3C context propagates.Verdict: Server-side plumbing from this PR is fully operational. Mobile→API stitching needs a follow-up on #165 to add HttpClient tracer instrumentation. Filing as separate issue.