Skip to content

Server-side App Insights: close mobile↔API correlation loop#166

Merged
davidortinau merged 2 commits into
mainfrom
squad/server-appinsights
Apr 22, 2026
Merged

Server-side App Insights: close mobile↔API correlation loop#166
davidortinau merged 2 commits into
mainfrom
squad/server-appinsights

Conversation

@davidortinau

@davidortinau davidortinau commented Apr 22, 2026

Copy link
Copy Markdown
Owner

Server-side App Insights: close mobile↔API correlation loop

Companion to PR #165 (the mobile-side slice that shipped Azure.Monitor.OpenTelemetry.Exporter into SentenceStudio.MauiServiceDefaults). This PR wires the same App Insights resource (sstudio-mobile-ai) into the server tier via SentenceStudio.ServiceDefaults, giving us joined requests / dependencies rows keyed by operation_Id.

Agent: Wash (Backend Dev). Draft until the deploy + correlation proof below is filled in.

Cap raise

Daily ingestion cap on sstudio-mobile-ai raised from 0.5 GB → 2 GB before deploy so combined mobile + 4 server emitters (API, WebApp, Workers, Marketing) don't get throttled in the first day.

az monitor app-insights component billing update \
  --app sstudio-mobile-ai --resource-group rg-sstudio-prod \
  --cap 2 --stop false

Result (from az monitor app-insights component billing show …):

// BEFORE
{
  "currentBillingFeatures": ["Basic"],
  "dataVolumeCap": {
    "cap": 0.5,
    "maxHistoryCap": 1000.0,
    "resetTime": 0,
    "stopSendNotificationWhenHitCap": true,
    "stopSendNotificationWhenHitThreshold": false,
    "warningThreshold": 90
  }
}

// AFTER
{
  "currentBillingFeatures": ["Basic"],
  "dataVolumeCap": {
    "cap": 2.0,
    "maxHistoryCap": 1000.0,
    "resetTime": 0,
    "stopSendNotificationWhenHitCap": false,
    "stopSendNotificationWhenHitThreshold": false,
    "warningThreshold": 90
  }
}

CLI quirk: --stop / -s only exposes stopSendNotificationWhenHitCap. There is no --stop-sending-notification-when-hitting-threshold flag despite what some docs claim — the CLI rejects it. Notifications at 90% threshold remain enabled.

Exception handler

Added a global app.UseExceptionHandler(…) as the first middleware in src/SentenceStudio.Api/Program.cs (just after builder.Build(), lines 276–302) that logs unhandled exceptions via a named UnhandledException ILogger and returns application/problem+json 500. This is required on top of AddAspNetCoreInstrumentation — that instrumentation tags the request span with exception events but does NOT produce rows in App Insights' exceptions table (those come only from ILogger records carrying an Exception, shipped through the OTel log exporter).

Smoke-validated locally via a temporary /__debug/boom endpoint (removed before commit): HTTP 500 + problem+json body, fail: UnhandledException[0] log line with full stack, process kept running for further requests.

What's in this PR

File Change
src/SentenceStudio.ServiceDefaults/SentenceStudio.ServiceDefaults.csproj OTel → 1.15.x; added Azure.Monitor.OpenTelemetry.Exporter 1.7.0.
src/SentenceStudio.ServiceDefaults/Extensions.cs AddServiceDefaults(..., cloudRoleName); ConfigureResource(AddService(roleName)); #if !DEBUG three-exporter Azure Monitor block gated on AzureMonitor:ConnectionString.
src/SentenceStudio.AppLib/SentenceStudio.AppLib.csproj OTel 1.11.x → 1.15.x to clear NU1605 downgrades surfaced by the ServiceDefaults bump.
src/SentenceStudio.Api/SentenceStudio.Api.csproj Added OpenTelemetry.Instrumentation.AspNetCore 1.15.0 locally (kept out of shared defaults — see MAUI-safety note below).
src/SentenceStudio.Api/Program.cs AddServiceDefaults("SentenceStudio.Api"); local .WithMetrics/.WithTracing with AddAspNetCoreInstrumentation(); global UseExceptionHandlerILogger.LogError (new; pre-deploy review fix).
src/SentenceStudio.Api/appsettings.Production.json Added AzureMonitor:ConnectionString (write-only ingestion key, same as mobile's — intentional reuse).
src/SentenceStudio.WebApp/Program.cs, Workers/Program.cs, Marketing/Program.cs Each now passes its own cloud_RoleName literal. No connection string shipped in their appsettings.Production.json yet → they stay OTLP-only until Captain opts them in.
.squad/skills/aspnetcore-azure-monitor/SKILL.md Sibling to maui-azure-monitor/SKILL.md. Captures the MAUI-safe server pattern + the exception-handler recipe + the cap-raise az CLI recipe.
.squad/agents/wash/history.md Appended learnings sections for 2026-04-22 (server slice + review fixes).

Locked-decisions adherence

  • ONE App Insights resource (sstudio-mobile-ai, workspace-backed by law-3ovvqiybthkb6): ✅ reused verbatim, same connection string as mobile.
  • Constant literal cloud_RoleName, no runtime detection:"SentenceStudio.Api", "SentenceStudio.WebApp", "SentenceStudio.Workers", "SentenceStudio.Marketing" — all passed from their respective Program.cs.
  • Local-dev null-out, OTLP → Aspire dashboard preserved: ✅ Azure Monitor wiring is #if !DEBUG. Container builds are Release so it activates in prod; aspire run is Debug so it stays OTLP-only. No double-export.
  • No Aspire.Hosting.AzureMonitor integration in AppHost: ✅ verified absent. The manual wiring in ServiceDefaults is the only path.

MAUI-safety pivot (deviation from task brief)

The task brief suggested Azure.Monitor.OpenTelemetry.AspNetCore 1.4.0 + UseAzureMonitor(). That package transitively pulls OpenTelemetry.Instrumentation.AspNetCore, which declares <FrameworkReference Include="Microsoft.AspNetCore.App" />. There is no Microsoft.AspNetCore.App runtime pack for maccatalyst-arm64 / ios-arm64 / android-* RIDs, so putting it in ServiceDefaults broke every MAUI head with NETSDK1082.

ServiceDefaults is consumed by web hosts and (transitively, via AppLib) by every MAUI head. So the .AspNetCore variant of Azure Monitor simply can't live in the shared project.

Resolution: swapped to the lower-level Azure.Monitor.OpenTelemetry.Exporter 1.7.0 — exactly what MauiServiceDefaults already uses client-side — with the three AddAzureMonitor{Log,Metric,Trace}Exporter calls. Added OpenTelemetry.Instrumentation.AspNetCore only to the API's csproj, and wired .AddAspNetCoreInstrumentation() from Program.cs. Net observability fidelity matches UseAzureMonitor(); MAUI stays buildable. The MAUI-safety note was already documented in the maui-azure-monitor skill; the sibling aspnetcore-azure-monitor skill in this PR captures the server flip-side.

Build proof

All zero-error:

  • dotnet build src/SentenceStudio.Api -f net10.0 -c Debug
  • dotnet build src/SentenceStudio.Api -f net10.0 -c Release
  • dotnet build src/SentenceStudio.WebApp -c Release
  • dotnet build src/SentenceStudio.Workers -c Release
  • dotnet build src/SentenceStudio.Marketing -c Release
  • dotnet build src/SentenceStudio.MacCatalyst -f net10.0-maccatalyst -c Debug ✅ (the MAUI-safety proof)

Deploy + validation — TO BE FILLED IN BEFORE MARKING READY

Captain to confirm VPN off, then:

./scripts/pre-deploy-check.sh         # resource locks, DB, volume mount, storage, file share
azd deploy                            # full-stack; azure.yaml has a single `app` mapping AppHost
./scripts/post-deploy-validate.sh     # infra + smoke

Then generate mobile→API traffic (any authenticated call from Mac Catalyst on DX24 or sim in Release), wait 2–5 min for ingestion, and paste results of these three KQL queries into a follow-up comment:

1. Server requests are flowing with the right role name

requests
| where timestamp > ago(15m)
| where cloud_RoleName == "SentenceStudio.Api"
| summarize count() by name, resultCode
| order by count_ desc

2. Mobile → API correlation (the money shot)

requests
| where timestamp > ago(15m)
| where cloud_RoleName == "SentenceStudio.Api"
| project timestamp, name, operation_Id, operation_ParentId, success, resultCode
| join kind=leftouter (
    dependencies
    | where cloud_RoleName startswith "SentenceStudio.Mobile"
    | project depTimestamp=timestamp, depName=name, operation_Id, client_role=cloud_RoleName
  ) on operation_Id
| where isnotempty(client_role)

Expect ≥1 row with client_role = SentenceStudio.Mobile.MacCatalyst (or iOS).

3. Server-side exceptions land with the right role name
Trigger a malformed payload against any API endpoint, then:

exceptions
| where timestamp > ago(15m)
| where cloud_RoleName == "SentenceStudio.Api"
| project timestamp, type, outerMessage, operation_Id, cloud_RoleInstance

Out of scope (follow-ups noted in .squad/decisions/inbox/wash-server-appinsights-shipped.md)

  • Custom sampling / telemetry processors.
  • Alerts + dashboards (5xx spike, OpenAI failure rate, mobile↔API latency).
  • Global UseExceptionHandler + AddProblemDetails middlewarelanded in this PR as pre-deploy review fix.
  • BackgroundService startup-failure wrapping in Workers (silent before OTel sees them).
  • /health endpoint for ACA liveness probes — currently gated to IsDevelopment().
  • SentenceStudio.WebServiceDefaults is dead code (nobody references it) — delete or migrate web projects to it in a separate PR.
  • Rolling out the connection string to WebApp/Workers/Marketing appsettings.Production.json — they'll export to App Insights the moment we do.
  • Managed Identity instead of write-only connection string — 🟠 follow-up only; out of scope for this PR.

Known unrelated breakage

ci.yml has been red on main since ~Apr 17 (wasm-tools workload missing for net10.0-ios on Ubuntu). Pre-existing; not in scope here.

Dress rehearsal (Release-build, local — 2026-04-22)

Before risking azd deploy, the API was built Release and run standalone against a local Docker Postgres (sstudio-pg-rehearsal, postgres:16, port 5433) with ASPNETCORE_ENVIRONMENT=Production and the real AzureMonitor:ConnectionString from appsettings.Production.json. This activates the #if !DEBUG branch of SentenceStudio.ServiceDefaults.AddOpenTelemetryExporters → live telemetry to sstudio-mobile-ai. Zero blast radius to the deployed container.

Smoke tests (API on https://localhost:7801):

  • POST /api/auth/login with bad creds → 401
  • GET /__debug/boom (temp endpoint, removed before commit) → 500 + application/problem+json body ✅
  • POST /api/auth/login with an injected W3C traceparent: 00-5c4324bba96c15b5da00f712ac863982-d96513170a11dd97-01401

After ~4 min ingestion wait, these KQL queries against sstudio-mobile-ai (App ID 74e94530-…) all returned non-empty:

Query A — exceptions table (the Fix 🟡 proof)

exceptions
| where timestamp > ago(15m)
| where cloud_RoleName == "SentenceStudio.Api"
| where outerMessage has "Dress rehearsal"
| project timestamp, type, outerMessage, cloud_RoleName, operation_Id
| order by timestamp desc
| take 5

4 rows returned (2 distinct operation_Ids × 2 log entries each from the ExceptionHandler + UnhandledException loggers):

timestamp type outerMessage cloud_RoleName operation_Id
2026-04-22T01:44:30.66344Z System.InvalidOperationException Dress rehearsal: server-side AppInsights exception capture SentenceStudio.Api f024e18789cefa6595e8d17a3addf15b
2026-04-22T01:44:30.663304Z System.InvalidOperationException Dress rehearsal: server-side AppInsights exception capture SentenceStudio.Api f024e18789cefa6595e8d17a3addf15b
2026-04-22T01:44:30.627562Z System.InvalidOperationException Dress rehearsal: server-side AppInsights exception capture SentenceStudio.Api 41700dd3f0449dce514423173427f4b9
2026-04-22T01:44:30.627123Z System.InvalidOperationException Dress rehearsal: server-side AppInsights exception capture SentenceStudio.Api 41700dd3f0449dce514423173427f4b9

Proves the UseExceptionHandlerILogger.LogError → OTel log exporter → App Insights exceptions table chain works end-to-end. This is the check that was green-field in the review fix (PR commit 4ff69c7).

Query B — requests table (instrumentation + role name)

requests
| where timestamp > ago(15m)
| where cloud_RoleName == "SentenceStudio.Api"
| project timestamp, name, resultCode, duration, operation_Id
| order by timestamp desc
| take 10

4 rows:

timestamp name resultCode duration (ms) operation_Id
2026-04-22T01:44:30.834154Z POST /api/auth/login 401 5.528 5c4324bba96c15b5da00f712ac863982
2026-04-22T01:44:30.661627Z GET /__debug/boom 500 1.85 f024e18789cefa6595e8d17a3addf15b
2026-04-22T01:44:30.612819Z GET /__debug/boom 500 15.023 41700dd3f0449dce514423173427f4b9
2026-04-22T01:44:30.371703Z POST /api/auth/login 401 207.482 87dc748e91e149db7bab784a52ef54cf

Proves AddAspNetCoreInstrumentation + AddAzureMonitorTraceExporter pipeline ships requests rows with the correct cloud_RoleName = "SentenceStudio.Api".

Query C — W3C traceparent propagation (the correlation proof)

union requests, dependencies
| where timestamp > ago(15m)
| where operation_Id == "5c4324bba96c15b5da00f712ac863982"
| project timestamp, itemType, name, cloud_RoleName, operation_Id, operation_ParentId

2 rows — the server adopted the injected trace id AND the injected span id as parent:

timestamp itemType name cloud_RoleName operation_Id operation_ParentId
2026-04-22T01:44:30.834154Z request POST /api/auth/login SentenceStudio.Api 5c4324bba96c15b5da00f712ac863982 d96513170a11dd97
2026-04-22T01:44:30.838081Z dependency postgresql SentenceStudio.Api 5c4324bba96c15b5da00f712ac863982 ab809153d6e3145d

operation_ParentId = d96513170a11dd97 exactly matches the span id we sent in the traceparent header. This is the proof that when Mac Catalyst (or any mobile head running MauiServiceDefaults with HttpClient instrumentation) calls the deployed API, the server span will inherit the mobile-originated trace id automatically — Query 2 of the pre-deploy proof block (mobile→API correlation) will light up the moment real mobile traffic hits the deployed container. And the Postgres dependency row in the same operation shows server-internal spans also hang off the correct trace, so mobile→API→DB will all chain under one operation_Id.

What's still unproven

  • Container Apps startup path (only azd deploy proves it).
  • Production DNS (api.livelyforest-b32e7d63.centralus.azurecontainerapps.io).
  • DX24 → real API correlation under production load.

These are the residuals azd deploy is expected to cover. The dress rehearsal de-risks the code itself; leaving this PR draft for Captain to flip ready-for-review and run azd deploy.


Production deploy (2026-04-22)

Deploy tool: azd deploy (aspire deploy migration deferred — AppHost does not yet register AddAzureContainerAppEnvironment; tracked as separate follow-up)
Deploy duration: 2m 18s
Pre-deploy check: refreshed for Flexible-Server architecture — commit 56a98cf

scripts/pre-deploy-check.sh — ALL CHECKS PASSED

Flexible Server db-3ovvqiybthkb6 state=Ready · both RG locks (do-not-delete-db, do-not-delete-db-storage) present · ACA env cae-3ovvqiybthkb6 Succeeded · latest backup 18h old.

scripts/post-deploy-validate.sh — 17 PASS / 0 FAIL / 2 SKIP / 1 WARN

Phase 1 infra: all services Running on latest revision, no crash indicators, DB reachable, migrations ran (12 migration log entries). Phase 2 skipped (no DEPLOY_TEST_PASSWORD configured). Phase 4 regression: bootstrap/login/register/marketing all green. Warning was workers revision scaled to zero — expected for an on-demand job.

Query A — production API telemetry proof

Ran against sstudio-mobile-ai (AppId 74e94530-...b70f) ~2 minutes after deploy:

requests
| where timestamp > ago(30m)
| where cloud_RoleName endswith "SentenceStudio.Api"
| summarize count=count() by resultCode, name
| order by count desc
resultCode name count
401 POST /api/auth/login 3
404 GET /bootstrap 3
404 POST /auth/login 3
401 GET /api/v1/auth/bootstrap 2
400 POST /api/auth/register 1
404 GET / 1

Gotcha surfaced in prod, not caught by rehearsal: cloud_RoleName in Container Apps is [cae-3ovvqiybthkb6]/SentenceStudio.Api (the ACA env name gets prepended), NOT the plain SentenceStudio.Api the local dress rehearsal used. Existing KQL queries should use endswith "SentenceStudio.Api" or strip the bracketed prefix. Runbook note worth filing.

Postgres dependency spans — correlation substrate confirmed

dependencies
| where timestamp > ago(30m)
| where cloud_RoleName endswith "SentenceStudio.Api"
| project target, name, type, operation_Id

3 rows, all pointing at db-3ovvqiybthkb6.postgres.database.azure.com | sentencestudio, each with a distinct operation_Id. This is the critical plumbing: when a mobile client's traceparent header lands on the API, the same operation_Id will chain through the API request span and the Postgres child dependency span — giving Captain full mobile → API → DB tracing under one trace id.

Mobile ↔ API correlation — pending next iOS publish

DX24 was unreachable via xcrun devicectl at deploy time and no mobile-tagged telemetry (cloud_RoleName containing Mobile/iOS/Maui) has arrived in the last 2 hours. The installed build on DX24 may predate PR #165's e002d3e. Not a blocker for merging this PR — the server side of the correlation is proven ready via the dependency span data above. The join query will light up on the next iOS publish to DX24 that carries PR #165's MauiServiceDefaults wiring.

Exceptions

exceptions
| where timestamp > ago(30m)
| where cloud_RoleName endswith "SentenceStudio.Api"

0 rows — expected (no errors occurred post-deploy). The UseExceptionHandler-before-Exporter fix from the rehearsal is still the authoritative proof that the exceptions table populates when errors do occur.

Follow-up work filed

See linked issues: #167 (aspire-deploy migration), #168 (Managed Identity for App Insights auth), #169 (JS bridge for Blazor WebView exceptions), #170 (OTel linker preserve configs for Android/iOS Release).


DX24 correlation lit — partial (2026-04-22)

Wash published iOS Release build to DX24 (iPhone 15 Pro, prod bundle) against this PR's API. Mobile telemetry is flowing to sstudio-mobile-ai — 701 traces from cloud_RoleName = SentenceStudio.Mobile.iOS in 15 minutes, sync-agent loop hammering api.livelyforest-b32e7d63.centralus.azurecontainerapps.io as expected.

API side (this PR): 72 requests, all named /api/sync-agent/*, all 200.

Correlation join: ❌ empty. Root cause: on every API request, operation_Id == operation_ParentId, meaning ASP.NET Core generated its own root trace because no traceparent header arrived from the mobile client. PR #165's mobile OTel setup wires up the logs exporter but not AddHttpClientInstrumentation() on the TracerProvider — so client-side HTTP calls emit log traces ("Sending HTTP request GET …") but no dependency spans, and no W3C context propagates.

Verdict: Server-side plumbing from this PR is fully operational. Mobile→API stitching needs a follow-up on #165 to add HttpClient tracer instrumentation. Filing as separate issue.

davidortinau and others added 2 commits April 21, 2026 19:28
…PI (close mobile↔API correlation loop)

Adds server-side Application Insights via Azure.Monitor.OpenTelemetry.Exporter
into SentenceStudio.ServiceDefaults so API, WebApp, Workers, and Marketing all
tag themselves with distinct cloud_RoleName values and export to the same
sstudio-mobile-ai resource the mobile client already uses (PR #165). With
HttpClient instrumentation on the client and AspNetCore instrumentation on
the server, W3C traceparent propagates automatically and requests join to
dependencies on operation_Id.

Critical MAUI-safety pivot: Azure.Monitor.OpenTelemetry.AspNetCore transitively
requires Microsoft.AspNetCore.App, which has no runtime pack for maccatalyst-*/
ios-*/android-* RIDs. ServiceDefaults is consumed by every MAUI head via AppLib,
so the AspNetCore variant breaks MAUI builds (NETSDK1082). Using the lower-level
Exporter package (same one mobile uses) with the three AddAzureMonitor{Log,
Metric,Trace}Exporter calls in shared defaults keeps MAUI buildable; AspNetCore
instrumentation is added only to the API csproj and wired from Program.cs.

OTel stack bumped to 1.15.x across ServiceDefaults and AppLib to stay aligned
with MauiServiceDefaults and clear NU1605 downgrade errors in web projects.

Azure Monitor wiring gated #if !DEBUG so local aspire run keeps streaming to
the Aspire dashboard via OTLP without dual-exporting. AzureMonitor:ConnectionString
committed to Api/appsettings.Production.json (write-only key, same approach
mobile slice used; ingestion capped at 0.5 GB/day).

Companion skill at .squad/skills/aspnetcore-azure-monitor/SKILL.md documents
the MAUI-safe server pattern.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ASP.NET Core OTel instrumentation tags exceptions on the request span
but does NOT produce rows in App Insights' exceptions table — that's
populated only from ILogger records carrying an Exception.

Wire UseExceptionHandler as the first middleware, log via a named
'UnhandledException' ILogger, and write a ProblemDetails 500. The OTel
log exporter ships the record so KQL like
  exceptions | where cloud_RoleName == 'SentenceStudio.Api'
surfaces unhandled controller / minimal-API throws.

Placement: before UseAuthentication so exceptions in auth handlers and
custom middleware are also caught.

Smoke-validated locally by temporarily adding /__debug/boom (removed
before commit): HTTP 500 + application/problem+json body, fail:
UnhandledException[0] log line with full stack, no process crash.

Also:
- aspnetcore-azure-monitor SKILL: replace 'no middleware needed' claim
  with the correct pattern; add az monitor app-insights component
  billing update recipe for daily-cap management.
- wash history: learnings for the cap raise CLI and the span-vs-
  exceptions-row distinction.

Companion: 0.5 GB/day → 2 GB/day cap raise on sstudio-mobile-ai
(done via az CLI, read-back archived in PR #166 body).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@davidortinau davidortinau marked this pull request as ready for review April 22, 2026 02:29
@davidortinau davidortinau merged commit 9e8c2b4 into main Apr 22, 2026
2 of 6 checks passed
@davidortinau davidortinau deleted the squad/server-appinsights branch April 22, 2026 02:29
davidortinau added a commit that referenced this pull request Apr 22, 2026
Big one: always read AppHost.cs before recommending a deploy-tool
flip. Plus: stale safety scripts are worse than no safety scripts,
ACA prepends env name to cloud_RoleName (prod-only KQL gotcha),
and azd deploy on Flexible-Server architecture is clean.
davidortinau added a commit that referenced this pull request Apr 22, 2026
…tion

PR #165 (mobile) + PR #166 (server) shipped the App Insights pipeline end
to end, but the mobile↔API correlation join in App Insights was returning
zero rows. Every server request had `operation_Id == operation_ParentId`
— i.e., no `traceparent` header was arriving from the device.

Diagnosis (see #171):

- `OpenTelemetry.Instrumentation.Http`'s `AddHttpClientInstrumentation()`
  was already on the MAUI TracerProvider since commit 216a2da and a
  trim-disabled Release build on DX24 produced the same zero-span result,
  so neither registration nor trimming was the problem.
- Mobile logs had empty `operation_Id` across the board, confirming no
  ambient `Activity` ever existed on the device.
- Root cause (tracked in #171): MAUI's `MauiApp` doesn't run
  `IHostedService` instances, so the `TelemetryHostedService` that would
  normally materialize the TracerProvider and attach its listeners never
  runs. Logs work because they hook `ILoggerFactory` synchronously; the
  tracer path needs the hosted-service startup.

This PR:

- Adds `ApiActivityHandler`, a `DelegatingHandler` that starts a
  `Client` Activity per outbound API call using a dedicated
  `ActivitySource` (`SentenceStudio.Mobile.HttpClient`). With an Activity
  current, HttpClient's built-in `DiagnosticsHandler` auto-injects the
  W3C `traceparent` header.
- Registers the new ActivitySource on the mobile TracerProvider via
  `.AddSource(...)` in `MauiServiceDefaults.Extensions` so the spans
  actually export.
- Wires the handler onto every API-bound HttpClient: CoreSync's
  `HttpClientToServer`, the auth client, the four typed API clients,
  and `VersionCheckService`. The handler is placed FIRST in the chain
  so the span wraps the full operation including auth token attachment.
- Hardens `OpenTelemetryInitializer` to call `GetRequiredService<T>()`
  instead of the nullable `GetService<T>()` for all three providers, so
  a misregistration fails loudly at startup instead of silently breaking
  telemetry at runtime.

Out of scope (explicitly):

- Root-cause fix for the IHostedService gap — tracked in #171.
- The raw `new HttpClient()` in `SentenceStudio.Shared/Services/AiService.cs:93`
  — bypasses `HttpClientFactory` entirely. Separate refactor.
- The KQL in `docs/deploy-runbook.md` is still wrong (joins requests to
  requests; should be dependencies to requests). Separate doc PR.

Verification: Mac Catalyst Debug + Release both build clean.
Post-merge verification will be an iOS publish to DX24 + KQL query for
non-empty `operation_ParentId` on server requests.
davidortinau added a commit that referenced this pull request Apr 22, 2026
…ile↔API correlation (#172)

* fix(mobile): wrap API HttpClients with ApiActivityHandler for correlation

PR #165 (mobile) + PR #166 (server) shipped the App Insights pipeline end
to end, but the mobile↔API correlation join in App Insights was returning
zero rows. Every server request had `operation_Id == operation_ParentId`
— i.e., no `traceparent` header was arriving from the device.

Diagnosis (see #171):

- `OpenTelemetry.Instrumentation.Http`'s `AddHttpClientInstrumentation()`
  was already on the MAUI TracerProvider since commit 216a2da and a
  trim-disabled Release build on DX24 produced the same zero-span result,
  so neither registration nor trimming was the problem.
- Mobile logs had empty `operation_Id` across the board, confirming no
  ambient `Activity` ever existed on the device.
- Root cause (tracked in #171): MAUI's `MauiApp` doesn't run
  `IHostedService` instances, so the `TelemetryHostedService` that would
  normally materialize the TracerProvider and attach its listeners never
  runs. Logs work because they hook `ILoggerFactory` synchronously; the
  tracer path needs the hosted-service startup.

This PR:

- Adds `ApiActivityHandler`, a `DelegatingHandler` that starts a
  `Client` Activity per outbound API call using a dedicated
  `ActivitySource` (`SentenceStudio.Mobile.HttpClient`). With an Activity
  current, HttpClient's built-in `DiagnosticsHandler` auto-injects the
  W3C `traceparent` header.
- Registers the new ActivitySource on the mobile TracerProvider via
  `.AddSource(...)` in `MauiServiceDefaults.Extensions` so the spans
  actually export.
- Wires the handler onto every API-bound HttpClient: CoreSync's
  `HttpClientToServer`, the auth client, the four typed API clients,
  and `VersionCheckService`. The handler is placed FIRST in the chain
  so the span wraps the full operation including auth token attachment.
- Hardens `OpenTelemetryInitializer` to call `GetRequiredService<T>()`
  instead of the nullable `GetService<T>()` for all three providers, so
  a misregistration fails loudly at startup instead of silently breaking
  telemetry at runtime.

Out of scope (explicitly):

- Root-cause fix for the IHostedService gap — tracked in #171.
- The raw `new HttpClient()` in `SentenceStudio.Shared/Services/AiService.cs:93`
  — bypasses `HttpClientFactory` entirely. Separate refactor.
- The KQL in `docs/deploy-runbook.md` is still wrong (joins requests to
  requests; should be dependencies to requests). Separate doc PR.

Verification: Mac Catalyst Debug + Release both build clean.
Post-merge verification will be an iOS publish to DX24 + KQL query for
non-empty `operation_ParentId` on server requests.

* fix(mobile): use Activity.AddException for OTel-conformant exception recording

Code review feedback on #172: exceptions should be recorded as Activity
events (via AddException/RecordException), not raw tags. Emits the
standard OTel 'exception' event with type/message/stacktrace, which
surfaces in App Insights' exception timeline rather than being tag-only.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
davidortinau added a commit that referenced this pull request Apr 22, 2026
…tion (#173)

PR #172 got mobile HttpClient dependency spans emitting with operation_Id,
but the correlation join against API requests still returned zero rows:
the API saw every incoming request without a traceparent header and started
a fresh operation_Id.

Root cause: HttpClient's built-in DiagnosticsHandler only injects traceparent
automatically when an OTel-style ActivityListener is attached to
"System.Net.Http". On MAUI the listener never attaches because OpenTelemetry's
TelemetryHostedService — which wires listeners to the TracerProvider — relies
on IHostedService, and MauiApp doesn't run hosted services (issue #171).

Fix: have ApiActivityHandler explicitly call
DistributedContextPropagator.Current.Inject(...) on the outbound request
headers after starting its Activity. Guards against double-injection if a
caller or a resilience retry already set traceparent.

This is the user-space workaround to #171. Framework fix is still desirable
but now lower priority.

Verification plan: re-run the App Insights correlation join; expect
requests | join dependencies on operation_Id to return > 0 rows for the
mobile role name.

Refs: #165 #166 #172 #171
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant