Skip to content

Mobile App Insights: small slice (Mac Catalyst)#165

Merged
davidortinau merged 5 commits into
mainfrom
squad/mobile-appinsights-slice
Apr 21, 2026
Merged

Mobile App Insights: small slice (Mac Catalyst)#165
davidortinau merged 5 commits into
mainfrom
squad/mobile-appinsights-slice

Conversation

@davidortinau

Copy link
Copy Markdown
Owner

Ships the first-increment (~3 hour) mobile Application Insights slice per Captain's path 1 go approval. Wires Azure.Monitor.OpenTelemetry.Exporter into the existing MAUI OTel pipeline, subscribes to MauiExceptions.UnhandledException, and embeds the connection string in appsettings.Production.json (write-only ingestion key — embedding is the documented pattern; per the decisions memo wash-mobile-appinsights-answers).

In scope

  • Azure resource sstudio-mobile-ai in rg-sstudio-prod, workspace-linked to law-3ovvqiybthkb6, daily cap 0.5 GB.
  • Azure.Monitor.OpenTelemetry.Exporter 1.7.0 added to SentenceStudio.MauiServiceDefaults. Bumped OpenTelemetry.Extensions.Hosting / Exporter.OTLP / Instrumentation.Http to 1.15.x to satisfy Azure Monitor's transitive floor.
  • cloud_RoleName set explicitly via ResourceBuilder.AddService("SentenceStudio.Mobile.<DeviceInfo.Platform>").
  • SentenceStudioAppBuilder.InitializeApp subscribes once to MauiExceptions.UnhandledExceptionILogger.LogCritical → best-effort ForceFlush(3000) on all three OTel providers before the process dies.
  • appsettings.Production.json updated with AzureMonitor:ConnectionString.
  • Fixed pre-existing Release-build break: MacCatalyst/MauiProgram.cs had unguarded using Microsoft.Maui.DevFlow.* even though those packages are Condition='$(Configuration)'=='Debug'. Wrapped the usings in #if DEBUG.

Out of scope (deferred to full plan)

  • Blazor WebView JS error bridge (window.onerror, unhandledrejection)
  • Android / iOS linker preserve directives for Release link-time trimming
  • PrivacyInfo.xcprivacy (Captain confirmed no App Store submission planned)
  • Windows head, marketing site
  • Custom telemetry processors / sampling / session replay / live metrics

Validation

Built Release Mac Catalyst, launched with SENTENCESTUDIO_CRASH_TEST=1, temp thread threw InvalidOperationException 10s after launch. Waited 4 minutes. KQL query:

exceptions
| where timestamp > ago(15m)
| where cloud_RoleName contains "SentenceStudio"
| project timestamp, type, outerMessage, cloud_RoleName, operation_Id
| order by timestamp desc

Result (truncated):

timestamp type outerMessage cloud_RoleName
2026-04-20T03:09:14Z System.InvalidOperationException AppInsights pipeline validation (wash: squad/mobile-appinsights-slice) SentenceStudio.Mobile.MacCatalyst
2026-04-20T03:09:04Z System.InvalidOperationException Model building is not supported when publishing with NativeAOT… SentenceStudio.Mobile.MacCatalyst
2026-04-20T03:09:04Z System.InvalidOperationException HelpKit could not select a presenter automatically… SentenceStudio.Mobile.MacCatalyst

The forced exception (AppInsights pipeline validation…) is there, plus bonus evidence the pipeline is already catching caught-and-logged exceptions from startup code paths (EF NativeAOT model-build, HelpKit presenter). cloud_RoleName is correct. Temp validation code reverted before commit.

operation_Id is empty because we throw from a bare Thread, not inside an OTel-instrumented activity. Real crashes inside an HttpClient span will populate it and correlate to the API once the server-side companion ships.

Secret management

Committed the connection string to appsettings.Production.json (which is already a tracked file containing service-discovery endpoints). Rationale, per .squad/decisions.md wash-mobile-appinsights-answers:

  • InstrumentationKey is write-only: it authorizes ingestion push; it cannot read telemetry or touch any other Azure resource.
  • Microsoft's documented pattern for mobile/desktop/JS clients is to ship it in the bundle.
  • Worst case is fake-telemetry spam, bounded by the 0.5 GB/day ingestion cap we set.
  • All "secure" alternatives (fetch-at-startup, per-user keys, Key Vault) are strictly worse for mobile — chicken-and-egg if the API is down, or require an Azure identity the app doesn't have.

If Captain prefers an env-var / non-committed override path later, that is easy to add (the config system reads env vars already).

Follow-ups before full rollout

  • Server-side Azure Monitor exporter in SentenceStudio.Api so W3C traceparent correlates mobile → API spans.
  • Blazor WebView JS bridge.
  • Android/iOS linker preserve + Android Release smoke test.
  • iOS device smoke test on DX24 (Release).
  • Windows head.

Resource

  • App Insights: /subscriptions/a25bc5f2-e641-47b9-89a8-5e5fd428d9d6/resourceGroups/rg-sstudio-prod/providers/microsoft.insights/components/sstudio-mobile-ai
  • AppId: 74e94530-d17f-404a-8726-b7266724b70f
  • Daily cap: 0.5 GB, notifications on cap

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

davidortinau and others added 5 commits April 19, 2026 17:37
- Merge wash-observability.md from decisions/inbox/ to decisions.md
- Delete inbox file
- Add orchestration log: 2026-04-19T22:36:52Z-wash.md
- Add session log: 2026-04-19-azure-error-visibility.md
- Append observability note to wash/history.md for cross-agent visibility

Audit outcome: Container logs flow correctly; App Insights unconfigured; no global exception handler; AI endpoints silent on failures; /health unmapped. Wash proposes four-part fix (App Insights, exception middleware, AI endpoint logging, /health) awaiting Captain approval (~1 day).

Immediate workaround provided: CLI tail + KQL query against law-3ovvqiybthkb6.
- Merge wash-mobile-observability.md from inbox → decisions.md
- Update decisions.md with full mobile App Insights plan
- Append cross-agent note to Kaylee history (Blazor JS error bridge opportunity)
- Update Wash history with planning context

Awaiting Captain decisions on:
1. One vs. two App Insights resources
2. Connection string embedding OK
3. App Store submission timeline (PrivacyInfo.xcprivacy)
4. Marketing site scope

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Orchestration log: 2026-04-20T02-42-17Z-wash.md
  * Wash answered Captain's 3 follow-up questions on mobile observability
  * Findings: ONE App Insights resource (shared MAUI+API), embed connection string, reject TinyInsights.Maui

- Session log: 2026-04-20T02-42-17Z-mobile-appinsights-qa.md (brief summary)

- Merged decision inbox → decisions.md (now 54.5KB)
  * wash-mobile-appinsights-answers.md merged with full QA rationale
  * Deleted inbox file after merge

- Tasks 4-7: no-op (no cross-agent work, no archiving needed, history.md already summarized)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wires Azure Monitor OpenTelemetry exporter into the existing MAUI OTel
pipeline and subscribes to MauiExceptions.UnhandledException so unhandled
crashes land in Application Insights.

Changes:
- MauiServiceDefaults: add Azure.Monitor.OpenTelemetry.Exporter 1.7.0,
  bump OpenTelemetry.Extensions.Hosting/Exporter.OTLP/Instrumentation.Http
  to 1.15.x to satisfy the transitive floor. AddAzureMonitor{Log,Metric,
  Trace}Exporter is gated on #if !DEBUG + a non-empty connection string so
  simulator/dev runs stay silent. Sets a stable service name
  SentenceStudio.Mobile.<Platform> so App Insights cloud_RoleName
  identifies the client clearly.
- SentenceStudioAppBuilder.InitializeApp: one subscriber on
  MauiExceptions.UnhandledException that critical-logs the exception and
  best-effort ForceFlush's LoggerProvider/TracerProvider/MeterProvider
  (3s budget) before the process dies.
- appsettings.Production.json: AzureMonitor:ConnectionString for the
  sstudio-mobile-ai App Insights resource in rg-sstudio-prod.
- MacCatalyst/MauiProgram.cs: #if DEBUG guard the DevFlow usings so
  Release builds don't fail on the debug-only package references
  (pre-existing bug, unblocks Release validation).

OUT of scope (deferred to full plan): Blazor WebView JS bridge, Android
linker preserve, iOS linker preserve, PrivacyInfo.xcprivacy, Windows,
custom processors.

Validated on Mac Catalyst Release with a forced InvalidOperationException;
record appeared in App Insights within ~5 minutes with cloud_RoleName=
SentenceStudio.Mobile.MacCatalyst. Server-side companion still pending
so client spans will be orphan until the API also emits to App Insights.

Refs: .squad/decisions.md wash-mobile-observability, wash-mobile-appinsights-answers

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Three issues flagged on the crash-handling hot path, all fixed:

1. (HIGH) DeviceInfo.Platform was read inside ConfigureOpenTelemetry while
   the host builder was still configuring — pre-MauiApp.Build() timing is
   unsafe for MAUI Essentials and could produce cloud_RoleName="Unknown".
   Added platformName parameter to AddMauiServiceDefaults /
   ConfigureOpenTelemetry; each platform head now passes the literal
   ("MacCatalyst" / "iOS" / "Android") from its MauiProgram.cs where
   per-TFM context is unambiguous. Windows head doesn't call
   AddMauiServiceDefaults today so no change needed there.

2. (MEDIUM) Serial ForceFlush across logger/tracer/meter at 3000ms each
   risked a 9s worst case, exceeding the ~5-10s iOS watchdog on crash
   paths. Replaced with three Task.Run calls (each with their own 2500ms
   internal timeout) wrapped in a single Task.WaitAll with a 3000ms
   outer deadline. Hard 3s wall-time ceiling regardless of per-provider
   stalls. All exceptions swallowed.

3. (MEDIUM) MauiExceptions.UnhandledException subscription wasn't
   idempotent — hot reload / re-init could double-wire the handler.
   Gated behind Interlocked.Exchange on a static int flag so concurrent
   init paths can't race, and repeated InitializeApp() calls won't
   double-fire the crash handler.

Validation: clean Release build + SENTENCESTUDIO_CRASH_TEST=1 reproduced
the forced InvalidOperationException. App Insights confirmed arrival at
2026-04-21T20:22:36Z with cloud_RoleName=SentenceStudio.Mobile.MacCatalyst.

Also captured the parallel-bounded-flush pattern and Interlocked-gated
subscription in .squad/skills/maui-azure-monitor/SKILL.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@davidortinau

Copy link
Copy Markdown
Owner Author

Review fixes applied (commit 656556b)

All three issues from review are addressed. PR stays in draft.

Fix 1 (HIGH) — DeviceInfo.Platform read too early

ConfigureOpenTelemetry was calling DeviceInfo.Platform while the host builder was still configuring — pre-MauiApp.Build(). MAUI Essentials aren't guaranteed to be ready there; cloud_RoleName could have landed as Unknown and defeated the whole tag.

Threaded the platform string through from each platform head where the per-TFM context is unambiguous:

  • AddMauiServiceDefaults and ConfigureOpenTelemetry now take string platformName = "Unknown".
  • src/SentenceStudio.MacCatalyst/MauiProgram.csAddMauiServiceDefaults("MacCatalyst")
  • src/SentenceStudio.iOS/MauiProgram.csAddMauiServiceDefaults("iOS")
  • src/SentenceStudio.Android/MauiProgram.csAddMauiServiceDefaults("Android")
  • Windows head doesn't call AddMauiServiceDefaults today — left alone.

Deterministic, zero-runtime-surface, works even pre-Essentials.

Fix 2 (MEDIUM) — Serial ForceFlush risked iOS watchdog kill

Previous code flushed logger → tracer → meter serially at 3000ms each = 9s worst case. iOS watchdog is ~5–10s. Now: three Task.Run flushes (each with their own 2500ms internal ceiling) wrapped in one Task.WaitAll(..., TimeSpan.FromMilliseconds(3000)). Hard 3s wall-time cap regardless of per-provider stalls. All exceptions swallowed — exception-in-handler is worse than missed telemetry.

Fix 3 (MEDIUM) — UnhandledException subscription now idempotent

Gated behind Interlocked.Exchange(ref _unhandledExceptionWired, 1) == 0 on a static int (not a plain bool) so hot-reload / double-init paths can't race or double-wire the handler.


Validation

SENTENCESTUDIO_CRASH_TEST=1 rerun on a clean Mac Catalyst Release build. Forced InvalidOperationException landed in sstudio-mobile-ai:

timestamp                  type                                outerMessage                                                   cloud_RoleName
2026-04-21T20:22:36.560Z   System.InvalidOperationException    AppInsights pipeline validation — PR #165 review fixes (wash) SentenceStudio.Mobile.MacCatalyst

KQL used:

union exceptions, traces
| where timestamp > datetime(2026-04-21T20:22:00Z)
| project timestamp, itemType, type, msg=coalesce(outerMessage, message), cloud_RoleName, severityLevel
| order by timestamp desc

cloud_RoleName is the threaded-in literal, confirming Fix 1 is live. Temp validation hook reverted before commit.

Incidental learnings (captured in .squad/skills/maui-azure-monitor/SKILL.md)

  • Parallel-bounded flush pattern is now documented in the skill with double-timeout rationale.
  • Launch Mac Catalyst Release bundles via open --env VAR=VAL app.app, NOT by invoking the binary in Contents/MacOS/ directly — MAUI aborts in load_aot_module when launched outside LaunchServices.
  • Stale obj/Release after OTel package bumps can produce ghost Failed to load AOT module 'Azure.Core' crashes. Clean src/*/obj/Release + rebuild fixes it.

PR remains in draft per the task.

@davidortinau davidortinau marked this pull request as ready for review April 21, 2026 23:58
@davidortinau davidortinau merged commit e002d3e into main Apr 21, 2026
2 of 6 checks passed
@davidortinau davidortinau deleted the squad/mobile-appinsights-slice branch April 21, 2026 23:59
davidortinau added a commit that referenced this pull request Apr 22, 2026
* server-appinsights: wire Azure Monitor OpenTelemetry exporters into API (close mobile↔API correlation loop)

Adds server-side Application Insights via Azure.Monitor.OpenTelemetry.Exporter
into SentenceStudio.ServiceDefaults so API, WebApp, Workers, and Marketing all
tag themselves with distinct cloud_RoleName values and export to the same
sstudio-mobile-ai resource the mobile client already uses (PR #165). With
HttpClient instrumentation on the client and AspNetCore instrumentation on
the server, W3C traceparent propagates automatically and requests join to
dependencies on operation_Id.

Critical MAUI-safety pivot: Azure.Monitor.OpenTelemetry.AspNetCore transitively
requires Microsoft.AspNetCore.App, which has no runtime pack for maccatalyst-*/
ios-*/android-* RIDs. ServiceDefaults is consumed by every MAUI head via AppLib,
so the AspNetCore variant breaks MAUI builds (NETSDK1082). Using the lower-level
Exporter package (same one mobile uses) with the three AddAzureMonitor{Log,
Metric,Trace}Exporter calls in shared defaults keeps MAUI buildable; AspNetCore
instrumentation is added only to the API csproj and wired from Program.cs.

OTel stack bumped to 1.15.x across ServiceDefaults and AppLib to stay aligned
with MauiServiceDefaults and clear NU1605 downgrade errors in web projects.

Azure Monitor wiring gated #if !DEBUG so local aspire run keeps streaming to
the Aspire dashboard via OTLP without dual-exporting. AzureMonitor:ConnectionString
committed to Api/appsettings.Production.json (write-only key, same approach
mobile slice used; ingestion capped at 0.5 GB/day).

Companion skill at .squad/skills/aspnetcore-azure-monitor/SKILL.md documents
the MAUI-safe server pattern.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* api: global unhandled-exception handler for AppInsights exceptions table

ASP.NET Core OTel instrumentation tags exceptions on the request span
but does NOT produce rows in App Insights' exceptions table — that's
populated only from ILogger records carrying an Exception.

Wire UseExceptionHandler as the first middleware, log via a named
'UnhandledException' ILogger, and write a ProblemDetails 500. The OTel
log exporter ships the record so KQL like
  exceptions | where cloud_RoleName == 'SentenceStudio.Api'
surfaces unhandled controller / minimal-API throws.

Placement: before UseAuthentication so exceptions in auth handlers and
custom middleware are also caught.

Smoke-validated locally by temporarily adding /__debug/boom (removed
before commit): HTTP 500 + application/problem+json body, fail:
UnhandledException[0] log line with full stack, no process crash.

Also:
- aspnetcore-azure-monitor SKILL: replace 'no middleware needed' claim
  with the correct pattern; add az monitor app-insights component
  billing update recipe for daily-cap management.
- wash history: learnings for the cap raise CLI and the span-vs-
  exceptions-row distinction.

Companion: 0.5 GB/day → 2 GB/day cap raise on sstudio-mobile-ai
(done via az CLI, read-back archived in PR #166 body).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
davidortinau added a commit that referenced this pull request Apr 22, 2026
…tion

PR #165 (mobile) + PR #166 (server) shipped the App Insights pipeline end
to end, but the mobile↔API correlation join in App Insights was returning
zero rows. Every server request had `operation_Id == operation_ParentId`
— i.e., no `traceparent` header was arriving from the device.

Diagnosis (see #171):

- `OpenTelemetry.Instrumentation.Http`'s `AddHttpClientInstrumentation()`
  was already on the MAUI TracerProvider since commit 216a2da and a
  trim-disabled Release build on DX24 produced the same zero-span result,
  so neither registration nor trimming was the problem.
- Mobile logs had empty `operation_Id` across the board, confirming no
  ambient `Activity` ever existed on the device.
- Root cause (tracked in #171): MAUI's `MauiApp` doesn't run
  `IHostedService` instances, so the `TelemetryHostedService` that would
  normally materialize the TracerProvider and attach its listeners never
  runs. Logs work because they hook `ILoggerFactory` synchronously; the
  tracer path needs the hosted-service startup.

This PR:

- Adds `ApiActivityHandler`, a `DelegatingHandler` that starts a
  `Client` Activity per outbound API call using a dedicated
  `ActivitySource` (`SentenceStudio.Mobile.HttpClient`). With an Activity
  current, HttpClient's built-in `DiagnosticsHandler` auto-injects the
  W3C `traceparent` header.
- Registers the new ActivitySource on the mobile TracerProvider via
  `.AddSource(...)` in `MauiServiceDefaults.Extensions` so the spans
  actually export.
- Wires the handler onto every API-bound HttpClient: CoreSync's
  `HttpClientToServer`, the auth client, the four typed API clients,
  and `VersionCheckService`. The handler is placed FIRST in the chain
  so the span wraps the full operation including auth token attachment.
- Hardens `OpenTelemetryInitializer` to call `GetRequiredService<T>()`
  instead of the nullable `GetService<T>()` for all three providers, so
  a misregistration fails loudly at startup instead of silently breaking
  telemetry at runtime.

Out of scope (explicitly):

- Root-cause fix for the IHostedService gap — tracked in #171.
- The raw `new HttpClient()` in `SentenceStudio.Shared/Services/AiService.cs:93`
  — bypasses `HttpClientFactory` entirely. Separate refactor.
- The KQL in `docs/deploy-runbook.md` is still wrong (joins requests to
  requests; should be dependencies to requests). Separate doc PR.

Verification: Mac Catalyst Debug + Release both build clean.
Post-merge verification will be an iOS publish to DX24 + KQL query for
non-empty `operation_ParentId` on server requests.
davidortinau added a commit that referenced this pull request Apr 22, 2026
…ile↔API correlation (#172)

* fix(mobile): wrap API HttpClients with ApiActivityHandler for correlation

PR #165 (mobile) + PR #166 (server) shipped the App Insights pipeline end
to end, but the mobile↔API correlation join in App Insights was returning
zero rows. Every server request had `operation_Id == operation_ParentId`
— i.e., no `traceparent` header was arriving from the device.

Diagnosis (see #171):

- `OpenTelemetry.Instrumentation.Http`'s `AddHttpClientInstrumentation()`
  was already on the MAUI TracerProvider since commit 216a2da and a
  trim-disabled Release build on DX24 produced the same zero-span result,
  so neither registration nor trimming was the problem.
- Mobile logs had empty `operation_Id` across the board, confirming no
  ambient `Activity` ever existed on the device.
- Root cause (tracked in #171): MAUI's `MauiApp` doesn't run
  `IHostedService` instances, so the `TelemetryHostedService` that would
  normally materialize the TracerProvider and attach its listeners never
  runs. Logs work because they hook `ILoggerFactory` synchronously; the
  tracer path needs the hosted-service startup.

This PR:

- Adds `ApiActivityHandler`, a `DelegatingHandler` that starts a
  `Client` Activity per outbound API call using a dedicated
  `ActivitySource` (`SentenceStudio.Mobile.HttpClient`). With an Activity
  current, HttpClient's built-in `DiagnosticsHandler` auto-injects the
  W3C `traceparent` header.
- Registers the new ActivitySource on the mobile TracerProvider via
  `.AddSource(...)` in `MauiServiceDefaults.Extensions` so the spans
  actually export.
- Wires the handler onto every API-bound HttpClient: CoreSync's
  `HttpClientToServer`, the auth client, the four typed API clients,
  and `VersionCheckService`. The handler is placed FIRST in the chain
  so the span wraps the full operation including auth token attachment.
- Hardens `OpenTelemetryInitializer` to call `GetRequiredService<T>()`
  instead of the nullable `GetService<T>()` for all three providers, so
  a misregistration fails loudly at startup instead of silently breaking
  telemetry at runtime.

Out of scope (explicitly):

- Root-cause fix for the IHostedService gap — tracked in #171.
- The raw `new HttpClient()` in `SentenceStudio.Shared/Services/AiService.cs:93`
  — bypasses `HttpClientFactory` entirely. Separate refactor.
- The KQL in `docs/deploy-runbook.md` is still wrong (joins requests to
  requests; should be dependencies to requests). Separate doc PR.

Verification: Mac Catalyst Debug + Release both build clean.
Post-merge verification will be an iOS publish to DX24 + KQL query for
non-empty `operation_ParentId` on server requests.

* fix(mobile): use Activity.AddException for OTel-conformant exception recording

Code review feedback on #172: exceptions should be recorded as Activity
events (via AddException/RecordException), not raw tags. Emits the
standard OTel 'exception' event with type/message/stacktrace, which
surfaces in App Insights' exception timeline rather than being tag-only.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
davidortinau added a commit that referenced this pull request Apr 22, 2026
…tion (#173)

PR #172 got mobile HttpClient dependency spans emitting with operation_Id,
but the correlation join against API requests still returned zero rows:
the API saw every incoming request without a traceparent header and started
a fresh operation_Id.

Root cause: HttpClient's built-in DiagnosticsHandler only injects traceparent
automatically when an OTel-style ActivityListener is attached to
"System.Net.Http". On MAUI the listener never attaches because OpenTelemetry's
TelemetryHostedService — which wires listeners to the TracerProvider — relies
on IHostedService, and MauiApp doesn't run hosted services (issue #171).

Fix: have ApiActivityHandler explicitly call
DistributedContextPropagator.Current.Inject(...) on the outbound request
headers after starting its Activity. Guards against double-injection if a
caller or a resilience retry already set traceparent.

This is the user-space workaround to #171. Framework fix is still desirable
but now lower priority.

Verification plan: re-run the App Insights correlation join; expect
requests | join dependencies on operation_Id to return > 0 rows for the
mobile role name.

Refs: #165 #166 #172 #171
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant