TraceCoreAI · trilamsr · May 19, 2026 · May 19, 2026 · May 19, 2026 · May 19, 2026
diff --git a/docs/FOLLOWUPS.md b/docs/FOLLOWUPS.md
@@ -1776,6 +1776,36 @@ scope. Each carries the trigger that should reopen the question.
 
 ### Phase: post-RFC-acceptance, before Phase 2 implementation
 
+- [ ] **v0.2 admission-webhook injection for image-immutable operators.**
+      M13 v0.1 alpha serves operators who can modify the workload
+      training image. Vendor-locked images (SageMaker training jobs,
+      Lightning AI Studios, Vertex AI Workbench, NVIDIA NGC PyTorch
+      consumed without modification) are explicitly out of scope per
+      RFC-0009 §Migration / rollout. Datadog's cluster agent pattern
+      (mutating admission webhook that injects the helper as an init
+      container + `LD_PRELOAD` env) is the industry-standard path for
+      this audience. v0.2 design needs: (a) inventory of which managed
+      platforms allow admission webhooks at all (some k8s services
+      restrict them); (b) injection mechanism that avoids `LD_PRELOAD`
+      conflicts with torchrun/accelerate/DeepSpeed launchers per
+      RFC-0009 §Alternatives `PyThreadState_Next via LD_PRELOAD`;
+      (c) UID-alignment story that survives image rebuilds. *Trigger:*
+      first operator request for non-cooperative-image coverage, or
+      M13 v0.1 alpha gains its first three production deployments and
+      a fourth requests vendor-locked-image support.
+- [ ] **Upstream `gen_ai.training.*` semconv PR.** The proposal
+      markdown body lives at `docs/proposals/gen-ai-training-semconv.md`
+      (drafted 2026-05-19; closes the M1 first-draft KPI gap from
+      NORTHSTARS O4 line 219). The remaining action is filing the
+      actual PR on `open-telemetry/semantic-conventions`. Prerequisite
+      cross-org work: (a) GenAI Instrumentation SIG attendance — at
+      least one meeting attended before filing to confirm scope
+      doesn't overlap a pending proposal; (b) competing-proposal
+      survey on the upstream repo; (c) ideally a co-author or sponsor
+      from a non-tracecore vendor (Pyroscope, Datadog, Honeycomb SIG
+      members are candidates). *Trigger:* O4 owner has bandwidth for
+      weekly SIG cadence (NORTHSTARS O4 KPI line 220), or M13 alpha
+      ships and external review pressure on the namespace surfaces.
 - [ ] **Fork-handler in the Python helper.** RFC-0009 §Open questions §2.
       v0.1 contract is: operator calls `attach()` in each worker entry
       point (post-fork) so each PID gets its own UDS socket. The
@@ -1871,6 +1901,107 @@ scope. Each carries the trigger that should reopen the question.
       settle window so the silent-under-collection misconfig surfaces
       on the operator dashboard. *Trigger:* Phase 1 trigger-loop ships.
 
+### Phase: empirical validation (needs production data or GPU hardware)
+
+Items here cannot be performed at design time. They require a real
+PyTorch training workload on GPU hardware, a fleet of real customer
+deployments, or external-org engagement.
+
+- [ ] **NFR ceilings sourced + measured.** RFC-0009 §Non-functional
+      rubrics marks all four ceilings (CPU ≤0.05%, RSS ≤10 MB,
+      egress ≤0.05 Mbps, shutdown ≤1s p99) as design contracts
+      pending Phase 3 benchstat. Two gaps survive that framing.
+      First, the ceiling values are inherited from MILESTONES §M13,
+      which inherited them from NORTHSTARS O2 per-receiver budget,
+      which states the budget without an external anchor — the
+      derivation chain is tracecore-defined all the way down. A
+      principal-tier rationale names the basis explicitly (e.g.,
+      "0.05% CPU is 1/20 of a single hyperthread on a 32-vCPU node,
+      the smallest-acceptable allocation a multi-receiver collector
+      can spend on one signal class without crowding") so operators
+      know whether the ceiling can be raised under real workload
+      pressure. Second, the wording itself replaces "design
+      contract" with measured bounds: "measured X sustained CPU
+      under cadence (60s/15s) on a 32-thread PyTorch DDP workload
+      (n=600, σ=Y, 1-hour soak); ceiling Z." The derivation half is
+      doc-only and could land earlier if the O2 owner has bandwidth;
+      the measurement half needs GPU hardware + real PyTorch.
+      *Needs:* GPU hardware (≥1 node with ≥1 NVIDIA GPU running a
+      PyTorch DDP toy model). *Trigger:* Phase 3 benchstat gate
+      wiring; the same PR replaces design-contract wording in
+      RFC-0009 §Non-functional rubrics.
+- [ ] **Rolling-upgrade chaos validation across mixed
+      helper/receiver versions.** RFC-0009 §Wire protocol commits to
+      a version-skew matrix (helper v vs receiver v±1). The matrix
+      has not been exercised against a real fleet rolling-upgrade
+      sequence. *Needs:* production data (a fleet of ≥3 simultaneous
+      deployments at different helper-pin / chart-version
+      combinations). *Trigger:* M13 alpha gains ≥3 production
+      deployments; chaos-test fixture lands alongside the third
+      adopter's rollout.
+- [ ] **GIL-hold blast radius of `dump_traceback` during NCCL
+      collective.** RFC-0009 §Motivation argues the in-target helper
+      approach is safer than ptrace-based sampling. The helper's own
+      `dump_traceback` call holds the GIL briefly. If a NCCL
+      collective is in flight during a dump, the workload-visible
+      cost is uncharacterized. RFC commits ≤0.05% CPU budget on the
+      receiver; the *target-process* cost during collective overlap
+      is unbudgeted. *Needs:* GPU hardware (≥1 multi-GPU node)
+      running NCCL collectives + Phase 3 helper. *Trigger:* Phase 3
+      benchstat measurement on multi-GPU workload.
+- [ ] **CI tolerance multiplier (5×) variance justification.**
+      Both M13 and M14 NFR rubrics use a "5× CI ceiling for
+      shared-runner variance" multiplier. The constant is
+      tracecore-defined with no empirical anchor. A measurement on
+      the actual CI runner class (GitHub-hosted, self-hosted) would
+      establish whether 5× is over-tight or over-loose. *Needs:*
+      production data (variance distribution across ~50 CI runs on
+      the target runner class). *Trigger:* Phase 3 benchstat
+      stability; or first flake report on either receiver's overhead
+      gate.
+- [ ] **GenAI Instrumentation SIG attendance + competing-proposal
+      survey.** NORTHSTARS O4 KPI line 220 commits to weekly SIG
+      attendance. As of 2026-05-19 no attendance log exists. Before
+      filing the upstream `gen_ai.training.*` proposal (see "Upstream
+      `gen_ai.training.*` semconv PR" row above), at least one SIG
+      meeting must be attended to confirm the proposal does not
+      overlap a pending vendor-shaped proposal. *Needs:* cross-org
+      engagement (weekly meeting time + GitHub SIG-issues review).
+      *Trigger:* O4 owner schedules first meeting; or first credible
+      competing-proposal report from the OTel semconv repo.
+- [ ] **Vendor verification for `kubectl debug` viability across
+      managed training platforms.** RFC-0009 §Migration / rollout
+      names `kubectl debug` + py-spy as operator-standard for
+      non-cooperative targets. Per-platform availability is
+      unverified: SageMaker training jobs disallow ephemeral debug
+      containers entirely; Lightning AI Studios and Vertex AI
+      Workbench behavior is undocumented in this repo. A per-platform
+      table belongs in Phase 4 `docs/integrations/pyspy.md` so
+      operators know which platforms the escape-hatch story actually
+      works on. *Needs:* cross-org engagement (vendor doc review or
+      direct testing on at least three managed platforms). *Trigger:*
+      Phase 4 integrations doc writing; first operator question
+      about non-cooperative coverage.
+- [ ] **OTel Python profiling SDK comparator verification.** The
+      `m13-industry-alignment-decisions` cross-check excluded the
+      OTel Python profiling SDK from its industry-comparator table
+      with an "unverified" caveat. The SDK's Python implementation
+      status is separate from the Profiles signal Alpha (data-model
+      level). Confirming whether the SDK is a credible v0.2 sampler
+      candidate (vs `grafana/otel-profiling-python`) needs an
+      upstream status check. *Needs:* cross-org engagement (OTel
+      Python SIG check). *Trigger:* v0.2 audience-expansion work; or
+      `grafana/otel-profiling-python` deprecation announcement.
+- [ ] **`pprofile` v1.0 graduation upgrade path.** RFC-0009 §6
+      footnote names two Phase 3 options. Option (a) vendors a
+      v0.152.0 `pprofile` against the v1.58.0 main `pdata` line.
+      When upstream `pprofile` reaches v1.0, tracecore should bump
+      and exercise the upgrade in a follow-up PR; capturing any
+      breaking API changes as a Phase 3+ amendment to RFC-0009 or a
+      companion RFC per the convention call left to the Phase 3
+      author. *Needs:* upstream `pprofile` v1.0 release.
+      *Trigger:* `pprofile` v1.0 published.
+
 ### Phase: Phase 3 polish (surfaced in Phase 2 review)
 
 - [ ] **Parser allocation discipline.** Receiver-side parser should