Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 131 additions & 0 deletions docs/FOLLOWUPS.md
Original file line number Diff line number Diff line change
Expand Up @@ -1776,6 +1776,36 @@ scope. Each carries the trigger that should reopen the question.

### Phase: post-RFC-acceptance, before Phase 2 implementation

- [ ] **v0.2 admission-webhook injection for image-immutable operators.**
M13 v0.1 alpha serves operators who can modify the workload
training image. Vendor-locked images (SageMaker training jobs,
Lightning AI Studios, Vertex AI Workbench, NVIDIA NGC PyTorch
consumed without modification) are explicitly out of scope per
RFC-0009 §Migration / rollout. Datadog's cluster agent pattern
(mutating admission webhook that injects the helper as an init
container + `LD_PRELOAD` env) is the industry-standard path for
this audience. v0.2 design needs: (a) inventory of which managed
platforms allow admission webhooks at all (some k8s services
restrict them); (b) injection mechanism that avoids `LD_PRELOAD`
conflicts with torchrun/accelerate/DeepSpeed launchers per
RFC-0009 §Alternatives `PyThreadState_Next via LD_PRELOAD`;
(c) UID-alignment story that survives image rebuilds. *Trigger:*
first operator request for non-cooperative-image coverage, or
M13 v0.1 alpha gains its first three production deployments and
a fourth requests vendor-locked-image support.
- [ ] **Upstream `gen_ai.training.*` semconv PR.** The proposal
markdown body lives at `docs/proposals/gen-ai-training-semconv.md`
(drafted 2026-05-19; closes the M1 first-draft KPI gap from
NORTHSTARS O4 line 219). The remaining action is filing the
actual PR on `open-telemetry/semantic-conventions`. Prerequisite
cross-org work: (a) GenAI Instrumentation SIG attendance — at
least one meeting attended before filing to confirm scope
doesn't overlap a pending proposal; (b) competing-proposal
survey on the upstream repo; (c) ideally a co-author or sponsor
from a non-tracecore vendor (Pyroscope, Datadog, Honeycomb SIG
members are candidates). *Trigger:* O4 owner has bandwidth for
weekly SIG cadence (NORTHSTARS O4 KPI line 220), or M13 alpha
ships and external review pressure on the namespace surfaces.
- [ ] **Fork-handler in the Python helper.** RFC-0009 §Open questions §2.
v0.1 contract is: operator calls `attach()` in each worker entry
point (post-fork) so each PID gets its own UDS socket. The
Expand Down Expand Up @@ -1871,6 +1901,107 @@ scope. Each carries the trigger that should reopen the question.
settle window so the silent-under-collection misconfig surfaces
on the operator dashboard. *Trigger:* Phase 1 trigger-loop ships.

### Phase: empirical validation (needs production data or GPU hardware)

Items here cannot be performed at design time. They require a real
PyTorch training workload on GPU hardware, a fleet of real customer
deployments, or external-org engagement.

- [ ] **NFR ceilings sourced + measured.** RFC-0009 §Non-functional
rubrics marks all four ceilings (CPU ≤0.05%, RSS ≤10 MB,
egress ≤0.05 Mbps, shutdown ≤1s p99) as design contracts
pending Phase 3 benchstat. Two gaps survive that framing.
First, the ceiling values are inherited from MILESTONES §M13,
which inherited them from NORTHSTARS O2 per-receiver budget,
which states the budget without an external anchor — the
derivation chain is tracecore-defined all the way down. A
principal-tier rationale names the basis explicitly (e.g.,
"0.05% CPU is 1/20 of a single hyperthread on a 32-vCPU node,
the smallest-acceptable allocation a multi-receiver collector
can spend on one signal class without crowding") so operators
know whether the ceiling can be raised under real workload
pressure. Second, the wording itself replaces "design
contract" with measured bounds: "measured X sustained CPU
under cadence (60s/15s) on a 32-thread PyTorch DDP workload
(n=600, σ=Y, 1-hour soak); ceiling Z." The derivation half is
doc-only and could land earlier if the O2 owner has bandwidth;
the measurement half needs GPU hardware + real PyTorch.
*Needs:* GPU hardware (≥1 node with ≥1 NVIDIA GPU running a
PyTorch DDP toy model). *Trigger:* Phase 3 benchstat gate
wiring; the same PR replaces design-contract wording in
RFC-0009 §Non-functional rubrics.
- [ ] **Rolling-upgrade chaos validation across mixed
helper/receiver versions.** RFC-0009 §Wire protocol commits to
a version-skew matrix (helper v vs receiver v±1). The matrix
has not been exercised against a real fleet rolling-upgrade
sequence. *Needs:* production data (a fleet of ≥3 simultaneous
deployments at different helper-pin / chart-version
combinations). *Trigger:* M13 alpha gains ≥3 production
deployments; chaos-test fixture lands alongside the third
adopter's rollout.
- [ ] **GIL-hold blast radius of `dump_traceback` during NCCL
collective.** RFC-0009 §Motivation argues the in-target helper
approach is safer than ptrace-based sampling. The helper's own
`dump_traceback` call holds the GIL briefly. If a NCCL
collective is in flight during a dump, the workload-visible
cost is uncharacterized. RFC commits ≤0.05% CPU budget on the
receiver; the *target-process* cost during collective overlap
is unbudgeted. *Needs:* GPU hardware (≥1 multi-GPU node)
running NCCL collectives + Phase 3 helper. *Trigger:* Phase 3
benchstat measurement on multi-GPU workload.
- [ ] **CI tolerance multiplier (5×) variance justification.**
Both M13 and M14 NFR rubrics use a "5× CI ceiling for
shared-runner variance" multiplier. The constant is
tracecore-defined with no empirical anchor. A measurement on
the actual CI runner class (GitHub-hosted, self-hosted) would
establish whether 5× is over-tight or over-loose. *Needs:*
production data (variance distribution across ~50 CI runs on
the target runner class). *Trigger:* Phase 3 benchstat
stability; or first flake report on either receiver's overhead
gate.
- [ ] **GenAI Instrumentation SIG attendance + competing-proposal
survey.** NORTHSTARS O4 KPI line 220 commits to weekly SIG
attendance. As of 2026-05-19 no attendance log exists. Before
filing the upstream `gen_ai.training.*` proposal (see "Upstream
`gen_ai.training.*` semconv PR" row above), at least one SIG
meeting must be attended to confirm the proposal does not
overlap a pending vendor-shaped proposal. *Needs:* cross-org
engagement (weekly meeting time + GitHub SIG-issues review).
*Trigger:* O4 owner schedules first meeting; or first credible
competing-proposal report from the OTel semconv repo.
- [ ] **Vendor verification for `kubectl debug` viability across
managed training platforms.** RFC-0009 §Migration / rollout
names `kubectl debug` + py-spy as operator-standard for
non-cooperative targets. Per-platform availability is
unverified: SageMaker training jobs disallow ephemeral debug
containers entirely; Lightning AI Studios and Vertex AI
Workbench behavior is undocumented in this repo. A per-platform
table belongs in Phase 4 `docs/integrations/pyspy.md` so
operators know which platforms the escape-hatch story actually
works on. *Needs:* cross-org engagement (vendor doc review or
direct testing on at least three managed platforms). *Trigger:*
Phase 4 integrations doc writing; first operator question
about non-cooperative coverage.
- [ ] **OTel Python profiling SDK comparator verification.** The
`m13-industry-alignment-decisions` cross-check excluded the
OTel Python profiling SDK from its industry-comparator table
with an "unverified" caveat. The SDK's Python implementation
status is separate from the Profiles signal Alpha (data-model
level). Confirming whether the SDK is a credible v0.2 sampler
candidate (vs `grafana/otel-profiling-python`) needs an
upstream status check. *Needs:* cross-org engagement (OTel
Python SIG check). *Trigger:* v0.2 audience-expansion work; or
`grafana/otel-profiling-python` deprecation announcement.
- [ ] **`pprofile` v1.0 graduation upgrade path.** RFC-0009 §6
footnote names two Phase 3 options. Option (a) vendors a
v0.152.0 `pprofile` against the v1.58.0 main `pdata` line.
When upstream `pprofile` reaches v1.0, tracecore should bump
and exercise the upgrade in a follow-up PR; capturing any
breaking API changes as a Phase 3+ amendment to RFC-0009 or a
companion RFC per the convention call left to the Phase 3
author. *Needs:* upstream `pprofile` v1.0 release.
*Trigger:* `pprofile` v1.0 published.

### Phase: Phase 3 polish (surfaced in Phase 2 review)

- [ ] **Parser allocation discipline.** Receiver-side parser should
Expand Down
Loading