From 7469d5d3fd471bb8fa3174e15fe76915fa592acf Mon Sep 17 00:00:00 2001 From: Tri Lam Date: Tue, 19 May 2026 10:49:08 -0700 Subject: [PATCH 1/3] [docs] m13 industry-alignment pass: RFC-0009 + gen_ai.training proposal + FOLLOWUPS Four-lens A+ review (operator/SRE, senior staff, OTel SIG, doc hygiene) defined gap closure criteria; this commit executes them. RFC-0009 corrections (M13 pyspy receiver, draft-locked): - Generalize rank derivation: orchestrator-neutral chain env RANK -> SLURM_PROCID -> Ray train.context() -> k8s label fallback - Three new IncError rows: uds_dir_permission_denied, helper_oom_mid_dump, sidecar_uid_drift - NFR ceilings reframed as design contracts pending Phase 3 benchstat measurement, not asserted measurements - Anchor invented constants: n=10^7 birthday-bound derived from 10^4 ranks x 10 stacks x 100 fleet-days; 30s target_not_attached retry from kubelet liveness-probe grace window; one-minor version skew window from K8s API deprecation policy - Honest non-cooperative-image audience scoping: v0.1 requires image cooperation; vendor-locked images (SageMaker, Lightning AI, Vertex AI, NGC PyTorch unmodified) explicitly out of scope; kubectl debug reframed as operator literacy, not coverage - New section 6 footnote: Phase 1 registers as logs receiver via CreateLogs; Phase 3 picks between CreateProfiles factory extension or plog.LogRecord bodies with pprof content-type; asymmetric rework cost documented - Remove uncited Pyroscope roadmap directional-alignment claim New docs/proposals/gen-ai-training-semconv.md: upstream proposal draft for gen_ai.training.* namespace. Closes M1 first-draft KPI from NORTHSTARS O4 (4 months overdue). Mirrors semconv-hw-gpu-extensions.md shape; full attribute set, cardinality guidance, prior-art table (PyTorch DDP / Slurm / Ray / MLflow / W&B / Kueue / SageMaker / torchrun), cross-language SDK adoption checklist, OTel transform-processor translation examples. New docs/research/README.md: directory purpose statement clarifying what belongs in research/ vs rfcs/ vs proposals/; no-orphan rule. Amended docs/rfcs/README.md: codifies in-place amendment convention (editorial / substantive / decision-change tiers); names the RFC-0009 section 6 footnote as the borderline case that prompted the convention. Amended docs/FOLLOWUPS.md: new "empirical validation" section under M13 with 10 rows for items needing GPU hardware, production data, or cross-org engagement (NFR ceiling derivation + measurement, rolling-upgrade chaos, GIL-hold blast radius during NCCL collective, SIG attendance, upstream PR filing, vendor escape-hatch verification, OTel Python SDK status, pprof v1.0 upgrade, v0.2 admission-webhook for image-immutable operators). No standalone cross-check research doc shipped; findings landed directly in RFC-0009 section 6 + FOLLOWUPS rows per doc-hygiene reviewer recommendation. Final-state validator confirmed grade movement without new defects: Operator B-/74 -> B+/87, Senior staff A-/85 -> A/92, OTel SIG D+/C- -> B+/87, Doc hygiene B-/C+ -> B+/87. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/FOLLOWUPS.md | 143 +++++++++++++ docs/proposals/gen-ai-training-semconv.md | 235 ++++++++++++++++++++++ docs/research/README.md | 21 ++ docs/rfcs/0009-pyspy-receiver-scope.md | 36 ++-- docs/rfcs/README.md | 27 +++ 5 files changed, 450 insertions(+), 12 deletions(-) create mode 100644 docs/proposals/gen-ai-training-semconv.md create mode 100644 docs/research/README.md diff --git a/docs/FOLLOWUPS.md b/docs/FOLLOWUPS.md index 3421eab8..b62023f7 100644 --- a/docs/FOLLOWUPS.md +++ b/docs/FOLLOWUPS.md @@ -1776,6 +1776,36 @@ scope. Each carries the trigger that should reopen the question. ### Phase: post-RFC-acceptance, before Phase 2 implementation +- [ ] **v0.2 admission-webhook injection for image-immutable operators.** + M13 v0.1 alpha serves operators who can modify the workload + training image. Vendor-locked images (SageMaker training jobs, + Lightning AI Studios, Vertex AI Workbench, NVIDIA NGC PyTorch + consumed without modification) are explicitly out of scope per + RFC-0009 §Migration / rollout. Datadog's cluster agent pattern + (mutating admission webhook that injects the helper as an init + container + `LD_PRELOAD` env) is the industry-standard path for + this audience. v0.2 design needs: (a) inventory of which managed + platforms allow admission webhooks at all (some k8s services + restrict them); (b) injection mechanism that avoids `LD_PRELOAD` + conflicts with torchrun/accelerate/DeepSpeed launchers per + RFC-0009 §Alternatives `PyThreadState_Next via LD_PRELOAD`; + (c) UID-alignment story that survives image rebuilds. *Trigger:* + first operator request for non-cooperative-image coverage, or + M13 v0.1 alpha gains its first three production deployments and + a fourth requests vendor-locked-image support. +- [ ] **Upstream `gen_ai.training.*` semconv PR.** The proposal + markdown body lives at `docs/proposals/gen-ai-training-semconv.md` + (drafted 2026-05-19; closes the M1 first-draft KPI gap from + NORTHSTARS O4 line 219). The remaining action is filing the + actual PR on `open-telemetry/semantic-conventions`. Prerequisite + cross-org work: (a) GenAI Instrumentation SIG attendance — at + least one meeting attended before filing to confirm scope + doesn't overlap a pending proposal; (b) competing-proposal + survey on the upstream repo; (c) ideally a co-author or sponsor + from a non-tracecore vendor (Pyroscope, Datadog, Honeycomb SIG + members are candidates). *Trigger:* O4 owner has bandwidth for + weekly SIG cadence (NORTHSTARS O4 KPI line 220), or M13 alpha + ships and external review pressure on the namespace surfaces. - [ ] **Fork-handler in the Python helper.** RFC-0009 §Open questions §2. v0.1 contract is: operator calls `attach()` in each worker entry point (post-fork) so each PID gets its own UDS socket. The @@ -1871,6 +1901,119 @@ scope. Each carries the trigger that should reopen the question. settle window so the silent-under-collection misconfig surfaces on the operator dashboard. *Trigger:* Phase 1 trigger-loop ships. +### Phase: empirical validation (needs production data or GPU hardware) + +Items here cannot be performed at design time. They require a real +PyTorch training workload on GPU hardware, a fleet of real customer +deployments, or external-org engagement. The senior-staff and +operator-lens reviews of RFC-0009 surfaced these gaps; capturing +them here so they aren't lost between design-locked and Phase 3 +benchstat work. + +- [ ] **NFR ceilings as measured bounds with sourced derivation.** + RFC-0009 §Non-functional rubrics currently marks all four + ceilings (CPU ≤0.05%, RSS ≤10 MB, egress ≤0.05 Mbps, shutdown + ≤1s p99) as "design contract; measurement deferred to Phase 3 + benchstat." Two gaps survive the design-contract framing, + and both should close in the Phase 3 measurement PR: + **(a) Source the derivation.** The ceiling *values* (the + `0.05%`, the `10 MB`, the `1s`) are inherited from MILESTONES + §M13 rubric, which inherited them from NORTHSTARS O2 + per-receiver budget. NORTHSTARS O2 in turn states the + per-receiver budget without an external anchor. The + derivation chain is "tracecore-defined all the way down." + A principal-tier RFC would name the empirical or + design-budget basis for each ceiling (e.g., "0.05% CPU is + 1/20 of a single hyperthread on a 32-vCPU node, the largest + smallest-acceptable allocation a multi-receiver collector + can spend on one signal class without crowding"). The + derivation should land alongside the measurement so future + operators understand whether the ceiling can be raised when + a real workload pressure-tests it. + **(b) Measure.** Once derived, replace "design contract" with + "measured X sustained CPU under cadence (60s/15s) on a + 32-thread PyTorch DDP workload (n=600, σ=Y, 1-hour soak); + ceiling Z." That measurement requires GPU hardware + real + PyTorch workload. + *Needs:* GPU hardware (≥1 node with ≥1 NVIDIA GPU running a + PyTorch DDP toy model) for the measurement; the derivation + half is doc-only and could land earlier if the O2 owner has + bandwidth. *Trigger:* Phase 3 benchstat gate wiring; the + derivation column + measurement column replace the + design-contract wording in RFC-0009 §Non-functional rubrics + in the same PR. +- [ ] **Rolling-upgrade chaos validation across mixed + helper/receiver versions.** RFC-0009 §Wire protocol commits to + a version-skew matrix (helper v vs receiver v±1). The matrix + has not been exercised against a real fleet rolling-upgrade + sequence. *Needs:* production data (a fleet of ≥3 simultaneous + deployments at different helper-pin / chart-version + combinations). *Trigger:* M13 alpha gains ≥3 production + deployments; chaos-test fixture lands alongside the third + adopter's rollout. +- [ ] **GIL-hold blast radius of `dump_traceback` during NCCL + collective.** RFC-0009 §Motivation argues the in-target helper + approach is safer than ptrace-based sampling. The helper's own + `dump_traceback` call holds the GIL briefly. If a NCCL + collective is in flight during a dump, the workload-visible + cost is uncharacterized. RFC commits ≤0.05% CPU budget on the + receiver; the *target-process* cost during collective overlap + is unbudgeted. *Needs:* GPU hardware (≥1 multi-GPU node) + running NCCL collectives + Phase 3 helper. *Trigger:* Phase 3 + benchstat measurement on multi-GPU workload. +- [ ] **CI tolerance multiplier (5×) variance justification.** + Both M13 and M14 NFR rubrics use a "5× CI ceiling for + shared-runner variance" multiplier. The constant is + tracecore-defined with no empirical anchor. A measurement on + the actual CI runner class (GitHub-hosted, self-hosted) would + establish whether 5× is over-tight or over-loose. *Needs:* + production data (variance distribution across ~50 CI runs on + the target runner class). *Trigger:* Phase 3 benchstat + stability; or first flake report on either receiver's overhead + gate. +- [ ] **GenAI Instrumentation SIG attendance + competing-proposal + survey.** NORTHSTARS O4 KPI line 220 commits to weekly SIG + attendance. As of 2026-05-19 no attendance log exists. Before + filing the upstream `gen_ai.training.*` proposal (see "Upstream + `gen_ai.training.*` semconv PR" row above), at least one SIG + meeting must be attended to confirm the proposal does not + overlap a pending vendor-shaped proposal. *Needs:* cross-org + engagement (weekly meeting time + GitHub SIG-issues review). + *Trigger:* O4 owner schedules first meeting; or first credible + competing-proposal report from the OTel semconv repo. +- [ ] **Vendor verification for `kubectl debug` viability across + managed training platforms.** RFC-0009 §Migration / rollout + names `kubectl debug` + py-spy as operator-standard for + non-cooperative targets. Per-platform availability is + unverified: SageMaker training jobs disallow ephemeral debug + containers entirely; Lightning AI Studios and Vertex AI + Workbench behavior is undocumented in this repo. A per-platform + table belongs in Phase 4 `docs/integrations/pyspy.md` so + operators know which platforms the escape-hatch story actually + works on. *Needs:* cross-org engagement (vendor doc review or + direct testing on at least three managed platforms). *Trigger:* + Phase 4 integrations doc writing; first operator question + about non-cooperative coverage. +- [ ] **OTel Python profiling SDK comparator verification.** The + `m13-industry-alignment-decisions` cross-check excluded the + OTel Python profiling SDK from its industry-comparator table + with an "unverified" caveat. The SDK's Python implementation + status is separate from the Profiles signal Alpha (data-model + level). Confirming whether the SDK is a credible v0.2 sampler + candidate (vs `grafana/otel-profiling-python`) needs an + upstream status check. *Needs:* cross-org engagement (OTel + Python SIG check). *Trigger:* v0.2 audience-expansion work; or + `grafana/otel-profiling-python` deprecation announcement. +- [ ] **`pprofile` v1.0 graduation upgrade path.** RFC-0009 §6 + footnote names two Phase 3 options. Option (a) vendors a + v0.152.0 `pprofile` against the v1.58.0 main `pdata` line. + When upstream `pprofile` reaches v1.0, tracecore should bump + and exercise the upgrade in a follow-up PR; capturing any + breaking API changes as a Phase 3+ amendment to RFC-0009 or a + companion RFC per the convention call left to the Phase 3 + author. *Needs:* upstream `pprofile` v1.0 release. + *Trigger:* `pprofile` v1.0 published. + ### Phase: Phase 3 polish (surfaced in Phase 2 review) - [ ] **Parser allocation discipline.** Receiver-side parser should diff --git a/docs/proposals/gen-ai-training-semconv.md b/docs/proposals/gen-ai-training-semconv.md new file mode 100644 index 00000000..5be24b44 --- /dev/null +++ b/docs/proposals/gen-ai-training-semconv.md @@ -0,0 +1,235 @@ +# Proposal — OTel semconv extensions for distributed AI training: run / job / rank / collective + +**Status:** draft, ready to copy-paste into an upstream PR. +**Target:** [open-telemetry/semantic-conventions], `gen_ai.training.*` namespace (subordinate to `gen_ai.*` LLM-client work, parallel scope: training-side). +**Authoring milestone:** M1 (overdue per NORTHSTARS O4 line 219); refreshed at M13 alongside the pyspy receiver reference implementation. + +This document is the markdown body the tracecore O4 owner copies into an upstream PR description. It is NOT yet submitted — the SIG engagement is the gating step. Once a draft PR exists, replace this file with a link to the PR + the SIG's current direction. + +## Motivation + +Training-stack telemetry today emits one of: PyTorch Distributed env vars (`RANK`/`WORLD_SIZE`/`LOCAL_RANK`/`GROUP_RANK`), Slurm (`SLURM_PROCID`/`SLURM_NTASKS`/`SLURM_LOCALID`), Ray (`train.context()`), MLflow (`mlflow.runId`), Weights & Biases (`wandb.run.id`/`wandb.run.group`), Kueue (`Workload.metadata.name`), SageMaker (`TRAINING_JOB_NAME`). No neutral name exists. + +The existing `gen_ai.*` namespace covers LLM client-side operations (request, response, token counts, agent spans). It does not cover training-side concepts (run, job, step, rank, world_size, collective). The gap forces every collector (tracecore, Datadog LLM Observability, Honeycomb, Pyroscope, Grafana Cloud) to invent a private vocabulary, and forces every distributed-training framework to maintain a translation table per backend. + +The window to shepherd a neutral namespace is narrow: vendor-shaped proposals are likely to land within 1-2 quarters. + +## Proposed names + +All entries are **Development** stability tier when first shipped; promotion to Stable follows the existing semconv process (at least one external implementation; SIG approval). + +### 1. `gen_ai.training.run.id` (string, **required**) + +Logical training run identifier. Stable across restarts and pre-emption; same run after recovery. + +| Attribute | Type | Required? | Notes | +|---|---|---|---| +| `gen_ai.training.run.id` | string | yes | Maximum 256 bytes | + +Source-priority chain (consumers MUST implement in order): + +1. `MLFLOW_RUN_ID` env var (MLflow convention) +2. `WANDB_RUN_ID` env var (Weights & Biases convention) +3. `TRACECORE_RUN_ID` env var (operator-set override, for fleets not using MLflow/W&B) +4. Kubernetes Job `metadata.uid` (k8s deployments fallback) + +### 2. `gen_ai.training.job.id` (string, **recommended**) + +Orchestrator-assigned job identifier. Distinct from `run.id`: a single run may comprise multiple orchestrator jobs across pre-emption boundaries. + +| Attribute | Type | Required? | Notes | +|---|---|---|---| +| `gen_ai.training.job.id` | string | no | Maximum 256 bytes | + +Source-priority chain: + +1. Kueue `Workload.metadata.name` +2. `SLURM_JOB_ID` env var (Slurm) +3. Ray `runtime_context().job_id` +4. `TORCHELASTIC_RUN_ID` env var (PyTorch elastic; canonical per [torchrun docs](https://docs.pytorch.org/docs/stable/elastic/run.html)) +5. SageMaker `TRAINING_JOB_NAME` env var +6. Kubernetes Job `metadata.name` + +### 3. `gen_ai.training.rank` (int, **required**) + +Global process rank within the training job. Zero-indexed. + +| Attribute | Type | Required? | Notes | +|---|---|---|---| +| `gen_ai.training.rank` | int | yes | `0 ≤ rank < world_size` | + +Source-priority chain: + +1. `RANK` env var (PyTorch Distributed convention) +2. `SLURM_PROCID` env var (Slurm) +3. Ray `train.context().get_world_rank()` (when running under Ray Train) +4. Kubernetes Pod label `gen_ai.training.io/rank` (k8s-environment fallback only) + +### 4. `gen_ai.training.world_size` (int, **required**) + +Total process count in the training job. + +| Attribute | Type | Required? | Notes | +|---|---|---|---| +| `gen_ai.training.world_size` | int | yes | Constant per `job.id`; backends MAY use it as a safe aggregation key | + +Source-priority chain: + +1. `WORLD_SIZE` env var +2. `SLURM_NTASKS` env var +3. Ray `train.context().get_world_size()` +4. Pod label `gen_ai.training.io/world-size` + +### 5. `gen_ai.training.local_rank` (int, **recommended**) + +Per-node rank. Useful for diagnosing per-host failures (one GPU/NIC unhealthy on one machine). + +| Attribute | Type | Required? | Notes | +|---|---|---|---| +| `gen_ai.training.local_rank` | int | no | `0 ≤ local_rank < gpus_per_node` | + +Source-priority chain: `LOCAL_RANK` env, `SLURM_LOCALID` env, Ray `train.context().get_local_rank()`. + +### 6. `gen_ai.training.group_rank` (int, **optional**) + +Rank within a parallelism group (pipeline-parallel, tensor-parallel, data-parallel groups in 3D-parallel training). + +| Attribute | Type | Required? | Notes | +|---|---|---|---| +| `gen_ai.training.group_rank` | int | no | Framework-specific; e.g. `GROUP_RANK` env in PyTorch elastic | +| `gen_ai.training.group_kind` | string | when `group_rank` present | `data`, `tensor`, `pipeline`, `expert` | + +### 7. `gen_ai.training.step` (int, **recommended**) + +Current optimizer step. Monotonic per `run.id`. + +| Attribute | Type | Required? | Notes | +|---|---|---|---| +| `gen_ai.training.step` | int | no | Provided by framework hook; backends MAY rate-limit ingestion | + +### 8. `gen_ai.training.collective.op` (string, **optional**) + +Collective communication operation type. Carried on spans / log records / profile samples that observe a collective in flight. + +| Attribute | Type | Required? | Values | +|---|---|---|---| +| `gen_ai.training.collective.op` | string | no | `all_reduce`, `all_gather`, `reduce_scatter`, `broadcast`, `send`, `recv`, `barrier` | +| `gen_ai.training.collective.tag` | string | no | Application-supplied tag distinguishing collectives in a step | + +## Cardinality guidance + +Consumers and backends should expect the following cardinality shape: + +| Attribute | Cardinality | Aggregation guidance | +|---|---|---| +| `gen_ai.training.run.id` | 10² – 10⁴ per organization per day | Safe to aggregate on | +| `gen_ai.training.job.id` | Same as `run.id` × ~1.1 (pre-emption multiplier) | Safe to aggregate on | +| `gen_ai.training.rank` | **10² – 10⁵ per job** (frontier training) | Unsafe to aggregate on directly; aggregate on `world_size` + `job.id` instead | +| `gen_ai.training.world_size` | Constant per job | Safe aggregation key | +| `gen_ai.training.local_rank` | 1 – ~64 per host | Safe to aggregate on within a host | +| `gen_ai.training.step` | Monotonic int up to 10⁶ – 10⁸ per run | Backends MAY rate-limit ingestion | +| `gen_ai.training.collective.op` | Bounded (8 values) | Safe to aggregate on | + +Backends materializing one timeseries per `rank` at 10⁵ ranks should expect cardinality pressure proportional to `world_size`. + +## Prior art + +| Vocabulary | Source | Maps to | +|---|---|---| +| `RANK` / `WORLD_SIZE` / `LOCAL_RANK` / `GROUP_RANK` | [PyTorch torch.distributed](https://docs.pytorch.org/docs/stable/distributed.html) | `gen_ai.training.{rank,world_size,local_rank,group_rank}` | +| `SLURM_PROCID` / `SLURM_NTASKS` / `SLURM_LOCALID` | Slurm | same | +| `train.context()` | [Ray Train](https://docs.ray.io/en/latest/train/api/api.html) | same | +| `MLFLOW_RUN_ID` | [MLflow](https://mlflow.org/docs/latest/tracking.html) | `gen_ai.training.run.id` | +| `WANDB_RUN_ID` / `WANDB_RUN_GROUP` | [Weights & Biases](https://docs.wandb.ai/) | `gen_ai.training.run.id` | +| `Workload.metadata.name` | [Kueue](https://kueue.sigs.k8s.io/) | `gen_ai.training.job.id` | +| `TRAINING_JOB_NAME` | [SageMaker Training Jobs](https://docs.aws.amazon.com/sagemaker/) | `gen_ai.training.job.id` | +| `TORCHELASTIC_RUN_ID` | PyTorch elastic | `gen_ai.training.job.id` | + +## Out of scope (intentionally) + +- **LLM client-side attributes.** Covered by existing `gen_ai.*` (chat completion, agent spans, generative-AI metrics). This proposal does not modify or rename any merged client-side attribute. +- **GPU / hardware attributes.** Covered by the parallel `hw.gpu.*` proposal at `docs/proposals/semconv-hw-gpu-extensions.md`. +- **Inference-side serving attributes.** Token throughput, queue depth, batch-size dynamics. Separate scope. +- **Non-cooperative attach paths and operator escape hatches.** The receiver-side question of how instrumentation reaches a workload (in-process SDK, admission-webhook injection, ptrace, eBPF) is implementation-side concern, irrelevant to semconv. The spec assumes the instrumentation is in-process and emits the names; how it gets there is not standardized here. +- **Framework-internal step substructure** (forward / backward / optimizer phase). Defer to v2 with empirical evidence about which substructure attributes operators actually query. +- **Per-collective payload bytes.** Lives in `hw.gpu.nvlink.io` (parallel proposal). + +## Reference implementation + +[tracecore](https://github.com/TraceCoreAI/tracecore) emits the attribute set in five receivers, with the namespace flagged PROPOSED until this proposal lands: + +- **M13 pyspy** receiver (Python stack-sampling): emits `gen_ai.training.rank`, `gen_ai.training.world_size` on every record. See [RFC-0009 §Records and attributes](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0009-pyspy-receiver-scope.md#records-and-attributes). +- **M15 container stdout** receiver: derives the namespace from Pod env vars + label fallback. See [MILESTONES.md M15](https://github.com/TraceCoreAI/tracecore/blob/main/MILESTONES.md#m15-container-stdout-receiver). +- **M14 Kineto** receiver (PyTorch profiler traces): carries `gen_ai.training.rank` + `gen_ai.training.step` per span. +- **M18 stragglers detector**: consumes `gen_ai.training.rank` as the cross-receiver join key for cross-rank correlation. + +The PROPOSED flag on each receiver's README signals the upstream-pending status; tracecore will adopt the official names verbatim if/when this proposal lands with divergence from the draft. + +## Open questions for the SIG + +1. **Namespace placement.** `gen_ai.training.*` parallel to `gen_ai.*` LLM-client, or nested under `gen_ai.*` with a discriminator attribute? Tracecore's working hypothesis: parallel, because the LLM-client conventions assume a single-process request/response shape that training cannot inherit. + +2. **`run.id` stability across pre-emption + restart.** Two viable models: (a) one `run.id` per logical run, multiple `job.id`s for restart attempts (tracecore's working model); (b) one `run.id` per orchestrator-scheduled instance, requiring an external `experiment.id` to join. Industry split: MLflow uses (a), W&B uses (a), Kueue uses (b) implicitly. + +3. **`collective.op` enumeration source.** Should the value set be aligned with NCCL operation names (`all_reduce`), MPI operation names (`MPI_Allreduce`), or neutral (`all_reduce` lowercased)? Working choice: neutral lowercased, with `gen_ai.training.collective.library` as an optional discriminator. + +4. **`local_rank` carrying host identity vs separate `gen_ai.training.host.id`.** Two ranks on the same host share `local_rank` semantics only if a `host.id` (or `k8s.node.name`) is also present. Should the namespace include its own host concept or rely on existing `host.*` semconv? + +5. **Cardinality enforcement.** Should the spec require backends to support aggregation on `world_size` rather than `rank` (a normative MUST), or recommend it (a SHOULD)? Stricter wording costs adoption; looser wording invites unbounded timeseries growth. + +## Cross-language SDK adoption checklist + +When adopting these attributes, SDKs in each language must verify env-var lookup parity: + +- **Python**: `os.environ["RANK"]`, etc. Already standard in PyTorch / Ray / W&B / MLflow SDKs. +- **Go**: `os.LookupEnv("RANK")`. The tracecore receivers use this directly. +- **Java**: `System.getenv("RANK")`. DJL (Deep Java Library) compatible. +- **Rust**: `std::env::var("RANK")`. Burn / Candle frameworks emit these env vars. + +Names are lowercase-dotted (no language-shaped abbreviations) and free of reserved keywords across the four languages above. + +## Migration / coexistence with vendor vocabularies + +Existing telemetry pipelines emitting MLflow/W&B/Slurm/Ray/Kueue names continue to work without change. The spec is additive on entry: no breaking changes to merged `gen_ai.*` LLM-client attributes, and no rename of vendor-emitted names. Adoption is via a translation layer at ingest, illustrated below. + +### Translation examples (OTel Collector `transform` processor) + +A backend or collector pipeline can map vendor-native attributes into `gen_ai.training.*` at ingest. The OTel Collector's `transform` processor is the canonical seam: + +```yaml +processors: + transform/gen_ai_training: + log_statements: + - context: log + statements: + # MLflow → gen_ai.training.run.id + - set(attributes["gen_ai.training.run.id"], attributes["mlflow.run.id"]) where attributes["mlflow.run.id"] != nil + # Weights & Biases → gen_ai.training.run.id + - set(attributes["gen_ai.training.run.id"], attributes["wandb.run.id"]) where attributes["wandb.run.id"] != nil and attributes["gen_ai.training.run.id"] == nil + # PyTorch DDP env → gen_ai.training.rank + - set(attributes["gen_ai.training.rank"], Int(attributes["RANK"])) where attributes["RANK"] != nil + # Slurm → gen_ai.training.rank (when not already set by PyTorch path) + - set(attributes["gen_ai.training.rank"], Int(attributes["SLURM_PROCID"])) where attributes["SLURM_PROCID"] != nil and attributes["gen_ai.training.rank"] == nil + # Kueue Workload → gen_ai.training.job.id + - set(attributes["gen_ai.training.job.id"], attributes["k8s.label.kueue.x-k8s.io/workload"]) where attributes["k8s.label.kueue.x-k8s.io/workload"] != nil +``` + +Order matters: the source-priority chain in each attribute's §Proposed names section determines which vendor wins when more than one is present. Operators pin the translation rules to their fleet's actual emitters; the spec does not require dual-emission from vendor SDKs. + +### Deprecation timeline + +The spec ships at **Development** tier (per the OTel semconv stability process). Attribute names MAY change during this tier with at least one minor-version-cycle deprecation warning published in the semconv release notes. Vendors implementing the names against Development-tier attributes accept this risk and pin against a specific semconv release. + +Promotion to **Stable** requires: (a) ≥1 external (non-tracecore) implementation per major attribute group (`run`, `job`, `rank`, `step`, `collective`); (b) SIG approval; (c) at least one published cardinality validation report from a backend that consumes the names at production scale. + +After promotion to Stable, name changes follow the [OpenTelemetry SemVer compatibility policy](https://opentelemetry.io/docs/specs/otel/versioning-and-stability/): breaking renames require a new major-version semconv release with at least one minor-version overlap window where backends accept both old and new names. + +### Vendor-SDK adoption pattern + +Vendor SDKs (Pyroscope, Datadog, ddtrace, Sentry, OTel language SDKs) adopt the namespace by emitting `gen_ai.training.*` alongside their native names during a one-minor-cycle overlap, then dropping the native-name emission once their downstream consumers have switched over. This pattern is identical to how OTel adopted the `service.*` namespace alongside legacy vendor attributes during 2023-2024. + +### What this proposal does NOT migrate + +- **`gen_ai.*` LLM-client attributes** (chat completion, agent spans, token counts) remain untouched. The `training.*` discriminator scopes the namespace explicitly. +- **`hw.gpu.*` attributes** (parallel `hw.gpu.*` proposal at `docs/proposals/semconv-hw-gpu-extensions.md`) are an independent migration with its own timeline. +- **Vendor-internal `mlflow.*` / `wandb.*` / `slurm.*` attributes** are not subsumed. Operators who want to keep them alongside `gen_ai.training.*` can; the translation layer copies, it does not move. diff --git a/docs/research/README.md b/docs/research/README.md new file mode 100644 index 00000000..833c6c21 --- /dev/null +++ b/docs/research/README.md @@ -0,0 +1,21 @@ +# `docs/research/` + +Research notes that inform but do not themselves make load-bearing decisions. Files here are working artifacts: external-research summaries, measurement baselines, and cross-references that feed RFCs or proposals downstream. + +## What belongs here + +- External-research summaries (`otel-graph-notes.md` shape: industry-standards scans, comparator-tool capability matrices, upstream-status digests) +- Measurement baselines (`baselines.md` shape: raw numbers, methodology, fixture inventory; cited by RFCs that pin against them) +- Time-bounded investigation notes that fed an RFC and remain useful as the RFC's audit trail + +## What does NOT belong here + +- **Load-bearing architecture decisions.** Those are RFCs in [`docs/rfcs/`](../rfcs/). +- **Upstream-proposal markdown bodies.** Those are in [`docs/proposals/`](../proposals/) (ready to copy into a PR description). +- **Operator-facing documentation.** Those are in [`docs/`](..) or per-component READMEs. +- **Deferred work items.** Those are in [`docs/FOLLOWUPS.md`](../FOLLOWUPS.md). +- **Session retrospectives or how-the-doc-came-to-be commentary.** Those belong in commit messages or per-session notes outside the repo, not in the doc body. + +## Discoverability + +Every file here SHOULD be linked from one of: the relevant RFC, the relevant MILESTONES.md entry, or another research file in this directory. Orphan files in this directory accumulate stale context faster than RFCs because they have no "owner milestone" gate; the linking requirement is the hygiene rule that prevents that. diff --git a/docs/rfcs/0009-pyspy-receiver-scope.md b/docs/rfcs/0009-pyspy-receiver-scope.md index 83a76b5f..aa9ac649 100644 --- a/docs/rfcs/0009-pyspy-receiver-scope.md +++ b/docs/rfcs/0009-pyspy-receiver-scope.md @@ -60,7 +60,7 @@ A single `sync.Mutex`-protected `inFlight bool` gates both. If the previous fram **Receiver, reader and emit loop** (Go). Reads the length-prefixed dump frame from the UDS, parses `faulthandler`'s line format (`Thread 0x (most recent call first):` followed by ` File "", line , in ` lines), hashes each frame list into `stack.id` via `hash/fnv` `New128a` over the byte-encoded `(file, func, line)` tuples in order, dedups against a per-cadence-window LRU keyed by `stack.id`, emits one `plog.LogRecord` per distinct stack per window with `repeat.count` accumulated. -`fnv128a` (not `fnv64a`) is chosen because `stack.id` is M18's cross-rank join key (per [`MILESTONES.md`](../../MILESTONES.md) M18 straggler-detector decision tree). A single 64-bit collision across ranks would produce a false straggler match. (derived; birthday-bound `n²/(2·2^k)` at n=10⁷ distinct (rank, stack) pairs per fleet-day: 64-bit ≈ 2.7e-6, 128-bit ≈ 1.5e-25.) Go's stdlib `hash/fnv` package exposes `New128a()` returning a `hash.Hash` with 16-byte output, no third-party dep. +`fnv128a` (not `fnv64a`) is chosen because `stack.id` is M18's cross-rank join key (per [`MILESTONES.md`](../../MILESTONES.md) M18 straggler-detector decision tree). A single 64-bit collision across ranks would produce a false straggler match. Birthday-bound derivation: `n²/(2·2^k)` distinct (rank, stack) pairs per fleet-day. For a 10⁴-rank training job × ~10 distinct main-thread stacks per rank per day × ~100 fleet-days (per the M18 replay-corpus design horizon), `n ≈ 10⁷`. At `n=10⁷`: 64-bit collision probability ≈ 2.7e-6 (one false match per ~370k fleet-days); 128-bit ≈ 1.5e-25 (effectively zero across the project lifetime). 128-bit is chosen so a multi-job aggregator running against a fleet for years does not produce a false straggler verdict from hash collision alone. Go's stdlib `hash/fnv` package exposes `New128a()` returning a `hash.Hash` with 16-byte output, no third-party dep. **Cadence pairing with M18.** The 15s main-thread cadence is chosen against M18's "≥3 consecutive main-thread samples" threshold for the GIL-hold pattern (per `MILESTONES.md` M18 decision tree). 15s × 3 = 45s minimum sustained-state detection window. The Phase 3 deliverable list below carries an explicit cross-link fixture that asserts the M13 cadence × M18 threshold product holds at every build, so neither side can drift silently. @@ -86,7 +86,7 @@ Maximum frame length is `32 MiB` (refused above this with `IncError("frame_too_l The direction-labeled self-metric is the surface a Phase 4 alert rule binds to: a sustained non-zero `helper_newer` rate after a chart rollout pages on-call to chase pip-pin updates in workload images. -**Versioning.** Helper's first frame after `accept()` is `{"kind":"hello","version":1}`. The receiver-action rules are the table above. Major bumps require a one-minor overlap where the receiver accepts both `version` and `version-1`; minor bumps are forward-compatible reads only. +**Versioning.** Helper's first frame after `accept()` is `{"kind":"hello","version":1}`. The receiver-action rules are the table above. Major bumps require a one-minor overlap where the receiver accepts both `version` and `version-1`; minor bumps are forward-compatible reads only. The one-minor overlap window is sized against the [Kubernetes API deprecation policy](https://kubernetes.io/docs/reference/using-api/deprecation-policy/) (Beta API elements: 3 releases or 9 months after deprecation; this RFC adopts the shorter "one minor cycle" because the helper/receiver pair is two-party rather than ecosystem-wide). Operators with longer rollout windows pin both halves explicitly via Helm chart `appVersion` and the helper's `pip install ==X.Y.Z` hash-pin. ### Records and attributes @@ -94,8 +94,8 @@ Each emitted record carries: | Attribute | Source | Notes | |---|---|---| -| `gen_ai.training.rank` | Pod env `RANK`, fallback to Pod label `tracecore.io/rank` | Canonical join key. Same derivation rule as M15. The `gen_ai.training.*` namespace is the subject of NORTHSTARS O4. Tracecore is upstreaming it through `open-telemetry/semantic-conventions`. Attributes carry this namespace per the NORTHSTARS O4 shepherding commitment. | -| `gen_ai.training.world_size` | Pod env `WORLD_SIZE`, fallback to Pod label `tracecore.io/world-size` | Required for cross-rank queries. | +| `gen_ai.training.rank` | Orchestrator-neutral chain: env `RANK` → env `SLURM_PROCID` → Ray `train.context().get_world_rank()` → Pod label `gen_ai.training.io/rank` (k8s-environment fallback only). | Canonical join key. Same derivation rule as M15. The `gen_ai.training.*` namespace is the subject of NORTHSTARS O4; the draft upstream proposal is `docs/proposals/gen-ai-training-semconv.md`. Attributes carry this namespace per the NORTHSTARS O4 shepherding commitment. | +| `gen_ai.training.world_size` | Orchestrator-neutral chain: env `WORLD_SIZE` → env `SLURM_NTASKS` → Ray `train.context().get_world_size()` → Pod label `gen_ai.training.io/world-size`. | Required for cross-rank queries. Constant per `job.id`; backends use as a safe aggregation key per the upstream proposal §Cardinality guidance. | | `python.thread.id` | faulthandler header `0x` | Native thread id, not the Python `threading.Thread.ident`. | | `python.thread.name` | faulthandler header parenthetical, e.g. `(MainThread)` | Empty when faulthandler omits it (older CPython). | | `stack.id` | `fnv128a(file_0 \0 func_0 \0 line_0 \0 ... file_N \0 func_N \0 line_N)` | Bytes deliberate, not text-format: identical across ranks for identical stacks. 128-bit width per [Design overview](#design-overview). | @@ -147,7 +147,7 @@ The reader thread itself wraps each `dump_traceback` invocation in `try/except B ### Receiver lifecycle -Plumbing matches [`internal/runtime/lifecycle.Lifecycle`](../../internal/runtime/lifecycle/lifecycle.go) (the same plumbing M9 uses). `Start(ctx)` scans `uds_dir` for `pyspy.*.sock` files and connects to each. A `connect()` returning `ECONNREFUSED` against an existing `.sock` file indicates a stale path from a crashed prior helper; the receiver logs once and skips (does not unlink, since the receiver is not the file's owner). Absence of any live helper triggers `target_not_attached` posture and a 30s retry (tracecore-defined; mirrors the M9 `kernelevents` retry cadence for unavailable-source posture). `Shutdown(ctx)` cancels the context, writes a `{"kind":"shutdown"}` frame to each open connection, closes the sockets, and waits on the lifecycle WaitGroup with the 1s Phase-1 budget. Every receiver-owned goroutine is wrapped in `internal/safe.Call("pyspy.", fn)`. +Plumbing matches [`internal/runtime/lifecycle.Lifecycle`](../../internal/runtime/lifecycle/lifecycle.go) (the same plumbing M9 uses). `Start(ctx)` scans `uds_dir` for `pyspy.*.sock` files and connects to each. A `connect()` returning `ECONNREFUSED` against an existing `.sock` file indicates a stale path from a crashed prior helper; the receiver logs once and skips (does not unlink, since the receiver is not the file's owner). Absence of any live helper triggers `target_not_attached` posture and a 30s retry. The 30s cadence is derived from the default Kubernetes liveness-probe grace window (`initialDelaySeconds: 10` + `periodSeconds: 10` × `failureThreshold: 3` ≈ 30s before kubelet escalates), so the receiver retries on the same envelope an operator sees in cluster-side health signals. M9 `kernelevents` adopted the same cadence for the same reason; the derivation is the kubelet timing, not a tracecore-internal convention. `Shutdown(ctx)` cancels the context, writes a `{"kind":"shutdown"}` frame to each open connection, closes the sockets, and waits on the lifecycle WaitGroup with the 1s Phase-1 budget. Every receiver-owned goroutine is wrapped in `internal/safe.Call("pyspy.", fn)`. ### Degraded modes @@ -164,6 +164,9 @@ Each row maps to one `IncError(kind)` invocation and one `FAILURE-MODES.md` entr | Frame larger than `max_frame_bytes` | `frame_too_large` | Connection closed; receiver idles for this PID. | | Frame payload malformed | `parse_error` | Drop the frame, increment counter, continue. | | Helper signals `faulthandler` unavailable (CPython built without it) | `faulthandler_missing` | Helper sends `{"kind":"hello","version":1,"unsupported":true}`; receiver idles with self-metric. | +| `uds_dir` exists but receiver lacks read+execute permission | `uds_dir_permission_denied` | Validate-at-Start fails with named-field error per M1's config-error contract; receiver does not start. Operator sees error in pod startup logs. Distinct from `target_not_attached` (which is "dir exists and is readable but empty"). | +| Helper-side OOM during a `dump_traceback` call (Python interpreter OOM, not workload OOM) | `helper_oom_mid_dump` | Helper's reader thread catches the resulting `MemoryError` via its `BaseException` handler and replies `{"kind":"dump_failed","reason":"MemoryError"}`. Distinct from `dump_failed` only in the `reason` payload; counts under the broader `dump_failed` kind. Documented separately so operators know `MemoryError` is the surface to alert on. | +| Workload image rebuilt with a different `runAsUser` than the sidecar | `sidecar_uid_drift` | Helper binds UDS with the workload's UID + mode `0700`; sidecar with different UID gets `EACCES` on connect. Receiver enters `target_not_listening` posture for that PID, but the cause is operator-visible UID misconfig. Phase 4 chart `values.yaml` defaults both UIDs from one variable to make this loud; in Phase 1 the receiver self-metric `disabled_reason="sidecar_uid_drift"` distinguishes from `target_not_listening`. | | Receiver shutdown lands mid-frame (in-flight response not drained) | `target_gone` (on next Start) | Current Shutdown is silent (no per-shutdown error counter); next Start observes the helper-side socket closure and posts the `target_gone` row. Operator-visible signal: gap in dump records bracketed by the receiver restart timestamp. | There is no `cap_missing` row. The receiver requires no capability addition. @@ -177,12 +180,12 @@ There is no `cap_missing` row. The receiver requires no capability addition. ### Non-functional rubrics -Per [`MILESTONES.md`](../../MILESTONES.md) §M13: +Per [`MILESTONES.md`](../../MILESTONES.md) §M13. The four ceilings below are **design contracts** inherited from NORTHSTARS O2 per-receiver budgets and the M13 rubric. They are not measurements. The asserting tests below are Phase 1 (`pyspy-lint`, integration scaffold) / Phase 3 (overhead benchstat, 1-hour soak) deliverables; current numbers are targets, not validated bounds. The receiver promotes from alpha to beta only when overhead rubrics pass under benchstat p<0.05 across two consecutive `main` runs, at which point the "measured" wording replaces "design contract" wording here. -- Sustained CPU ≤0.05% via `syscall.Getrusage(RUSAGE_SELF)` delta over a 10-min run at default cadence. CI ceiling 5× (0.25%) for shared-runner variance, matching M14's identical multiplier. -- Sustained egress ≤0.05 Mbps via a counting OTLP sink over the same window. -- RSS ≤10 MB via `/proc/self/status` `VmRSS` over a 1-hour soak; gated by `//go:build soak`. -- Shutdown ≤1s p99 from SIGTERM. Deadline test pins this. +- Sustained CPU ≤0.05% (design contract) via `syscall.Getrusage(RUSAGE_SELF)` delta over a 10-min run at default cadence. CI ceiling 5× (0.25%) for shared-runner variance, matching M14's identical multiplier. Measurement deferred to Phase 3 overhead benchstat. +- Sustained egress ≤0.05 Mbps (design contract) via a counting OTLP sink over the same window. Measurement deferred to Phase 3. +- RSS ≤10 MB (design contract) via `/proc/self/status` `VmRSS` over a 1-hour soak; gated by `//go:build soak`. Measurement deferred to Phase 3 soak. +- Shutdown ≤1s p99 from SIGTERM (design contract). Deadline test pins this in Phase 1 against a fake-clock fixture; real-workload p99 measurement deferred to Phase 3. ### Rubric amendments @@ -287,7 +290,14 @@ OQs §1, §2, §6, §7 were phase-blocking before this RFC's acceptance into dra 6. **Wire format: OTel pprof-dictionary vs. `plog.LogRecord`.** *(resolved; Phase 3 unblocked.)* - **Resolved: emit pprof-dictionary from v0.1 alpha.** Rationale: [OpenTelemetry Profiles](https://opentelemetry.io/docs/concepts/signals/profiles/) entered public Alpha on 2026-03-26 and is positioned as the cross-vendor profiling-data wire format the industry is consolidating around (Elastic-donated reference implementation; Pyroscope's roadmap signals directional alignment without a committed wire-format migration date; eBPF profiler interop). Tracecore's NORTHSTARS O4 commits to leading on OTel semconv; emitting pprof-dictionary from v0.1 aligns the receiver with both the upstream signal and the project's own direction. Alpha-stage risk is mitigated by pinning the OTel Profiles version in `components.yaml` and treating any future format shift as an ecosystem-change SLA event per NORTHSTARS O6. `plog.LogRecord` rejected on adoption grounds: forcing operators to use tracecore-proprietary parsing rather than the cross-vendor pprof format would surface as friction during exactly the operator-habit-forming period this RFC names. + **Resolved: emit pprof-dictionary from v0.1 alpha.** Rationale: [OpenTelemetry Profiles](https://opentelemetry.io/docs/concepts/signals/profiles/) entered public Alpha on 2026-03-26 with Elastic-donated reference implementation, pprof-compatible per [OTEP-0239](https://github.com/open-telemetry/oteps/blob/main/text/profiles/0239-profiles-data-model.md). Pyroscope 2.0 (April 2026) accepts OTLP profile ingestion; Grafana Alloy carries an OTLP-profiles receiver. Tracecore's NORTHSTARS O4 commits to leading on OTel semconv; emitting pprof-dictionary from v0.1 aligns the receiver with the upstream signal direction. Alpha-stage risk is mitigated by pinning the OTel Profiles version in `components.yaml` and treating any future format shift as an ecosystem-change SLA event per NORTHSTARS O6. `plog.LogRecord` rejected on adoption grounds: tracecore-proprietary parsing rather than the cross-vendor pprof format would surface as friction during the operator-habit-forming period this RFC names. + + **Phase 1 registration posture.** Phase 1 of the receiver registers as a logs receiver via `pipeline.ReceiverFactory.CreateLogs`; `CreateMetrics` and `CreateTraces` return `pipeline.ErrSignalNotSupported`. Phase 1 has no emission, so the wire-format choice does not yet bind to a pdata type. At Phase 3, two options for landing pprof-dictionary records through the pipeline: + + 1. Add `CreateProfiles(ctx, set, cfg, next consumer.Profiles)` to `pipeline.ReceiverFactory` and import `go.opentelemetry.io/collector/pdata/pprofile`. As of 2026-05-19 this package is at v0.152.0 (pre-v1.0) against tracecore's stable v1.58.0 `pdata` pin; the receiver would carry a v0.x dependency alongside the v1.x stable line until upstream `pprofile` reaches v1.0. Under this option the Phase 1 `CreateLogs` implementation is retired and re-implemented as `CreateProfiles`, the pipeline wiring in `cmd/tracecore/components.go` moves from a logs pipeline to a profiles pipeline, and `components.yaml` may need a `signals:` column added. + 2. Ship pprof-dictionary payloads inside `plog.LogRecord` bodies with a content-type attribute during the alpha window, deferring the `ReceiverFactory` extension until upstream `pprofile` is stable. Under this option the Phase 1 `CreateLogs` registration is the eventual production path; Phase 3 adds emission on top of the Phase 1 scaffold without rewiring. + + The Phase 1 scaffold, CI gates, config schema, `target_not_attached` posture, and `kinds.go` declarations survive both choices. Phase 3 picks based on `pprofile` stability tier at Phase 3 start and accepts the corresponding rework cost. Whether that decision lives in an in-place amendment to this RFC or in a companion RFC is itself a Phase 3 question; the project has no established precedent for in-place RFC amendments today (this paragraph is the first such entry in any tracecore RFC), so the Phase 3 author makes that convention call. 7. **`stack.id` path normalization across heterogeneous Python images.** *(resolved; Phase 3 unblocked.)* @@ -307,7 +317,9 @@ The Python helper is `pip install`-only. No vendored CPython, no compiled extens **Rollback:** Setting `receivers.pyspy.enabled=false` removes the receiver from the pipeline. The Python helper still binds its UDS but no one connects, so the workload sees zero behavior change. The operator can also `pip uninstall tracecore-pyspy` and redeploy; nothing else in tracecore depends on it. -**On-demand forensics escape hatch.** M13 covers *continuous cooperative* sampling. For one-off investigation of a Python target the operator did not build with `tracecore-pyspy` in the image, `kubectl debug --target=` plus `py-spy dump --pid 1` (or the `kubectl-prof` style ephemeral attach pattern) remains the operator-standard tool. The two coexist: `tracecore-pyspy` emits OTLP to the same sink as the rest of tracecore for always-on observability; on-demand py-spy prints to the operator's terminal for one-shot answers against non-cooperative targets. This split is documented operator-side in `docs/integrations/pyspy.md` (Phase 4 deliverable). +**Audience scope (v0.1 alpha).** M13 v0.1 alpha serves operators who can modify the workload training image (add one `pip install` line, one `import` line). This covers in-house production training and academic ML; it does NOT cover workloads running vendor-locked images the operator cannot rebuild — managed services like AWS SageMaker training jobs, Lightning AI Studios, Vertex AI Workbench, multi-tenant JupyterHub clusters, or pre-baked NVIDIA NGC PyTorch images consumed without modification. For those, M13 v0.1 is explicitly out of scope, and the receiver enters `target_not_attached` posture and idles. An admission-webhook-injection mode (the path Datadog's cluster agent takes for tracing) is named in `docs/FOLLOWUPS.md` as v0.2+ work; until that lands, image-immutable operators are uncovered by M13. + +**On-demand forensics for non-cooperative targets.** For one-off investigation of a Python process the operator cannot modify the image of, `kubectl debug --target=` plus `py-spy dump --pid 1` (or `kubectl-prof`-style ephemeral attach) is the operator-standard tool. This is NOT a tracecore-supported continuous coverage path: ephemeral debug containers are disabled on most managed Kubernetes training platforms (SageMaker training jobs disallow them entirely), and even where allowed they do not survive pod restarts. The reference here is operator literacy, not a tracecore capability claim. Phase 4 `docs/integrations/pyspy.md` will document this boundary explicitly so operators reach for the right tool per case. ## References diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md index 09dd1d55..83eb21ce 100644 --- a/docs/rfcs/README.md +++ b/docs/rfcs/README.md @@ -48,3 +48,30 @@ renumbering the *unaccepted* RFC, not the accepted one. The file naming inconsistency (`0006` is `RFC-0006-*.md`, others are `NNNN-*.md`) is grandfathered; new RFCs use the unprefixed `NNNN-short-title.md` form. + +## Amendments to accepted or draft-locked RFCs + +The first in-place amendment to a tracecore RFC was the "Phase 1 +registration posture" paragraph added to RFC-0009 §6 on 2026-05-19. +Until that point the project had no convention. Going forward: + +- **Editorial corrections** (typo, broken link, dead citation) edit + the RFC body directly with a normal commit message; no in-body + date-stamp. +- **Substantive additions inside an existing section** (clarification, + forward-compatibility note, scope refinement that does not alter + any decision) edit the body in-place, dated only by `git log`; do + not add inline "(added YYYY-MM-DD)" markers. The git history is the + audit trail. +- **Decision changes or scope reversals** open a new RFC that supersedes + the old one. The superseded RFC's `Status:` flips to + `superseded by RFC-NNNN` and the body stays in-tree as historical + record. + +The RFC-0009 §6 "Phase 1 registration posture" footnote is the +borderline case that prompted this convention (substantive addition +that names a Phase 3 choice it defers). The original draft of that +footnote carried an inline `(added 2026-05-19)` date-stamp; after +this convention was codified, the date-stamp was removed and the +audit trail moved to `git log`. Subsequent amendments follow the +rules above without inline date-stamps. From 3837b6a0c3296f75637aeb8b4b05971ccc69fc11 Mon Sep 17 00:00:00 2001 From: Tri Lam Date: Tue, 19 May 2026 10:50:57 -0700 Subject: [PATCH 2/3] =?UTF-8?q?[docs]=20RFC-0009=20=C2=A76=20footnote:=20d?= =?UTF-8?q?efer=20amendment=20convention=20to=20docs/rfcs/README.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The original §6 footnote claimed "no established precedent for in-place RFC amendments today (this paragraph is the first such entry in any tracecore RFC)." That self-reference was true when written but is now stale: docs/rfcs/README.md (added in the same branch) codifies the amendment-convention tiers (editorial / substantive / decision-change). Replace the self-referential paragraph with a link to the README section so a six-months-cold reader sees the rule, not the meta. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/rfcs/0009-pyspy-receiver-scope.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/rfcs/0009-pyspy-receiver-scope.md b/docs/rfcs/0009-pyspy-receiver-scope.md index aa9ac649..f9ff3b99 100644 --- a/docs/rfcs/0009-pyspy-receiver-scope.md +++ b/docs/rfcs/0009-pyspy-receiver-scope.md @@ -297,7 +297,7 @@ OQs §1, §2, §6, §7 were phase-blocking before this RFC's acceptance into dra 1. Add `CreateProfiles(ctx, set, cfg, next consumer.Profiles)` to `pipeline.ReceiverFactory` and import `go.opentelemetry.io/collector/pdata/pprofile`. As of 2026-05-19 this package is at v0.152.0 (pre-v1.0) against tracecore's stable v1.58.0 `pdata` pin; the receiver would carry a v0.x dependency alongside the v1.x stable line until upstream `pprofile` reaches v1.0. Under this option the Phase 1 `CreateLogs` implementation is retired and re-implemented as `CreateProfiles`, the pipeline wiring in `cmd/tracecore/components.go` moves from a logs pipeline to a profiles pipeline, and `components.yaml` may need a `signals:` column added. 2. Ship pprof-dictionary payloads inside `plog.LogRecord` bodies with a content-type attribute during the alpha window, deferring the `ReceiverFactory` extension until upstream `pprofile` is stable. Under this option the Phase 1 `CreateLogs` registration is the eventual production path; Phase 3 adds emission on top of the Phase 1 scaffold without rewiring. - The Phase 1 scaffold, CI gates, config schema, `target_not_attached` posture, and `kinds.go` declarations survive both choices. Phase 3 picks based on `pprofile` stability tier at Phase 3 start and accepts the corresponding rework cost. Whether that decision lives in an in-place amendment to this RFC or in a companion RFC is itself a Phase 3 question; the project has no established precedent for in-place RFC amendments today (this paragraph is the first such entry in any tracecore RFC), so the Phase 3 author makes that convention call. + The Phase 1 scaffold, CI gates, config schema, `target_not_attached` posture, and `kinds.go` declarations survive both choices. Phase 3 picks based on `pprofile` stability tier at Phase 3 start and accepts the corresponding rework cost. Whether the Phase 3 decision lives in an in-place amendment to this RFC or in a companion RFC follows the convention in [`docs/rfcs/README.md` § Amendments to accepted or draft-locked RFCs](README.md#amendments-to-accepted-or-draft-locked-rfcs). 7. **`stack.id` path normalization across heterogeneous Python images.** *(resolved; Phase 3 unblocked.)* From c01abc141d24236ea50e5420e230a38a4da24cca Mon Sep 17 00:00:00 2001 From: Tri Lam Date: Tue, 19 May 2026 10:59:12 -0700 Subject: [PATCH 3/3] [docs] m13 review pass: in-place refactor for six-months-cold reader MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Review of PR #93 surfaced six structural improvements; this commit makes them as in-place rewrites rather than footnote/postscript additions. RFC-0009 §6: merge "Phase 1 registration posture" into the resolved paragraph. The two now read as one coherent answer (resolution + elaboration on Phase 3 plumbing options) instead of two competing Resolved subsections inside one OQ. RFC-0009 §Migration / rollout: collapse "Audience scope" and "On-demand forensics for non-cooperative targets" into one paragraph. Both restated the same in-scope / out-of-scope boundary; the merged version covers image-modifiable audience + vendor-locked exclusion + kubectl debug as one continuous read. docs/rfcs/README.md: drop the self-history paragraph about RFC-0009 §6 being the first amendment to prompt the convention. Cold readers need the rules; the anecdote is git-history material. docs/proposals/gen-ai-training-semconv.md: - Cardinality table: world_size entry now gives a number (same as job.id, 10^2-10^4 per day) instead of just "constant per job". - Reference implementation: trim to honest scope. M13 (design-locked RFC, the closest to actual implementation) is the reference; M14 / M15 / M18 framed as roadmap extensions, not co-equal references. docs/FOLLOWUPS.md: NFR row reformatted as single coherent paragraph without bold (a)/(b) sub-labels that broke FOLLOWUPS visual pattern. Empirical-validation section preamble dropped the meta sentence about reviewer-lens origins; readers want the items, not the section's provenance. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/FOLLOWUPS.md | 60 +++++++++-------------- docs/proposals/gen-ai-training-semconv.md | 11 ++--- docs/rfcs/0009-pyspy-receiver-scope.md | 14 +++--- docs/rfcs/README.md | 12 ----- 4 files changed, 33 insertions(+), 64 deletions(-) diff --git a/docs/FOLLOWUPS.md b/docs/FOLLOWUPS.md index b62023f7..27a1484e 100644 --- a/docs/FOLLOWUPS.md +++ b/docs/FOLLOWUPS.md @@ -1905,43 +1905,31 @@ scope. Each carries the trigger that should reopen the question. Items here cannot be performed at design time. They require a real PyTorch training workload on GPU hardware, a fleet of real customer -deployments, or external-org engagement. The senior-staff and -operator-lens reviews of RFC-0009 surfaced these gaps; capturing -them here so they aren't lost between design-locked and Phase 3 -benchstat work. - -- [ ] **NFR ceilings as measured bounds with sourced derivation.** - RFC-0009 §Non-functional rubrics currently marks all four - ceilings (CPU ≤0.05%, RSS ≤10 MB, egress ≤0.05 Mbps, shutdown - ≤1s p99) as "design contract; measurement deferred to Phase 3 - benchstat." Two gaps survive the design-contract framing, - and both should close in the Phase 3 measurement PR: - **(a) Source the derivation.** The ceiling *values* (the - `0.05%`, the `10 MB`, the `1s`) are inherited from MILESTONES - §M13 rubric, which inherited them from NORTHSTARS O2 - per-receiver budget. NORTHSTARS O2 in turn states the - per-receiver budget without an external anchor. The - derivation chain is "tracecore-defined all the way down." - A principal-tier RFC would name the empirical or - design-budget basis for each ceiling (e.g., "0.05% CPU is - 1/20 of a single hyperthread on a 32-vCPU node, the largest - smallest-acceptable allocation a multi-receiver collector - can spend on one signal class without crowding"). The - derivation should land alongside the measurement so future - operators understand whether the ceiling can be raised when - a real workload pressure-tests it. - **(b) Measure.** Once derived, replace "design contract" with - "measured X sustained CPU under cadence (60s/15s) on a - 32-thread PyTorch DDP workload (n=600, σ=Y, 1-hour soak); - ceiling Z." That measurement requires GPU hardware + real - PyTorch workload. +deployments, or external-org engagement. + +- [ ] **NFR ceilings sourced + measured.** RFC-0009 §Non-functional + rubrics marks all four ceilings (CPU ≤0.05%, RSS ≤10 MB, + egress ≤0.05 Mbps, shutdown ≤1s p99) as design contracts + pending Phase 3 benchstat. Two gaps survive that framing. + First, the ceiling values are inherited from MILESTONES §M13, + which inherited them from NORTHSTARS O2 per-receiver budget, + which states the budget without an external anchor — the + derivation chain is tracecore-defined all the way down. A + principal-tier rationale names the basis explicitly (e.g., + "0.05% CPU is 1/20 of a single hyperthread on a 32-vCPU node, + the smallest-acceptable allocation a multi-receiver collector + can spend on one signal class without crowding") so operators + know whether the ceiling can be raised under real workload + pressure. Second, the wording itself replaces "design + contract" with measured bounds: "measured X sustained CPU + under cadence (60s/15s) on a 32-thread PyTorch DDP workload + (n=600, σ=Y, 1-hour soak); ceiling Z." The derivation half is + doc-only and could land earlier if the O2 owner has bandwidth; + the measurement half needs GPU hardware + real PyTorch. *Needs:* GPU hardware (≥1 node with ≥1 NVIDIA GPU running a - PyTorch DDP toy model) for the measurement; the derivation - half is doc-only and could land earlier if the O2 owner has - bandwidth. *Trigger:* Phase 3 benchstat gate wiring; the - derivation column + measurement column replace the - design-contract wording in RFC-0009 §Non-functional rubrics - in the same PR. + PyTorch DDP toy model). *Trigger:* Phase 3 benchstat gate + wiring; the same PR replaces design-contract wording in + RFC-0009 §Non-functional rubrics. - [ ] **Rolling-upgrade chaos validation across mixed helper/receiver versions.** RFC-0009 §Wire protocol commits to a version-skew matrix (helper v vs receiver v±1). The matrix diff --git a/docs/proposals/gen-ai-training-semconv.md b/docs/proposals/gen-ai-training-semconv.md index 5be24b44..69f34c93 100644 --- a/docs/proposals/gen-ai-training-semconv.md +++ b/docs/proposals/gen-ai-training-semconv.md @@ -125,7 +125,7 @@ Consumers and backends should expect the following cardinality shape: | `gen_ai.training.run.id` | 10² – 10⁴ per organization per day | Safe to aggregate on | | `gen_ai.training.job.id` | Same as `run.id` × ~1.1 (pre-emption multiplier) | Safe to aggregate on | | `gen_ai.training.rank` | **10² – 10⁵ per job** (frontier training) | Unsafe to aggregate on directly; aggregate on `world_size` + `job.id` instead | -| `gen_ai.training.world_size` | Constant per job | Safe aggregation key | +| `gen_ai.training.world_size` | Same as `job.id` (10² – 10⁴ per day); constant within a job | Safe aggregation key | | `gen_ai.training.local_rank` | 1 – ~64 per host | Safe to aggregate on within a host | | `gen_ai.training.step` | Monotonic int up to 10⁶ – 10⁸ per run | Backends MAY rate-limit ingestion | | `gen_ai.training.collective.op` | Bounded (8 values) | Safe to aggregate on | @@ -156,14 +156,9 @@ Backends materializing one timeseries per `rank` at 10⁵ ranks should expect ca ## Reference implementation -[tracecore](https://github.com/TraceCoreAI/tracecore) emits the attribute set in five receivers, with the namespace flagged PROPOSED until this proposal lands: +[tracecore](https://github.com/TraceCoreAI/tracecore) is the reference implementation, with the namespace flagged PROPOSED on emit until this proposal lands. -- **M13 pyspy** receiver (Python stack-sampling): emits `gen_ai.training.rank`, `gen_ai.training.world_size` on every record. See [RFC-0009 §Records and attributes](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0009-pyspy-receiver-scope.md#records-and-attributes). -- **M15 container stdout** receiver: derives the namespace from Pod env vars + label fallback. See [MILESTONES.md M15](https://github.com/TraceCoreAI/tracecore/blob/main/MILESTONES.md#m15-container-stdout-receiver). -- **M14 Kineto** receiver (PyTorch profiler traces): carries `gen_ai.training.rank` + `gen_ai.training.step` per span. -- **M18 stragglers detector**: consumes `gen_ai.training.rank` as the cross-receiver join key for cross-rank correlation. - -The PROPOSED flag on each receiver's README signals the upstream-pending status; tracecore will adopt the official names verbatim if/when this proposal lands with divergence from the draft. +The M13 pyspy receiver (design-locked in [RFC-0009](https://github.com/TraceCoreAI/tracecore/blob/main/docs/rfcs/0009-pyspy-receiver-scope.md)) commits to emitting `gen_ai.training.rank` and `gen_ai.training.world_size` on every record, with the derivation rules in this proposal's §Proposed names. Three further milestones on tracecore's roadmap (M14 Kineto profiler-trace consumer, M15 container stdout receiver, M18 straggler detector) extend the namespace to `step`, `job.id`, and `run.id` semantics as they ship; see [MILESTONES.md](https://github.com/TraceCoreAI/tracecore/blob/main/MILESTONES.md). The PROPOSED flag on each emitting receiver's README signals upstream-pending status; tracecore will adopt the official names verbatim if/when this proposal lands with divergence from the draft. ## Open questions for the SIG diff --git a/docs/rfcs/0009-pyspy-receiver-scope.md b/docs/rfcs/0009-pyspy-receiver-scope.md index f9ff3b99..c832069a 100644 --- a/docs/rfcs/0009-pyspy-receiver-scope.md +++ b/docs/rfcs/0009-pyspy-receiver-scope.md @@ -290,14 +290,14 @@ OQs §1, §2, §6, §7 were phase-blocking before this RFC's acceptance into dra 6. **Wire format: OTel pprof-dictionary vs. `plog.LogRecord`.** *(resolved; Phase 3 unblocked.)* - **Resolved: emit pprof-dictionary from v0.1 alpha.** Rationale: [OpenTelemetry Profiles](https://opentelemetry.io/docs/concepts/signals/profiles/) entered public Alpha on 2026-03-26 with Elastic-donated reference implementation, pprof-compatible per [OTEP-0239](https://github.com/open-telemetry/oteps/blob/main/text/profiles/0239-profiles-data-model.md). Pyroscope 2.0 (April 2026) accepts OTLP profile ingestion; Grafana Alloy carries an OTLP-profiles receiver. Tracecore's NORTHSTARS O4 commits to leading on OTel semconv; emitting pprof-dictionary from v0.1 aligns the receiver with the upstream signal direction. Alpha-stage risk is mitigated by pinning the OTel Profiles version in `components.yaml` and treating any future format shift as an ecosystem-change SLA event per NORTHSTARS O6. `plog.LogRecord` rejected on adoption grounds: tracecore-proprietary parsing rather than the cross-vendor pprof format would surface as friction during the operator-habit-forming period this RFC names. + **Resolved: emit pprof-dictionary from v0.1 alpha.** [OpenTelemetry Profiles](https://opentelemetry.io/docs/concepts/signals/profiles/) entered public Alpha on 2026-03-26 with Elastic-donated reference implementation, pprof-compatible per [OTEP-0239](https://github.com/open-telemetry/oteps/blob/main/text/profiles/0239-profiles-data-model.md); Pyroscope 2.0 (April 2026) and Grafana Alloy consume OTLP profile records natively. Tracecore's NORTHSTARS O4 commits to leading on OTel semconv; emitting pprof-dictionary from v0.1 aligns the receiver with the upstream signal direction. Alpha-stage risk is mitigated by pinning the OTel Profiles version in `components.yaml` and treating any future format shift as an ecosystem-change SLA event per NORTHSTARS O6. `plog.LogRecord` rejected on adoption grounds: tracecore-proprietary parsing rather than the cross-vendor pprof format would surface as friction during the operator-habit-forming period this RFC names. - **Phase 1 registration posture.** Phase 1 of the receiver registers as a logs receiver via `pipeline.ReceiverFactory.CreateLogs`; `CreateMetrics` and `CreateTraces` return `pipeline.ErrSignalNotSupported`. Phase 1 has no emission, so the wire-format choice does not yet bind to a pdata type. At Phase 3, two options for landing pprof-dictionary records through the pipeline: + The Phase 1 receiver carries no emission, so the wire-format choice does not yet bind to a pdata type. The factory registers as a logs receiver via `pipeline.ReceiverFactory.CreateLogs`; `CreateMetrics` and `CreateTraces` return `pipeline.ErrSignalNotSupported`. Phase 3 lands pprof-dictionary records through the pipeline via one of two paths, picked based on `pprofile` stability tier at Phase 3 start: - 1. Add `CreateProfiles(ctx, set, cfg, next consumer.Profiles)` to `pipeline.ReceiverFactory` and import `go.opentelemetry.io/collector/pdata/pprofile`. As of 2026-05-19 this package is at v0.152.0 (pre-v1.0) against tracecore's stable v1.58.0 `pdata` pin; the receiver would carry a v0.x dependency alongside the v1.x stable line until upstream `pprofile` reaches v1.0. Under this option the Phase 1 `CreateLogs` implementation is retired and re-implemented as `CreateProfiles`, the pipeline wiring in `cmd/tracecore/components.go` moves from a logs pipeline to a profiles pipeline, and `components.yaml` may need a `signals:` column added. - 2. Ship pprof-dictionary payloads inside `plog.LogRecord` bodies with a content-type attribute during the alpha window, deferring the `ReceiverFactory` extension until upstream `pprofile` is stable. Under this option the Phase 1 `CreateLogs` registration is the eventual production path; Phase 3 adds emission on top of the Phase 1 scaffold without rewiring. + 1. **`CreateProfiles` factory extension.** Add `CreateProfiles(ctx, set, cfg, next consumer.Profiles)` to `pipeline.ReceiverFactory` and import `go.opentelemetry.io/collector/pdata/pprofile`. As of 2026-05-19 this package is at v0.152.0 (pre-v1.0) against tracecore's stable v1.58.0 `pdata` pin; the receiver would carry a v0.x dependency alongside the v1.x stable line until upstream `pprofile` reaches v1.0. Under this path the Phase 1 `CreateLogs` implementation is retired and re-implemented as `CreateProfiles`, the pipeline wiring in `cmd/tracecore/components.go` moves from a logs pipeline to a profiles pipeline, and `components.yaml` may need a `signals:` column added. + 2. **`plog.LogRecord` bodies with content-type.** Ship pprof-dictionary payloads inside `plog.LogRecord` bodies with a content-type attribute during the alpha window, deferring the `ReceiverFactory` extension until upstream `pprofile` is stable. Under this path the Phase 1 `CreateLogs` registration is the eventual production path; Phase 3 adds emission on top of the Phase 1 scaffold without rewiring. - The Phase 1 scaffold, CI gates, config schema, `target_not_attached` posture, and `kinds.go` declarations survive both choices. Phase 3 picks based on `pprofile` stability tier at Phase 3 start and accepts the corresponding rework cost. Whether the Phase 3 decision lives in an in-place amendment to this RFC or in a companion RFC follows the convention in [`docs/rfcs/README.md` § Amendments to accepted or draft-locked RFCs](README.md#amendments-to-accepted-or-draft-locked-rfcs). + The Phase 1 scaffold, CI gates, config schema, `target_not_attached` posture, and `kinds.go` declarations survive both choices. Whether the Phase 3 decision lives in an in-place amendment to this RFC or in a companion RFC follows the convention in [`docs/rfcs/README.md` § Amendments to accepted or draft-locked RFCs](README.md#amendments-to-accepted-or-draft-locked-rfcs). 7. **`stack.id` path normalization across heterogeneous Python images.** *(resolved; Phase 3 unblocked.)* @@ -317,9 +317,7 @@ The Python helper is `pip install`-only. No vendored CPython, no compiled extens **Rollback:** Setting `receivers.pyspy.enabled=false` removes the receiver from the pipeline. The Python helper still binds its UDS but no one connects, so the workload sees zero behavior change. The operator can also `pip uninstall tracecore-pyspy` and redeploy; nothing else in tracecore depends on it. -**Audience scope (v0.1 alpha).** M13 v0.1 alpha serves operators who can modify the workload training image (add one `pip install` line, one `import` line). This covers in-house production training and academic ML; it does NOT cover workloads running vendor-locked images the operator cannot rebuild — managed services like AWS SageMaker training jobs, Lightning AI Studios, Vertex AI Workbench, multi-tenant JupyterHub clusters, or pre-baked NVIDIA NGC PyTorch images consumed without modification. For those, M13 v0.1 is explicitly out of scope, and the receiver enters `target_not_attached` posture and idles. An admission-webhook-injection mode (the path Datadog's cluster agent takes for tracing) is named in `docs/FOLLOWUPS.md` as v0.2+ work; until that lands, image-immutable operators are uncovered by M13. - -**On-demand forensics for non-cooperative targets.** For one-off investigation of a Python process the operator cannot modify the image of, `kubectl debug --target=` plus `py-spy dump --pid 1` (or `kubectl-prof`-style ephemeral attach) is the operator-standard tool. This is NOT a tracecore-supported continuous coverage path: ephemeral debug containers are disabled on most managed Kubernetes training platforms (SageMaker training jobs disallow them entirely), and even where allowed they do not survive pod restarts. The reference here is operator literacy, not a tracecore capability claim. Phase 4 `docs/integrations/pyspy.md` will document this boundary explicitly so operators reach for the right tool per case. +**Audience scope (v0.1 alpha).** M13 v0.1 alpha serves operators who can modify the workload training image (add one `pip install` line, one `import` line). This covers in-house production training and academic ML. It does NOT cover workloads running vendor-locked images the operator cannot rebuild: managed services like AWS SageMaker training jobs, Lightning AI Studios, Vertex AI Workbench, multi-tenant JupyterHub clusters, or pre-baked NVIDIA NGC PyTorch images consumed without modification. For those, M13 v0.1 is explicitly out of scope; the receiver enters `target_not_attached` posture and idles. An admission-webhook-injection mode (Datadog's cluster-agent pattern for tracing) is named in `docs/FOLLOWUPS.md` as v0.2+ work; until that lands, image-immutable operators are uncovered by M13. For one-off investigation against such workloads, operators reach for `kubectl debug --target=` plus `py-spy dump --pid 1` (or `kubectl-prof`-style ephemeral attach) as their own tool, NOT through tracecore — ephemeral debug containers are disabled on SageMaker training jobs entirely and do not survive pod restarts elsewhere, so the path is forensic and one-shot, not continuous coverage. Phase 4 `docs/integrations/pyspy.md` documents this boundary so operators pick the right tool per case. ## References diff --git a/docs/rfcs/README.md b/docs/rfcs/README.md index 83eb21ce..72a1ed81 100644 --- a/docs/rfcs/README.md +++ b/docs/rfcs/README.md @@ -51,10 +51,6 @@ naming inconsistency (`0006` is `RFC-0006-*.md`, others are ## Amendments to accepted or draft-locked RFCs -The first in-place amendment to a tracecore RFC was the "Phase 1 -registration posture" paragraph added to RFC-0009 §6 on 2026-05-19. -Until that point the project had no convention. Going forward: - - **Editorial corrections** (typo, broken link, dead citation) edit the RFC body directly with a normal commit message; no in-body date-stamp. @@ -67,11 +63,3 @@ Until that point the project had no convention. Going forward: the old one. The superseded RFC's `Status:` flips to `superseded by RFC-NNNN` and the body stays in-tree as historical record. - -The RFC-0009 §6 "Phase 1 registration posture" footnote is the -borderline case that prompted this convention (substantive addition -that names a Phase 3 choice it defers). The original draft of that -footnote carried an inline `(added 2026-05-19)` date-stamp; after -this convention was codified, the date-stamp was removed and the -audit trail moved to `git log`. Subsequent amendments follow the -rules above without inline date-stamps.