[docs] m13 industry-alignment pass: RFC-0009 + gen_ai.training.* proposal + FOLLOWUPS#93
Merged
Conversation
…al + FOLLOWUPS Four-lens A+ review (operator/SRE, senior staff, OTel SIG, doc hygiene) defined gap closure criteria; this commit executes them. RFC-0009 corrections (M13 pyspy receiver, draft-locked): - Generalize rank derivation: orchestrator-neutral chain env RANK -> SLURM_PROCID -> Ray train.context() -> k8s label fallback - Three new IncError rows: uds_dir_permission_denied, helper_oom_mid_dump, sidecar_uid_drift - NFR ceilings reframed as design contracts pending Phase 3 benchstat measurement, not asserted measurements - Anchor invented constants: n=10^7 birthday-bound derived from 10^4 ranks x 10 stacks x 100 fleet-days; 30s target_not_attached retry from kubelet liveness-probe grace window; one-minor version skew window from K8s API deprecation policy - Honest non-cooperative-image audience scoping: v0.1 requires image cooperation; vendor-locked images (SageMaker, Lightning AI, Vertex AI, NGC PyTorch unmodified) explicitly out of scope; kubectl debug reframed as operator literacy, not coverage - New section 6 footnote: Phase 1 registers as logs receiver via CreateLogs; Phase 3 picks between CreateProfiles factory extension or plog.LogRecord bodies with pprof content-type; asymmetric rework cost documented - Remove uncited Pyroscope roadmap directional-alignment claim New docs/proposals/gen-ai-training-semconv.md: upstream proposal draft for gen_ai.training.* namespace. Closes M1 first-draft KPI from NORTHSTARS O4 (4 months overdue). Mirrors semconv-hw-gpu-extensions.md shape; full attribute set, cardinality guidance, prior-art table (PyTorch DDP / Slurm / Ray / MLflow / W&B / Kueue / SageMaker / torchrun), cross-language SDK adoption checklist, OTel transform-processor translation examples. New docs/research/README.md: directory purpose statement clarifying what belongs in research/ vs rfcs/ vs proposals/; no-orphan rule. Amended docs/rfcs/README.md: codifies in-place amendment convention (editorial / substantive / decision-change tiers); names the RFC-0009 section 6 footnote as the borderline case that prompted the convention. Amended docs/FOLLOWUPS.md: new "empirical validation" section under M13 with 10 rows for items needing GPU hardware, production data, or cross-org engagement (NFR ceiling derivation + measurement, rolling-upgrade chaos, GIL-hold blast radius during NCCL collective, SIG attendance, upstream PR filing, vendor escape-hatch verification, OTel Python SDK status, pprof v1.0 upgrade, v0.2 admission-webhook for image-immutable operators). No standalone cross-check research doc shipped; findings landed directly in RFC-0009 section 6 + FOLLOWUPS rows per doc-hygiene reviewer recommendation. Final-state validator confirmed grade movement without new defects: Operator B-/74 -> B+/87, Senior staff A-/85 -> A/92, OTel SIG D+/C- -> B+/87, Doc hygiene B-/C+ -> B+/87. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…README.md The original §6 footnote claimed "no established precedent for in-place RFC amendments today (this paragraph is the first such entry in any tracecore RFC)." That self-reference was true when written but is now stale: docs/rfcs/README.md (added in the same branch) codifies the amendment-convention tiers (editorial / substantive / decision-change). Replace the self-referential paragraph with a link to the README section so a six-months-cold reader sees the rule, not the meta. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review of PR #93 surfaced six structural improvements; this commit makes them as in-place rewrites rather than footnote/postscript additions. RFC-0009 §6: merge "Phase 1 registration posture" into the resolved paragraph. The two now read as one coherent answer (resolution + elaboration on Phase 3 plumbing options) instead of two competing Resolved subsections inside one OQ. RFC-0009 §Migration / rollout: collapse "Audience scope" and "On-demand forensics for non-cooperative targets" into one paragraph. Both restated the same in-scope / out-of-scope boundary; the merged version covers image-modifiable audience + vendor-locked exclusion + kubectl debug as one continuous read. docs/rfcs/README.md: drop the self-history paragraph about RFC-0009 §6 being the first amendment to prompt the convention. Cold readers need the rules; the anecdote is git-history material. docs/proposals/gen-ai-training-semconv.md: - Cardinality table: world_size entry now gives a number (same as job.id, 10^2-10^4 per day) instead of just "constant per job". - Reference implementation: trim to honest scope. M13 (design-locked RFC, the closest to actual implementation) is the reference; M14 / M15 / M18 framed as roadmap extensions, not co-equal references. docs/FOLLOWUPS.md: NFR row reformatted as single coherent paragraph without bold (a)/(b) sub-labels that broke FOLLOWUPS visual pattern. Empirical-validation section preamble dropped the meta sentence about reviewer-lens origins; readers want the items, not the section's provenance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
Three build-approach-independent rubric refinements from the M15 research evidence base (PR #92) and RFC-0010 (PR #94): - R-4 (cosmetic): max_log_size citation now references the OTel `container` stanza operator default (1 MiB matches; documents the prior art whether we depend on it via BA-1 or port it via BA-2). - R-5 (new reliability caveat): containerd #11149 silently drops bytes from 0.log when an in-container process reads its own FD 1. Shared-pipe contention; not generic backpressure. Standard workloads unaffected. Was previously misframed in research round 1 as "disk-I/O backpressure"; the 2025-01-22 reproducer in the issue pinpoints the mechanism. - R-8 (degraded-mode specificity): rotation-stalled is now defined concretely as 0.log size > containerLogMaxSize for ≥30 s (3× kubelet default containerLogMonitorInterval of 10 s, cited at source). Surfaced via IncError("rotation_stalled"). Prior text was generic "kubelet rotation breakage" with no detection mechanism. R-1 / R-2 (namespace) withheld: OD-12 effectively resolved by the upstream proposal at docs/proposals/gen-ai-training-semconv.md (O4-overdue first-draft KPI closed PR #93). No rename needed. R-3 / R-7 (rotation correctness, gzip handling) deferred: pending the corresponding integration-test fixtures (TestContainerStdout_*) landing in the M15 implementation phase per RFC-0010. R-6 (bbolt cursor) dropped: BA-2 build approach (RFC-0010) keeps the JSON cursor at /var/lib/tracecore/container_stdout/cursor.json as originally rubricked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
## Summary Three build-approach-independent rubric refinements for M15, surfaced by the research doc (PR #92) and locked-in by RFC-0010 (PR #94). Five additional rubric edits (R-1, R-2, R-3, R-6, R-7) are intentionally withheld per the rationale below. ## Diff 3 lines changed in `MILESTONES.md`: - **R-4 (cosmetic):** `max_log_size` citation now references the OTel `container` stanza operator default. 1 MiB matches; documents the prior art whether we depend on it (BA-1) or port it (BA-2 per RFC-0010). - **R-5 (new reliability caveat):** containerd [#11149](containerd/containerd#11149) silently drops bytes from `0.log` when an in-container process reads its own FD 1. Mechanism is shared-pipe contention, not generic backpressure (round-1 research framed it incorrectly; the 2025-01-22 reproducer in the issue pinpoints the mechanism). Standard workloads unaffected. - **R-8 (degraded-mode specificity):** rotation-stalled is now defined concretely as `0.log` size > `containerLogMaxSize` for ≥30 s (3× kubelet default `containerLogMonitorInterval` of 10 s, cited at source). Surfaced via `IncError("rotation_stalled")`. Prior text was generic "kubelet rotation breakage" with no detection mechanism. ## Why not the other 5 rubric edits - **R-1 / R-2 (namespace):** OD-12 effectively resolved by the upstream proposal at `docs/proposals/gen-ai-training-semconv.md` (PR #93 closed the O4-overdue first-draft KPI). No rename needed. - **R-3 / R-7 (rotation correctness, gzip handling):** deferred pending the corresponding `TestContainerStdout_*` integration-test fixtures landing in the M15 implementation phase per RFC-0010. - **R-6 (bbolt cursor):** dropped because BA-2 keeps the JSON cursor at `/var/lib/tracecore/container_stdout/cursor.json` as originally rubricked. ## Test plan - [x] `make doc-check` clean (273 markdown links resolve, banned-phrase lint clean across 67 files, RUNBOOK ↔ alerts pairing clean). - [x] Containerd #11149 mechanism re-verified at the issue's 2025-01-22 reproducer comment; mechanism is shared-pipe contention. - [x] `containerLogMonitorInterval` default cited verbatim at `pkg/kubelet/apis/config/v1beta1/defaults.go`; verified in research §13.8. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
RANK→SLURM_PROCID→ Raytrain.context()→ k8s label fallback); add three IncError rows (uds_dir_permission_denied,helper_oom_mid_dump,sidecar_uid_drift); reframe NFRs as design contracts pending Phase 3 measurement; anchor invented constants (n=10⁷derivation, kubelet-timing retry, K8s deprecation-policy version skew); honest non-cooperative-image audience scoping. Add §6 footnote: Phase 1 logs-receiver registration; Phase 3 picks betweenCreateProfilesfactory extension orplog.LogRecordbodies. Follow-up commit refactors the §6 footnote's amendment-convention reference to linkdocs/rfcs/README.md.docs/proposals/gen-ai-training-semconv.md. Closes NORTHSTARS O4 first-draft KPI (4 months overdue). Mirrors hw-gpu proposal shape; full attribute set with cardinality guidance, prior-art table (PyTorch DDP / Slurm / Ray / MLflow / W&B / Kueue / SageMaker / torchrun), cross-language SDK adoption checklist, OTeltransformprocessor translation examples.docs/research/README.mdpurpose statement. Codify in-place amendment convention indocs/rfcs/README.md(editorial / substantive / decision-change tiers). FOLLOWUPS new "empirical validation" section with 10 rows for Phase 3 / cross-org / GPU-hardware work.Review surface
Four reviewer lenses (operator/SRE, senior staff, OTel SIG, doc hygiene) defined A+ criteria. Final-state validator confirmed grade movement without new defects:
Single residual gap: NFR ceiling derivation + measurement. Both halves captured in FOLLOWUPS as Phase 3 work requiring GPU hardware.
Test plan
make doc-checkpasses (banned-phrase lint clean across 67 markdown files; 267 internal links resolve; 53 test refs verified; alert-check + release-doc-parity gates green)make lintclean (0 golangci-lint issues)make vetcleandocs/proposals/gen-ai-training-semconv.mdmirrorssemconv-hw-gpu-extensions.mdsection structure (Motivation, Proposed names, Cardinality guidance, Prior art, Out of scope, Reference implementation, Open questions; plus Cross-language SDK checklist + Migration for richer adoption story)uds_dir_permission_denied,helper_oom_mid_dump,sidecar_uid_drift(verified via grep, count = 3)docs/research/README.mdlinks to canonical entry points (docs/rfcs/,docs/proposals/,FOLLOWUPS.md)docs/rfcs/README.mdper six-months-cold-reader sweep🤖 Generated with Claude Code