Skip to content

[docs] m13 industry-alignment pass: RFC-0009 + gen_ai.training.* proposal + FOLLOWUPS#93

Merged
trilamsr merged 3 commits into
mainfrom
worktree-m13-decisions
May 19, 2026
Merged

[docs] m13 industry-alignment pass: RFC-0009 + gen_ai.training.* proposal + FOLLOWUPS#93
trilamsr merged 3 commits into
mainfrom
worktree-m13-decisions

Conversation

@trilamsr

Copy link
Copy Markdown
Contributor

Summary

  • RFC-0009 corrections (M13 pyspy receiver, design-locked draft). Generalize rank derivation to orchestrator-neutral chain (env RANKSLURM_PROCID → Ray train.context() → k8s label fallback); add three IncError rows (uds_dir_permission_denied, helper_oom_mid_dump, sidecar_uid_drift); reframe NFRs as design contracts pending Phase 3 measurement; anchor invented constants (n=10⁷ derivation, kubelet-timing retry, K8s deprecation-policy version skew); honest non-cooperative-image audience scoping. Add §6 footnote: Phase 1 logs-receiver registration; Phase 3 picks between CreateProfiles factory extension or plog.LogRecord bodies. Follow-up commit refactors the §6 footnote's amendment-convention reference to link docs/rfcs/README.md.
  • New upstream proposal at docs/proposals/gen-ai-training-semconv.md. Closes NORTHSTARS O4 first-draft KPI (4 months overdue). Mirrors hw-gpu proposal shape; full attribute set with cardinality guidance, prior-art table (PyTorch DDP / Slurm / Ray / MLflow / W&B / Kueue / SageMaker / torchrun), cross-language SDK adoption checklist, OTel transform processor translation examples.
  • Hygiene. No standalone cross-check research doc shipped; findings folded directly into RFC-0009 §6 + FOLLOWUPS rows per doc-hygiene reviewer recommendation. Add docs/research/README.md purpose statement. Codify in-place amendment convention in docs/rfcs/README.md (editorial / substantive / decision-change tiers). FOLLOWUPS new "empirical validation" section with 10 rows for Phase 3 / cross-org / GPU-hardware work.

Review surface

Four reviewer lenses (operator/SRE, senior staff, OTel SIG, doc hygiene) defined A+ criteria. Final-state validator confirmed grade movement without new defects:

Lens Before After
Operator/SRE B−/74 B+/87
Senior staff A−/85 A/92
OTel SIG D+/C− B+/87
Doc hygiene B−/C+ B+/87

Single residual gap: NFR ceiling derivation + measurement. Both halves captured in FOLLOWUPS as Phase 3 work requiring GPU hardware.

Test plan

  • make doc-check passes (banned-phrase lint clean across 67 markdown files; 267 internal links resolve; 53 test refs verified; alert-check + release-doc-parity gates green)
  • make lint clean (0 golangci-lint issues)
  • make vet clean
  • docs/proposals/gen-ai-training-semconv.md mirrors semconv-hw-gpu-extensions.md section structure (Motivation, Proposed names, Cardinality guidance, Prior art, Out of scope, Reference implementation, Open questions; plus Cross-language SDK checklist + Migration for richer adoption story)
  • RFC-0009 §Degraded modes contains three new rows: uds_dir_permission_denied, helper_oom_mid_dump, sidecar_uid_drift (verified via grep, count = 3)
  • FOLLOWUPS new "empirical validation" section present under M13 (grep count = 1)
  • docs/research/README.md links to canonical entry points (docs/rfcs/, docs/proposals/, FOLLOWUPS.md)
  • §6 footnote no longer self-references; defers amendment convention to docs/rfcs/README.md per six-months-cold-reader sweep

🤖 Generated with Claude Code

trilamsr and others added 3 commits May 19, 2026 10:49
…al + FOLLOWUPS

Four-lens A+ review (operator/SRE, senior staff, OTel SIG, doc
hygiene) defined gap closure criteria; this commit executes them.

RFC-0009 corrections (M13 pyspy receiver, draft-locked):
- Generalize rank derivation: orchestrator-neutral chain env RANK
  -> SLURM_PROCID -> Ray train.context() -> k8s label fallback
- Three new IncError rows: uds_dir_permission_denied,
  helper_oom_mid_dump, sidecar_uid_drift
- NFR ceilings reframed as design contracts pending Phase 3
  benchstat measurement, not asserted measurements
- Anchor invented constants: n=10^7 birthday-bound derived from
  10^4 ranks x 10 stacks x 100 fleet-days; 30s target_not_attached
  retry from kubelet liveness-probe grace window; one-minor version
  skew window from K8s API deprecation policy
- Honest non-cooperative-image audience scoping: v0.1 requires
  image cooperation; vendor-locked images (SageMaker, Lightning AI,
  Vertex AI, NGC PyTorch unmodified) explicitly out of scope;
  kubectl debug reframed as operator literacy, not coverage
- New section 6 footnote: Phase 1 registers as logs receiver via
  CreateLogs; Phase 3 picks between CreateProfiles factory
  extension or plog.LogRecord bodies with pprof content-type;
  asymmetric rework cost documented
- Remove uncited Pyroscope roadmap directional-alignment claim

New docs/proposals/gen-ai-training-semconv.md: upstream proposal
draft for gen_ai.training.* namespace. Closes M1 first-draft KPI
from NORTHSTARS O4 (4 months overdue). Mirrors
semconv-hw-gpu-extensions.md shape; full attribute set,
cardinality guidance, prior-art table (PyTorch DDP / Slurm / Ray /
MLflow / W&B / Kueue / SageMaker / torchrun), cross-language SDK
adoption checklist, OTel transform-processor translation examples.

New docs/research/README.md: directory purpose statement clarifying
what belongs in research/ vs rfcs/ vs proposals/; no-orphan rule.

Amended docs/rfcs/README.md: codifies in-place amendment convention
(editorial / substantive / decision-change tiers); names the
RFC-0009 section 6 footnote as the borderline case that prompted
the convention.

Amended docs/FOLLOWUPS.md: new "empirical validation" section under
M13 with 10 rows for items needing GPU hardware, production data,
or cross-org engagement (NFR ceiling derivation + measurement,
rolling-upgrade chaos, GIL-hold blast radius during NCCL collective,
SIG attendance, upstream PR filing, vendor escape-hatch
verification, OTel Python SDK status, pprof v1.0 upgrade, v0.2
admission-webhook for image-immutable operators).

No standalone cross-check research doc shipped; findings landed
directly in RFC-0009 section 6 + FOLLOWUPS rows per doc-hygiene
reviewer recommendation.

Final-state validator confirmed grade movement without new defects:
Operator B-/74 -> B+/87, Senior staff A-/85 -> A/92, OTel SIG
D+/C- -> B+/87, Doc hygiene B-/C+ -> B+/87.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…README.md

The original §6 footnote claimed "no established precedent for in-place
RFC amendments today (this paragraph is the first such entry in any
tracecore RFC)." That self-reference was true when written but is now
stale: docs/rfcs/README.md (added in the same branch) codifies the
amendment-convention tiers (editorial / substantive / decision-change).
Replace the self-referential paragraph with a link to the README
section so a six-months-cold reader sees the rule, not the meta.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Review of PR #93 surfaced six structural improvements; this commit
makes them as in-place rewrites rather than footnote/postscript
additions.

RFC-0009 §6: merge "Phase 1 registration posture" into the resolved
paragraph. The two now read as one coherent answer (resolution +
elaboration on Phase 3 plumbing options) instead of two competing
Resolved subsections inside one OQ.

RFC-0009 §Migration / rollout: collapse "Audience scope" and
"On-demand forensics for non-cooperative targets" into one
paragraph. Both restated the same in-scope / out-of-scope boundary;
the merged version covers image-modifiable audience + vendor-locked
exclusion + kubectl debug as one continuous read.

docs/rfcs/README.md: drop the self-history paragraph about RFC-0009
§6 being the first amendment to prompt the convention. Cold readers
need the rules; the anecdote is git-history material.

docs/proposals/gen-ai-training-semconv.md:
- Cardinality table: world_size entry now gives a number (same as
  job.id, 10^2-10^4 per day) instead of just "constant per job".
- Reference implementation: trim to honest scope. M13 (design-locked
  RFC, the closest to actual implementation) is the reference; M14
  / M15 / M18 framed as roadmap extensions, not co-equal references.

docs/FOLLOWUPS.md: NFR row reformatted as single coherent paragraph
without bold (a)/(b) sub-labels that broke FOLLOWUPS visual pattern.
Empirical-validation section preamble dropped the meta sentence
about reviewer-lens origins; readers want the items, not the
section's provenance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@trilamsr trilamsr merged commit 5dd9df4 into main May 19, 2026
8 checks passed
@trilamsr trilamsr deleted the worktree-m13-decisions branch May 19, 2026 18:28
trilamsr added a commit that referenced this pull request May 19, 2026
Three build-approach-independent rubric refinements from the M15
research evidence base (PR #92) and RFC-0010 (PR #94):

- R-4 (cosmetic): max_log_size citation now references the OTel
  `container` stanza operator default (1 MiB matches; documents the
  prior art whether we depend on it via BA-1 or port it via BA-2).
- R-5 (new reliability caveat): containerd #11149 silently drops
  bytes from 0.log when an in-container process reads its own FD 1.
  Shared-pipe contention; not generic backpressure. Standard
  workloads unaffected. Was previously misframed in research round 1
  as "disk-I/O backpressure"; the 2025-01-22 reproducer in the issue
  pinpoints the mechanism.
- R-8 (degraded-mode specificity): rotation-stalled is now defined
  concretely as 0.log size > containerLogMaxSize for ≥30 s
  (3× kubelet default containerLogMonitorInterval of 10 s, cited at
  source). Surfaced via IncError("rotation_stalled"). Prior text was
  generic "kubelet rotation breakage" with no detection mechanism.

R-1 / R-2 (namespace) withheld: OD-12 effectively resolved by the
upstream proposal at docs/proposals/gen-ai-training-semconv.md
(O4-overdue first-draft KPI closed PR #93). No rename needed.

R-3 / R-7 (rotation correctness, gzip handling) deferred: pending
the corresponding integration-test fixtures (TestContainerStdout_*)
landing in the M15 implementation phase per RFC-0010.

R-6 (bbolt cursor) dropped: BA-2 build approach (RFC-0010) keeps the
JSON cursor at /var/lib/tracecore/container_stdout/cursor.json as
originally rubricked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
## Summary

Three build-approach-independent rubric refinements for M15, surfaced by
the research doc (PR #92) and locked-in by RFC-0010 (PR #94). Five
additional rubric edits (R-1, R-2, R-3, R-6, R-7) are intentionally
withheld per the rationale below.

## Diff

3 lines changed in `MILESTONES.md`:

- **R-4 (cosmetic):** `max_log_size` citation now references the OTel
`container` stanza operator default. 1 MiB matches; documents the prior
art whether we depend on it (BA-1) or port it (BA-2 per RFC-0010).
- **R-5 (new reliability caveat):** containerd
[#11149](containerd/containerd#11149) silently
drops bytes from `0.log` when an in-container process reads its own FD
1. Mechanism is shared-pipe contention, not generic backpressure
(round-1 research framed it incorrectly; the 2025-01-22 reproducer in
the issue pinpoints the mechanism). Standard workloads unaffected.
- **R-8 (degraded-mode specificity):** rotation-stalled is now defined
concretely as `0.log` size > `containerLogMaxSize` for ≥30 s (3× kubelet
default `containerLogMonitorInterval` of 10 s, cited at source).
Surfaced via `IncError("rotation_stalled")`. Prior text was generic
"kubelet rotation breakage" with no detection mechanism.

## Why not the other 5 rubric edits

- **R-1 / R-2 (namespace):** OD-12 effectively resolved by the upstream
proposal at `docs/proposals/gen-ai-training-semconv.md` (PR #93 closed
the O4-overdue first-draft KPI). No rename needed.
- **R-3 / R-7 (rotation correctness, gzip handling):** deferred pending
the corresponding `TestContainerStdout_*` integration-test fixtures
landing in the M15 implementation phase per RFC-0010.
- **R-6 (bbolt cursor):** dropped because BA-2 keeps the JSON cursor at
`/var/lib/tracecore/container_stdout/cursor.json` as originally
rubricked.

## Test plan

- [x] `make doc-check` clean (273 markdown links resolve, banned-phrase
lint clean across 67 files, RUNBOOK ↔ alerts pairing clean).
- [x] Containerd #11149 mechanism re-verified at the issue's 2025-01-22
reproducer comment; mechanism is shared-pipe contention.
- [x] `containerLogMonitorInterval` default cited verbatim at
`pkg/kubelet/apis/config/v1beta1/defaults.go`; verified in research
§13.8.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant