Skip to content

[docs] rfc-0010: containerstdout receiver scope (M15, alpha; design-locked draft)#94

Merged
trilamsr merged 6 commits into
mainfrom
rfc-0010-m15-containerstdout
May 19, 2026
Merged

[docs] rfc-0010: containerstdout receiver scope (M15, alpha; design-locked draft)#94
trilamsr merged 6 commits into
mainfrom
rfc-0010-m15-containerstdout

Conversation

@trilamsr

@trilamsr trilamsr commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Locks M15 design ahead of implementation. Builds on the 2664-line research evidence base at docs/research/m15-container-stdout.md (PR #92) and survived 5 review passes on this branch — see commit log for the trail.

Key decisions (post-review):

  • Build approach BA-2 (port pkg/stanza/fileconsumer); BA-1 adapter rejected because M16's rubric is consistent with either path and presumes tracecore-native.
  • Attribute namespace gen_ai.training.* per the upstream proposal at docs/proposals/gen-ai-training-semconv.md, with one documented exception for tracecore.training.data_time_s / iter_time_s (M18 input contract is older than the upstream draft).
  • Cursor JSON at /var/lib/tracecore/container_stdout/cursor.json (atomic rename, mode 0600).
  • Pod attribution chain mirrors RFC-0009 for cross-receiver join consistency.
  • MILESTONES.md M15 line 358 amended (tracecore.io/rankgen_ai.training.io/rank) to canonicalize the cross-RFC label name.

Review history

Pass Commit Top finding
Self-review 456f6da Softened M16 claim; surfaced validate-at-load; 4 deferrals
8 stakeholder lenses 19bd492 BLOCKER (namespace carve-out) resolved; 12 CONCERN/NIT applied; 13 deferred
2 adversarial 63dccbb Cross-RFC label resolved in-PR; propagation gaps closed; "compile-time pin" corrected
2 A+ aspiration be3db09 §Operator-surfaces inlined; §Performance promoted from FOLLOWUPS; SHA-pinned manifest URL
2 simplification (this commit) Removed meta-prose, duplicate justifications, self-referential open-question tombstones

Per-commit message body has the full pushback table.

Test plan

  • make doc-check clean across all 5 commits.
  • grep -rn "tracecore.io/rank\|tracecore.io/job-id" MILESTONES.md docs/rfcs/ returns only the rename-explanation citation in RFC-0010 (no stale uses).
  • HintPodEvicted symbol resolves at components/receivers/k8sevents/hint.go:24.
  • AI-vocab diff gate clean.

🤖 Generated with Claude Code

…ocked draft)

Locks the M15 design ahead of implementation. Builds on the research
evidence base at docs/research/m15-container-stdout.md.

Decisions baked in:
- Build approach BA-2 (port pkg/stanza/fileconsumer). BA-1 adapter
  rejected because M16 is planned tracecore-native (Prometheus
  scraper, single-line factory per MILESTONES.md lines 383-388);
  adapter has no second user.
- Attribute namespace gen_ai.training.* per the upstream proposal at
  docs/proposals/gen-ai-training-semconv.md. No pre-emptive
  tracecore.training.* hedge; collector-side rename is the fallback
  if upstream rejects.
- Cursor at /var/lib/tracecore/container_stdout/cursor.json (JSON +
  atomic rename, not bbolt). Matches MILESTONES.md M15 rubric line
  364; vendoring file_storage rejected.
- Pod attribution chain mirrors RFC-0009 for join-key consistency
  across M13/M14/M15/M18.
- Typed Record schema exported for M18/M19 compile-time joins;
  WorldSize > 0 sentinel pattern (Rank=0 is valid).
- Self-telemetry kinds: KindRotationStalled, KindCursorWriteFailed,
  KindBackpressureDrop, KindCardinality, KindWatch.
- 15 named TestContainerStdout_* identifiers covering rotation,
  shutdown, pod-deletion / eviction / drain, sibling isolation,
  cursor write failure, FD hygiene, JSON parse fallthrough,
  per-key rate-limit, namespace allowlist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added 5 commits May 19, 2026 11:56
…validate-at-load, defer 4 items

Author self-review against PRINCIPLES.md / NORTHSTARS.md / STYLE.md /
MEMORY.md (branch does not match feat/m<X>-* pattern; rubric anchor
is repo-standards only). Edge-case hunt produced 8 candidates; 5
applied inline, 4 deferred to docs/FOLLOWUPS.md, 1 explicitly-
skipped (Rank=0 ambiguity — schema doc already addresses).

Pushback table:

| ID   | Severity | Beneficiary    | Finding                                                                                          | Proof                                                              | Contradict                                                          | Action                                  |
| ---- | -------- | -------------- | ------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------- | --------------------------------------- |
| P1.1 | CONCERN  | repo-long-term | "M16 planned tracecore-native" is inferred from MILESTONES.md, not stated                        | M16 rubric lines 383-395 specify "Prometheus scrape", not approach | M16 RFC could land BA-1; RFC-0010 amendment convention covers       | applied inline                          |
| P1.2 | CONCERN  | customer-M13   | RFC-0009 uses pod label gen_ai.training.io/rank; M15 rubric uses tracecore.io/rank               | grep MILESTONES.md + RFC-0009                                      | Could be deliberate divergence; unlikely given cross-receiver joins | deferred to FOLLOWUPS.md M15 section    |
| P1.3 | NIT      | repo-long-term | "production-tested via otelcol-k8s" propagated from sub-agent report without direct verification | research §13.3 cites manifest URL, not direct deployment data      | manifest inclusion is reasonable evidence of production load        | applied — added "unverified at the production-deployment level"   |
| E1   | CONCERN  | operator       | NODE_NAME unset → FieldSelector empty → cluster-wide watch → cardinality blowup                  | mock test with NODE_NAME unset                                     | downward-API pattern is standard k8s; near-zero chance to be missed | applied — Validate rejects with node_name_unset error             |
| E2   | CONCERN  | operator       | Operator-overridden regex with Go-RE2-unsupported syntax fails at runtime                        | round-3 lookahead regression already proved this real              | none — already happened once                                        | applied — Validate rejects unsupported Perl regex constructs       |
| E3   | NIT      | repo-long-term | cursor.json size not bounded                                                                     | benchmark threshold test                                           | atomic rename remains atomic at multi-MB                             | deferred to FOLLOWUPS.md M15 section    |
| E5   | CONCERN  | repo-long-term | Upstream-SIG rejection fallback not designed                                                     | follow-up RFC when upstream lands                                  | upstream PR may yet be accepted with the proposed names              | applied — added to Open questions       |
| E7   | NIT      | repo-long-term | Rank=0 vs undiscovered Rank ambiguity in schema                                                  | IsValid() helper                                                   | schema doc already warns "consumers MUST check WorldSize > 0"        | explicitly-skipped — schema warning sufficient (PRINCIPLES §13 operator-first; honor-system OK at receiver-API boundary) |
| E8   | NIT      | repo-long-term | Self-telemetry kind aliasing CI gate is manual ("grep at PR time")                               | go-vet-level lint                                                  | k8sevents + kernelevents have not collided; small N                  | deferred to FOLLOWUPS.md M15 section    |
| P1.5 | NIT      | repo-long-term | /var/lib/tracecore/ subdir namespace governance not formalized                                   | future amendment to RFC-0001                                       | M15 is currently sole tracked cursor consumer                        | deferred to FOLLOWUPS.md M15 section    |

Validation-cycle stats:
- Findings rejected during contradict: 0 (rate-limit-processor framing collapsed but was not raised as a finding; weakened during planning)
- Findings whose hard-proof did not reproduce: 0

TDD discipline: N/A — RFC is prose. Test reproducibility: make doc-check
clean (manual run; same SHA + git clean -fdx would reproduce).

Rubric evolution: (none accepted in this phase)

Discovered constraints: (none load-bearing)

Signed-off-by: Tri Lam <trilamsr@gmail.com>
…IT applied; 13 deferred

8 parallel stakeholder-lens reviews against the design-locked draft.
6/8 lenses returned CONCERNS-REQUIRE-FIX; 1 returned BLOCKER-REQUIRES-RESOLUTION (researcher).

Pushback (severity-prefixed, applied inline unless noted):

BLOCKER:
- P2.R-1 (researcher) — §Attribute namespace's blanket "gen_ai-only, no dual-emit" contradicts MILESTONES.md M15 rubric L360 + M18 rubric L481, both of which compile-time-pin tracecore.training.data_time_s / iter_time_s (M18 via ErrStragglerInputMissing in internal/synthesis/patterns/straggler.go). APPLIED: wire-attribute table carves the two timing attributes as a documented exception until upstream proposal lands, then sibling cross-receiver PR.

CONCERN (multi-lens — escalated):
- P2.multi-2 (sre,adopter,security) — /var/log/containers mount + fsGroup + runAsNonRoot missing per M15 rubric L373. DEFERRED to FOLLOWUPS (§Pod manifest subsection in Phase-2 PR).
- P2.multi-3 (operator,sre,adopter) — KindCardinality covers both fingerprint and attribution LRUs; on-call can't distinguish. APPLIED: split into KindFingerprintCardinality + KindAttributionCardinality; renamed config to `attribution.lru_cap` vs `egress_rate_limit.lru_cap`.
- P2.multi-4 (contributor,maintainer,adopter,researcher) — "makes funded" ungrammatical. APPLIED: replaced with "the drafted upstream proposal already absorbs".
- P2.multi-6 (contributor,adopter) — dataloader_regex shipped as default without marking placeholder per rubric L360 "operators MUST override". APPLIED: inline comment block.
- P2.multi-7 (operator,sre) — RUNBOOK.md + prometheus-alerts.example.yaml + FAILURE-MODES.md rows not named as Phase-4 deliverables. DEFERRED.

CONCERN (single-lens, applied):
- P2.R-2 (researcher) — HintEvicted does not compile; actual constant is k8sevents.HintPodEvicted at hint.go:24. APPLIED: typo fixed + evictionMatchWindow defined as eviction_match_window default 5s in Configuration surface.
- P2.R-4 (researcher) — JobID source-priority chain missing. APPLIED: JOB_ID → TORCHELASTIC_RUN_ID → SLURM_JOB_ID → k8s label fallback (per rubric L358 + research §13.13).
- P2.sre-1 — ClusterRole named tracecore-containerstdout-clusterrole breaks k8sevents convention. APPLIED: renamed to tracecore-containerstdout (no -clusterrole suffix).
- P2.contrib-1 — SchemaURLv0/SchemaURL collapse inverts k8sevents anti-pattern warning. APPLIED: rewrote snippet to match k8sevents/record.go:90-107 line-for-line with the warning comment preserved.
- P2.maint-3 — "§Functional rubric #3" cross-doc reference points outside the RFC. APPLIED: fully-qualified to "MILESTONES.md M15 functional rubric bullet 3".
- P2.sec-4 — cursor file mode unspecified. APPLIED: cursor.file_mode: 0600 + dir_mode: 0700 in Configuration.
- P2.sec-5 — FieldSelector defense-in-depth missing. APPLIED: Validate rejects empty FieldSelector; runtime drops mismatched-nodeName events with KindWatch.
- P2.R-7 — named-capture (?P<name>…) allowance not stated in validate-at-load list. APPLIED.

CONCERN (single-lens, deferred to FOLLOWUPS):
- P2.perf-1/2/3/4 — §Performance subsection (alloc budget, channel sizing rationale, regex literal-prefilter, attribution-lookup concurrency). DEFERRED as consolidated FOLLOWUPS entry.
- P2.sec-2 — log-body secret redaction posture. DEFERRED (alpha-stage posture acceptable per sibling-receiver pattern).
- P2.sec-3 — vendor-tree CVE tracking. DEFERRED.
- P2.sre-2 — RBAC opt-out path when rank_source: downward_api. DEFERRED.
- P2.sre-5 — Rollback subsection in §Migration. DEFERRED.
- P2.maint-1/2 overlap with P2.R-2/R-4 (already applied).

NIT (deferred or skipped):
- P2.perf-5 — cursor flush cadence. DEFERRED.
- P2.op-3 — covered by P2.multi-3 (applied).
- P2.op-4 — chart smoke test asserts NODE_NAME. DEFERRED.
- P2.op-5 — M16 reversal recheck. DEFERRED.
- P2.contrib-3 — FOLLOWUPS governance-row falsifying-check. DEFERRED.
- P2.contrib-4 — covered by P2.multi-6.
- P2.maint-5 — forward-pointer to FOLLOWUPS. DEFERRED.
- P2.maint-6 — vendor/ path collision with go mod vendor. DEFERRED.
- P2.R-8 — SHA-pin manifest URL. DEFERRED.

EXPLICITLY-SKIPPED:
- P2.multi-5 / P2.adopt-3 / P2.maint-5 / P2.R-5 (cross-RFC label tracecore.io/rank vs gen_ai.training.io/rank). Phase 1 deferred to FOLLOWUPS; Phase 2 escalation contradicted by: design-locked DRAFT is the appropriate place to defer cross-receiver reconciliation pre-v1.0 (PRINCIPLES §11 backward-compat opt-in pre-v1.0); label fallback only fires when env discovery already failed, in which case rank is "unknown" regardless of label name. Phase 1 defer stands.

Validation-cycle stats:
- Findings rejected during contradict: 1 (P2.multi-5 namespace-rename; defended by pre-v1.0 + fallback-path-only argument)
- Findings whose hard-proof did not reproduce: 0

TDD discipline: N/A — RFC is prose. Reproducibility: make doc-check clean.

Rubric additions accepted (binding):
- Alpha-receiver RFCs MUST contain §Operator-surfaces section (RUNBOOK + alerts + FAILURE-MODES rows).
- RFCs with typed-record schemas MUST include wire-attribute name table cross-checked against MILESTONES.
- RFCs naming sibling-package symbols MUST verify compile-resolve at HEAD.

Discovered constraints:
- Pattern: alpha receivers consistently drop one of {RUNBOOK row, alerts row, FAILURE-MODES row}. Promote rubric.
- Pattern: cross-RFC join-key inconsistencies tend to defer to "whichever PR lands first"; design-locked status should exclude this class.

Per-lens final verdicts:
- Performance:    CONCERNS-REQUIRE-FIX (deferred)
- SRE/Infra:      CONCERNS-REQUIRE-FIX (deferred/applied mix)
- Maintainer:     CONCERNS-REQUIRE-FIX (mostly applied)
- Contributor:    CONCERNS-REQUIRE-FIX (applied)
- Operator/User:  CONCERNS-REQUIRE-FIX (deferred)
- Adopter:        CONCERNS-REQUIRE-FIX (overlapped with SRE/Security)
- Security:       CONCERNS-REQUIRE-FIX (mostly deferred)
- Researcher:     BLOCKER-REQUIRES-RESOLUTION (resolved inline)

Signed-off-by: Tri Lam <trilamsr@gmail.com>
…solve cross-RFC label; sharpen contracts

Two adversarial deep reads against the post-Phase-2 state (commit 19bd492).
Both verdicts: CONCERNS-REQUIRE-FIX. Adversarial-1 hunted symbol drift
across the RFC body; adversarial-2 hunted architectural-fitness +
prose-vs-symbol claims.

Pushback (severity-prefixed, applied unless noted):

CONCERN (multi-adversarial — both reviewers raised; escalated):
- P3.adv-A — ClusterRole name `tracecore-containerstdout-clusterrole` at line 369 (§Coexistence) was not updated when Phase-2 P2.sre-1 fixed line 196 (§RBAC). Both adversaries grepped k8sevents/rbac.yaml; actual sibling artifact uses `tracecore-k8sevents` (no -clusterrole suffix). APPLIED: line 369 fixed; §Self-telemetry kinds enumerated.

CONCERN (adversarial-1 line-by-line):
- P3.adv1-1 — KindCardinality at line 172 is dangling after Phase-2 P2.multi-3 split; §Coexistence at line 369 only listed 2 of (now 6) Kinds. APPLIED: line 172 → KindAttributionCardinality; §Coexistence enumerates all kinds including intentional aliases.
- P3.adv1-3 — Phase-2 EXPLICITLY-SKIPPED cross-RFC label inconsistency (P2.multi-5) failed contradict step. MILESTONES.md line 358 defines the label fallback as NON-unknown; if a customer follows RFC-0009's `gen_ai.training.io/rank` and M15 expects `tracecore.io/rank`, the two receivers' rank attributions diverge for the same pod. APPLIED IN-PR: §Pod attribution + JobID chain updated to `gen_ai.training.io/rank` / `gen_ai.training.io/job-id`; MILESTONES.md M15 line 358 amended in this same PR.
- P3.adv1-7 — Phase-2's KindCardinality split into 2 left `egress_rate_limit.lru_cap` overflow without a named Kind. APPLIED: added `KindRateLimitCardinality` for symmetric coverage of all 3 LRUs.
- P3.adv1-8 — `/var/log/containers` mount + fsGroup + runAsNonRoot + readOnlyRootFilesystem are rubric-binding design contracts (M15 rubric line 373), not implementation detail; deferral to Phase-2 PR was inappropriate. Three lenses (Phase-2 SRE + Adopter + Security) already raised; adversarial-1 escalated. APPLIED: new §Pod manifest subsection enumerates both mounts + full securityContext.

CONCERN (adversarial-2 prose-vs-symbol):
- P3.adv2-2 — "compile-time pin" / "enforced at compile time" phrasing at lines 144, 149 oversells the M18 contract; `internal/synthesis/patterns/straggler.go` does not exist at HEAD; `ErrStragglerInputMissing` does not resolve under `git grep`. APPLIED: replaced with "rubric-binding (with compile-time enforcement to follow once straggler.go ships per M18)" + "(planned per M18 rubric line 481; not yet at HEAD)" markers.

NIT:
- P3.adv1-4 — §16.3 / §16.1 unqualified cross-doc references at line 260 (research doc anchors, not RFC sections). APPLIED: qualified to "[research §16.3 / §16.1](../research/m15-container-stdout.md)".
- P3.adv1-5 — `process_rank_regex` body never described what it matches against. APPLIED: §Pod attribution gains a "process_rank_regex fallback" paragraph describing the log-body regex use case.
- P3.adv1-6 — channel_cap cited in §Lifecycle (matches k8sevents config.go:55-60) but not exposed in Configuration surface. APPLIED: added `channel_cap: 1024` to the YAML example.
- P3.adv2-4 — `TestContainerStdout_SiblingReceiverIsolation` ambiguous (isolation from what? RBAC? cursor? channels?). APPLIED: renamed to `SiblingReceiverCursorAndKindIsolation`.

Findings dropped during contradict (5 total):
- Operator-surfaces deferral defended by precedent (RFC-0007 / 0009 follow the same pattern); dropped.
- M16 reversal recheck — already named as Open Question + amendment path; dropped.
- gen_ai.training.* upstream-SIG rejection — concrete fallback already in Open Questions; dropped.
- fileconsumer breaking-change cost — quantified + budgeted in research §13.3 + Migration; dropped.
- §Migration/rollout deprecation contract "fictional" — precedent in RFC-0005 + RFC-0007 establishes class-level contract; un-tested but not absent; dropped.

EXPLICITLY-SKIPPED:
- P3.adv2-5 (Migration paragraph adds no falsifiable content) — kept; the paragraph provides operator-search-relevant context and removing it costs deploy-time findability for marginal hygiene gain. Defended by: operator-facing doc readers do search by keyword.

Rubric additions accepted (binding):
- Design-locked drafts MUST grep-validate symbol consistency across all sections after every applied fix.
- Cross-RFC join-keys MUST resolve at design-lock, not defer.
- Symbol splits MUST propagate to every section.
- Forward-referenced sibling symbols MUST be marked "(planned per <rubric>; not yet at HEAD)".

Validation-cycle stats:
- Findings rejected during contradict: 5 (dropped above)
- Findings whose hard-proof did not reproduce: 0
- Phase-2 explicitly-skipped reopened by Phase-3 contradict: 1 (P2.multi-5 → P3.adv1-3)

TDD discipline: N/A — RFC is prose. Reproducibility: make doc-check clean; AI-vocab diff-gate clean.

Discovered constraints:
- Pattern: Phase-2's "applied" findings consistently missed sibling-section propagation. Promoted to rubric: every applied fix runs a grep audit before commit.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
… of P2 rubric closed

Two A+ aspiration reviewers; both grade A- with verdict RAISE-TO-A-PLUS.
Both converge on the same gap: §Operator-surfaces deferral violates the
P2-stakeholders rubric I myself accepted (self-violation closure).

A+ criteria applied inline (mechanical doc edits, no design work):

- A+1 (both reviewers) — §Operator surfaces subsection added; names
  RUNBOOK.md + prometheus-alerts.example.yaml + one alert row per
  Kind (≥7 rows post-Phase-3 split). Closes the P2-stakeholders
  self-violation.
- A+2 (both) — §Rollback subsection added with literal
  `rm -rf /var/lib/tracecore/container_stdout/` command and
  forward-compat-across-patch-versions claim.
- A+3 (both) — SHA-pinned manifest.yaml URL to commit
  e8f3db9c2f7ddfb21c5e84c9f295d0a820318e04; hedge replaced with the
  pin.
- A+4 (reviewer 1) — §Performance budget table promoted from FOLLOWUPS
  inline. 4 commitments with falsifying benches
  (BenchmarkContainerStdout_HotPath_AllocPerLine,
  ChannelDepthUnderEviction, DataloaderRegex_PerLine,
  AttributionLookup).
- A+5 (reviewer 2) — §Rubric trace appendix added; maps every M15
  MILESTONES.md rubric bullet (lines 357-375) to the RFC subsection
  that addresses it or to a FOLLOWUPS trigger.
- A+6 (reviewer 1) — Forward-vs-at-HEAD marker legend added before
  §Typed Record schema; standardizes the convention introduced in
  Phase 3.
- A+7 (reviewer 2, cross-doc hygiene) — Stale FOLLOWUPS entries
  cleaned: cross-RFC pod-label entry struck-through with "Resolved
  by RFC-0010" pointer; §Performance, §Pod manifest, §Operator
  surfaces, §Rollback, SHA-pin entries all closed.

Deferred to Phase 5 / Phase-2 implementation:
- A+8 (reviewer 1) — Cross-RFC label-name lint script in doc-check.
  Adds CI gate but not the design change; FOLLOWUPS row would carry
  the trigger.
- A+9 (reviewer 1) — Test identifiers for orphan typed-Record fields
  (DroppedLines, LogTag, RestartCount, LocalRank, JobID). Phase-2
  implementation adds these as the test bodies are written.
- A+10 (reviewer 1) — Kind → IncError-call-site → alert-severity →
  runbook-anchor table. Phase-2 lands the IncError call sites; the
  table becomes lossless then.

Both reviewers explicitly punted PR title/body sync to Phase 5; per
skill spec that is where the sync lands.

Validation-cycle stats:
- Findings rejected during contradict: 0 (all A+ criteria were measurable from the start)
- Findings whose hard-proof did not reproduce: 0

TDD discipline: every applied criterion ships with a falsifying check:
- §Operator-surfaces: `grep -L "RUNBOOK.md\|prometheus-alerts" docs/rfcs/0010-*.md` empty
- §Rollback: `grep -n 'rm -rf /var/lib/tracecore/container_stdout' docs/rfcs/0010-*.md` returns 1
- SHA-pin: `grep -n 'e8f3db9c2f7ddfb21c5e84c9f295d0a820318e04' docs/rfcs/0010-*.md` returns 1
- §Performance: `grep -nE "benchmem|alloc/op|ns/op" docs/rfcs/0010-*.md` returns ≥4
- Cross-RFC label resolution: `grep -rn "tracecore.io/rank\|tracecore.io/job-id" MILESTONES.md docs/rfcs/` returns only intentional resolution citations (1 hit, in the rename-explanation sentence in RFC-0010 itself)

Reproducibility: make doc-check clean; AI-vocab diff-gate clean.

Rubric additions accepted (binding, Phase 4):
- Alpha-receiver RFCs MUST inline §Operator-surfaces or explicitly waive.
- Alpha-receiver RFCs MUST contain §Performance budget when CPU/RSS thresholds are rubric-stated.
- Alpha-receiver RFCs MUST contain §Rubric trace appendix.
- FOLLOWUPS entries resolved by the closing PR MUST be struck-through with a Resolved-by pointer.

Discovered constraints:
- Pattern: self-accepted rubric additions tend to be over-deferred to FOLLOWUPS in the same PR that proposes them. The Phase-4 A+ pass is the natural place to close that gap because the rubric is fresh and the cost is mechanical.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
…ced to 5-phase trail

Two simplification reviewers. Reviewer 1 ~80-line cut candidate;
reviewer 2 ~10-line conservative cut. Applied convergent findings +
reviewer-1 finds whose contradiction failed; skipped the reviewer-2-
contradicted-held items (RFC-0009 parallel conventions, rubric-trace
anchor dependencies).

Removed inline (~25 lines):
- §Forward-referenced symbol markers H3 (8 lines) → collapsed to a one-line
  note at the top of §Typed Record schema. Phase-3 attribution sentence
  dropped (review-loop scar tissue).
- §Open questions OD-9 self-tombstone (announces its own removal) —
  removed outright.
- §Open questions "Upstream-SIG rejection fallback" (5 lines, duplicate
  of line 141 + §Alternatives concede-O4) — removed.
- §Open questions "M16's actual implementation choice" (2 lines, duplicate
  of §Proposal first bullet + §Alternatives BA-1) — removed.
- §Wire-attribute trailing paragraph (3 lines, restates row 6 of the
  table verbatim) — removed.
- §Pod manifest closing meta-paragraph (2 lines, restates that the RFC
  is rubric-binding) — collapsed to "Helm chart lands these in the
  DaemonSet template."
- §Performance budget intro paragraph (3 lines of promotion-history
  meta-prose) — collapsed to one sentence.
- §Rubric trace intro second sentence (1 line of table-self-description)
  — removed.

PR title: unchanged (parallel to RFC-0009 convention; reviewer 2's
contradict held — bikeshedding "(design locked)" → "(decisions locked,
body iterating)" not worth breaking cross-PR parallel for one RFC).

PR body: synced via `gh pr edit 94`. Adds §Review history table with
5 commit pointers; drops the pre-Phase-1 Key-decisions list (now
captured per-commit); keeps the 3-item §Test plan per
`feedback_test_plan_before_pr` MEMORY rule.

Findings that survived reviewer-2 contradiction (kept as-is):
- §Pod manifest + §RBAC + §Operator surfaces are NOT collapsed into one
  §Operational footprint; §Rubric trace cites each by name.
- 5 BAs in §Alternatives kept (each carries distinct rejection reasoning).
- README §"When to read which" not amended (per-receiver instances
  diluting the curated decision-axis index).

Validation-cycle stats:
- Findings rejected during contradict: 6 (5 from reviewer 2; 1 from
  reviewer 1 — §KindCardinality justifications kept because the design
  rationale is load-bearing)
- Findings whose hard-proof did not reproduce: 0

TDD discipline: every removal verified by re-reading the surrounding
context post-edit; no cross-section reference was broken.

Reproducibility: make doc-check clean; AI-vocab diff-gate clean.

Discovered constraints:
- Pattern: multi-phase review accretes meta-prose (notation conventions,
  promotion history, self-removing open questions). Phase 5 is the
  natural place to strip; doing so is mechanical given the rubric trace
  + commit log already preserve the audit.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr merged commit bee6642 into main May 19, 2026
8 checks passed
@trilamsr trilamsr deleted the rfc-0010-m15-containerstdout branch May 19, 2026 19:55
trilamsr added a commit that referenced this pull request May 19, 2026
Three build-approach-independent rubric refinements from the M15
research evidence base (PR #92) and RFC-0010 (PR #94):

- R-4 (cosmetic): max_log_size citation now references the OTel
  `container` stanza operator default (1 MiB matches; documents the
  prior art whether we depend on it via BA-1 or port it via BA-2).
- R-5 (new reliability caveat): containerd #11149 silently drops
  bytes from 0.log when an in-container process reads its own FD 1.
  Shared-pipe contention; not generic backpressure. Standard
  workloads unaffected. Was previously misframed in research round 1
  as "disk-I/O backpressure"; the 2025-01-22 reproducer in the issue
  pinpoints the mechanism.
- R-8 (degraded-mode specificity): rotation-stalled is now defined
  concretely as 0.log size > containerLogMaxSize for ≥30 s
  (3× kubelet default containerLogMonitorInterval of 10 s, cited at
  source). Surfaced via IncError("rotation_stalled"). Prior text was
  generic "kubelet rotation breakage" with no detection mechanism.

R-1 / R-2 (namespace) withheld: OD-12 effectively resolved by the
upstream proposal at docs/proposals/gen-ai-training-semconv.md
(O4-overdue first-draft KPI closed PR #93). No rename needed.

R-3 / R-7 (rotation correctness, gzip handling) deferred: pending
the corresponding integration-test fixtures (TestContainerStdout_*)
landing in the M15 implementation phase per RFC-0010.

R-6 (bbolt cursor) dropped: BA-2 build approach (RFC-0010) keeps the
JSON cursor at /var/lib/tracecore/container_stdout/cursor.json as
originally rubricked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
## Summary

Three build-approach-independent rubric refinements for M15, surfaced by
the research doc (PR #92) and locked-in by RFC-0010 (PR #94). Five
additional rubric edits (R-1, R-2, R-3, R-6, R-7) are intentionally
withheld per the rationale below.

## Diff

3 lines changed in `MILESTONES.md`:

- **R-4 (cosmetic):** `max_log_size` citation now references the OTel
`container` stanza operator default. 1 MiB matches; documents the prior
art whether we depend on it (BA-1) or port it (BA-2 per RFC-0010).
- **R-5 (new reliability caveat):** containerd
[#11149](containerd/containerd#11149) silently
drops bytes from `0.log` when an in-container process reads its own FD
1. Mechanism is shared-pipe contention, not generic backpressure
(round-1 research framed it incorrectly; the 2025-01-22 reproducer in
the issue pinpoints the mechanism). Standard workloads unaffected.
- **R-8 (degraded-mode specificity):** rotation-stalled is now defined
concretely as `0.log` size > `containerLogMaxSize` for ≥30 s (3× kubelet
default `containerLogMonitorInterval` of 10 s, cited at source).
Surfaced via `IncError("rotation_stalled")`. Prior text was generic
"kubelet rotation breakage" with no detection mechanism.

## Why not the other 5 rubric edits

- **R-1 / R-2 (namespace):** OD-12 effectively resolved by the upstream
proposal at `docs/proposals/gen-ai-training-semconv.md` (PR #93 closed
the O4-overdue first-draft KPI). No rename needed.
- **R-3 / R-7 (rotation correctness, gzip handling):** deferred pending
the corresponding `TestContainerStdout_*` integration-test fixtures
landing in the M15 implementation phase per RFC-0010.
- **R-6 (bbolt cursor):** dropped because BA-2 keeps the JSON cursor at
`/var/lib/tracecore/container_stdout/cursor.json` as originally
rubricked.

## Test plan

- [x] `make doc-check` clean (273 markdown links resolve, banned-phrase
lint clean across 67 files, RUNBOOK ↔ alerts pairing clean).
- [x] Containerd #11149 mechanism re-verified at the issue's 2025-01-22
reproducer comment; mechanism is shared-pipe contention.
- [x] `containerLogMonitorInterval` default cited verbatim at
`pkg/kubelet/apis/config/v1beta1/defaults.go`; verified in research
§13.8.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…15 RFC-0010) (#104)

## Summary

Resolves the RFC-0010 collision created when PR #94 (M15
containerstdout, merged first at `bee6642`) and PR #96 (M16 kueue,
merged second at `24c3073`) both numbered themselves 0010.

Per [`docs/rfcs/README.md` §
Numbering](../tree/main/docs/rfcs/README.md#numbering): collisions are
resolved by renumbering the *unaccepted* RFC. M15 was accepted first;
M16 renumbers to 0011.

## Changes

- Rename `docs/rfcs/0010-m16-kueue-receiver-scope.md` →
`docs/rfcs/0011-m16-kueue-receiver-scope.md`
- Update file title `# RFC 0010` → `# RFC 0011` + one-line
renumber-history note in the body.
- Add the missing 0011 row to the README Status index (the M16 PR landed
the file but never registered it; fixing as part of the renumber).
- Rewrite 5 cross-references (1 in `m16-kueue.md`, 4 in
`m16-kueue-production-followups.md`) from RFC-0010 → RFC-0011.

M15 RFC-0010 references in `docs/FOLLOWUPS.md` (17 hits) are unchanged —
all correctly point at the containerstdout receiver.

## Test plan

- [x] `make doc-check` clean (link integrity validated across the
rename).
- [x] `grep -rn "0010-m16" docs/` returns empty.
- [x] `grep -rn "0011-m16" docs/` returns the rename target + the README
row + 4 cross-refs in research docs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…-index pointer

Adds one load-bearing lesson and one topic-index pointer derived from
the M15 / RFC-0010 review trail.

Load-bearing lesson:
- Cross-receiver join contracts resolve at design-lock, not at
  implementation-merge. Every `M-X emits Y` MILESTONES.md rubric is
  implicitly a contract with every `M-Z consumes Y` receiver. Before
  opening a receiver-scope RFC, grep for every attribute / label / field
  name introduced; reconcile divergences in-PR, not in a FOLLOWUPS row.
  Anchor: PR #94 (M15 RFC-0010) caught the tracecore.io/rank vs
  gen_ai.training.io/rank divergence with RFC-0009 in Phase-3 review and
  amended MILESTONES.md line 358 in the same PR.

Topic-index pointer:
- `docs/rfcs/README.md` — surfaces the receiver-scope RFC conventions
  (Operator surfaces, Rubric trace, Performance budget, wire-attribute
  table, cross-RFC join-key resolution, components.yaml codegen)
  promoted to that file in the body of this PR. Without the index
  entry, the next agent has to know the file exists.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…d + notes/reviews.md (#106)

## Summary

The 5-pass review on RFC-0010 (PR #94) plus the broader M15 session
accumulated 11 binding rubric additions + 3 review-pattern lessons in
session-ephemeral state (gitignored `.claude/pr-review-loop.local.md`)
that would have been lost when the worktree was cleaned. Promoting the
load-bearing ones to three durable surfaces:

## Changes

### `docs/rfcs/README.md` additions

- **§ Authoring conventions** — 8 binding rules for receiver-scope RFCs,
each carrying a falsifying grep / test:
- § Operator surfaces, § Rubric trace, § Performance budget required
(inline or waived)
  - Wire-attribute name table required for typed-record RFCs
- Sibling-package symbols compile-resolve at HEAD or carry `(planned per
…; not yet at HEAD)`
  - Cross-RFC join-keys resolve at design-lock, not defer
  - Grep-validate symbol consistency after every applied review fix
- FOLLOWUPS entries resolved by closing PR are struck-through with
"Resolved by RFC-XXXX §Y"

- **§ Component registration** — the `components.yaml` + `make generate`
codegen pattern documented for the next receiver author.

### `docs/notes/reviews.md` additions (matches existing format)

- **Multi-pass review accretes propagation gaps; grep after every
applied fix.** Anchor: RFC-0010 commit `63dccbb`.
- **Rubric self-violation sneaks in when the rubric is fresh.** Anchor:
PR #94 Phase-2 accepted §Operator-surfaces rubric AND deferred it in the
same commit; Phase-4 caught it.
- **Single-maintainer projects collapse multi-stakeholder meeting
processes to solo decision exercises.** Anchor: RFC-0010 §15.6.

### `AGENTS.md` additions

- **New load-bearing lesson:** Cross-receiver join contracts resolve at
design-lock, not at implementation-merge. Every `M-X emits Y` rubric is
implicitly a contract with every `M-Z consumes Y` receiver. Grep
MILESTONES.md before opening any receiver-scope RFC. Anchor: PR #94's
`tracecore.io/rank` vs `gen_ai.training.io/rank` divergence with
RFC-0009.
- **Topic-index pointer to `docs/rfcs/README.md`** — surfaces the
receiver-scope RFC conventions added in this same PR so the next agent
finds them.

## Test plan

- [x] `make doc-check` clean (banned-phrase lint across 69 markdown
files; link integrity verified).
- [x] AI-vocab diff-gate clean (literal commit-tag citations rewritten
to SHA + description form).
- [x] Lesson anchors point at real commit SHAs (`63dccbb`, `be3db09`, PR
#94) reachable on main.
- [x] AGENTS.md additions match the established `**Bold lesson title**
body. Anchor: ...` format.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant