[docs] rfc-0010: containerstdout receiver scope (M15, alpha; design-locked draft)#94
Merged
Merged
Conversation
…ocked draft) Locks the M15 design ahead of implementation. Builds on the research evidence base at docs/research/m15-container-stdout.md. Decisions baked in: - Build approach BA-2 (port pkg/stanza/fileconsumer). BA-1 adapter rejected because M16 is planned tracecore-native (Prometheus scraper, single-line factory per MILESTONES.md lines 383-388); adapter has no second user. - Attribute namespace gen_ai.training.* per the upstream proposal at docs/proposals/gen-ai-training-semconv.md. No pre-emptive tracecore.training.* hedge; collector-side rename is the fallback if upstream rejects. - Cursor at /var/lib/tracecore/container_stdout/cursor.json (JSON + atomic rename, not bbolt). Matches MILESTONES.md M15 rubric line 364; vendoring file_storage rejected. - Pod attribution chain mirrors RFC-0009 for join-key consistency across M13/M14/M15/M18. - Typed Record schema exported for M18/M19 compile-time joins; WorldSize > 0 sentinel pattern (Rank=0 is valid). - Self-telemetry kinds: KindRotationStalled, KindCursorWriteFailed, KindBackpressureDrop, KindCardinality, KindWatch. - 15 named TestContainerStdout_* identifiers covering rotation, shutdown, pod-deletion / eviction / drain, sibling isolation, cursor write failure, FD hygiene, JSON parse fallthrough, per-key rate-limit, namespace allowlist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
…validate-at-load, defer 4 items
Author self-review against PRINCIPLES.md / NORTHSTARS.md / STYLE.md /
MEMORY.md (branch does not match feat/m<X>-* pattern; rubric anchor
is repo-standards only). Edge-case hunt produced 8 candidates; 5
applied inline, 4 deferred to docs/FOLLOWUPS.md, 1 explicitly-
skipped (Rank=0 ambiguity — schema doc already addresses).
Pushback table:
| ID | Severity | Beneficiary | Finding | Proof | Contradict | Action |
| ---- | -------- | -------------- | ------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------- | --------------------------------------- |
| P1.1 | CONCERN | repo-long-term | "M16 planned tracecore-native" is inferred from MILESTONES.md, not stated | M16 rubric lines 383-395 specify "Prometheus scrape", not approach | M16 RFC could land BA-1; RFC-0010 amendment convention covers | applied inline |
| P1.2 | CONCERN | customer-M13 | RFC-0009 uses pod label gen_ai.training.io/rank; M15 rubric uses tracecore.io/rank | grep MILESTONES.md + RFC-0009 | Could be deliberate divergence; unlikely given cross-receiver joins | deferred to FOLLOWUPS.md M15 section |
| P1.3 | NIT | repo-long-term | "production-tested via otelcol-k8s" propagated from sub-agent report without direct verification | research §13.3 cites manifest URL, not direct deployment data | manifest inclusion is reasonable evidence of production load | applied — added "unverified at the production-deployment level" |
| E1 | CONCERN | operator | NODE_NAME unset → FieldSelector empty → cluster-wide watch → cardinality blowup | mock test with NODE_NAME unset | downward-API pattern is standard k8s; near-zero chance to be missed | applied — Validate rejects with node_name_unset error |
| E2 | CONCERN | operator | Operator-overridden regex with Go-RE2-unsupported syntax fails at runtime | round-3 lookahead regression already proved this real | none — already happened once | applied — Validate rejects unsupported Perl regex constructs |
| E3 | NIT | repo-long-term | cursor.json size not bounded | benchmark threshold test | atomic rename remains atomic at multi-MB | deferred to FOLLOWUPS.md M15 section |
| E5 | CONCERN | repo-long-term | Upstream-SIG rejection fallback not designed | follow-up RFC when upstream lands | upstream PR may yet be accepted with the proposed names | applied — added to Open questions |
| E7 | NIT | repo-long-term | Rank=0 vs undiscovered Rank ambiguity in schema | IsValid() helper | schema doc already warns "consumers MUST check WorldSize > 0" | explicitly-skipped — schema warning sufficient (PRINCIPLES §13 operator-first; honor-system OK at receiver-API boundary) |
| E8 | NIT | repo-long-term | Self-telemetry kind aliasing CI gate is manual ("grep at PR time") | go-vet-level lint | k8sevents + kernelevents have not collided; small N | deferred to FOLLOWUPS.md M15 section |
| P1.5 | NIT | repo-long-term | /var/lib/tracecore/ subdir namespace governance not formalized | future amendment to RFC-0001 | M15 is currently sole tracked cursor consumer | deferred to FOLLOWUPS.md M15 section |
Validation-cycle stats:
- Findings rejected during contradict: 0 (rate-limit-processor framing collapsed but was not raised as a finding; weakened during planning)
- Findings whose hard-proof did not reproduce: 0
TDD discipline: N/A — RFC is prose. Test reproducibility: make doc-check
clean (manual run; same SHA + git clean -fdx would reproduce).
Rubric evolution: (none accepted in this phase)
Discovered constraints: (none load-bearing)
Signed-off-by: Tri Lam <trilamsr@gmail.com>
…IT applied; 13 deferred 8 parallel stakeholder-lens reviews against the design-locked draft. 6/8 lenses returned CONCERNS-REQUIRE-FIX; 1 returned BLOCKER-REQUIRES-RESOLUTION (researcher). Pushback (severity-prefixed, applied inline unless noted): BLOCKER: - P2.R-1 (researcher) — §Attribute namespace's blanket "gen_ai-only, no dual-emit" contradicts MILESTONES.md M15 rubric L360 + M18 rubric L481, both of which compile-time-pin tracecore.training.data_time_s / iter_time_s (M18 via ErrStragglerInputMissing in internal/synthesis/patterns/straggler.go). APPLIED: wire-attribute table carves the two timing attributes as a documented exception until upstream proposal lands, then sibling cross-receiver PR. CONCERN (multi-lens — escalated): - P2.multi-2 (sre,adopter,security) — /var/log/containers mount + fsGroup + runAsNonRoot missing per M15 rubric L373. DEFERRED to FOLLOWUPS (§Pod manifest subsection in Phase-2 PR). - P2.multi-3 (operator,sre,adopter) — KindCardinality covers both fingerprint and attribution LRUs; on-call can't distinguish. APPLIED: split into KindFingerprintCardinality + KindAttributionCardinality; renamed config to `attribution.lru_cap` vs `egress_rate_limit.lru_cap`. - P2.multi-4 (contributor,maintainer,adopter,researcher) — "makes funded" ungrammatical. APPLIED: replaced with "the drafted upstream proposal already absorbs". - P2.multi-6 (contributor,adopter) — dataloader_regex shipped as default without marking placeholder per rubric L360 "operators MUST override". APPLIED: inline comment block. - P2.multi-7 (operator,sre) — RUNBOOK.md + prometheus-alerts.example.yaml + FAILURE-MODES.md rows not named as Phase-4 deliverables. DEFERRED. CONCERN (single-lens, applied): - P2.R-2 (researcher) — HintEvicted does not compile; actual constant is k8sevents.HintPodEvicted at hint.go:24. APPLIED: typo fixed + evictionMatchWindow defined as eviction_match_window default 5s in Configuration surface. - P2.R-4 (researcher) — JobID source-priority chain missing. APPLIED: JOB_ID → TORCHELASTIC_RUN_ID → SLURM_JOB_ID → k8s label fallback (per rubric L358 + research §13.13). - P2.sre-1 — ClusterRole named tracecore-containerstdout-clusterrole breaks k8sevents convention. APPLIED: renamed to tracecore-containerstdout (no -clusterrole suffix). - P2.contrib-1 — SchemaURLv0/SchemaURL collapse inverts k8sevents anti-pattern warning. APPLIED: rewrote snippet to match k8sevents/record.go:90-107 line-for-line with the warning comment preserved. - P2.maint-3 — "§Functional rubric #3" cross-doc reference points outside the RFC. APPLIED: fully-qualified to "MILESTONES.md M15 functional rubric bullet 3". - P2.sec-4 — cursor file mode unspecified. APPLIED: cursor.file_mode: 0600 + dir_mode: 0700 in Configuration. - P2.sec-5 — FieldSelector defense-in-depth missing. APPLIED: Validate rejects empty FieldSelector; runtime drops mismatched-nodeName events with KindWatch. - P2.R-7 — named-capture (?P<name>…) allowance not stated in validate-at-load list. APPLIED. CONCERN (single-lens, deferred to FOLLOWUPS): - P2.perf-1/2/3/4 — §Performance subsection (alloc budget, channel sizing rationale, regex literal-prefilter, attribution-lookup concurrency). DEFERRED as consolidated FOLLOWUPS entry. - P2.sec-2 — log-body secret redaction posture. DEFERRED (alpha-stage posture acceptable per sibling-receiver pattern). - P2.sec-3 — vendor-tree CVE tracking. DEFERRED. - P2.sre-2 — RBAC opt-out path when rank_source: downward_api. DEFERRED. - P2.sre-5 — Rollback subsection in §Migration. DEFERRED. - P2.maint-1/2 overlap with P2.R-2/R-4 (already applied). NIT (deferred or skipped): - P2.perf-5 — cursor flush cadence. DEFERRED. - P2.op-3 — covered by P2.multi-3 (applied). - P2.op-4 — chart smoke test asserts NODE_NAME. DEFERRED. - P2.op-5 — M16 reversal recheck. DEFERRED. - P2.contrib-3 — FOLLOWUPS governance-row falsifying-check. DEFERRED. - P2.contrib-4 — covered by P2.multi-6. - P2.maint-5 — forward-pointer to FOLLOWUPS. DEFERRED. - P2.maint-6 — vendor/ path collision with go mod vendor. DEFERRED. - P2.R-8 — SHA-pin manifest URL. DEFERRED. EXPLICITLY-SKIPPED: - P2.multi-5 / P2.adopt-3 / P2.maint-5 / P2.R-5 (cross-RFC label tracecore.io/rank vs gen_ai.training.io/rank). Phase 1 deferred to FOLLOWUPS; Phase 2 escalation contradicted by: design-locked DRAFT is the appropriate place to defer cross-receiver reconciliation pre-v1.0 (PRINCIPLES §11 backward-compat opt-in pre-v1.0); label fallback only fires when env discovery already failed, in which case rank is "unknown" regardless of label name. Phase 1 defer stands. Validation-cycle stats: - Findings rejected during contradict: 1 (P2.multi-5 namespace-rename; defended by pre-v1.0 + fallback-path-only argument) - Findings whose hard-proof did not reproduce: 0 TDD discipline: N/A — RFC is prose. Reproducibility: make doc-check clean. Rubric additions accepted (binding): - Alpha-receiver RFCs MUST contain §Operator-surfaces section (RUNBOOK + alerts + FAILURE-MODES rows). - RFCs with typed-record schemas MUST include wire-attribute name table cross-checked against MILESTONES. - RFCs naming sibling-package symbols MUST verify compile-resolve at HEAD. Discovered constraints: - Pattern: alpha receivers consistently drop one of {RUNBOOK row, alerts row, FAILURE-MODES row}. Promote rubric. - Pattern: cross-RFC join-key inconsistencies tend to defer to "whichever PR lands first"; design-locked status should exclude this class. Per-lens final verdicts: - Performance: CONCERNS-REQUIRE-FIX (deferred) - SRE/Infra: CONCERNS-REQUIRE-FIX (deferred/applied mix) - Maintainer: CONCERNS-REQUIRE-FIX (mostly applied) - Contributor: CONCERNS-REQUIRE-FIX (applied) - Operator/User: CONCERNS-REQUIRE-FIX (deferred) - Adopter: CONCERNS-REQUIRE-FIX (overlapped with SRE/Security) - Security: CONCERNS-REQUIRE-FIX (mostly deferred) - Researcher: BLOCKER-REQUIRES-RESOLUTION (resolved inline) Signed-off-by: Tri Lam <trilamsr@gmail.com>
…solve cross-RFC label; sharpen contracts Two adversarial deep reads against the post-Phase-2 state (commit 19bd492). Both verdicts: CONCERNS-REQUIRE-FIX. Adversarial-1 hunted symbol drift across the RFC body; adversarial-2 hunted architectural-fitness + prose-vs-symbol claims. Pushback (severity-prefixed, applied unless noted): CONCERN (multi-adversarial — both reviewers raised; escalated): - P3.adv-A — ClusterRole name `tracecore-containerstdout-clusterrole` at line 369 (§Coexistence) was not updated when Phase-2 P2.sre-1 fixed line 196 (§RBAC). Both adversaries grepped k8sevents/rbac.yaml; actual sibling artifact uses `tracecore-k8sevents` (no -clusterrole suffix). APPLIED: line 369 fixed; §Self-telemetry kinds enumerated. CONCERN (adversarial-1 line-by-line): - P3.adv1-1 — KindCardinality at line 172 is dangling after Phase-2 P2.multi-3 split; §Coexistence at line 369 only listed 2 of (now 6) Kinds. APPLIED: line 172 → KindAttributionCardinality; §Coexistence enumerates all kinds including intentional aliases. - P3.adv1-3 — Phase-2 EXPLICITLY-SKIPPED cross-RFC label inconsistency (P2.multi-5) failed contradict step. MILESTONES.md line 358 defines the label fallback as NON-unknown; if a customer follows RFC-0009's `gen_ai.training.io/rank` and M15 expects `tracecore.io/rank`, the two receivers' rank attributions diverge for the same pod. APPLIED IN-PR: §Pod attribution + JobID chain updated to `gen_ai.training.io/rank` / `gen_ai.training.io/job-id`; MILESTONES.md M15 line 358 amended in this same PR. - P3.adv1-7 — Phase-2's KindCardinality split into 2 left `egress_rate_limit.lru_cap` overflow without a named Kind. APPLIED: added `KindRateLimitCardinality` for symmetric coverage of all 3 LRUs. - P3.adv1-8 — `/var/log/containers` mount + fsGroup + runAsNonRoot + readOnlyRootFilesystem are rubric-binding design contracts (M15 rubric line 373), not implementation detail; deferral to Phase-2 PR was inappropriate. Three lenses (Phase-2 SRE + Adopter + Security) already raised; adversarial-1 escalated. APPLIED: new §Pod manifest subsection enumerates both mounts + full securityContext. CONCERN (adversarial-2 prose-vs-symbol): - P3.adv2-2 — "compile-time pin" / "enforced at compile time" phrasing at lines 144, 149 oversells the M18 contract; `internal/synthesis/patterns/straggler.go` does not exist at HEAD; `ErrStragglerInputMissing` does not resolve under `git grep`. APPLIED: replaced with "rubric-binding (with compile-time enforcement to follow once straggler.go ships per M18)" + "(planned per M18 rubric line 481; not yet at HEAD)" markers. NIT: - P3.adv1-4 — §16.3 / §16.1 unqualified cross-doc references at line 260 (research doc anchors, not RFC sections). APPLIED: qualified to "[research §16.3 / §16.1](../research/m15-container-stdout.md)". - P3.adv1-5 — `process_rank_regex` body never described what it matches against. APPLIED: §Pod attribution gains a "process_rank_regex fallback" paragraph describing the log-body regex use case. - P3.adv1-6 — channel_cap cited in §Lifecycle (matches k8sevents config.go:55-60) but not exposed in Configuration surface. APPLIED: added `channel_cap: 1024` to the YAML example. - P3.adv2-4 — `TestContainerStdout_SiblingReceiverIsolation` ambiguous (isolation from what? RBAC? cursor? channels?). APPLIED: renamed to `SiblingReceiverCursorAndKindIsolation`. Findings dropped during contradict (5 total): - Operator-surfaces deferral defended by precedent (RFC-0007 / 0009 follow the same pattern); dropped. - M16 reversal recheck — already named as Open Question + amendment path; dropped. - gen_ai.training.* upstream-SIG rejection — concrete fallback already in Open Questions; dropped. - fileconsumer breaking-change cost — quantified + budgeted in research §13.3 + Migration; dropped. - §Migration/rollout deprecation contract "fictional" — precedent in RFC-0005 + RFC-0007 establishes class-level contract; un-tested but not absent; dropped. EXPLICITLY-SKIPPED: - P3.adv2-5 (Migration paragraph adds no falsifiable content) — kept; the paragraph provides operator-search-relevant context and removing it costs deploy-time findability for marginal hygiene gain. Defended by: operator-facing doc readers do search by keyword. Rubric additions accepted (binding): - Design-locked drafts MUST grep-validate symbol consistency across all sections after every applied fix. - Cross-RFC join-keys MUST resolve at design-lock, not defer. - Symbol splits MUST propagate to every section. - Forward-referenced sibling symbols MUST be marked "(planned per <rubric>; not yet at HEAD)". Validation-cycle stats: - Findings rejected during contradict: 5 (dropped above) - Findings whose hard-proof did not reproduce: 0 - Phase-2 explicitly-skipped reopened by Phase-3 contradict: 1 (P2.multi-5 → P3.adv1-3) TDD discipline: N/A — RFC is prose. Reproducibility: make doc-check clean; AI-vocab diff-gate clean. Discovered constraints: - Pattern: Phase-2's "applied" findings consistently missed sibling-section propagation. Promoted to rubric: every applied fix runs a grep audit before commit. Signed-off-by: Tri Lam <trilamsr@gmail.com>
… of P2 rubric closed Two A+ aspiration reviewers; both grade A- with verdict RAISE-TO-A-PLUS. Both converge on the same gap: §Operator-surfaces deferral violates the P2-stakeholders rubric I myself accepted (self-violation closure). A+ criteria applied inline (mechanical doc edits, no design work): - A+1 (both reviewers) — §Operator surfaces subsection added; names RUNBOOK.md + prometheus-alerts.example.yaml + one alert row per Kind (≥7 rows post-Phase-3 split). Closes the P2-stakeholders self-violation. - A+2 (both) — §Rollback subsection added with literal `rm -rf /var/lib/tracecore/container_stdout/` command and forward-compat-across-patch-versions claim. - A+3 (both) — SHA-pinned manifest.yaml URL to commit e8f3db9c2f7ddfb21c5e84c9f295d0a820318e04; hedge replaced with the pin. - A+4 (reviewer 1) — §Performance budget table promoted from FOLLOWUPS inline. 4 commitments with falsifying benches (BenchmarkContainerStdout_HotPath_AllocPerLine, ChannelDepthUnderEviction, DataloaderRegex_PerLine, AttributionLookup). - A+5 (reviewer 2) — §Rubric trace appendix added; maps every M15 MILESTONES.md rubric bullet (lines 357-375) to the RFC subsection that addresses it or to a FOLLOWUPS trigger. - A+6 (reviewer 1) — Forward-vs-at-HEAD marker legend added before §Typed Record schema; standardizes the convention introduced in Phase 3. - A+7 (reviewer 2, cross-doc hygiene) — Stale FOLLOWUPS entries cleaned: cross-RFC pod-label entry struck-through with "Resolved by RFC-0010" pointer; §Performance, §Pod manifest, §Operator surfaces, §Rollback, SHA-pin entries all closed. Deferred to Phase 5 / Phase-2 implementation: - A+8 (reviewer 1) — Cross-RFC label-name lint script in doc-check. Adds CI gate but not the design change; FOLLOWUPS row would carry the trigger. - A+9 (reviewer 1) — Test identifiers for orphan typed-Record fields (DroppedLines, LogTag, RestartCount, LocalRank, JobID). Phase-2 implementation adds these as the test bodies are written. - A+10 (reviewer 1) — Kind → IncError-call-site → alert-severity → runbook-anchor table. Phase-2 lands the IncError call sites; the table becomes lossless then. Both reviewers explicitly punted PR title/body sync to Phase 5; per skill spec that is where the sync lands. Validation-cycle stats: - Findings rejected during contradict: 0 (all A+ criteria were measurable from the start) - Findings whose hard-proof did not reproduce: 0 TDD discipline: every applied criterion ships with a falsifying check: - §Operator-surfaces: `grep -L "RUNBOOK.md\|prometheus-alerts" docs/rfcs/0010-*.md` empty - §Rollback: `grep -n 'rm -rf /var/lib/tracecore/container_stdout' docs/rfcs/0010-*.md` returns 1 - SHA-pin: `grep -n 'e8f3db9c2f7ddfb21c5e84c9f295d0a820318e04' docs/rfcs/0010-*.md` returns 1 - §Performance: `grep -nE "benchmem|alloc/op|ns/op" docs/rfcs/0010-*.md` returns ≥4 - Cross-RFC label resolution: `grep -rn "tracecore.io/rank\|tracecore.io/job-id" MILESTONES.md docs/rfcs/` returns only intentional resolution citations (1 hit, in the rename-explanation sentence in RFC-0010 itself) Reproducibility: make doc-check clean; AI-vocab diff-gate clean. Rubric additions accepted (binding, Phase 4): - Alpha-receiver RFCs MUST inline §Operator-surfaces or explicitly waive. - Alpha-receiver RFCs MUST contain §Performance budget when CPU/RSS thresholds are rubric-stated. - Alpha-receiver RFCs MUST contain §Rubric trace appendix. - FOLLOWUPS entries resolved by the closing PR MUST be struck-through with a Resolved-by pointer. Discovered constraints: - Pattern: self-accepted rubric additions tend to be over-deferred to FOLLOWUPS in the same PR that proposes them. The Phase-4 A+ pass is the natural place to close that gap because the rubric is fresh and the cost is mechanical. Signed-off-by: Tri Lam <trilamsr@gmail.com>
…ced to 5-phase trail Two simplification reviewers. Reviewer 1 ~80-line cut candidate; reviewer 2 ~10-line conservative cut. Applied convergent findings + reviewer-1 finds whose contradiction failed; skipped the reviewer-2- contradicted-held items (RFC-0009 parallel conventions, rubric-trace anchor dependencies). Removed inline (~25 lines): - §Forward-referenced symbol markers H3 (8 lines) → collapsed to a one-line note at the top of §Typed Record schema. Phase-3 attribution sentence dropped (review-loop scar tissue). - §Open questions OD-9 self-tombstone (announces its own removal) — removed outright. - §Open questions "Upstream-SIG rejection fallback" (5 lines, duplicate of line 141 + §Alternatives concede-O4) — removed. - §Open questions "M16's actual implementation choice" (2 lines, duplicate of §Proposal first bullet + §Alternatives BA-1) — removed. - §Wire-attribute trailing paragraph (3 lines, restates row 6 of the table verbatim) — removed. - §Pod manifest closing meta-paragraph (2 lines, restates that the RFC is rubric-binding) — collapsed to "Helm chart lands these in the DaemonSet template." - §Performance budget intro paragraph (3 lines of promotion-history meta-prose) — collapsed to one sentence. - §Rubric trace intro second sentence (1 line of table-self-description) — removed. PR title: unchanged (parallel to RFC-0009 convention; reviewer 2's contradict held — bikeshedding "(design locked)" → "(decisions locked, body iterating)" not worth breaking cross-PR parallel for one RFC). PR body: synced via `gh pr edit 94`. Adds §Review history table with 5 commit pointers; drops the pre-Phase-1 Key-decisions list (now captured per-commit); keeps the 3-item §Test plan per `feedback_test_plan_before_pr` MEMORY rule. Findings that survived reviewer-2 contradiction (kept as-is): - §Pod manifest + §RBAC + §Operator surfaces are NOT collapsed into one §Operational footprint; §Rubric trace cites each by name. - 5 BAs in §Alternatives kept (each carries distinct rejection reasoning). - README §"When to read which" not amended (per-receiver instances diluting the curated decision-axis index). Validation-cycle stats: - Findings rejected during contradict: 6 (5 from reviewer 2; 1 from reviewer 1 — §KindCardinality justifications kept because the design rationale is load-bearing) - Findings whose hard-proof did not reproduce: 0 TDD discipline: every removal verified by re-reading the surrounding context post-edit; no cross-section reference was broken. Reproducibility: make doc-check clean; AI-vocab diff-gate clean. Discovered constraints: - Pattern: multi-phase review accretes meta-prose (notation conventions, promotion history, self-removing open questions). Phase 5 is the natural place to strip; doing so is mechanical given the rubric trace + commit log already preserve the audit. Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
Three build-approach-independent rubric refinements from the M15 research evidence base (PR #92) and RFC-0010 (PR #94): - R-4 (cosmetic): max_log_size citation now references the OTel `container` stanza operator default (1 MiB matches; documents the prior art whether we depend on it via BA-1 or port it via BA-2). - R-5 (new reliability caveat): containerd #11149 silently drops bytes from 0.log when an in-container process reads its own FD 1. Shared-pipe contention; not generic backpressure. Standard workloads unaffected. Was previously misframed in research round 1 as "disk-I/O backpressure"; the 2025-01-22 reproducer in the issue pinpoints the mechanism. - R-8 (degraded-mode specificity): rotation-stalled is now defined concretely as 0.log size > containerLogMaxSize for ≥30 s (3× kubelet default containerLogMonitorInterval of 10 s, cited at source). Surfaced via IncError("rotation_stalled"). Prior text was generic "kubelet rotation breakage" with no detection mechanism. R-1 / R-2 (namespace) withheld: OD-12 effectively resolved by the upstream proposal at docs/proposals/gen-ai-training-semconv.md (O4-overdue first-draft KPI closed PR #93). No rename needed. R-3 / R-7 (rotation correctness, gzip handling) deferred: pending the corresponding integration-test fixtures (TestContainerStdout_*) landing in the M15 implementation phase per RFC-0010. R-6 (bbolt cursor) dropped: BA-2 build approach (RFC-0010) keeps the JSON cursor at /var/lib/tracecore/container_stdout/cursor.json as originally rubricked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
## Summary Three build-approach-independent rubric refinements for M15, surfaced by the research doc (PR #92) and locked-in by RFC-0010 (PR #94). Five additional rubric edits (R-1, R-2, R-3, R-6, R-7) are intentionally withheld per the rationale below. ## Diff 3 lines changed in `MILESTONES.md`: - **R-4 (cosmetic):** `max_log_size` citation now references the OTel `container` stanza operator default. 1 MiB matches; documents the prior art whether we depend on it (BA-1) or port it (BA-2 per RFC-0010). - **R-5 (new reliability caveat):** containerd [#11149](containerd/containerd#11149) silently drops bytes from `0.log` when an in-container process reads its own FD 1. Mechanism is shared-pipe contention, not generic backpressure (round-1 research framed it incorrectly; the 2025-01-22 reproducer in the issue pinpoints the mechanism). Standard workloads unaffected. - **R-8 (degraded-mode specificity):** rotation-stalled is now defined concretely as `0.log` size > `containerLogMaxSize` for ≥30 s (3× kubelet default `containerLogMonitorInterval` of 10 s, cited at source). Surfaced via `IncError("rotation_stalled")`. Prior text was generic "kubelet rotation breakage" with no detection mechanism. ## Why not the other 5 rubric edits - **R-1 / R-2 (namespace):** OD-12 effectively resolved by the upstream proposal at `docs/proposals/gen-ai-training-semconv.md` (PR #93 closed the O4-overdue first-draft KPI). No rename needed. - **R-3 / R-7 (rotation correctness, gzip handling):** deferred pending the corresponding `TestContainerStdout_*` integration-test fixtures landing in the M15 implementation phase per RFC-0010. - **R-6 (bbolt cursor):** dropped because BA-2 keeps the JSON cursor at `/var/lib/tracecore/container_stdout/cursor.json` as originally rubricked. ## Test plan - [x] `make doc-check` clean (273 markdown links resolve, banned-phrase lint clean across 67 files, RUNBOOK ↔ alerts pairing clean). - [x] Containerd #11149 mechanism re-verified at the issue's 2025-01-22 reproducer comment; mechanism is shared-pipe contention. - [x] `containerLogMonitorInterval` default cited verbatim at `pkg/kubelet/apis/config/v1beta1/defaults.go`; verified in research §13.8. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…15 RFC-0010) (#104) ## Summary Resolves the RFC-0010 collision created when PR #94 (M15 containerstdout, merged first at `bee6642`) and PR #96 (M16 kueue, merged second at `24c3073`) both numbered themselves 0010. Per [`docs/rfcs/README.md` § Numbering](../tree/main/docs/rfcs/README.md#numbering): collisions are resolved by renumbering the *unaccepted* RFC. M15 was accepted first; M16 renumbers to 0011. ## Changes - Rename `docs/rfcs/0010-m16-kueue-receiver-scope.md` → `docs/rfcs/0011-m16-kueue-receiver-scope.md` - Update file title `# RFC 0010` → `# RFC 0011` + one-line renumber-history note in the body. - Add the missing 0011 row to the README Status index (the M16 PR landed the file but never registered it; fixing as part of the renumber). - Rewrite 5 cross-references (1 in `m16-kueue.md`, 4 in `m16-kueue-production-followups.md`) from RFC-0010 → RFC-0011. M15 RFC-0010 references in `docs/FOLLOWUPS.md` (17 hits) are unchanged — all correctly point at the containerstdout receiver. ## Test plan - [x] `make doc-check` clean (link integrity validated across the rename). - [x] `grep -rn "0010-m16" docs/` returns empty. - [x] `grep -rn "0011-m16" docs/` returns the rename target + the README row + 4 cross-refs in research docs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
4 tasks
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…-index pointer Adds one load-bearing lesson and one topic-index pointer derived from the M15 / RFC-0010 review trail. Load-bearing lesson: - Cross-receiver join contracts resolve at design-lock, not at implementation-merge. Every `M-X emits Y` MILESTONES.md rubric is implicitly a contract with every `M-Z consumes Y` receiver. Before opening a receiver-scope RFC, grep for every attribute / label / field name introduced; reconcile divergences in-PR, not in a FOLLOWUPS row. Anchor: PR #94 (M15 RFC-0010) caught the tracecore.io/rank vs gen_ai.training.io/rank divergence with RFC-0009 in Phase-3 review and amended MILESTONES.md line 358 in the same PR. Topic-index pointer: - `docs/rfcs/README.md` — surfaces the receiver-scope RFC conventions (Operator surfaces, Rubric trace, Performance budget, wire-attribute table, cross-RFC join-key resolution, components.yaml codegen) promoted to that file in the body of this PR. Without the index entry, the next agent has to know the file exists. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…d + notes/reviews.md (#106) ## Summary The 5-pass review on RFC-0010 (PR #94) plus the broader M15 session accumulated 11 binding rubric additions + 3 review-pattern lessons in session-ephemeral state (gitignored `.claude/pr-review-loop.local.md`) that would have been lost when the worktree was cleaned. Promoting the load-bearing ones to three durable surfaces: ## Changes ### `docs/rfcs/README.md` additions - **§ Authoring conventions** — 8 binding rules for receiver-scope RFCs, each carrying a falsifying grep / test: - § Operator surfaces, § Rubric trace, § Performance budget required (inline or waived) - Wire-attribute name table required for typed-record RFCs - Sibling-package symbols compile-resolve at HEAD or carry `(planned per …; not yet at HEAD)` - Cross-RFC join-keys resolve at design-lock, not defer - Grep-validate symbol consistency after every applied review fix - FOLLOWUPS entries resolved by closing PR are struck-through with "Resolved by RFC-XXXX §Y" - **§ Component registration** — the `components.yaml` + `make generate` codegen pattern documented for the next receiver author. ### `docs/notes/reviews.md` additions (matches existing format) - **Multi-pass review accretes propagation gaps; grep after every applied fix.** Anchor: RFC-0010 commit `63dccbb`. - **Rubric self-violation sneaks in when the rubric is fresh.** Anchor: PR #94 Phase-2 accepted §Operator-surfaces rubric AND deferred it in the same commit; Phase-4 caught it. - **Single-maintainer projects collapse multi-stakeholder meeting processes to solo decision exercises.** Anchor: RFC-0010 §15.6. ### `AGENTS.md` additions - **New load-bearing lesson:** Cross-receiver join contracts resolve at design-lock, not at implementation-merge. Every `M-X emits Y` rubric is implicitly a contract with every `M-Z consumes Y` receiver. Grep MILESTONES.md before opening any receiver-scope RFC. Anchor: PR #94's `tracecore.io/rank` vs `gen_ai.training.io/rank` divergence with RFC-0009. - **Topic-index pointer to `docs/rfcs/README.md`** — surfaces the receiver-scope RFC conventions added in this same PR so the next agent finds them. ## Test plan - [x] `make doc-check` clean (banned-phrase lint across 69 markdown files; link integrity verified). - [x] AI-vocab diff-gate clean (literal commit-tag citations rewritten to SHA + description form). - [x] Lesson anchors point at real commit SHAs (`63dccbb`, `be3db09`, PR #94) reachable on main. - [x] AGENTS.md additions match the established `**Bold lesson title** body. Anchor: ...` format. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closed
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Locks M15 design ahead of implementation. Builds on the 2664-line research evidence base at
docs/research/m15-container-stdout.md(PR #92) and survived 5 review passes on this branch — see commit log for the trail.Key decisions (post-review):
pkg/stanza/fileconsumer); BA-1 adapter rejected because M16's rubric is consistent with either path and presumes tracecore-native.gen_ai.training.*per the upstream proposal atdocs/proposals/gen-ai-training-semconv.md, with one documented exception fortracecore.training.data_time_s/iter_time_s(M18 input contract is older than the upstream draft)./var/lib/tracecore/container_stdout/cursor.json(atomic rename, mode 0600).tracecore.io/rank→gen_ai.training.io/rank) to canonicalize the cross-RFC label name.Review history
456f6da19bd49263dccbbbe3db09Per-commit message body has the full pushback table.
Test plan
make doc-checkclean across all 5 commits.grep -rn "tracecore.io/rank\|tracecore.io/job-id" MILESTONES.md docs/rfcs/returns only the rename-explanation citation in RFC-0010 (no stale uses).HintPodEvictedsymbol resolves atcomponents/receivers/k8sevents/hint.go:24.🤖 Generated with Claude Code