[docs] research: M15 container-stdout receiver (evidence base for design phase)#92
Merged
Conversation
…alysis + cross-cutting findings) Research evidence base for M15 (Kubernetes container-stdout receiver), to be consumed by the M15 design doc. Not a design doc itself. Highlights: - Build approach: 5 options (BA-0 sidecar / BA-1 adapter / BA-2 port fileconsumer / BA-3 reimplement / BA-4 defer). No single recommendation; §15.6 has a 1-week stakeholder-meeting resolution path. - Namespace: NORTHSTARS O4 upstream PR is NOT filed as of 2026-05-19 (verified via gh search). §7.4 reframes the strategic-bet posture. Owner decision needed; doc punts. - containerd #11149 mechanism corrected: shared-pipe contention when in-container reads FD 1, not generic disk-I/O backpressure. - 8 rubric edits drafted; only R-4 / R-5 / R-8 are proposable now (build-approach-independent). R-1 / R-2 / R-3 / R-6 / R-7 are gated. - Typed Record schema (§8.1) for M18/M19 downstream consumers. - Security threat model (§16) with 7-adversary matrix. - 9-alert catalog (§19), 7-failure-mode coverage (§17), overhead-budget methodology (§20). - 20 follow-ups tagged by requirement type (§21); none need GPU or production data. Three review rounds + adversarial pass + independent design-readiness and stakeholder-perspective reviewers; ~70% load-bearing claims hold up under cross-checking, gaps explicitly enumerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 19, 2026
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…ocked draft) (#94) ## Summary Locks M15 design ahead of implementation. Builds on the 2664-line research evidence base at [`docs/research/m15-container-stdout.md`](https://github.com/TraceCoreAI/tracecore/blob/main/docs/research/m15-container-stdout.md) (PR #92) and survived 5 review passes on this branch — see commit log for the trail. Key decisions (post-review): - Build approach **BA-2** (port `pkg/stanza/fileconsumer`); BA-1 adapter rejected because M16's rubric is consistent with either path and presumes tracecore-native. - Attribute namespace `gen_ai.training.*` per the upstream proposal at `docs/proposals/gen-ai-training-semconv.md`, with one documented exception for `tracecore.training.data_time_s` / `iter_time_s` (M18 input contract is older than the upstream draft). - Cursor JSON at `/var/lib/tracecore/container_stdout/cursor.json` (atomic rename, mode 0600). - Pod attribution chain mirrors RFC-0009 for cross-receiver join consistency. - **MILESTONES.md M15 line 358 amended** (`tracecore.io/rank` → `gen_ai.training.io/rank`) to canonicalize the cross-RFC label name. ## Review history | Pass | Commit | Top finding | |---|---|---| | Self-review | `456f6da` | Softened M16 claim; surfaced validate-at-load; 4 deferrals | | 8 stakeholder lenses | `19bd492` | BLOCKER (namespace carve-out) resolved; 12 CONCERN/NIT applied; 13 deferred | | 2 adversarial | `63dccbb` | Cross-RFC label resolved in-PR; propagation gaps closed; "compile-time pin" corrected | | 2 A+ aspiration | `be3db09` | §Operator-surfaces inlined; §Performance promoted from FOLLOWUPS; SHA-pinned manifest URL | | 2 simplification | (this commit) | Removed meta-prose, duplicate justifications, self-referential open-question tombstones | Per-commit message body has the full pushback table. ## Test plan - [x] `make doc-check` clean across all 5 commits. - [x] `grep -rn "tracecore.io/rank\|tracecore.io/job-id" MILESTONES.md docs/rfcs/` returns only the rename-explanation citation in RFC-0010 (no stale uses). - [x] `HintPodEvicted` symbol resolves at `components/receivers/k8sevents/hint.go:24`. - [x] AI-vocab diff gate clean. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Tri Lam <trilamsr@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
Three build-approach-independent rubric refinements from the M15 research evidence base (PR #92) and RFC-0010 (PR #94): - R-4 (cosmetic): max_log_size citation now references the OTel `container` stanza operator default (1 MiB matches; documents the prior art whether we depend on it via BA-1 or port it via BA-2). - R-5 (new reliability caveat): containerd #11149 silently drops bytes from 0.log when an in-container process reads its own FD 1. Shared-pipe contention; not generic backpressure. Standard workloads unaffected. Was previously misframed in research round 1 as "disk-I/O backpressure"; the 2025-01-22 reproducer in the issue pinpoints the mechanism. - R-8 (degraded-mode specificity): rotation-stalled is now defined concretely as 0.log size > containerLogMaxSize for ≥30 s (3× kubelet default containerLogMonitorInterval of 10 s, cited at source). Surfaced via IncError("rotation_stalled"). Prior text was generic "kubelet rotation breakage" with no detection mechanism. R-1 / R-2 (namespace) withheld: OD-12 effectively resolved by the upstream proposal at docs/proposals/gen-ai-training-semconv.md (O4-overdue first-draft KPI closed PR #93). No rename needed. R-3 / R-7 (rotation correctness, gzip handling) deferred: pending the corresponding integration-test fixtures (TestContainerStdout_*) landing in the M15 implementation phase per RFC-0010. R-6 (bbolt cursor) dropped: BA-2 build approach (RFC-0010) keeps the JSON cursor at /var/lib/tracecore/container_stdout/cursor.json as originally rubricked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
## Summary Three build-approach-independent rubric refinements for M15, surfaced by the research doc (PR #92) and locked-in by RFC-0010 (PR #94). Five additional rubric edits (R-1, R-2, R-3, R-6, R-7) are intentionally withheld per the rationale below. ## Diff 3 lines changed in `MILESTONES.md`: - **R-4 (cosmetic):** `max_log_size` citation now references the OTel `container` stanza operator default. 1 MiB matches; documents the prior art whether we depend on it (BA-1) or port it (BA-2 per RFC-0010). - **R-5 (new reliability caveat):** containerd [#11149](containerd/containerd#11149) silently drops bytes from `0.log` when an in-container process reads its own FD 1. Mechanism is shared-pipe contention, not generic backpressure (round-1 research framed it incorrectly; the 2025-01-22 reproducer in the issue pinpoints the mechanism). Standard workloads unaffected. - **R-8 (degraded-mode specificity):** rotation-stalled is now defined concretely as `0.log` size > `containerLogMaxSize` for ≥30 s (3× kubelet default `containerLogMonitorInterval` of 10 s, cited at source). Surfaced via `IncError("rotation_stalled")`. Prior text was generic "kubelet rotation breakage" with no detection mechanism. ## Why not the other 5 rubric edits - **R-1 / R-2 (namespace):** OD-12 effectively resolved by the upstream proposal at `docs/proposals/gen-ai-training-semconv.md` (PR #93 closed the O4-overdue first-draft KPI). No rename needed. - **R-3 / R-7 (rotation correctness, gzip handling):** deferred pending the corresponding `TestContainerStdout_*` integration-test fixtures landing in the M15 implementation phase per RFC-0010. - **R-6 (bbolt cursor):** dropped because BA-2 keeps the JSON cursor at `/var/lib/tracecore/container_stdout/cursor.json` as originally rubricked. ## Test plan - [x] `make doc-check` clean (273 markdown links resolve, banned-phrase lint clean across 67 files, RUNBOOK ↔ alerts pairing clean). - [x] Containerd #11149 mechanism re-verified at the issue's 2025-01-22 reproducer comment; mechanism is shared-pipe contention. - [x] `containerLogMonitorInterval` default cited verbatim at `pkg/kubelet/apis/config/v1beta1/defaults.go`; verified in research §13.8. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Research evidence base for M15 (Kubernetes container-stdout receiver) to feed the design doc. Not a design doc itself; no code, no MILESTONES.md edits.
Key findings:
gen_ai.training.*PR exists as of 2026-05-19 (verified viagh search). NORTHSTARS O4 M1-deadline commitment is overdue. §7.4 routes the decision to the O4 owner via new OD-12.pipeline.ReceiverFactoryis not upstream OTel'sreceiver.Factory. §15 enumerates 5 build approaches; §15.6 has a 1-week stakeholder-meeting resolution path.Actionable now
Review order
Test plan
make doc-checkclean.make lint0 issues.selftelemetry.Receiver, MILESTONES line 360, regex against Go RE2, upstream PR existence).🤖 Generated with Claude Code