Skip to content

[docs] research: M15 container-stdout receiver (evidence base for design phase)#92

Merged
trilamsr merged 1 commit into
mainfrom
worktree-m15-research
May 19, 2026
Merged

[docs] research: M15 container-stdout receiver (evidence base for design phase)#92
trilamsr merged 1 commit into
mainfrom
worktree-m15-research

Conversation

@trilamsr

@trilamsr trilamsr commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Research evidence base for M15 (Kubernetes container-stdout receiver) to feed the design doc. Not a design doc itself; no code, no MILESTONES.md edits.

Key findings:

  • No upstream gen_ai.training.* PR exists as of 2026-05-19 (verified via gh search). NORTHSTARS O4 M1-deadline commitment is overdue. §7.4 routes the decision to the O4 owner via new OD-12.
  • containerd #11149 mechanism corrected: shared-pipe contention when in-container processes read FD 1, not generic disk-I/O backpressure. Standard workloads unaffected.
  • Tracecore's pipeline.ReceiverFactory is not upstream OTel's receiver.Factory. §15 enumerates 5 build approaches; §15.6 has a 1-week stakeholder-meeting resolution path.

Actionable now

  • 3 rubric edits proposable: R-4, R-5, R-8 (build-approach-independent).
  • 5 rubric edits gated on OD-11 / OD-12 or tests that don't yet exist (R-1 / R-2 / R-3 / R-6 / R-7).
  • OD-11 (build approach) blocks 6 other open decisions; §15.6 has the resolution process.

Review order

  1. §1 decision sheet.
  2. §15 build-approach correction.
  3. §7 namespace reframing.
  4. §10.1 + §21 owner table + tagged follow-ups.
  5. §8.1 / §16 / §17 / §19 design-doc inputs.

Test plan

  • make doc-check clean.
  • make lint 0 issues.
  • Claims cross-checked at source on 2026-05-19 (k8sevents lines, selftelemetry.Receiver, MILESTONES line 360, regex against Go RE2, upstream PR existence).

🤖 Generated with Claude Code

…alysis + cross-cutting findings)

Research evidence base for M15 (Kubernetes container-stdout receiver),
to be consumed by the M15 design doc. Not a design doc itself.

Highlights:
- Build approach: 5 options (BA-0 sidecar / BA-1 adapter / BA-2 port
  fileconsumer / BA-3 reimplement / BA-4 defer). No single recommendation;
  §15.6 has a 1-week stakeholder-meeting resolution path.
- Namespace: NORTHSTARS O4 upstream PR is NOT filed as of 2026-05-19
  (verified via gh search). §7.4 reframes the strategic-bet posture.
  Owner decision needed; doc punts.
- containerd #11149 mechanism corrected: shared-pipe contention when
  in-container reads FD 1, not generic disk-I/O backpressure.
- 8 rubric edits drafted; only R-4 / R-5 / R-8 are proposable now
  (build-approach-independent). R-1 / R-2 / R-3 / R-6 / R-7 are gated.
- Typed Record schema (§8.1) for M18/M19 downstream consumers.
- Security threat model (§16) with 7-adversary matrix.
- 9-alert catalog (§19), 7-failure-mode coverage (§17), overhead-budget
  methodology (§20).
- 20 follow-ups tagged by requirement type (§21); none need GPU or
  production data.

Three review rounds + adversarial pass + independent design-readiness
and stakeholder-perspective reviewers; ~70% load-bearing claims hold up
under cross-checking, gaps explicitly enumerated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@trilamsr trilamsr merged commit a59a552 into main May 19, 2026
9 checks passed
@trilamsr trilamsr deleted the worktree-m15-research branch May 19, 2026 17:56
trilamsr added a commit that referenced this pull request May 19, 2026
…ocked draft) (#94)

## Summary

Locks M15 design ahead of implementation. Builds on the 2664-line
research evidence base at
[`docs/research/m15-container-stdout.md`](https://github.com/TraceCoreAI/tracecore/blob/main/docs/research/m15-container-stdout.md)
(PR #92) and survived 5 review passes on this branch — see commit log
for the trail.

Key decisions (post-review):
- Build approach **BA-2** (port `pkg/stanza/fileconsumer`); BA-1 adapter
rejected because M16's rubric is consistent with either path and
presumes tracecore-native.
- Attribute namespace `gen_ai.training.*` per the upstream proposal at
`docs/proposals/gen-ai-training-semconv.md`, with one documented
exception for `tracecore.training.data_time_s` / `iter_time_s` (M18
input contract is older than the upstream draft).
- Cursor JSON at `/var/lib/tracecore/container_stdout/cursor.json`
(atomic rename, mode 0600).
- Pod attribution chain mirrors RFC-0009 for cross-receiver join
consistency.
- **MILESTONES.md M15 line 358 amended** (`tracecore.io/rank` →
`gen_ai.training.io/rank`) to canonicalize the cross-RFC label name.

## Review history

| Pass | Commit | Top finding |
|---|---|---|
| Self-review | `456f6da` | Softened M16 claim; surfaced
validate-at-load; 4 deferrals |
| 8 stakeholder lenses | `19bd492` | BLOCKER (namespace carve-out)
resolved; 12 CONCERN/NIT applied; 13 deferred |
| 2 adversarial | `63dccbb` | Cross-RFC label resolved in-PR;
propagation gaps closed; "compile-time pin" corrected |
| 2 A+ aspiration | `be3db09` | §Operator-surfaces inlined; §Performance
promoted from FOLLOWUPS; SHA-pinned manifest URL |
| 2 simplification | (this commit) | Removed meta-prose, duplicate
justifications, self-referential open-question tombstones |

Per-commit message body has the full pushback table.

## Test plan

- [x] `make doc-check` clean across all 5 commits.
- [x] `grep -rn "tracecore.io/rank\|tracecore.io/job-id" MILESTONES.md
docs/rfcs/` returns only the rename-explanation citation in RFC-0010 (no
stale uses).
- [x] `HintPodEvicted` symbol resolves at
`components/receivers/k8sevents/hint.go:24`.
- [x] AI-vocab diff gate clean.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
Three build-approach-independent rubric refinements from the M15
research evidence base (PR #92) and RFC-0010 (PR #94):

- R-4 (cosmetic): max_log_size citation now references the OTel
  `container` stanza operator default (1 MiB matches; documents the
  prior art whether we depend on it via BA-1 or port it via BA-2).
- R-5 (new reliability caveat): containerd #11149 silently drops
  bytes from 0.log when an in-container process reads its own FD 1.
  Shared-pipe contention; not generic backpressure. Standard
  workloads unaffected. Was previously misframed in research round 1
  as "disk-I/O backpressure"; the 2025-01-22 reproducer in the issue
  pinpoints the mechanism.
- R-8 (degraded-mode specificity): rotation-stalled is now defined
  concretely as 0.log size > containerLogMaxSize for ≥30 s
  (3× kubelet default containerLogMonitorInterval of 10 s, cited at
  source). Surfaced via IncError("rotation_stalled"). Prior text was
  generic "kubelet rotation breakage" with no detection mechanism.

R-1 / R-2 (namespace) withheld: OD-12 effectively resolved by the
upstream proposal at docs/proposals/gen-ai-training-semconv.md
(O4-overdue first-draft KPI closed PR #93). No rename needed.

R-3 / R-7 (rotation correctness, gzip handling) deferred: pending
the corresponding integration-test fixtures (TestContainerStdout_*)
landing in the M15 implementation phase per RFC-0010.

R-6 (bbolt cursor) dropped: BA-2 build approach (RFC-0010) keeps the
JSON cursor at /var/lib/tracecore/container_stdout/cursor.json as
originally rubricked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr added a commit that referenced this pull request May 19, 2026
## Summary

Three build-approach-independent rubric refinements for M15, surfaced by
the research doc (PR #92) and locked-in by RFC-0010 (PR #94). Five
additional rubric edits (R-1, R-2, R-3, R-6, R-7) are intentionally
withheld per the rationale below.

## Diff

3 lines changed in `MILESTONES.md`:

- **R-4 (cosmetic):** `max_log_size` citation now references the OTel
`container` stanza operator default. 1 MiB matches; documents the prior
art whether we depend on it (BA-1) or port it (BA-2 per RFC-0010).
- **R-5 (new reliability caveat):** containerd
[#11149](containerd/containerd#11149) silently
drops bytes from `0.log` when an in-container process reads its own FD
1. Mechanism is shared-pipe contention, not generic backpressure
(round-1 research framed it incorrectly; the 2025-01-22 reproducer in
the issue pinpoints the mechanism). Standard workloads unaffected.
- **R-8 (degraded-mode specificity):** rotation-stalled is now defined
concretely as `0.log` size > `containerLogMaxSize` for ≥30 s (3× kubelet
default `containerLogMonitorInterval` of 10 s, cited at source).
Surfaced via `IncError("rotation_stalled")`. Prior text was generic
"kubelet rotation breakage" with no detection mechanism.

## Why not the other 5 rubric edits

- **R-1 / R-2 (namespace):** OD-12 effectively resolved by the upstream
proposal at `docs/proposals/gen-ai-training-semconv.md` (PR #93 closed
the O4-overdue first-draft KPI). No rename needed.
- **R-3 / R-7 (rotation correctness, gzip handling):** deferred pending
the corresponding `TestContainerStdout_*` integration-test fixtures
landing in the M15 implementation phase per RFC-0010.
- **R-6 (bbolt cursor):** dropped because BA-2 keeps the JSON cursor at
`/var/lib/tracecore/container_stdout/cursor.json` as originally
rubricked.

## Test plan

- [x] `make doc-check` clean (273 markdown links resolve, banned-phrase
lint clean across 67 files, RUNBOOK ↔ alerts pairing clean).
- [x] Containerd #11149 mechanism re-verified at the issue's 2025-01-22
reproducer comment; mechanism is shared-pipe contention.
- [x] `containerLogMonitorInterval` default cited verbatim at
`pkg/kubelet/apis/config/v1beta1/defaults.go`; verified in research
§13.8.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant