From 6a48806b31507f3bbbbcda0586ffd2625fe5e621 Mon Sep 17 00:00:00 2001 From: Tri Lam Date: Tue, 19 May 2026 11:37:58 -0700 Subject: [PATCH] [docs] milestones m15 rubric: precursor edits (R-4/R-5/R-8) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three build-approach-independent rubric refinements from the M15 research evidence base (PR #92) and RFC-0010 (PR #94): - R-4 (cosmetic): max_log_size citation now references the OTel `container` stanza operator default (1 MiB matches; documents the prior art whether we depend on it via BA-1 or port it via BA-2). - R-5 (new reliability caveat): containerd #11149 silently drops bytes from 0.log when an in-container process reads its own FD 1. Shared-pipe contention; not generic backpressure. Standard workloads unaffected. Was previously misframed in research round 1 as "disk-I/O backpressure"; the 2025-01-22 reproducer in the issue pinpoints the mechanism. - R-8 (degraded-mode specificity): rotation-stalled is now defined concretely as 0.log size > containerLogMaxSize for ≥30 s (3× kubelet default containerLogMonitorInterval of 10 s, cited at source). Surfaced via IncError("rotation_stalled"). Prior text was generic "kubelet rotation breakage" with no detection mechanism. R-1 / R-2 (namespace) withheld: OD-12 effectively resolved by the upstream proposal at docs/proposals/gen-ai-training-semconv.md (O4-overdue first-draft KPI closed PR #93). No rename needed. R-3 / R-7 (rotation correctness, gzip handling) deferred: pending the corresponding integration-test fixtures (TestContainerStdout_*) landing in the M15 implementation phase per RFC-0010. R-6 (bbolt cursor) dropped: BA-2 build approach (RFC-0010) keeps the JSON cursor at /var/lib/tracecore/container_stdout/cursor.json as originally rubricked. Co-Authored-By: Claude Opus 4.7 (1M context) --- MILESTONES.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/MILESTONES.md b/MILESTONES.md index de4d12e4..02129b42 100644 --- a/MILESTONES.md +++ b/MILESTONES.md @@ -354,7 +354,7 @@ Alpha unified-source logs receiver covering L2 + L9 (kernel + system events). Ta - **Depends on:** M1 **Functional rubrics:** -- Tails files matching `/var/log/pods/__//*.log`; parses CRI log format into `body`, `log.iostream`, `timestamp`; reassembles `P` (partial) into `F` (full) bounded by `max_log_size` (default 1 MiB). (per https://kubernetes.io/docs/concepts/cluster-administration/logging/) +- Tails files matching `/var/log/pods/__//*.log`; parses CRI log format into `body`, `log.iostream`, `timestamp`; reassembles `P` (partial) into `F` (full) bounded by `max_log_size` (default 1 MiB; matches the OTel `container` stanza operator default — per [`pkg/stanza/docs/operators/container.md`](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/docs/operators/container.md)). (per https://kubernetes.io/docs/concepts/cluster-administration/logging/) - Per-rank attribution: derives `gen_ai.training.rank` (canonical join key across receivers) from Pod env vars `RANK`/`WORLD_SIZE`/`LOCAL_RANK`, and `gen_ai.training.job.id` from `JOB_ID` *or* `TORCHELASTIC_RUN_ID` (canonical per [torchrun docs](https://docs.pytorch.org/docs/stable/elastic/run.html) — "equal to the rendezvous run_id"); falls back to Pod labels `gen_ai.training.io/rank`, `gen_ai.training.io/job-id` (matches RFC-0009 / M13 for cross-receiver join consistency); missing → record emitted with `rank=unknown`. - Structured-log auto-detection: first non-whitespace byte `{` triggers JSON parse; on success emits parsed fields as attributes capped at `max_attributes` (default 16); on failure or non-`{`, passthrough as `body`. - **Dataloader-timing extraction (M18 feed):** PyTorch has no canonical dataloader-progress format (torchvision `MetricLogger` emits `time: ... data: ...`; detectron2 logs `data_time` as a scalar; neither prints `iter_time`). The receiver ships with a tracecore-placeholder regex (`dataloader_regex` config key) — operators MUST override it to match their training stack. When a log line matches, receiver emits `tracecore.training.data_time_s` and `tracecore.training.iter_time_s` keyed by `gen_ai.training.rank` for M18's straggler detector to join on; schema lives in fixture. @@ -362,7 +362,7 @@ Alpha unified-source logs receiver covering L2 + L9 (kernel + system events). Ta - Rotation: kubelet rotates by renaming `0.log` → `0.log.` and creating new `0.log`; receiver follows inode, not path; integration test asserts zero record loss. - Straggler-pattern feed: emits per-rank line-rate (`tracecore.container.lines_per_s`) as derived metric on 15s window. - Checkpoint persistence: cursor stored under `/var/lib/tracecore/container_stdout/cursor.json` (atomic rename); on restart resumes within 1 record of last-acknowledged position. -- **Degraded mode:** `/var/log/pods` missing or unreadable, Pod informer flap, or kubelet rotation breakage each set `Degraded()=true` with an `IncError("")` counter; receiver stays alive and recovers without restart; `FAILURE-MODES.md` row exists per failure mode. +- **Degraded mode:** `/var/log/pods` missing or unreadable, Pod informer flap, or kubelet rotation breakage each set `Degraded()=true` with an `IncError("")` counter; receiver stays alive and recovers without restart; `FAILURE-MODES.md` row exists per failure mode. Rotation-stalled is detected when `0.log` size exceeds `containerLogMaxSize` for ≥30 s (3× kubelet's default `containerLogMonitorInterval` of 10 s, set at [`pkg/kubelet/apis/config/v1beta1/defaults.go`](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/v1beta1/defaults.go)); surfaced via `IncError("rotation_stalled")`. **Non-functional rubrics:** - Overhead: ≤0.10% CPU, ≤0.3 Mbps egress sustained, ≤20 MB RSS. (per NORTHSTARS O2 "Container stdout (rate-limited)" row) @@ -371,6 +371,7 @@ Alpha unified-source logs receiver covering L2 + L9 (kernel + system events). Ta - Back-pressure: 1M-line burst from one rank MUST NOT block sibling streams; bounded per-file goroutine + bounded channel (1024); `goleak` test. - File-handle hygiene: ≤2× pod-count open fds steady-state; closed within 30s of Pod `Terminated`; verified by `lsof` golden. - Security: hostPath read-only mounts for `/var/log/pods` AND `/var/log/containers` (containerd symlink target — per https://github.com/containerd/containerd/issues/11149); runs as non-root with fsGroup matching kubelet log-group. +- Reliability caveat: containerd [#11149](https://github.com/containerd/containerd/issues/11149) (open upstream, last updated 2025-05-30) silently drops bytes from `0.log` when an in-container process reads its own FD 1 (e.g. application self-tee, sidecar reading `/proc/1/fd/1`). The mechanism is shared-pipe contention with containerd's log copier, not generic disk-I/O backpressure; standard workloads that do not read FD 1 are unaffected. RUNBOOK enumerates this failure mode; receiver does not claim universal lossless delivery from `0.log` alone. - Panic recovery: per-file goroutines wrap `defer/recover`; malformed CRI line MUST NOT crash other watchers. (per PRINCIPLES §1) - Shutdown: SIGTERM → 1s Phase-1 budget; in-flight reads abandoned cleanly; cursor flushed best-effort.