Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions MILESTONES.md
Original file line number Diff line number Diff line change
Expand Up @@ -354,15 +354,15 @@ Alpha unified-source logs receiver covering L2 + L9 (kernel + system events). Ta
- **Depends on:** M1

**Functional rubrics:**
- Tails files matching `/var/log/pods/<namespace>_<podname>_<uid>/<container>/*.log`; parses CRI log format into `body`, `log.iostream`, `timestamp`; reassembles `P` (partial) into `F` (full) bounded by `max_log_size` (default 1 MiB). (per https://kubernetes.io/docs/concepts/cluster-administration/logging/)
- Tails files matching `/var/log/pods/<namespace>_<podname>_<uid>/<container>/*.log`; parses CRI log format into `body`, `log.iostream`, `timestamp`; reassembles `P` (partial) into `F` (full) bounded by `max_log_size` (default 1 MiB; matches the OTel `container` stanza operator default — per [`pkg/stanza/docs/operators/container.md`](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/docs/operators/container.md)). (per https://kubernetes.io/docs/concepts/cluster-administration/logging/)
- Per-rank attribution: derives `gen_ai.training.rank` (canonical join key across receivers) from Pod env vars `RANK`/`WORLD_SIZE`/`LOCAL_RANK`, and `gen_ai.training.job.id` from `JOB_ID` *or* `TORCHELASTIC_RUN_ID` (canonical per [torchrun docs](https://docs.pytorch.org/docs/stable/elastic/run.html) — "equal to the rendezvous run_id"); falls back to Pod labels `gen_ai.training.io/rank`, `gen_ai.training.io/job-id` (matches RFC-0009 / M13 for cross-receiver join consistency); missing → record emitted with `rank=unknown`.
- Structured-log auto-detection: first non-whitespace byte `{` triggers JSON parse; on success emits parsed fields as attributes capped at `max_attributes` (default 16); on failure or non-`{`, passthrough as `body`.
- **Dataloader-timing extraction (M18 feed):** PyTorch has no canonical dataloader-progress format (torchvision `MetricLogger` emits `time: ... data: ...`; detectron2 logs `data_time` as a scalar; neither prints `iter_time`). The receiver ships with a tracecore-placeholder regex (`dataloader_regex` config key) — operators MUST override it to match their training stack. When a log line matches, receiver emits `tracecore.training.data_time_s` and `tracecore.training.iter_time_s` keyed by `gen_ai.training.rank` for M18's straggler detector to join on; schema lives in fixture.
- Pod-discovery: single per-node Pod informer with `FieldSelector=spec.nodeName=<NODE_NAME>` (env via downward API); new Pod → new file watcher within 5s p99 *(unverified)*.
- Rotation: kubelet rotates by renaming `0.log` → `0.log.<N>` and creating new `0.log`; receiver follows inode, not path; integration test asserts zero record loss.
- Straggler-pattern feed: emits per-rank line-rate (`tracecore.container.lines_per_s`) as derived metric on 15s window.
- Checkpoint persistence: cursor stored under `/var/lib/tracecore/container_stdout/cursor.json` (atomic rename); on restart resumes within 1 record of last-acknowledged position.
- **Degraded mode:** `/var/log/pods` missing or unreadable, Pod informer flap, or kubelet rotation breakage each set `Degraded()=true` with an `IncError("<kind>")` counter; receiver stays alive and recovers without restart; `FAILURE-MODES.md` row exists per failure mode.
- **Degraded mode:** `/var/log/pods` missing or unreadable, Pod informer flap, or kubelet rotation breakage each set `Degraded()=true` with an `IncError("<kind>")` counter; receiver stays alive and recovers without restart; `FAILURE-MODES.md` row exists per failure mode. Rotation-stalled is detected when `0.log` size exceeds `containerLogMaxSize` for ≥30 s (3× kubelet's default `containerLogMonitorInterval` of 10 s, set at [`pkg/kubelet/apis/config/v1beta1/defaults.go`](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/v1beta1/defaults.go)); surfaced via `IncError("rotation_stalled")`.

**Non-functional rubrics:**
- Overhead: ≤0.10% CPU, ≤0.3 Mbps egress sustained, ≤20 MB RSS. (per NORTHSTARS O2 "Container stdout (rate-limited)" row)
Expand All @@ -371,6 +371,7 @@ Alpha unified-source logs receiver covering L2 + L9 (kernel + system events). Ta
- Back-pressure: 1M-line burst from one rank MUST NOT block sibling streams; bounded per-file goroutine + bounded channel (1024); `goleak` test.
- File-handle hygiene: ≤2× pod-count open fds steady-state; closed within 30s of Pod `Terminated`; verified by `lsof` golden.
- Security: hostPath read-only mounts for `/var/log/pods` AND `/var/log/containers` (containerd symlink target — per https://github.com/containerd/containerd/issues/11149); runs as non-root with fsGroup matching kubelet log-group.
- Reliability caveat: containerd [#11149](https://github.com/containerd/containerd/issues/11149) (open upstream, last updated 2025-05-30) silently drops bytes from `0.log` when an in-container process reads its own FD 1 (e.g. application self-tee, sidecar reading `/proc/1/fd/1`). The mechanism is shared-pipe contention with containerd's log copier, not generic disk-I/O backpressure; standard workloads that do not read FD 1 are unaffected. RUNBOOK enumerates this failure mode; receiver does not claim universal lossless delivery from `0.log` alone.
- Panic recovery: per-file goroutines wrap `defer/recover`; malformed CRI line MUST NOT crash other watchers. (per PRINCIPLES §1)
- Shutdown: SIGTERM → 1s Phase-1 budget; in-flight reads abandoned cleanly; cursor flushed best-effort.

Expand Down