diff --git a/docs/research/m15-container-stdout.md b/docs/research/m15-container-stdout.md new file mode 100644 index 00000000..07662e91 --- /dev/null +++ b/docs/research/m15-container-stdout.md @@ -0,0 +1,2664 @@ +# M15 — Container-stdout receiver: research + +Synthesis of six parallel research passes for [`M15` in +MILESTONES.md](../../MILESTONES.md#m15-container-stdout-receiver), +performed 2026-05-19 against current upstream sources. Purpose: address +the architectural knowledge gaps **before** any design or code work. + +This file is not a design doc. It is the evidence base for one. Open +decisions are listed in §10. Rubric-edit asks are listed in §11. + +## 1. Decision sheet + +| Area | Finding | Decision implication | +|---|---|---| +| CRI on-disk format | Space-delimited `TS STREAM TAG BODY\n`, RFC3339Nano timestamp, `F`/`P` partial-line tags. Identical across containerd and CRI-O. | Parser is well-specified. Use the off-the-shelf parser, not a hand-rolled regex. | +| Path topology | Runtime writes `/var/log/pods/__//.log`. `/var/log/containers/*.log` is a kubelet-created symlink farm. | Tail `/var/log/pods/**/*.log` directly. Mount `/var/log` (parent) read-only, not just `/var/log/pods`. | +| Rotation driver | Kubelet (not the runtime) rotates via `rename → ReopenContainerLog → gzip`. Timestamped suffix, not numeric. | Tailer must handle inode changes and a brief window where `0.log` is absent. | +| Tailer strategy | OTel filelog uses **poll + fingerprint**, not `fsnotify`/`tail` libs. Both common Go tail libs (`nxadm/tail`, `hpcloud/tail`) are stale or unmaintained. | Depend on `pkg/stanza/fileconsumer` rather than rolling our own or pulling a stale dep. | +| Build approach | Filelog receiver + `container` stanza operator already do glob, rotation, partial-line recombine, pod attribution from path. | **Depend** on filelog + container operator. Add tracecore features as downstream processors. | +| Per-rank attribution | No upstream OTel component reads pod env vars. Filelog + `k8sattributes` only attributes by IP / UID / labels. | Net-new for tracecore. Read `Pod.spec.containers[].env` via informer, not by execing in containers. | +| SemConv `gen_ai.*` | Upstream explicitly forbids vendor-prefixed `gen_ai.*` attributes. No `gen_ai.training.*` exists. Active proposal puts training under `rl.*`. | **Rubric edit needed:** rename `gen_ai.training.rank` → `tracecore.training.rank`. See §7 and §11. | +| Dataloader regex | Only torchvision and detectron2 emit `data_time` by default. Lightning / NeMo / HF Trainer / Composer do not. | Ship a multi-pattern default covering torchvision + detectron2. Document the rest as user-instrumentation territory. | +| containerd #11149 | Open upstream bug, last updated 2025-05-30, marked Stale. Mechanism (per the 2025-01-22 reproducer in the issue): container stdout can be silently dropped when an in-container process reads from FD 1 (e.g. application self-tee, `cat /proc/1/fd/1`). Shared-pipe contention with containerd's log copier. Standard workloads that do not read FD 1 are unaffected. | Receiver README must enumerate the narrow failure mode rather than claim universal lossless delivery, but the practical surface is small. | + +## 2. CRI log format and path topology + +The wire format is defined in the [kubelet CRI logging design +proposal](https://github.com/kubernetes/design-proposals-archive/blob/main/node/kubelet-cri-logging.md), +not the CRI protobuf spec itself. The CRI proto only covers RPCs like +`ReopenContainerLog`; the on-disk encoding is a kubelet convention each +runtime implements. + +``` +2016-10-06T00:17:09.669794202Z stdout F The content of the log entry 1 +``` + +- **Separator:** single ASCII space between the four fields; body may + contain spaces, so parsers split on the first three only. +- **Timestamp:** `time.RFC3339Nano`, always UTC. +- **Stream:** literal `stdout` or `stderr`. +- **Tag:** single rune `F` (full) or `P` (partial), but the proposal + reserves comma-extension (`P,foo`); split tag field on `,` and look + for `F`/`P` rather than match the whole token. +- **Partial-line reassembly:** runtime emits `P` lines until the final + segment is tagged `F`. Consumers concatenate consecutive same-stream + `P` lines with the trailing `F`. Containerd's stdout buffer is 16 KiB; + CRI-O via conmon is the same historically. +- **Docker JSON format** (`{"log":...,"stream":...,"time":...}`) is + obsolete; dockershim was removed in Kubernetes 1.24. If we target + CRI runtimes only (containerd, CRI-O, cri-dockerd), assume CRI text. + +### Path topology + +The runtime writes directly to +`/var/log/pods/__//.log`. +`/var/log/containers/__-.log` +is a symlink kubelet creates via +[`legacyLogSymlink` in `pkg/kubelet/kuberuntime/legacy.go`](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/legacy.go), +pointing at the real pod-log file. Both paths exist on every modern +distro (GKE/COS, EKS AL2/Bottlerocket, kind, k3s, OpenShift). + +Note the **field-order inversion**: the pod-log directory is +`__` but the symlink filename is `__-`. + +Pod and namespace names cannot contain `_` per +[Kubernetes object names](https://kubernetes.io/docs/concepts/overview/working-with-objects/names/) +(namespaces: DNS subdomain; pods: RFC 1123 label), so the underscore +split is unambiguous. Defensively split on the **last two** underscores +(UID first, then pod) so future CRDs that produce pod-like objects with +relaxed naming don't break the parser silently. + +### Mount strategy + +Mount `/var/log` (the parent) read-only. Mounting only `/var/log/pods` +breaks on OpenShift and Bottlerocket where `/var/log/pods` may be a +symlink into `/var/lib/kubelet/...` and the resolved target lives +elsewhere under `/var/log`. Read-only is sufficient: kubelet and the +runtime own writes and rotation; we never write. + +## 3. Rotation mechanics + +Driven by the kubelet `containerLogManager` goroutine +(`pkg/kubelet/logs/container_log_manager.go`), not by the runtime. The +runtime only responds to a CRI `ReopenContainerLog` RPC. + +Tunables: `containerLogMaxSize` (default `10Mi`), `containerLogMaxFiles` +(default 5), `containerLogMaxWorkers`, `containerLogMonitorInterval`. +Trigger is **size-only**; no time or line-count thresholds. + +Sequence on rotation: + +1. `rename("0.log", "0.log.20060102-150405")` — atomic on same FS; the + inode is now reachable only via the timestamped path. Any open `fd` + on the old name stays valid (POSIX semantics). +2. Kubelet calls `ReopenContainerLog` over CRI. +3. Runtime closes its old `fd` and opens a fresh `0.log` (new inode). +4. Later, a separate `compressLog` step gzips older rotated files via a + `.tmp` intermediate, leaving `0.log.20060102-150405.gz`. + +Observable on disk in sequence: +`0.log` (live) → `0.log.` (just rotated, plain) → `0.log..gz`. + +Edge cases: + +- **Brief absence window.** Between rename and runtime reopen, `0.log` + does not exist. A tailer that treats "file missing" as fatal will + mis-handle this. Container writes are not lost during the window; the + runtime buffers through the shim until reopen. +- **`MaxFiles=2`.** Retention math is `MaxFiles - 2 = 0` retained + rotated files. A slow tailer can lose the tail of the rotated file + before draining it. +- **Compression race.** If the tailer is slow, the renamed plain file + becomes `.gz` mid-read. Either drain promptly or decompress. +- **Fast writers.** Kubelet does not truncate; it only rotates. A + pod writing faster than the monitor interval can grow `0.log` past + `MaxSize` until the next tick. No drop at the kubelet layer. +- **containerd #11149.** Container stdout can be silently dropped when + **anything inside the container reads from FD 1** (e.g. the + application tees its own stdout, or another in-container process + `cat /proc/1/fd/1`). Reproducer in the issue (2025-01-22): writing + 100 lines from nginx with an in-container `cat /proc/1/fd/1 | tee` — + the container's tee saw 90 lines, the kubelet log file saw only 10. + Without the in-container reader, all 100 lines reached the log. + The mechanism is shared-pipe contention between the in-container + reader and containerd's log copier, not generic disk-I/O backpressure. + Open, marked Stale, last updated 2025-05-30, no PR. Narrow failure + mode: standard workloads that don't read their own FD 1 are + unaffected. We still cannot claim universal lossless delivery, but + the practical reliability surface is much smaller than a generic + "0.log is lossy" framing would suggest. + +## 4. Tailer strategy + +OTel filelog uses neither `fsnotify` nor a `tail`-style library. Its +[`pkg/stanza/fileconsumer`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/stanza/fileconsumer) +is a **poll-based reader** that scans the glob each `poll_interval` +(default 200 ms), fingerprints file-head bytes to detect rotation +(handles both move/create and copy/truncate), and persists byte offsets +via the `storage` extension. No OS-level inotify or `inotify`-via-cgo; +works identically on Linux, macOS, Windows. + +Go tail-library survey: + +| Library | Last commit | Inode-follow | Verdict | +|---|---|---|---| +| `github.com/nxadm/tail` | 2023-10 | `Config.ReOpen` works | Maintained but stale; ~400 stars | +| `github.com/hpcloud/tail` | 2018 (abandoned) | Buggy edges | Don't use | +| `github.com/fsnotify/fsnotify` | active | N/A (low-level) | Foundation if rolling our own | +| `pkg/stanza/fileconsumer` (OTel) | active | Fingerprint-based | Production-grade, what filelog uses | + +If we depend on filelog (see §5), we inherit `fileconsumer` for free. +Rolling our own on `fsnotify` only makes sense if we explicitly reject +filelog; pulling `nxadm/tail` adds a stale dependency for no gain. + +**Rubric note.** [`MILESTONES.md` §M15](../../MILESTONES.md#m15-container-stdout-receiver) +says the receiver "follows inode, not path". Poll-plus-fingerprint +satisfies the *intent* (correctly track rotation) but does it by file +identity (fingerprint hash) rather than by `fstat` inode number. The +rubric phrasing predates this finding; the integration-test gate should +verify "zero record loss across rotation", not specifically inode +semantics. See §11. + +## 5. Build approach: depend on filelog + container operator + +Three options were evaluated: **depend** (import filelog as a Go +dependency), **borrow** (fork parsing code), **rewrite** (own the +stack). + +Recommendation: **depend**. Apache-2.0 ↔ Apache-2.0 is clean. + +The [`container` stanza operator](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/docs/operators/container.md) +already handles: auto-detection across docker/crio/containerd, CRI +partial-line recombine (`P`/`F`, `max_log_size: 1MiB`), path parsing of +`NAMESPACE_PODNAME_UID/CONTAINERNAME/RESTARTCOUNT.log`, and extraction +of `k8s.pod.name`, `k8s.pod.uid`, `k8s.container.name`, +`k8s.container.restart_count`, `k8s.namespace.name`, plus `time`, +`logtag`, `log.iostream`. + +Minimal config that gets us most of the rubric for free: + +```yaml +receivers: + filelog: + include: [/var/log/pods/*/*/*.log] + start_at: end + operators: + - type: container +``` + +The three tracecore differentiators all compose **downstream** of this: + +- **Per-rank attribution.** New processor reading + `Pod.spec.containers[*].env` via an informer, mapping configured + env names (e.g. `RANK`, `WORLD_SIZE`, `TORCHELASTIC_RUN_ID`) to + attributes keyed on `(k8s.pod.uid, k8s.container.name)`. + No upstream OTel-collector-contrib processor we surveyed does this + today; + [`resourcedetectionprocessor`'s `env` detector](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/resourcedetectionprocessor) + reads only `OTEL_RESOURCE_ATTRIBUTES`, not arbitrary env vars. + Components not surveyed include `k8sobjectsreceiver` (could be + configured to project pod-spec env into attributes), Vector's + `kubernetes_logs` source, and Datadog Agent autodiscovery — each + may approximate the behavior with different ergonomics. Validate + one of these is unsuitable before declaring this work greenfield. +- **Dataloader regex tagging.** Either a `transform` processor with OTTL + statements, or a dedicated processor with a configurable regex map. + See §6 for the regex landscape. +- **Rate limiting.** New per-key token-bucket processor keyed on + `(k8s.pod.uid, k8s.container.name)`. Filelog has none today; the + closest existing processors are `probabilisticsampler` and + `tailsampling`, neither of which solves per-key budgeting. + +Trade-offs accepted: + +- **Binary size.** Filelog + `pkg/stanza` adds dependency weight, but + tracecore's binary already ships an OTel-derived pipeline; the delta + is small relative to current footprint. Measure before claiming. +- **API stability risk.** `pkg/stanza` Go API is less stable than the + YAML config surface. If we hit churn, the fallback is to wrap filelog + by config rather than by Go import. Lower risk if we only consume + filelog's receiver factory. + +## 6. Pod attribution and dataloader regex + +### Env-var landscape + +Six launchers surveyed; the canonical ground truth is **the in-process +env at training-script time**, not the PodSpec. + +- **torchrun / torch.distributed.run** sets `RANK`, `LOCAL_RANK`, + `GROUP_RANK`, `ROLE_RANK`, `WORLD_SIZE`, `LOCAL_WORLD_SIZE`, + `ROLE_WORLD_SIZE`, `MASTER_ADDR`, `MASTER_PORT`, + `TORCHELASTIC_RESTART_COUNT`, `TORCHELASTIC_RUN_ID`. From + [`torch/distributed/run.py` lines 205-241](https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py): + `TORCHELASTIC_RUN_ID` is documented as "equal to the rendezvous + `run_id`" — stable only if the operator passes `--rdzv-id=$JOB_ID`. +- **Kubeflow PyTorchJob v1** injects `RANK`, `WORLD_SIZE`, `MASTER_ADDR`, + `MASTER_PORT`, plus `PET_*` mirrors for torchelastic. Worker pods are + offset by +1 (Master is rank 0). PodSpec-level `RANK` is **wrong** as + soon as `nproc_per_node > 1`, because torchrun overwrites it + per-process. Confirmed at + [`pkg/controller.v1/pytorch/envvar.go` (release-1.9)](https://github.com/kubeflow/training-operator/blob/release-1.9/pkg/controller.v1/pytorch/envvar.go). +- **torchx Kubernetes scheduler** sets only `TORCHX_RANK0_HOST` and + `TORCHX_IMAGE` at pod level; rank vars come from the inner `torchrun`. +- **MPI Operator** sets nothing rank-like on the PodSpec. + `OMPI_COMM_WORLD_RANK` is set per-process by `orted` at spawn; it is + not visible in `Pod.spec.env`. Must read from `/proc//environ` + or rely on the training script to log it. +- **Ray Train** sets `RANK`/`WORLD_SIZE`/`LOCAL_RANK` etc. **inside the + actor process**, not on the pod. Mirrors the torchrun contract + deliberately ([`ray/train/torch/config.py` line 167](https://github.com/ray-project/ray/blob/master/python/ray/train/torch/config.py)). + Run-id via `RAY_JOB_ID` only when using KubeRay's `RayJob`. +- **JobSet** sets no custom env vars. Attribution by pod labels + (`jobset.sigs.k8s.io/job-index`, `replicatedjob-name`) and the + standard `JOB_COMPLETION_INDEX` from kube's IndexedJob feature. + +**Implication for tracecore.** Reading `Pod.spec.env` via an informer +gives us correct *node-level* attribution (job ID, replica index) but +wrong *process-level* rank when `nproc_per_node > 1`. The robust +attribution chain is: + +1. Pod metadata for: namespace, pod name, container name, node, job + labels. +2. Pod env for: job ID / run ID (Kubeflow injects `TORCHELASTIC_RUN_ID` + when in elastic mode; vanilla pods can be configured via downward + API). +3. Training-script log content (regex on `body`) for: per-process + `RANK` and `LOCAL_RANK`. Most training scripts print these on + startup (e.g., `[Rank 3] starting epoch 0`). + +Document this in the receiver README so operators understand why the +rubric's "derives `RANK` from Pod env vars" is correct for the common +case (1 process per pod) but degenerate for `nproc_per_node > 1`. See +§10 open decision OD-2. + +### Dataloader log formats + +Only torchvision references and detectron2 emit per-step `data_time` +out of the box. Surveyed in detail: + +| Framework | Per-step `data_time`? | Sample line | +|---|---|---| +| torchvision `MetricLogger` | Yes | `... time: 2.0156 data: 1.4523 max mem: 14523` | +| detectron2 `CommonMetricPrinter` | Yes | `... time: 0.2540 ... data_time: 0.0084 ...` | +| PyTorch Lightning | No (only SimpleProfiler post-train table) | `[_TrainingEpochLoop].train_dataloader_next | 0.012345 | 12.345` | +| NVIDIA NeMo | No (only `train_step_timing in s=`) | `... train_step_timing in s=0.512]` | +| HF `transformers.Trainer` | No (only end-of-train aggregates) | `'train_runtime': 1234.5678, 'train_steps_per_second': 8.04` | +| MosaicML Composer | No (throughput only) | `throughput/batches_per_sec: 1234.567` | + +The default `dataloader_regex` should be a multi-pattern alternation +matching torchvision and detectron2. For Lightning / NeMo / HF / +Composer, the only path is user-side instrumentation (log a line with +the same shape from the training script). Document this in the receiver +README as the **placeholder regex is a starting point**, not a +silver bullet. + +Proposed default (covers torchvision and detectron2; validated against +Go's `regexp` package on 2026-05-19): + +```regex +\btime:\s+(?P\d+(?:\.\d+)?)\b.*?\b(?:data_time|data):\s+(?P\d+(?:\.\d+)?)\b +``` + +**Validation:** the regex compiles under Go RE2 and produces correct +captures across five test inputs. Test fixture: + +| Input | iter_time_s | data_time_s | +|---|---|---| +| `... time: 2.0156 data: 1.4523 max mem: 14523` (torchvision) | `2.0156` | `1.4523` | +| `... time: 0.2540 last_time: 0.2491 data_time: 0.0084 ...` (detectron2) | `0.2540` | `0.0084` | +| `dataloader_idle_time: 0.05 unrelated:0` (false-positive guard) | (no match) | (no match) | +| `time: 2 data: 1` (integer seconds) | `2` | `1` | +| `no match here` | (no match) | (no match) | + +Design notes: + +- **`\d+(?:\.\d+)?` accepts integer and decimal seconds.** Round-1's + `\d+\.\d+` rejected `time: 2 data: 1`; the new form is more + defensive. +- **Alternation order is `data_time` first, then `data`.** At a word + boundary, the regex engine tries the longer literal before the + shorter, so `data_time:` matches the longer branch and is never + mis-parsed as `data` + `_time` residue. +- **Trailing `\b` after each numeric capture anchors the end.** Round-2 + used `(?=\s|$)` lookahead, which Go's RE2 engine does not support + (`error parsing regexp: invalid or unsupported Perl syntax: '(?='`). + `\b` is the RE2-compatible substitute: it matches between a digit + and any non-word character (space, comma, end-of-string, + punctuation), which is what we want at the end of a numeric token. +- **`.*?` is intentionally non-greedy.** If a log line emits multiple + `time:`/`data:` pairs, the non-greedy match preferentially binds the + first pair, so iter and data are correctly paired. +- **No multi-line concerns.** OTel `container` operator recombines + `P`/`F` partials into a single record before this regex sees the + body. The same holds under any BA-* build approach: CRI partial-line + reassembly happens before the regex. +- **Go RE2 lookahead is unavailable.** Do not re-introduce `(?=…)` in + any tracecore-emitted regex; the regex must run unchanged in both + tracecore-native code and (under BA-1) OTel `ExtractPatterns`, both + of which use Go's stdlib `regexp` package. + +The lit-GPT format in §13.7 is **a separate pattern**, not covered by +this default. Multi-pattern support would need a config map +(`framework_name → regex`), not regex alternation. + +## 7. SemConv namespace decision + +**Status: contested. The current M15 rubric uses `gen_ai.training.rank` +deliberately as part of NORTHSTARS objective O4 (shepherd the +`gen_ai.training.*` namespace into upstream SemConv before the +ecosystem standardizes elsewhere). The naming.md "recommendation" cuts +against it. This is a strategic bet, not an oversight. R-1 below is +revised accordingly.** + +### 7.1 Evidence that `gen_ai.training.*` is a deliberate project goal + +[`NORTHSTARS.md`](../../NORTHSTARS.md) O4 row (line 38) names "Standards" +as a top-level objective with `gen_ai.training.*` external +implementations as the hero KPI. Line 202 states the goal verbatim: +"author and shepherd the OpenTelemetry `gen_ai.training.*` semantic +conventions through to wide adoption, before the ecosystem standardizes +on someone else's vocabulary." The O4 commitments include: + +- First-draft PR filed on `open-telemetry/semantic-conventions` by M1 + (M1 has shipped; PR status not verified in this research pass). +- First merged `gen_ai.training.*` upstream PR by M6 (M6 in progress). +- Tracecore receivers emit semconv attribute names: 100% per release. + +[`MILESTONES.md`](../../MILESTONES.md) line 29 reinforces this: +"M7 is absent by design — OTel `gen_ai.training.*` semconv work lives +in `open-telemetry/semantic-conventions`, not this repo (recurring +cadence in NORTHSTARS.md O4)." + +So `gen_ai.training.rank` in M15's rubric is consistent with a stated +project objective, not a naming mistake. The earlier framing in this +section (round 1) missed this entirely. + +### 7.2 Evidence that the naming.md rule cuts the other way + +The +[`docs/general/naming.md` rule](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/naming.md) +states verbatim (re-verified 2026-05-19): "It is not recommended to use +existing OpenTelemetry semantic convention namespace as a prefix for a +new company- or application-specific attribute name. Doing so may +result in a name clash in the future." + +This is a **recommendation**, not a hard prohibition. The risk it warns +against is a name clash if upstream later defines the same attribute +with different semantics. Tracecore's bet under O4 is that **tracecore +is the upstream definer**, so the clash risk reduces to "the upstream +PR fails to land or lands with different semantics." + +The `gen_ai.*` registry today is inference-only +(`request.*`/`response.*`/`usage.*`/`agent.*`/`tool.*`/`operation.*`). +Issue +[semantic-conventions-genai #88](https://github.com/open-telemetry/semantic-conventions-genai/issues/88) +proposes a sibling `rl.*` namespace for training, which is a competing +proposal to O4's plan. Either could land first. + +### 7.3 Verdict and revised recommendation + +The decision is not "which namespace is technically correct" but "how +much risk does tracecore want to carry that O4 doesn't land." Two +defensible postures: + +| Posture | Emit | Risk | +|---|---|---| +| **Hold the bet** (status quo) | `gen_ai.training.*` | If `rl.*` lands first or `gen_ai.training.*` PR is rejected, we own a clash. Mitigation: collector-side rename via attributes processor at that point. | +| **Hedge** | `tracecore.training.*` primary, `gen_ai.training.*` aliased | Doubles attribute cardinality. Reads heavier in dashboards. Operationally pre-pays the clash risk. | +| **Concede** | `tracecore.training.*` only | Cleanest under naming.md. Abandons O4's namespace bet. Out of scope for a research doc to recommend. | + +This research pass cannot decide the bet; it requires sign-off from the +O4 owner (per NORTHSTARS.md line 204, "OTel/semconv lead"). The +research finding here is to **stop treating this as a research-resolved +question**. The rubric edit framing in §11 R-1 (round-1's "rename to +`tracecore.training.*`") is **withdrawn** in favor of recommending the +owner make the call. + +If O4 status is "stalled or rejected" at design-doc time, the fallback +to `tracecore.training.*` is well-evidenced. If O4 status is "draft PR +open and tracking", hold the bet. + +This is a [`MILESTONES.md`](../../MILESTONES.md) rubric question with +8 cross-receiver call sites (lines 29, 358, 360, 433, 453, 459, 460, +481 — see §11 R-1). Any change must be a single cross-cutting PR, not +an M15-scope edit. + +### 7.4 O4 status check: no upstream PR exists as of 2026-05-19 + +Round-2 noted the M1-target first-draft PR's status was unverified. +This pass verified directly via `gh search` against both upstream +repos: + +- `open-telemetry/semantic-conventions`: zero PRs containing + `tracecore` (open or closed); zero PRs containing + `gen_ai.training` (open or closed). +- `open-telemetry/semantic-conventions-genai` (the post-split repo + hosting `gen_ai.*`): zero PRs containing `tracecore`; one open PR + containing `training` (PR #172 by `renovate[bot]`, dependency update, + unrelated). + +**Verdict: the NORTHSTARS O4 "First-draft PR filed on +`open-telemetry/semantic-conventions` by M1" commitment is not +fulfilled.** M1 has shipped per MILESTONES.md. Either the PR was filed +under different keywords this search missed (verify by re-running with +the O4 owner's GitHub handle and the actual PR title if known), the PR +exists in a different repo, or the commitment is overdue. + +**Implication for §7.3 verdict.** The "hold the bet" posture in §7.3 +assumed an active upstream effort. Without a filed PR, the bet is +unfunded. The defensible postures now are: + +- **Hedge** (`tracecore.training.*` primary, `gen_ai.training.*` + aliased on emit): pre-pays the clash risk, doubles attribute + cardinality, but lets the receiver ship without depending on an + upstream effort that has not started. +- **Concede** (`tracecore.training.*` only): cleanest under naming.md; + acknowledges O4 as dormant. +- **Re-activate O4** (escalate to the O4 owner): file the draft PR + before M15 ships so the upstream effort is at least visible. If this + happens within the M15 design window, the original "hold the bet" + recommendation becomes defensible again. + +The doc cannot resolve which posture; the call must go to the project +lead and the O4 owner. The negative finding above is the load-bearing +input — it changes the strategic-bet framing from "we're in flight" +to "we're not yet." + +### 7.5 Attribute table + +The attributes M15 would emit under the current rubric, scattered +across §6, §11, §13.11, §14, §16 and consolidated here for the +O4-owner stakeholder lens: + +| Attribute | Type | Cardinality | Source | Stability | Rationale | +|---|---|---|---|---|---| +| `k8s.namespace.name` | string | low (≤cluster namespaces) | path parse / `container` operator | stable (upstream SemConv) | Standard pod attribution | +| `k8s.pod.name` | string | medium (pods/cluster) | path parse | stable (upstream SemConv) | Standard | +| `k8s.pod.uid` | string | medium (pods/cluster lifetime) | path parse | stable (upstream SemConv) | Globally unique join key | +| `k8s.container.name` | string | low (containers/pod) | path parse | stable (upstream SemConv) | Per-container disambiguation | +| `k8s.container.restart_count` | int | low | path parse | stable (upstream SemConv) | Restart correlation | +| `k8s.node.name` | string | low (nodes/cluster) | env via downward API | stable (upstream SemConv) | Per-node sharding | +| `log.iostream` | string | 2 (stdout/stderr) | CRI parse | stable (upstream SemConv) | Stream attribution | +| `log.file.path` | string | medium | tailer | stable (upstream SemConv) | Path traceability | +| `tracecore.training.rank` | int | medium (≤world_size) | pod env via informer | tracecore-coined; contested namespace per §7 | Per-process join key | +| `tracecore.training.world_size` | int | low (≤jobs × replicas) | pod env via informer | tracecore-coined | Cluster-size context | +| `tracecore.training.local_rank` | int | low | pod env / body regex | tracecore-coined | Per-node process index | +| `tracecore.training.job.id` | string | medium | pod env or label | tracecore-coined | Cross-restart correlation; UUID4 under torchrun `--standalone` per §13.13 | +| `tracecore.training.data_time_s` | double, nullable | per-record | body regex (§6) | tracecore-coined; emitted only on dataloader-line match | M18 straggler input | +| `tracecore.training.iter_time_s` | double, nullable | per-record | body regex (§6) | tracecore-coined; emitted only on dataloader-line match | M18 straggler input | +| `tracecore.container.lines_per_s` | double | per-(pod_uid, container) per 15s | derived metric | tracecore-coined | Straggler-pattern feed; rate-derived | +| `tracecore.dropped_lines` | int | per-record (sampled) | rate-limiter | tracecore-coined | Rate-limit observability | +| `k8s.event.hint` (joined) | string | low | from k8sevents M10 record | tracecore-coined | Cross-receiver pattern input | + +**Cardinality budget:** the `tracecore.training.*` set is bounded by +`(jobs × world_size)` cluster-wide, which for typical training +workloads (≤32 jobs × ≤8192 ranks) keeps the attribute-value +cardinality under 256K — within Prometheus's default scrape sanity +limits. Document the budget in the receiver README. + +**Stability classes:** + +- **stable (upstream SemConv):** the attribute name and semantics are + fixed by upstream OpenTelemetry SemConv; tracecore must align. +- **tracecore-coined:** the name was introduced by tracecore. Two + sub-classes under §7: + - `tracecore.training.*` attributes: the name is committed but the + namespace choice is contested per §7.4. If the bet ends as + "hedge" or "concede", the prefix changes; the semantics do not. + - `tracecore.container.*` and `tracecore.dropped_lines`: pure + receiver-specific telemetry, no upstream contention. + +### 7.6 Draft semantic-conventions YAML sketch (for the O4 owner) + +If the O4 effort is re-activated, the upstream PR shape would be a +new file under `model/gen-ai/training/registry.yaml` in +`open-telemetry/semantic-conventions-genai`. Sketch: + +```yaml +groups: + - id: registry.gen_ai.training + type: attribute_group + display_name: Generative AI Training Attributes + brief: > + Describes attributes for distributed-training observability. + Applies to receivers and processors that emit per-process, + per-job training-loop telemetry from Kubernetes-orchestrated + jobs. + attributes: + - id: gen_ai.training.rank + type: int + stability: experimental + brief: > + Global rank of the worker within the training job + (0 ≤ rank < gen_ai.training.world_size). Sourced from + torchrun `RANK`, Ray Train's mirrored env, or Kubeflow + PyTorchJob worker offset (+1 for workers in v1). + examples: [0, 1, 7, 4095] + - id: gen_ai.training.world_size + type: int + stability: experimental + brief: Total number of workers in the training job. + examples: [8, 64, 4096] + - id: gen_ai.training.local_rank + type: int + stability: experimental + brief: > + Rank of the worker within its node. 0 ≤ local_rank + < local_world_size. Sourced from torchrun `LOCAL_RANK`. + examples: [0, 1, 7] + - id: gen_ai.training.job.id + type: string + stability: experimental + brief: > + Stable identifier for the training job across worker + restarts. Sourced from orchestrator (Kubeflow + `training.kubeflow.org/job-name`, JobSet + `jobset.sigs.k8s.io/jobset-uid`, Slurm `SLURM_JOB_ID`). + NOT `TORCHELASTIC_RUN_ID` in `--standalone` mode (random + UUID per launch; see implementation notes). + examples: ["pytorch-train-12345", "jobset-abc-uid"] + - id: gen_ai.training.data_time_s + type: double + stability: experimental + brief: > + Seconds the dataloader spent producing the batch consumed + by this training step. Emitted per-step when the training + framework logs a recognizable pattern (torchvision + MetricLogger, detectron2 CommonMetricPrinter). + - id: gen_ai.training.iter_time_s + type: double + stability: experimental + brief: > + Seconds for the full training iteration (forward + backward + + optimizer step). Pairs with data_time_s for straggler + detection (M18 in tracecore). +``` + +This is **a sketch, not a proposal text.** The actual PR would need: +- Reference implementation link (tracecore as the first emitter). +- Cross-references to existing `gen_ai.*` attribute groups. +- A `stability: experimental` lifecycle commitment. +- Engagement with the `rl.*` proposal (issue #88) for scope overlap. + +Filing this as a draft PR closes the §7.4 negative finding and +re-enables the §7.3 "hold the bet" posture. + +## 8. Internal repo prior art + +[`components/receivers/k8sevents/`](../../components/receivers/k8sevents) +is the closest template. Patterns below verified against source on +2026-05-19 (not just relayed from sub-agent report). + +- **Lifecycle.** Embed `pipeline.ComponentState` and own a + `*lifecycle.Lifecycle` per independent source + (`components/receivers/k8sevents/receiver.go:61, 73`). +- **Optional sub-sources.** k8sevents ships a `NodeConditions.Enabled` + toggle that registers an additional Node SharedInformer **with its + own typed record, channel, and degraded flag** + (`receiver.go:75, 99-101`; `config.go:90-99`). Pattern relevant for + M15's OD-1b Path B (informer-based env-var projection): ship the + pod-spec informer as an opt-in sub-source, not as a hard dependency, + so the receiver still works for operators using OD-1b Path A + (downward API). +- **Config.** Public struct, YAML-stable field names, `Validate()` at + load time (not Start time). Ambiguity errors are exit-2 with named + field. `config.go:18-80, 269-288`. Cap internal channel at `2^20` + with an error message redirecting to a persistent queue + (`config.go:197-206`). +- **Typed record export.** `record.go:20-69` defines `Record` and + `ObjectRef` structs; downstream consumers (M19 pattern detector) + import these for compile-time joins. Same approach for M15: export + a `containerstdout.Record` with the canonical attribute fields + including `tracecore.training.rank` once §11 lands. +- **Self-telemetry.** `selftelemetry.Receiver` interface has five + methods (verified at source 2026-05-19, + `internal/selftelemetry/interface.go:193-237`): + `IncError(Kind)` (line 200), `IncEmissions(n int64)` (line 206), + `ObserveLatency(d time.Duration)` (line 218), + `SetDegraded(degraded bool)` (line 231), + `MarkActivity()` (line 237). Package-level + `selftelemetry.RecordInitError(ctx, mp, kind, id, reason)` handles + the noop-fallback case when MeterProvider is unavailable (used in + k8sevents' factory.go:57-58). Kinds are receiver-local typed + constants (e.g. `KindBackpressureDrop`, `KindCardinality`), never + raw error strings; the package documents this cardinality contract + but does not guard it at runtime. +- **Degraded transitions.** Track `lastWatchSeen` atomically; + `SetDegraded(false)` fires exactly once per error→recovery edge + (`receiver.go:276-305`). Don't call it on every successful emit. +- **RBAC pinning.** Hand-authored `rbac.yaml` per receiver; golden + file `rbac.can-i.golden` validated by Go test (`rbac_test.go`). + M15 needs a new ClusterRole granting pod-list/watch (and namespace + list for filtering) — see §10 OD-1. +- **Backpressure vs downstream errors.** `KindBackpressureDrop` and + `KindDownstream` are distinct kinds with different remediations. + M15 should partition: rate-limit drops vs file-read errors vs pod + informer flap. + +**Gaps with no in-tree helper:** + +- No file-tailer code anywhere (kernelevents reads journald + kmsg, not + files). M15 brings the first file-watch path; depending on filelog + inherits `pkg/stanza/fileconsumer` and removes the question. +- No generic per-key token-bucket rate limiter. + [`internal/telemetry/windowed_rate.go`](../../internal/telemetry) + exists but is for self-telemetry, not data-plane rate limiting. + Bring in `golang.org/x/time/rate` with a per-key map and eviction. +- No cursor/checkpoint helper. K8sevents is stateless. M15 needs to + persist byte offsets; if we depend on filelog, the `storage` + extension contract provides this — pick a backend (likely the + `file_storage` extension at a configurable path). +- No NODE_NAME downward-API wiring in the chart. + [`install/kubernetes/tracecore/`](../../install/kubernetes/tracecore) + does not expose `NODE_NAME`; the nccl_fr example documents + `${POD_NAME}` / `${POD_UID}` as operator-provided. Per chart + convention, the tracecore config loader does not expand env vars — + the DaemonSet template must inject them via `fieldRef:` and the + receiver reads `os.Getenv("NODE_NAME")`. + +### 8.1 Typed-record contract for downstream consumers + +M18 (straggler pattern) and M19 (pod-evicted pattern) must consume M15's +records via a compile-time-stable Go schema. The k8sevents +`Record` / `ObjectRef` / `SchemaURLv0` pattern (verified at +`components/receivers/k8sevents/record.go:20-107`) is the template. + +Sketch (Go pseudocode for the design phase; field names contingent on +§7 namespace decision — `tracecore.training.*` if conceded, otherwise +`gen_ai.training.*`): + +```go +// Record is the typed projection of a single CRI log line plus +// receiver-derived attribution. Exported for M18/M19 to import +// without grepping plog.LogRecord attributes. +type Record struct { + // CRI parse output (always populated). + Timestamp time.Time // RFC3339Nano from CRI line + Stream StreamType // StreamStdout / StreamStderr + LogTag LogTag // LogTagFinal / LogTagPartial; recombined upstream + Body string // post-recombine; may be JSON-parsed if §6 auto-detect fires + + // Kubernetes attribution (always populated; sourced from filepath). + Pod PodRef // namespace, name, uid, container, restartCount + NodeName string // from $NODE_NAME via downward API + + // Training attribution (populated when discoverable; zero values + // are explicit signals of "not discovered", not "no rank"). + Training TrainingRef // see TrainingRef below + + // Receiver-derived metrics (per-record samples; aggregated by the + // derived-metric path). + DroppedLines int64 // 0 unless this record is a rate-limiter sample + + // Schema URL pin for version-gated joins (mirrors k8sevents pattern). + // Frozen via SchemaURLv0 constant; bumping is a deprecation hook. +} + +type PodRef struct { + Namespace string + Name string + UID string + Container string + RestartCount int +} + +// TrainingRef holds the per-process training attribution. All fields +// are zero/empty when the receiver couldn't discover them; consumers +// must guard against zero RANK (which is a valid value!) by checking +// the WorldSize > 0 sentinel. +type TrainingRef struct { + Rank int // -1 if undiscovered, 0..N-1 otherwise + WorldSize int // 0 if undiscovered; >0 means Rank is meaningful + LocalRank int // -1 if undiscovered + JobID string // empty if undiscovered + + // Body-regex outputs (per-record; nullable via pointer). + DataTimeS *float64 // nil when no dataloader match on this record + IterTimeS *float64 // nil when no dataloader match on this record +} + +type StreamType uint8 +const ( + StreamUnknown StreamType = iota + StreamStdout + StreamStderr +) + +type LogTag uint8 +const ( + LogTagFinal LogTag = iota + LogTagPartial +) + +// SchemaURLv0 freezes the v0 join schema for M18/M19 consumers. +// Pattern follows components/receivers/k8sevents/record.go:100. +const SchemaURLv0 = "https://tracecore.ai/schemas/containerstdout/v0" +const SchemaURL = SchemaURLv0 +``` + +**Decisions baked into the sketch:** + +- **Rank uses `-1` as the "undiscovered" sentinel, not `int *`.** Rank + 0 is a valid global rank; a nullable-pointer would force every + consumer to deref. The `-1` sentinel is checked by `WorldSize > 0`, + not by comparing Rank itself. +- **`DataTimeS` and `IterTimeS` ARE nullable pointers.** Most records + do not match the dataloader regex; emitting zero would be a false + signal. M18's input contract should accept "match present" / "match + absent" cleanly. +- **`Body` post-recombine, not pre-recombine.** Under any BA-* the + CRI partial-line reassembly happens before this Record is emitted. +- **`SchemaURLv0` frozen pattern.** Mirrors k8sevents + `record.go:90-107`'s explicit "do not redefine in terms of SchemaURL" + comment; future v1 bumps require a separate constant. + +**M18 join-key contract (load-bearing for §11 R-2):** + +M18's straggler detector requires the following attribute set co-present +on a single record for the join to fire: + +- `Training.Rank` (valid: `WorldSize > 0` && `Rank >= 0`) +- `Training.WorldSize` (valid: `> 0`) +- One of (`Training.DataTimeS`, `Training.IterTimeS`) non-nil +- `Pod.UID` (always present) + +Records missing any of these are not M18 input; the receiver emits them +anyway for log-fidelity reasons (per PRINCIPLES §1, never drop data +just because one consumer wouldn't use it). + +**M19 join-key contract:** + +M19's pod-evicted detector consumes M15's records as one of two +inputs (the other is k8sevents' Record with `Hint == HintEvicted`). +M19's join is: + +- `Pod.UID` (matches against k8sevents `Regarding.UID`) +- `Timestamp` (within evictionMatchWindow of k8sevents `EventTime`) + +The receiver SLA for M19 (§17.4) is: **on pod-eviction event, the +receiver MUST emit any pending records from that pod within `2 × +poll_interval` (default `400 ms`) of receiving the informer's +deletion event.** This is enforced by the +`TestContainerStdout_PodEvictionTailFlush` fixture. + +### components.go is generated + +[`cmd/tracecore/components.go`](../../cmd/tracecore/components.go) is +code-generated from +[`cmd/tracecore/components.yaml`](../../cmd/tracecore/components.yaml) +by [`tools/components-gen/main.go`](../../tools/components-gen/main.go). +Sorted alphabetically. The receiver registration is a one-line edit to +`components.yaml`, then `make generate`. Never hand-edit +`components.go`. + +## 9. Chart and RBAC additions + +### 9.0 Operator-facing values.yaml schema + +The operator-visible configuration surface, derived from the rubric + +the open decisions in §10. Field names are alpha-stable contracts; +renames go through the §18.1 deprecation flow. + +```yaml +receivers: + containerstdout: + enabled: false # Alpha; opt-in (§18.1) + + # File discovery + include: ["/var/log/pods/*/*/*.log"] # Default; rarely changed + namespaces: [] # Allowlist (empty = all); §16.1 + + # CRI parser knobs (rubric line 358-359) + max_log_size: 1MiB # Per partial-line cap + max_attributes: 16 # JSON parse output cap + + # Pod attribution (§13.4, OD-1b) + rank_source: downward_api # downward_api | informer + process_rank_regex: "\\bRANK[=:]\\s*(\\d+)\\b" # OD-2; user-override expected + + # Dataloader timing (§6) + dataloader_regex: | + \btime:\s+(?P\d+(?:\.\d+)?)\b.*?\b(?:data_time|data):\s+(?P\d+(?:\.\d+)?)\b + + # Rate limiting (OD-3; §10.1 OD-3 expanded below) + egress_rate_limit: + rate: 200 # lines/s per (pod_uid, container) + burst: 1000 # token bucket depth + lru_cap: 8192 # max tracked (pod_uid, container) keys + lru_evict_after: 5m # idle TTL before eviction + namespace_budgets: {} # optional per-namespace sub-budget + + # Cursor persistence (§13.5, R-6) + cursor: + dir: /var/lib/tracecore/container_stdout + fsync: true # BA-1 forwards to file_storage; BA-2/3 native + + # Tailer tuning (§4) + poll_interval: 200ms + fingerprint_size: 1KiB + max_concurrent_files: 1024 + + # Compression (R-7; BA-1 only) + compression: auto # gzip | auto | "" +``` + +### 9.1 NODE_NAME downward-API injection + +M15 is the first tracecore receiver to require `NODE_NAME` via +downward API. The DaemonSet template addition (sketch; the chart +maintainer is the owner of the exact YAML): + +```yaml +spec: + containers: + - name: tracecore + env: + - name: NODE_NAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + - name: POD_NAMESPACE + valueFrom: + fieldRef: + fieldPath: metadata.namespace + - name: POD_UID + valueFrom: + fieldRef: + fieldPath: metadata.uid +``` + +`POD_NAMESPACE` and `POD_UID` are pre-existing tracecore downward-API +patterns (per nccl_fr's example config); `NODE_NAME` is the new +addition for M15. + +### 9.2 Post-deployment smoke test + +After enabling M15 on a node, operators should verify M15 sees the +same records `kubectl logs` does. Recommended one-liner for +`docs/integrations/containerstdout.md` runbook: + +```bash +# Pick a pod with stdout activity +POD=$(kubectl get pod -o name | head -1) + +# Capture 10 lines via kubectl +kubectl logs --tail=10 "$POD" > /tmp/kubectl.log + +# Wait for M15 to flush (≈poll_interval + force_flush_period) +sleep 1 + +# Compare against M15-emitted records (via Prometheus or downstream sink) +# Expect: 10 records in M15's output with matching body+timestamp. +``` + +Failures here indicate misconfigured `include` glob, RBAC denying pod +informer, or NODE_NAME not injected. Cross-reference RUNBOOK § +"Smoke test failure" (to be written in M15 design phase). + +### 9.3 Helm-chart asks + + + +Helm-chart changes M15 will require: + +1. `receivers.containerstdout.enabled` toggle in `values.yaml`. +2. New host-path mount on the DaemonSet template: + ```yaml + - name: var-log + hostPath: + path: /var/log + type: Directory + ``` + with `readOnly: true` `volumeMount`. +3. New volume for cursor persistence: + ```yaml + - name: tracecore-state + hostPath: + path: /var/lib/tracecore/container_stdout + type: DirectoryOrCreate + ``` +4. Downward-API env injection for `NODE_NAME` (new pattern; nothing in + the chart uses it today). +5. New `rbac.yaml` per receiver (pods list+watch in target namespaces; + namespaces list for filtering). Pinned by `rbac.can-i.golden`. +6. `conftest`/`kyverno` policy delta: M5b rejects extra capabilities + and writable mounts. The two new hostPath mounts above must be + read-only or `DirectoryOrCreate`, never writable across pod + boundaries. + +## 10. Open decisions for design phase + +Round-2 research closed several of these. Resolved ones are kept with their +verdict for the design-doc trail; surviving open items have `OPEN` in the +status column. + +| ID | Status | Decision | Verdict / notes | +|---|---|---|---| +| OD-1 | RESOLVED | Pod-attribution mechanism. | **Hybrid.** Pod metadata (namespace, pod, container, node, owner refs, training labels) via `k8sattributesprocessor` with `pod_association` keyed on `k8s.pod.uid` (the `container` operator already emits this; default `from: connection` does not work for local file tails). RANK / WORLD_SIZE / JOB_ID via a tracecore-owned env-projection mechanism — see OD-1b. | +| OD-1b | OPEN | Env-var projection path: tracecore processor reading `Pod.spec.containers[].env` via an informer, **or** ask training launchers to set `OTEL_RESOURCE_ATTRIBUTES=tracecore.training.rank=$(RANK),...` via downward API and ride `resourcedetectionprocessor` env detector. | Zero upstream prior art either way. The downward-API path is one shell line per launch script with no extra RBAC; the informer path is invisible to operators but adds pod-list/watch RBAC and a cardinality budget. Lean toward the downward-API path as the default and the informer as an opt-in for unmodified workloads. | +| OD-2 | OPEN | Process-level `RANK` recovery when `nproc_per_node > 1`. | Pod env reports the launcher's view (master rank only); per-process rank requires regex on body content. Ship a configurable `process_rank_regex` that defaults to a tolerant pattern (`\bRANK[=:]\s*(\d+)\b`); document that pod-level rank is the join key when `nproc_per_node = 1` and operators must instrument their script otherwise. | +| OD-3 | RESOLVED (semantics specified below) | Rate-limit processor vs inline. | **Processor.** Per-key token-bucket keyed on `(k8s.pod.uid, k8s.container.name)`. Reusable beyond filelog. Receiver-inline would tighten coupling for no architectural gain. Concrete semantics: `golang.org/x/time/rate.Limiter` per key; default `rate=200` lines/s, `burst=1000`; per-key state stored in an LRU map with `lru_cap=8192` keys and `lru_evict_after=5m` idle TTL; over-budget records sampled (1 in N) emitted with `tracecore.dropped_lines=N` attribute and `IncError(KindBackpressureDrop)` counter; namespace-scoped sub-budgets supported via a parent-bucket-and-child-buckets structure (per-namespace budget consumed before per-(pod,container) budget). Cardinality cap on the LRU prevents runaway memory under pod churn. | +| OD-4 | RESOLVED | Dataloader-regex tagging: OTTL `transform` processor or dedicated processor. | **OTTL first.** Config-only, no new Go code, and the regex set is small (~2 default patterns). Re-evaluate if the regex set grows beyond a half-dozen patterns or needs stateful book-keeping. | +| OD-5 | RESOLVED (with rubric edit) | Cursor persistence backend. | **Use OTel `file_storage` extension (bbolt) as authoritative.** Production landscape is split (SQLite in Fluent Bit, JSON in Vector/Datadog, YAML in Promtail, bbolt in OTel). The rubric's JSON-at-`cursor.json` predates the depend-on-filelog finding; reconciling means rubric edit R-6. Optionally emit a low-frequency JSON snapshot for human inspection — read-only, not authoritative. | +| OD-6 | RESOLVED | Gzip-compressed rotated files. | **Decompress and continue** via filelog `compression: auto`. Supported since contrib v0.144+; gzip-corruption bugs (#46105, #45572) are fixed. Pin contrib >= v0.144.0 and add an integration test that gzips a rotated file mid-tail and asserts zero record loss. | +| OD-7 | RESOLVED | Filelog dependency surface. | **Factory-only.** Consume `filelogreceiver.NewFactory()` and configure via YAML; do not embed `pkg/stanza` operators directly in Go. Round-2 found ~3 BREAKING changes in 18 months, all in adjacent code (OTTL, Windows event logs); `pkg/stanza/fileconsumer` surface has been stable but the deprecation policy is not documented. Factory-only keeps the blast radius bounded. | +| OD-8 | OPEN | Bench harness coupling with M5. | M5 (install + overhead benchmark harness) is still planned; M15's overhead-budget rubrics (≤0.10% CPU, ≤20 MB RSS, ≤0.3 Mbps) must land through M5's framework, not a parallel one. Sequence: M5 lands the harness, M15 lands a benchmark fixture under it. | +| OD-9 | OPEN | Filelog feature-gate posture. | Two relevant gates: `filelog.protobufCheckpointEncoding` (new bbolt key encoding, stable timeline TBD) and `filelog.decompressFingerprint` (stable at v0.142). Decide whether tracecore opts in by default or follows upstream defaults; affects upgrade churn. | +| OD-10 | OPEN | Custom builder manifest for tree-shaking. | If the binary-size delta from `filelogreceiver` + `pkg/stanza` (~15–30 MB estimated, unmeasured) is unacceptable, the only path is a custom builder manifest excluding unused operators. Measure before committing engineering effort. | + +## 11. Rubric edits to propose against MILESTONES.md + +**Status note.** Round-1 of this research recommended opening these as a +small PR before any M15 implementation. The §15 correction and the +NORTHSTARS O4 finding in §7 changed the picture substantially: +- R-1 and R-2 are namespace edits, which §7 has now reframed as a + cross-cutting strategic question, not an M15-local rubric mistake. + They are **withdrawn as PR proposals**; the call is the O4 owner's. +- R-3 and R-7 are gated on tests that do not yet exist. They should + not be opened as MILESTONES.md edits until the corresponding fixtures + are written and pass. Marked as **deferred-pending-fixture**. +- R-6 is conditional on the build-approach choice (§15.1 BA-1 vs the + others). Only valid under BA-1 (adapter to upstream filelog). +- R-5 and R-8 are independent of build approach and remain proposable. +- R-4 is cosmetic and low-priority. + +| Edit | Status | Current rubric | Proposed | Rationale | +|---|---|---|---|---| +| R-1 | **WITHDRAWN; recast as cross-cutting strategic question.** Originally proposed to rename `gen_ai.training.rank` → `tracecore.training.rank` in M15's rubric. §7 now reframes this as a NORTHSTARS O4 bet (own the namespace upstream). MILESTONES.md uses `gen_ai.training.*` at **lines 29, 358, 360, 433, 453, 459, 460, 481** — eight sites across M7-absence-note, M15, M13, M14, M18. Any rename must be a cross-receiver PR with the O4 owner's sign-off, not an M15-local edit. | "per-rank attribution: derives `gen_ai.training.rank` (canonical join key across receivers) from Pod env vars ... falls back to Pod labels `tracecore.io/rank`, `tracecore.io/job-id`; missing → record emitted with `rank=unknown`." (line 358 verbatim) | No change at M15 level; surface decision to O4 owner. | §7. The namespace is a strategic bet, not an oversight. | +| R-2 | **REVISED.** Original proposal misquoted the rubric (dropped the `keyed by 'gen_ai.training.rank' for M18's straggler detector to join on` clause). Corrected quote below. Status now ties to R-1: if R-1 is concluded as "hedge" or "concede", R-2 mirrors that choice on the M18-join-key clause; if R-1 is "hold the bet", R-2 is unchanged. | "When a log line matches, receiver emits `tracecore.training.data_time_s` and `tracecore.training.iter_time_s` keyed by `gen_ai.training.rank` for M18's straggler detector to join on; schema lives in fixture." (line 360 verbatim) | If R-1 = "hold the bet": no change. If R-1 = "hedge"/"concede": rename `gen_ai.training.rank` in the keyed-by clause to match R-1's chosen target. | §7 + §6 dataloader survey. | +| R-3 | **DEFERRED-PENDING-FIXTURE.** Round-2's "fingerprint detection satisfies the intent" is research, not a passing test. Should not open as a rubric edit until an integration test demonstrates zero-record-loss across both inode-rename and copy-truncate rotation on a kind cluster. | "Rotation: kubelet rotates by renaming `0.log` → `0.log.` and creating new `0.log`; receiver follows inode, not path; integration test asserts zero record loss." | (Pending) "Rotation: receiver preserves zero-record-loss across both inode-rename and copy-truncate rotation, validated by integration test at `TestContainerStdout_RotationInodeRename` and `TestContainerStdout_RotationCopyTruncate`." | §4. Edit gates on the named tests existing. | +| R-4 | **PROPOSABLE.** Cosmetic; documents the prior art. Low priority. | "`max_log_size` (default 1 MiB)" | Add citation: "`max_log_size` (default 1 MiB; matches OTel `container` stanza operator default)". | Cosmetic. | +| R-5 | **PROPOSABLE, but revised mechanism.** Round-1 framed #11149 as disk-I/O backpressure; the actual mechanism (per the 2025-01-22 reproducer) is shared-pipe contention when something inside the container reads FD 1. Narrower failure surface than originally implied. | (No current row; new rubric.) | New: "Reliability caveat: containerd #11149 (open upstream) causes log-loss in `/var/log/pods/.../0.log` when an in-container process reads from its own FD 1 (e.g. application self-tee, sidecar reading `/proc/1/fd/1`). The mechanism is shared-pipe contention, not generic backpressure. Standard workloads that do not read FD 1 are unaffected. README must enumerate this failure mode." | §3 / §13.2. No CRI-RPC mitigation exists. | +| R-6 | **CONDITIONAL on §15.1 BA-1.** Only valid if we adopt the OTel-adapter build approach. Under BA-2 (port `fileconsumer`) or BA-3 (reimplement), the rubric's original JSON-at-`cursor.json` framing remains correct because there is no `file_storage` extension to delegate to. | "Checkpoint persistence: cursor stored under `/var/lib/tracecore/container_stdout/cursor.json` (atomic rename); on restart resumes within 1 record of last-acknowledged position." (line 364) | **Under BA-1 only:** "Checkpoint persistence: cursor stored via OTel `file_storage` extension (bbolt) at `/var/lib/tracecore/container_stdout/`; on restart resumes within 1 record of last-acknowledged position. Optional read-only JSON snapshot for human inspection." Otherwise: no change. | §13.5 + OD-5. | +| R-7 | **DEFERRED-PENDING-FIXTURE.** Also build-approach-conditional. The integration test must exist before the rubric edit can be proposed. | (No current row; would be new.) | (Pending fixture, BA-1 only) "Compressed rotated files (`0.log..gz`) are read transparently via filelog `compression: auto`; an integration test at `TestContainerStdout_RotationCompressed` gzips a rotated file mid-tail and asserts zero record loss." | §13.6. | +| R-8 | **PROPOSABLE.** Build-approach-independent; the receiver must surface kubelet rotation failure regardless of whether it consumes filelog or rolls its own. | (No current row; new rubric.) | New: "Degraded mode for kubelet rotation failure: receiver tracks observed `0.log` size and surfaces `IncError(KindRotationStalled)` plus `SetDegraded(true)` when size > `containerLogMaxSize` for ≥ 30 s (3× kubelet default `containerLogMonitorInterval`); receiver stays alive; FAILURE-MODES.md row references `TestContainerStdout_RotationStalled`." | §13.1. Kubelet emits klog-only on rotation failure; no K8s Event or metric. Tracecore is the observability layer that surfaces this. | + +**Net proposable edits at this time: R-4, R-5, R-8.** Three small, +build-approach-independent edits that can land as a precursor PR with +no architecture commitment. R-1/R-2/R-3/R-6/R-7 are all gated on +decisions or fixtures that do not yet exist. + +### 11.1 Coverage of M15 rubric lines not addressed elsewhere + +Reviewer feedback (round 4) flagged that 5 M15 rubric lines had no +analysis in this doc. Analyzed here: + +**MILESTONES.md line 359 — Structured-log JSON auto-detection.** +Rubric: "first non-whitespace byte `{` triggers JSON parse; on success +emits parsed fields as attributes capped at `max_attributes` (default +16); on failure or non-`{`, passthrough as `body`." + +- **Implementation under each BA:** under BA-1, OTel's `json_parser` + stanza operator (`pkg/stanza/operator/parser/json`) does exactly + this; wire it after the `container` operator with a routing rule on + `body[0] == '{'`. Under BA-2/3, native code calling + `encoding/json.Decoder` on the body, with same routing. +- **Cardinality risk:** `max_attributes: 16` is the existing + k8sevents default. Reasonable for structured logs from training + scripts (most emit ≤8 fields). +- **Failure mode:** JSON parse failure on body that starts with `{` + but is not valid JSON. Receiver MUST fall through to body + passthrough, not drop the record. Test target: + `TestContainerStdout_JSONParseFailureFallthrough`. +- **Rubric stands as written.** No edit proposed. + +**MILESTONES.md line 363 — `tracecore.container.lines_per_s` derived +metric.** +Rubric: "emits per-rank line-rate (`tracecore.container.lines_per_s`) +as derived metric on 15s window." + +- **Aggregation key:** the rubric says "per-rank" but `Training.Rank` + is only populated when discoverable (see §8.1 typed-record schema). + Records with `WorldSize == 0` (undiscovered rank) need a bucket + key — `Pod.UID` is the natural fallback. Decision for design phase: + emit one metric per `(rank, pod_uid)` tuple, with `rank = "unknown"` + for the undiscovered case. +- **Window mechanics:** 15s sliding window or 15s tumbling? + Tumbling is simpler and matches Prometheus scrape cadence. Sliding + would smooth bursts but doubles state. Recommend tumbling for v0. +- **Cardinality:** `(rank, pod_uid)` cardinality on a node is + `pod_count × ranks_per_pod`. Default cardinality cap (§13.4 pattern) + should bound this; emit `IncError(KindCardinality)` on overflow. +- **Test target:** `TestContainerStdout_LinesPerSDerivedMetric`. +- **Rubric stands.** No edit proposed; designate the undiscovered-rank + bucket key in the design doc. + +**MILESTONES.md line 370 — Multi-tenancy namespace allowlist.** +Rubric: "`namespaces:` allowlist filters Pod discovery before file +watch opened; per-namespace egress sub-budget configurable." + +- **Filesystem-level filter is required.** §16.1 emphasized this: + filter at file-open time, not at emit time. Filelog's `include` + globs CAN filter by directory pattern but the Pod-namespace lives + in the directory name (`__`), not as a top-level + directory. Practical implementation: pre-glob filter that walks + `/var/log/pods/*` and rejects entries whose `_` prefix is not + in the allowlist BEFORE the watcher opens the file. Under BA-1, + filelog's `exclude` globs handle the prefix match: `exclude: + ['/var/log/pods/^(?!ns1_|ns2_).*']` — but Go's RE2 doesn't support + negative lookahead, so glob must enumerate excluded namespaces + explicitly OR use filelog's filter-after-glob path. Confirm during + design. +- **Per-namespace sub-budget:** rate-limit processor (OD-3) keys on + `(k8s.pod.uid, k8s.container.name)`. Per-namespace budget is a + second-tier aggregator: token bucket per namespace, sub-budgets per + pod within. Implementation under any BA: a `namespace_budgets` map + in the rate-limit processor config. +- **Test targets:** `TestContainerStdout_NamespaceAllowlistFiltersAtOpen`, + `TestContainerStdout_PerNamespaceSubBudget`. +- **Rubric stands.** Worth a follow-up note in §10 OD-3 about the + two-tier bucket structure. + +**MILESTONES.md line 371 — `goleak` test for back-pressure.** +Rubric: "1M-line burst from one rank MUST NOT block sibling streams; +bounded per-file goroutine + bounded channel (1024); `goleak` test." + +- **goleak setup:** tracecore already uses `go.uber.org/goleak` (per + go.mod), so the test integration is straightforward. The `goleak` + test wraps the receiver Start/Shutdown lifecycle and asserts no + leaked goroutines post-shutdown. +- **Bounded channel: 1024.** Matches k8sevents' default per + `config.go:55-60`. The k8sevents ceiling of `2^20` (cap against + operator typos that allocate the channel into swap territory) is + the right ceiling. +- **Per-file goroutine isolation:** under BA-1, filelog manages this + internally via `pkg/stanza/fileconsumer`'s reader pool. Under + BA-2/3, tracecore owns it; pattern is "spawn one goroutine per + tailed file, bound the total by `max_concurrent_files` (default + 1024), drop on overflow with `IncError(KindMaxFilesExceeded)`." +- **Test target:** `TestContainerStdout_BackPressureGoLeak` already + named in §17. +- **Rubric stands.** No edit proposed. + +**MILESTONES.md line 372 — File-handle hygiene with `lsof` golden.** +Rubric: "≤2× pod-count open fds steady-state; closed within 30s of +Pod `Terminated`; verified by `lsof` golden." + +- **2× pod-count rationale:** ~1 fd per `0.log` plus ~1 fd per + currently-rotating `0.log.` (transient during the kubelet- + rotation window of §13.1). +- **`lsof` golden:** parse `lsof -p $(pidof tracecore) -F n` output; + assert count of `/var/log/pods/...` entries ≤ `2 × len(pods)`. + Easy CI gate but requires `lsof` in the test image. +- **30s close-after-Terminated:** the Pod informer's deletion event + triggers cursor-GC (§17.3) and fd-close. 30s window covers + drain-then-close. Cursor-GC must complete in this window. +- **Test target:** `TestContainerStdout_FdHygieneAfterPodTermination`. +- **Rubric stands.** Worth adding `lsof` to the test-image + requirements in the design doc. + +## 12. Follow-up gaps + +Closed gaps and the round-2 evidence that resolved them are recorded here +for the audit trail. + +### Closed + +1. **containerd #11149 mitigation path.** **No CRI-RPC mitigation exists.** + The CRI `RuntimeService` has no `GetContainerLog` / `ContainerLog` / + `StreamLogs` RPC; only `ReopenContainerLog` (a notification). `kubectl + logs` tails the same `/var/log/pods` file via kubelet's `/containerLogs` + handler, which calls `m.ReadLogs(...)` on the file path returned by + `ContainerStatus(...).log_path`. Real fixes have to live upstream of the + disk write (ring-buffered shim logger, sidecar logger). Resolved. +2. **`MaxFiles=1` validation.** **Rejected at two layers.** + `KubeletConfiguration.ContainerLogMaxFiles <= 1` returns + `"containerLogMaxFiles must be greater than 1"` at config validation + (`pkg/kubelet/apis/config/validation/validation.go`); `NewContainerLogManager` + also rejects `<= 1` at constructor time. `MaxFiles=2` is accepted but is + degenerate (zero retained rotated files; the rotated file is deleted at + the next monitor tick before compression can run). Resolved. +3. **Host FS-full rotation behavior.** **`Rename` fails synchronously; no + retry, no K8s Event, no metric.** `rotateLatestLog` returns immediately + on `Rename` error; klog logs at error level. The live `0.log` stays + open via the runtime's fd and continues to grow past `MaxSize` until + the rename succeeds on a later tick or the container exits. + `processContainer`'s `defer Forget(key)` neutralizes the workqueue's + exponential backoff, so retry happens on every monitor tick. Resolved. +4. **Containerd vs CRI-O reopen timing.** Both implement + `ReopenContainerLog` synchronously (gRPC response IS the ack). + Containerd does the dup2 in its CRI plugin; CRI-O signals conmon + (historically `SIGUSR1`). Behavior is observably identical to the + tailer. Resolved. +5. **`pkg/stanza` Go API stability.** **~3 BREAKING in 18 months + (v0.140–v0.152), all in adjacent code (`pkg/ottl`, `receiver/windowseventlog`).** + `pkg/stanza/fileconsumer` public surface has not had a tracked + breaking change in this window. No formal deprecation policy + documented; rely on collector-wide stability matrix + feature gates. + Budget ~1 day/quarter for upgrades, pin contrib version, integration-test + each bump. Resolved. +6. **`fileconsumer` fingerprint default.** **1 KiB default, 16 B minimum.** + CRI lines as short as 32 B produce a 32 B fingerprint; collision risk + exists only across pods sharing identical first-line prefixes that then + idle. Resolves automatically with normal log flow. Recommendation: keep + default; raise `fingerprint_size` only if tailing very-low-volume pods. + Resolved. +7. **lit-GPT log format.** Format: + `Epoch N | iter N step N | loss train: X, val: Y | iter time: Z ms` + with optional ` (step)` suffix on optimizer-step boundaries. **No + per-step data_time.** Source: + [`litgpt/pretrain.py`](https://github.com/Lightning-AI/litgpt/blob/main/litgpt/pretrain.py) + `fit()` format string. Default dataloader regex must convert ms → s. + Resolved. +8. **`TORCHELASTIC_RUN_ID` in `--standalone` mode.** **Random `uuid4()` per + invocation.** No correlation to PID, hostname, or start time. Useful as + intra-launch grouping key only; do not treat as a stable job + identifier. Always pair with an orchestrator-provided ID (Kubeflow + `training.kubeflow.org/job-name`, Slurm `SLURM_JOB_ID`, JobSet + `jobset-uid`). Resolved. + +### Closed (round 3) + +11. **Filestorage corruption observability.** Extension emits **zero + metrics** and 5 log lines. Tracecore alerts must bind to log-string + matches (`"Database corruption detected"`, `"compaction on start + failed"`) or host metrics on the storage directory. See §13.10. +14. **`monitoringPeriod` default value.** **10 seconds.** Set in + `pkg/kubelet/apis/config/v1beta1/defaults.go` + `SetDefaults_KubeletConfiguration`; unchanged since 2024-02-09. See + §13.8. +15. **PodLogs API KEP status.** **Dead.** KEP-3059 was auto-closed via + `lifecycle/rotten` without reaching alpha. No replacement filed. + Status quo (file tail via `/var/log/pods`) is the only supported + path. See §13.9. + +### Still open (deferred to future iterations or measurement) + +9. **MPI `OMPI_COMM_WORLD_RANK` extraction.** Reading from + `/proc//environ` is the only path; needs hostPID + + `CAP_SYS_PTRACE` or equivalent. Conflicts with M5b's minimal-privilege + policy. **Defer MPI attribution to a future receiver iteration; + document the gap.** +10. **Empirical binary-size delta of `filelogreceiver` import.** Estimate + is 15–30 MB unstripped; `-ldflags='-s -w'` saves 20–25%; only + custom-builder-manifest tree-shaking removes unused operators. + **Measure on a tracecore build before merging M15.** +12. **Compression-window race.** Kubelet's compress-after-rotate happens + in the next monitor tick (~10 s after rename) and writes via + `.tmp` + rename (confirmed round-2). Verify filelog only opens after + the final rename completes by stat-ing for `.gz` suffix without + `.tmp`. Add to integration test. +13. **Cross-pod fingerprint collision under realistic CRI prefixes.** + Write a property-style test: N pods, identical first-line timestamps + truncated to second precision + same stream marker, idle, force + rotation, assert no offset cross-pollination. + +## 13. Round-2 deeper-dive findings + +These sections record the evidence that resolved most of the §12 gaps and +several §10 decisions. They sit here (rather than expanding earlier +sections) so the original first-pass research stays readable as written. + +### 13.1 Kubelet rotation source dive + +Source: `pkg/kubelet/logs/container_log_manager.go` (kubernetes/kubernetes, +master), `pkg/kubelet/apis/config/validation/validation.go`. + +- **`MaxFiles <= 1` is rejected.** Two layers: KubeletConfiguration + validation returns `"containerLogMaxFiles must be greater than 1"`; + `NewContainerLogManager` returns `"invalid MaxFiles N, must be > 1"`. + `MaxFiles=2` is the degenerate case — `removeExcessLogs` deletes every + rotated file at the next tick (`maxRotatedFiles = MaxFiles - 2 = 0`), + before compression can run. +- **FS-full rotation rollback.** `rotateLatestLog` returns immediately on + `Rename` failure; no retry, no K8s Event, no metric — only klog at + error level. The live `0.log` stays open via the runtime's fd and + grows past `MaxSize` until either rename succeeds on a later tick or + the container exits. +- **Rename → Reopen sequence.** `Rename(log, rotated)` is synchronous on + the same FS; if it succeeds, kubelet calls + `runtimeService.ReopenContainerLog(ctx, id)` (synchronous gRPC). On + reopen failure, kubelet attempts a rollback rename (`rotated → log`). + Pathological case: if the rollback itself fails AND kubelet then + restarts, the original log is orphaned (containerd/CRI-O still writing + to the rotated-file inode, but `0.log` does not exist; the comment in + source warns "we'll lose original log"). +- **Workqueue backoff neutralized.** + `processContainer`'s `defer queue.Forget(key)` resets the + rate-limiter history; failed rotations retry on every monitor tick + with no backoff. +- **Compression timing.** `compressLog` runs in the **same goroutine as + rotation**, BEFORE `rotateLatestLog`, on the NEXT tick after the rename + (default `monitoringPeriod` ~ 10 s). So a renamed plain file sits + uncompressed for roughly one monitor period before becoming `.gz`. + Compression uses `.tmp` + rename for atomicity; mid-write `.gz` + is therefore not observable. + +**Receiver implication.** Add a degraded-mode signal when `0.log` size +exceeds `containerLogMaxSize` for sustained periods — this is the only +observable a tailer has for upstream rotation failure (rubric R-8). + +### 13.2 CRI `GetContainerLogs` is not a thing + +The CRI v1 `RuntimeService` +([`pkg/apis/runtime/v1/api.proto`](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1/api.proto)) +has no `GetContainerLog`, `ContainerLog`, or `StreamLogs` RPC. The only +log-related RPC is `ReopenContainerLog(ReopenContainerLogRequest) +returns (ReopenContainerLogResponse)` — a notification, not a read. +Log content lives only as a file path (`ContainerStatus.log_path`). + +`kubectl logs` is a file tail behind kubelet's HTTP API: +`pkg/kubelet/server/server.go`'s `/containerLogs/{ns}/{pod}/{container}` +calls `HostInterface.GetKubeletContainerLogs`, which retrieves +`log_path` from `ContainerStatus(...)` and then calls `m.ReadLogs(...)` +in `pkg/kubelet/kuberuntime/logs/logs.go` — a standard `os.Open` on the +CRI text file. + +Implication for #11149: **no CRI consumer can recover bytes lost when +an in-container reader contends on the shared stdout pipe**, because +the loss happens before anything observable at the CRI surface. The +fix has to live in containerd's IO copier (move from a shared pipe to +a non-blocking ring-buffer or separate copy targets) or in operator +discipline (don't `cat /proc/1/fd/1` from inside training containers). +Document the failure mode in the receiver README; do not propose +CRI-RPC mitigation in the design doc. + +### 13.3 `pkg/stanza` API stability and binary cost + +CHANGELOG audit of contrib v0.140–v0.152 (~18 months): + +- **3 BREAKING changes, all in adjacent code:** OTTL semantics changes + in v0.150 / v0.152 and a Windows event_data shape change in v0.148. + None in `pkg/stanza/fileconsumer` itself. +- **6 enhancements + 3 bug fixes** tagged `pkg/stanza` or + `receiver/file_log` in the same window — active maintenance. +- **Feature gates** in flight: `filelog.protobufCheckpointEncoding` + (v0.148, alpha — new bbolt key encoding) and + `filelog.decompressFingerprint` (stable at v0.142). +- **No formal deprecation policy** documented in the CHANGELOG; the + project relies on collector-wide stability matrix and feature gates. +- **Production users:** `otelcol-k8s` distribution ships filelog + + `filestorage` by default + ([manifest](https://github.com/open-telemetry/opentelemetry-collector-releases/blob/main/distributions/otelcol-k8s/manifest.yaml)). + Strong signal that filelog is production-load-tested. + +**Binary-size estimate:** 15–30 MB unstripped (qualitative, from package +count and reflection-driven decode patterns); `-ldflags='-s -w'` saves +20–25 % by stripping DWARF/symbol tables; **only custom-builder-manifest +tree-shaking removes unused operators**, because `pkg/stanza` +type-registers all operators at init. Measure before claiming. + +**Filestorage internals:** bbolt single-writer per file (one file per +component, sharing a directory); default +`/var/lib/otelcol/file_storage`; `create_directory: true` / +`recreate_on_error: true` toggles; on corruption renames +`..backup` and starts fresh. Single-collector-per-node is +fine; shared-PVC anti-pattern would deadlock on the file lock. + +**Fingerprint default:** 1 KiB (`pkg/stanza/fileconsumer/internal/fingerprint/fingerprint.go`), +minimum 16 B. CRI lines as short as 32 B yield 32 B fingerprints — +collision risk exists across pods that share an identical first-line +prefix and then idle, but resolves under normal log flow because the +fingerprint grows with the file. + +### 13.4 Pod attribution: `k8sattributes` pairing + env-var prior art + +**Pairing config** for the `container` stanza operator's output: + +```yaml +processors: + k8sattributes: + pod_association: + - sources: + - from: resource_attribute + name: k8s.pod.uid + extract: + metadata: + - k8s.node.name + - k8s.deployment.name + - k8s.daemonset.name + - k8s.job.name + - k8s.statefulset.name + - k8s.pod.start_time + labels: + - tag_name: training.kubeflow.org/job-name + key: training.kubeflow.org/job-name + from: pod + - tag_name: training.kubeflow.org/replica-type + key: training.kubeflow.org/replica-type + from: pod + - tag_name: training.kubeflow.org/replica-index + key: training.kubeflow.org/replica-index + from: pod + - tag_name: jobset.sigs.k8s.io/jobset-name + key: jobset.sigs.k8s.io/jobset-name + from: pod + - tag_name: jobset.sigs.k8s.io/job-index + key: jobset.sigs.k8s.io/job-index + from: pod +``` + +Critical: the **default `pod_association`** is `from: connection` (the +OTLP client's source IP). For a local file-tail pipeline, the +"connection" is the agent itself, so the default does not work; explicit +`k8s.pod.uid` association is required. + +**RBAC for k8sattributes (under §15.1 BA-1 only):** +`get,list,watch` on `pods`, `namespaces`, `nodes` (core); +`apps` `replicasets,deployments,statefulsets,daemonsets` +for owner-kind resolution; `batch` `jobs,cronjobs` for job ownership. +ClusterRoleBinding to tracecore's ServiceAccount. + +This is the **upstream `k8sattributesprocessor`'s default RBAC**, which +is what BA-1 inherits. Under BA-2 or BA-3, tracecore implements its +own informer and SHOULD scope the watch to node-local pods via +`FieldSelector=spec.nodeName=$NODE_NAME` (see §16.3 — this contradicts +the BA-1 default and intentionally so, because under BA-2/BA-3 we +control the informer scope and should use the tighter setting). +Resolution of the apparent §13.4 vs §16.3 conflict: §13.4 documents +the upstream-OTel default (BA-1 inherits this); §16.3 documents +tracecore's recommended posture (BA-2/BA-3 own this choice). + +**Env-var-as-attributes prior art: not found in the components +surveyed.** The components I checked +(`resourcedetectionprocessor`, OTel Operator, `k8sattributesprocessor`, +OpenInference / LangSmith / Arize semantic conventions) do not project +arbitrary running-pod env vars into attributes. The OTel Operator +injects only the OTel SDK config set (`OTEL_RESOURCE_ATTRIBUTES`, +`OTEL_NODE_IP`, etc.). Components NOT surveyed and worth validating +before declaring greenfield: `k8sobjectsreceiver` (a generic K8s API +object collector), Datadog Agent's autodiscovery, Vector's +`kubernetes_logs` source. The strength of the "greenfield" framing +is therefore "no prior art among the components surveyed", not "no +prior art exists." + +Implication for OD-1b: two viable paths. + +- **Path A (downward API + resourcedetection).** Operators add an env + block to their PodSpec / training launcher script: + ```yaml + env: + - name: OTEL_RESOURCE_ATTRIBUTES + value: "tracecore.training.rank=$(RANK),tracecore.training.world_size=$(WORLD_SIZE),tracecore.training.job.id=$(TORCHELASTIC_RUN_ID)" + ``` + Then `resourcedetectionprocessor` with the `env` detector lifts these + to resource attributes. No new tracecore code; one shell line per + launch script. +- **Path B (tracecore env-projection processor).** New processor reading + `Pod.spec.containers[*].env` via informer; maps configured env names + to attributes keyed on `(k8s.pod.uid, k8s.container.name)`. Invisible + to operators but adds pod-watch RBAC and a cardinality surface. + +Recommend shipping Path A as the default (documented in the receiver +README) and Path B as an opt-in for unmodified workloads. + +### 13.5 Cursor persistence: bbolt wins over JSON + +Production landscape (round-2 survey): + +| Shipper | Backend | Format | Notes | +|---|---|---|---| +| Fluent Bit `tail` | SQLite (WAL) | binary | `.db-shm` / `.db-wal` companions | +| Vector `file` | JSON | `checkpoints.json` | tmp + atomic rename, human-readable | +| Datadog Agent | JSON | `registry.json` | optional atomic write toggle | +| Promtail | YAML | `positions.yaml` | flat path → offset map | +| OTel filelog | bbolt | binary | via `file_storage` extension | + +The rubric's JSON-at-`cursor.json` matches Vector/Datadog convention, +not OTel. The rubric predates the depend-on-filelog decision; the +cleanest resolution is to adopt filelog's `file_storage` (bbolt) as +authoritative and update the rubric (R-6). Optional read-only JSON +snapshot for human inspection can be a low-frequency side-channel — +not the source of truth, to avoid dual-cursor consistency bugs. + +Config: + +```yaml +extensions: + file_storage: + directory: /var/lib/tracecore/container_stdout + fsync: true + compaction: + on_start: true + +receivers: + filelog: + storage: file_storage + ... +``` + +### 13.6 Gzip-compressed rotated files: decompress and continue + +Filelog supports `compression: auto` (since contrib v0.142+) and +`compression: gzip`; with `auto`, files matching gzip suffix are read +through a transparent decompressing reader. + +Known gzip-corruption bugs (`#46105` rotation-with-compression +corruption; `#45572` last-line-without-newline) are **fixed in contrib +v0.144+**. Pin and integration-test. + +Other shippers: + +- Fluent Bit `tail`: no gzip decompression; users typically + `exclude_path: *.gz` and accept the gap. +- Vector `file`: no gzip decompression. +- Filebeat: supports `.gz` via the `tail` reader with a separate + `gzip` processor. + +Recommendation: **decompress.** Kubelet's rename-then-gzip flow makes +"accept the gap" a real data-loss surface on chatty pods during +pod-death bursts. Add integration test that writes 50 MB to `0.log`, +triggers rotation + gzip + new `0.log` in sequence, asserts zero record +loss across the seam. + +### 13.7 lit-GPT log format + +Real samples from +[`litgpt` issues #1110 and #1607](https://github.com/Lightning-AI/litgpt/issues): + +``` +Epoch 4 | iter 962 step 962 | loss train: 0.937, val: 1.057 | iter time: 503.53 ms +Epoch 34 | iter 3051050 step 610210 | loss train: 2.895, val: 2.861 | iter time: 111.24 ms (step) +``` + +Source format string in `litgpt/pretrain.py:fit()`: + +```python +f"Epoch {metrics['epoch'] + 1} | iter {metrics['iter']} step {metrics['step']} |" +f" loss train: {metrics['loss']:.3f}, val: {val_loss} |" +f" iter time: {metrics['iter_time'] * 1000:.2f} ms" +f"{' (step)' if not is_accumulating else ''}" +``` + +Fields present: `epoch`, `iter`, `step`, `loss train`, `val`, +`iter time` (ms), optional ` (step)` marker on optimizer-step +boundaries. **No per-step data_time** in the printed line. + +Regex for extraction (note: source emits ms; receiver must divide): + +```regex +^Epoch\s+\d+\s+\|\s+iter\s+\d+\s+step\s+\d+\s+\|\s+loss train:\s+[\d.]+,\s+val:\s+\S+\s+\|\s+iter time:\s+(?P[\d.]+)\s+ms +``` + +Use the trailing `(step)` flag to gate on true optimizer steps; do not +emit derived metrics on accumulation-only iterations. + +### 13.8 Kubelet `containerLogMonitorInterval` default + +**10 seconds.** Set in `pkg/kubelet/apis/config/v1beta1/defaults.go` in +`SetDefaults_KubeletConfiguration`; introduced 2024-02-09 in commit +`ab8c784ee970d72b03fd1c2ed7c228914e17e954` ("kubelet: enable configurable +rotation duration and parallel rotate"), unchanged since. + +```go +if obj.ContainerLogMonitorInterval == nil { + obj.ContainerLogMonitorInterval = &metav1.Duration{Duration: 10 * time.Second} +} +``` + +Worst-case rotation latency on defaults: ~10 s (one monitor period) plus +single-worker queue drain (`ContainerLogMaxWorkers` default `1`). A +container emitting >`ContainerLogMaxSize` (10 MiB default) in <10 s can +briefly exceed the cap before kubelet rotates. Relevant for rubric R-8's +"sustained period" definition: 30 s is a safe threshold for "rotation +stalled" alerting (3× the monitor period). + +### 13.9 PodLogs API KEP status: dead + +**No upstream replacement path.** KEP-3059 ("Add pod level logs API") +was filed 2021-11-27 as `[WIP]`, never assigned a sig-node owner, and +auto-closed via `lifecycle/rotten` without reaching alpha +(). No +replacement KEP has been opened. Adjacent work (KEP-2411 CRI log +rotation, KEP-1602 structured logging, KEP-1753 logs sanitization, +KEP-3077 contextual logging) does not promote container logs to a +streaming API. + +The status quo — kubelet `/containerLogs/{ns}/{pod}/{container}` +proxied through apiserver `GET .../log` plus node-local file tailing — +is the only supported pattern. **Tracecore M15 design does not need to +anticipate a near-term upstream shift.** + +### 13.10 `file_storage` extension observability surface + +**Zero metrics. Five log lines. No collector-side health metric.** + +Source: `extension/storage/filestorage/extension.go`, `factory.go`. + +Log lines emitted (via component logger): + +1. `Warn` — "filename too long, using hashed filename instead" +2. `Warn` — "Database corruption detected, recreating database file" + (fires from `recover()` panic handler) +3. `Info` — "Corrupted database file renamed" (backup path) +4. `Error` — "compaction on start failed" +5. `Info`/`Debug` — "cleanup" / "cleanup error listing temporary files" + +Notably absent: +- No lock-timeout log when bbolt can't acquire the file lock within the + configured `Timeout` (default 1 s); the error propagates to the + caller (filelog) as a component error. +- No metric instrumenting rebuild count, lock-wait, or db-size growth. + +**Tracecore alerting strategy (M15 RUNBOOK):** + +- **Log-string alert** on `"Database corruption detected"` and + `"compaction on start failed"` from logger name + `extension/file_storage`. This is the only collector-side signal. +- **Host metrics** on the storage directory via `hostmetricsreceiver` + (`filesystem` scraper): disk free, inode pressure, db file mtime + drift as liveness proxy. +- **Downstream signal:** filelog's + `otelcol_receiver_refused_log_records` rising while the storage + path is unreachable. Indirect but observable. + +Defaults that matter: `Timeout: 1s`, `Compaction.OnStart: false`, +`Compaction.OnRebound: false`, `Compaction.CheckInterval: 5s`, +`FSync: false`, `DirectoryPermissions: 0750`. Recommend +`Compaction.OnStart: true` and `FSync: true` for tracecore's +production posture; both add small startup/write cost but improve +crash safety. + +### 13.11 `container` operator vs `k8sattributes` attribute overlap + +**Status banner (read before continuing):** the pipeline YAML below is +**illustrative of the semantic shape**, not the deployable +configuration. Tracecore's pipeline runtime is not the OTel Collector +(see §15). Under any build approach other than BA-1, the components +named here (filelog, k8sattributes, resourcedetection, transform, +file_storage) cannot run as-shown. The YAML is preserved as a +specification of *which concerns* the receiver and its surrounding +processors must address, not *how* they are wired. Under BA-2/BA-3, +each concern becomes a tracecore-native component with equivalent +behavior. + +**Operator wins by default; processor fills in gaps.** Three attributes +overlap (`k8s.namespace.name`, `k8s.pod.name`, `k8s.pod.uid`) but +`k8sattributesprocessor.setResourceAttribute` skips non-empty +existing values: + +```go +func setResourceAttribute(attributes pcommon.Map, key, val string) { + attr, found := attributes.Get(key) + if !found || attr.AsString() == "" { + attributes.PutStr(key, val) + } +} +``` + +**Canonical pipeline config** that avoids double-extraction and +exercises both components' strengths: + +```yaml +receivers: + filelog/containerstdout: + include: [/var/log/pods/*/*/*.log] + start_at: end + storage: file_storage + compression: auto # OD-6: decompress rotated .gz + operators: + - type: container # writes k8s.{pod.uid,pod.name,namespace.name,container.name,container.restart_count} + +processors: + k8sattributes/training: + auth_type: serviceAccount + pod_association: + - sources: + - from: resource_attribute + name: k8s.pod.uid # operator already set this + extract: + metadata: + - k8s.pod.start_time # not duplicated by operator + - k8s.deployment.name + - k8s.statefulset.name + - k8s.daemonset.name + - k8s.job.name + - k8s.node.name + labels: + - tag_name: training.kubeflow.org/job-name + key: training.kubeflow.org/job-name + from: pod + - tag_name: training.kubeflow.org/replica-type + key: training.kubeflow.org/replica-type + from: pod + - tag_name: training.kubeflow.org/replica-index + key: training.kubeflow.org/replica-index + from: pod + - tag_name: jobset.sigs.k8s.io/jobset-name + key: jobset.sigs.k8s.io/jobset-name + from: pod + + resourcedetection/training: + detectors: [env] # OD-1b Path A: lifts $OTEL_RESOURCE_ATTRIBUTES + override: false # preserve any operator-set values + + transform/training_dataloader: # §14.1 — OTTL recipe for dataloader regex + error_mode: ignore + log_statements: + - context: log + statements: + - merge_maps(attributes, + ExtractPatterns(body, + "\\btime:\\s+(?P\\d+(?:\\.\\d+)?)\\b.*?\\b(?:data_time|data):\\s+(?P\\d+(?:\\.\\d+)?)\\b"), + "upsert") + where IsString(body) and IsMatch(body, "\\btime:.*\\b(?:data_time|data):") + - set(attributes["tracecore.training.iter_time_s"], Double(attributes["iter_time_s_str"])) + where attributes["iter_time_s_str"] != nil + - set(attributes["tracecore.training.data_time_s"], Double(attributes["data_time_s_str"])) + where attributes["data_time_s_str"] != nil + - delete_key(attributes, "iter_time_s_str") + - delete_key(attributes, "data_time_s_str") + +extensions: + file_storage: # OD-5: bbolt cursor persistence + directory: /var/lib/tracecore/container_stdout + fsync: true + compaction: + on_start: true + +service: + extensions: [file_storage] + pipelines: + logs/containerstdout: + receivers: [filelog/containerstdout] + processors: + - k8sattributes/training + - resourcedetection/training + - transform/training_dataloader + - tracecore_ratelimit # OD-3: per-(pod_uid, container) token bucket + exporters: [...] +``` + +This is **not the final M15 config** — it's the canonical shape that +the receiver wrapper should produce or document, with tracecore-specific +defaults baked in. + +### 13.12 Industry training-observability namespace survey + +Surveyed 10 platforms; **no shared namespace exists for distributed-training +signals.** Each platform invents its own: + +| Platform | Namespace | Rank attribution | +|---|---|---| +| NVIDIA Triton | `nv_*` (flat snake_case Prometheus) | `gpu_uuid` label, no rank | +| NVIDIA NeMo | None (flat PTL keys: `train_step_timing`) | Not first-class | +| Google Vertex AI | `AIP_*` env vars; no metric namespace | JSON `CLUSTER_SPEC` env | +| AWS SageMaker | `/aws/sagemaker/TrainingJobs` CW namespace | Encoded in `Host` dimension | +| AWS Bedrock fine-tuning | None — no metric stream | API/EventBridge only | +| MosaicML Composer | Slash-path (`throughput/batches_per_sec`) | W&B/MLflow run tag | +| Together fine-tuning | Flat object fields | Not exposed | +| OpenAI fine-tuning | Flat fields under `data` | Not exposed | +| Weights & Biases | `gpu.*`, `gpu.process.*` | Per-rank run | +| MLflow | Flat snake_case (`gpu_{i}_*`) | Per-rank run | + +**Three robust patterns emerge from the survey:** + +1. **Rank is universally a resource-level identifier**, not a per-metric + label. SageMaker bakes it into `Host`; W&B/MLflow tag the run; + Composer stores it as a run attribute. Tracecore should follow: + `tracecore.training.rank` is a *resource attribute*, set once per + process, never a record-level label. +2. **No vendor uses a shared namespace.** `nv_*` is Triton-only. `gpu.*` + is W&B-only. Tracecore's `tracecore.training.*` is consistent with + the prevailing pattern (vendor-prefixed for unstandardized concepts). +3. **`process.runtime.*` does not fit.** SemConv's process registry is + explicitly OS-process + language-VM metadata; zero collective-comm + vocabulary. Filing an upstream PR is unlikely to land given the + GenAI SIG's inference-only scope. `system.*` is forbidden by spec + for non-host metrics. There is no upstream group that accepts + distributed-training attributes today. + +**Verdict (caveated): the surveyed industry has no shared +distributed-training namespace.** The sample is biased: it covers the +ten most visible commercial / cloud-vendor platforms (Triton, NeMo, +Vertex, SageMaker, Bedrock, Composer, Together, OpenAI, W&B, MLflow), +which are exactly the actors with the market power to invent and +sustain a vendor-prefixed namespace. Smaller / newer platforms not +surveyed in this pass include **ClearML, Determined.AI, Skypilot, +lit-Lightning, Anyscale**, any of which may use shared conventions +inherited from PyTorch / MLflow rather than inventing their own. + +The conclusion "no shared namespace exists" is therefore "no shared +namespace exists among the platforms surveyed." Under §7's revised +framing, this verdict is also less decisive than it appeared: tracecore +is actively trying to *create* the shared namespace via NORTHSTARS O4. +A small-platform follow-up survey would either reveal an existing +convention worth aligning with or confirm O4's hypothesis that no +convention exists yet. + +Round-1 used this survey to corroborate `tracecore.training.*` as the +right namespace. §7's revision withdraws that recommendation pending +the O4 owner's call. The survey remains a useful negative-evidence +artifact, not a positive recommendation. + +### 13.13 `TORCHELASTIC_RUN_ID` in `--standalone` mode + +Source: `torch/distributed/run.py` in +[pytorch/pytorch](https://github.com/pytorch/pytorch). + +```python +if args.standalone: + args.rdzv_backend = "c10d" + args.rdzv_endpoint = "localhost:0" + args.rdzv_id = str(uuid.uuid4()) +``` + +That `rdzv_id` is passed to `LaunchConfig(run_id=args.rdzv_id, ...)` and +surfaced to workers as `TORCHELASTIC_RUN_ID`. + +- **Random `uuid4()` per launcher invocation.** Siblings within one + launch share the value; cross-launch correlation is zero. +- **No correlation to PID, hostname, or wall clock.** Purely random + bits. +- **Practical implication.** Intra-launch grouping key only. Do not + treat as a stable job identifier; a crash-loop produces a new UUID + every restart. For job-level attribution, pair with an orchestrator + identifier: `training.kubeflow.org/job-name`, `SLURM_JOB_ID`, or + `jobset.sigs.k8s.io/jobset-uid`. + +Receiver README must document this so operators don't store +`tracecore.training.job.id = $TORCHELASTIC_RUN_ID` and expect cross-restart +joins to work. + +## 14. Pipeline-shape illustration and OTTL recipe (semantics-only) + +**Status banner.** The OTTL recipe in this section is **specification +of the desired semantic behavior**, not a runnable Collector config. It +has not been validated against `otelcol validate` or executed against a +real pipeline. Under §15.1 BA-0 (sidecar collector), the recipe would +run inside the sidecar; under BA-1, it would run via the adapter; +under BA-2 or BA-3, the equivalent must be reimplemented as a native +tracecore processor using `regexp` from the Go standard library. The +syntax shown was checked against the OTTL `ottlfuncs` README on +2026-05-19 but not exercised. Treat as a design hint, not as a +deliverable. + +The pipeline config sketched in §13.11 captures the canonical shape. +Notable elements: + +- **`compression: auto`** on the filelog receiver (OD-6). +- **`storage: file_storage`** + `fsync: true` + `compaction.on_start: true` + (OD-5). +- **`k8sattributes` with `pod_association` keyed on `k8s.pod.uid`**. + Default IP-based association doesn't work for local file tails. +- **`resourcedetection` with `env` detector** for OD-1b Path A: lifts + `OTEL_RESOURCE_ATTRIBUTES=tracecore.training.rank=$(RANK),...` set + by the training launcher via downward API. +- **OTTL `transform` processor** for dataloader regex extraction, with + named capture groups, body-type guard (`IsString`), match-shortcut + (`IsMatch`), explicit `Double(...)` coercion, and temp-key cleanup. + This resolves OD-4 as config-only; no new Go code required. +- **`tracecore_ratelimit` processor** (new) for per-key token bucket + on `(k8s.pod.uid, k8s.container.name)` (OD-3). + +### 14.1 OTTL recipe nuances + +Worth calling out for the design phase: + +- **`ExtractPatterns` requires named capture groups** (Go regex + identifier rules; no dots in the name). Use temp keys then rename in + a second step. +- **Body-type guard is mandatory.** Structured logs where `body` is a + map will crash the extract under `error_mode: propagate`. Under + `error_mode: ignore` it silently no-ops but is slower; the `IsString` + + `IsMatch` guards short-circuit cleanly. +- **Type coercion is explicit.** `ExtractPatterns` always returns + strings; downstream gauges/histograms need floats. `Double(...)` + returns `nil` on parse failure (per OTTL `ottlfuncs` README — verified + 2026-05-19), so the `!= nil` guard is required to skip rather than + emit a null-valued attribute. +- **Pattern configurability:** OTTL has no env-var interpolation; + expose the regex pattern via Collector-config template substitution + (`confmap` providers) at startup, not inside OTTL. + +## 15. Correction: tracecore is not an OTel Collector distribution + +**Discovered during the 2026-05-19 confidence-raising pass; invalidates +significant parts of §5, §13.11, and §14 as originally written. Section +left in place for the audit trail; this section is the load-bearing one.** + +Tracecore has its **own** pipeline runtime under +[`internal/pipeline/`](../../internal/pipeline). Quoting +`internal/pipeline/factory.go:72-79`: + +```go +type ReceiverFactory interface { + Type() Type + CreateDefaultConfig() Config + + CreateMetrics(ctx context.Context, set CreateSettings, cfg Config, next consumer.Metrics) (Receiver, error) + CreateTraces(ctx context.Context, set CreateSettings, cfg Config, next consumer.Traces) (Receiver, error) + CreateLogs(ctx context.Context, set CreateSettings, cfg Config, next consumer.Logs) (Receiver, error) +} +``` + +This **mirrors** upstream `go.opentelemetry.io/collector/receiver.Factory` +at v1.55.0 in shape (per the source comment in `factory.go:37-38`), but +is **a different Go type**. Upstream filelog's `receiver.Factory` does +not satisfy tracecore's `pipeline.ReceiverFactory`; the interfaces have +different method names (`CreateLogs` vs `CreateLogsReceiver`), different +parameter types (`pipeline.CreateSettings` vs `receiver.Settings`, +`pipeline.Config` vs `component.Config`), and tracecore's version omits +the `*ReceiverStability()` methods. + +Confirmed by grep across the repo: only test files in +`components/receivers/kernelevents/otelcontrib_e2e_test.go` import the +upstream `go.opentelemetry.io/collector/receiver` packages, and only as +test-time end-to-end fixtures. No runtime adapter or shim exists. + +### 15.1 What this means for the build approach + +**You cannot "depend on filelog" as §5 originally said.** Five viable +paths. The adversarial review surfaced two that round-2 missed (BA-0, +BA-4); they are included here. + +| ID | Approach | Cost | Trade-offs | +|---|---|---|---| +| BA-0 | **Sidecar otelcol DaemonSet.** Run upstream `otelcol-k8s` as a sibling DaemonSet (separate from the tracecore DaemonSet); configure its filelog receiver and exporters; have it ship OTLP to tracecore as an upstream producer. Tracecore consumes that OTLP via its existing receiver path. | Lowest. No tracecore code change. Operator-config-only. | Adds a second pod per node. Two binaries to keep current. Two RBAC sets. Resource overhead duplicates what tracecore already provides for hostmetrics. Loses any tracecore-specific receiver knobs (per-rank attribution, dataloader regex, tracecore rate-limiting) unless we also build them as native processors downstream. | +| BA-1 | Build a generic `pipeline.OTelReceiverFactory` adapter in `internal/pipeline/` that wraps any upstream `receiver.Factory`. Import `filelogreceiver` through the adapter. | High up front; pays back across future imports of OTel components. | Once-and-done infrastructure. Future M16 (Kueue scraper) and other receivers benefit. Risk: subtle semantic mismatches in `CreateSettings`/`TelemetrySettings` field mapping. Reasonable mitigation: `plog.Logs` IS what tracecore's `consumer.Logs` already takes (verified at `internal/consumer/logs.go:13`), so data-model interop is not the problem; factory-interface bridging is. | +| BA-2 | Port `pkg/stanza/fileconsumer` and the `container` operator as a vendored Go dependency, write tracecore-native wiring (factory, config, lifecycle) around them. The consumed types are `plog.Logs` and `pcommon.Map`, both already in tracecore's data path (`internal/consumer/logs.go:13`). | Medium; one-time port + tracking upstream releases. | Inherits filelog's tail mechanics without the upstream-receiver-interface coupling. Vendor-update burden on every contrib release we care about. Data-model interop verified clean per `plog.Logs` finding above. | +| BA-3 | Reimplement on top of `fsnotify` or polling + custom CRI parser. | Highest; we own the rotation/fingerprint/partial-line bugs forever. | No upstream dependency surface. Reinvents proven code. Round-1 finding was that this is exactly what we shouldn't do. | +| BA-4 | **Defer M15. Promote the adapter to its own milestone first.** If BA-1 is the right answer architecturally but only justifies its cost when ≥2 upstream components ride on it, ship the adapter as its own milestone (M-Adapter, between Lane 1 and Lane 4), then ship M15 against it as the first user. | Low for the M15 sequencing decision; the adapter cost moves to a new milestone. | Scope-discipline win: M15 doesn't carry the adapter's design overhead. Calendar cost: M15 ships later. Surfaces the build-approach question at a milestone-planning level (the right level), not as a Day-1 M15 design decision. | + +**No single recommendation in this research pass.** The choice depends +on stakeholder questions this doc cannot answer alone: + +1. Is M16 (Kueue receiver) intended to consume upstream OTel + components? Verified evidence (MILESTONES.md lines 377-395): M16 is + a Prometheus scrape against `kueue-controller-manager`, emits OTLP + metrics, has a custom `cluster_queue` cardinality cap (default 256), + and registers via `components.go` one-line factory. This rubric + **could** be implemented either way: (a) tracecore-native scraper + (no BA-1 dependency, ~mid-size work), or (b) wrap upstream OTel + `prometheusreceiver` via BA-1's adapter (reuses upstream Prom + + histogram + label-translation machinery, but commits to BA-1's + adapter cost). The rubric does not mandate either path. **Net: + M16's owner has a real choice.** If they pick (b), BA-1 amortizes + across both receivers; if they pick (a), BA-1 serves only M15 and + BA-2 becomes the cheaper choice for tracecore overall. +2. Is the additional pod-per-node footprint of BA-0 acceptable in + tracecore's "minimal-privilege, single-binary" positioning? Per + `NORTHSTARS.md` O2 (Convenience) — likely no, but it's a stakeholder + call. +3. Is scope-discipline more valuable than calendar speed? BA-4 explicitly + trades the latter for the former. + +Round-1 recommended BA-1; round-2's correction implied BA-1 or BA-2; +round-3 (this) recognizes BA-0 and BA-4 as legitimate options that +should be on the table before any commitment. + +This decision should be made before any M15 implementation work, with +M16's owner, the O2 stakeholder, and the milestone-planning lead in the +room. + +### 15.2 What this means for the pipeline config in §13.11 + +**Most of §13.11 is illustrative-only.** Tracecore's pipeline runtime +does not load an `otelcol`-style YAML config; it has its own loader and +component graph. The example pipeline shows what an equivalent +upstream-OTel pipeline would look like, useful for understanding which +*concerns* need to be addressed, but **not deployable as-shown**. + +The concerns that still apply, with adjusted owners: + +- **CRI parsing + partial-line recombine.** Either ported from + `pkg/stanza/operator/parser/container` (BA-2) or invoked through the + adapter (BA-1). Owner: the M15 receiver. +- **Cursor persistence.** No filelog `file_storage` extension; the + receiver owns this. Reconsider OD-5: under BA-1 the bbolt path is + re-attainable through the adapter; under BA-2/BA-3 the rubric's + original JSON-at-`cursor.json` becomes the sane choice. **Rubric edit + R-6 is now conditional on BA-1.** +- **Pod attribution.** `k8sattributesprocessor` is also an upstream + component; same adapter question. If we don't have BA-1, tracecore + owns the informer + attribute-projection logic. The k8sevents + informer wiring (verified in §8) is the template. +- **Dataloader regex.** OTTL `transformprocessor` is upstream-only too. + Without BA-1, this needs to be a native tracecore processor (config: + `dataloader_regex` string; behavior: ExtractPatterns-equivalent + using `regexp` stdlib). The OTTL recipe in §14 is illustrative of + the *semantics* but cannot run as-written. +- **Rate limiting.** Native tracecore processor regardless of build + approach; no upstream equivalent existed anyway (OD-3 conclusion + unchanged). +- **Env-var projection.** Native tracecore (Path B in OD-1b); + upstream `resourcedetectionprocessor` only reads + `OTEL_RESOURCE_ATTRIBUTES`, which is Path A. + +### 15.3 What this means for §10 open decisions + +New open decision and updates: + +| ID | Status | Update | +|---|---|---| +| OD-11 | OPEN (new) | **Build approach choice: BA-0 (sidecar) / BA-1 (adapter) / BA-2 (port `fileconsumer`) / BA-3 (reimplement) / BA-4 (defer).** See §15.1 and §15.6. Decision needs M16 owner input. | + +### 10.1 Open-decision owner table + +| OD | Open question | Primary owner | Stakeholders | Target decision date | +|---|---|---|---|---| +| OD-1b | Env-var projection path: downward-API or informer | Receiver owner | Operator (deployment cost), Security reviewer (RBAC scope) | Before M15 design doc opens | +| OD-2 | Process-rank regex default | Receiver owner | Operator (regex override workflow) | Before M15 alpha | +| OD-8 | Bench harness coupling with M5 | M5 owner | M15 receiver owner | Before M5 starts harness work | +| OD-9 | Filelog feature-gate posture | Receiver owner (BA-1 conditional) | Upstream-tracking lead | After OD-11 lands | +| OD-10 | Binary-size delta measurement | Receiver owner | Project lead (binary-size budget) | After BA-1 vs BA-2 spike | +| OD-11 | Build-approach (BA-0..BA-4) | **Milestone-planning lead** | M16 owner, O2 stakeholder, Receiver owner | **Blocking; resolve in 1 week per §15.6** | +| OD-12 (new) | O4 namespace posture (hold/hedge/concede) | **O4 owner** per NORTHSTARS.md line 204 | Project lead, M13/M14/M18 owners (cross-receiver join contract) | Before any rubric edit affecting `gen_ai.training.*` | + +"OD-12" is added in this pass to make the namespace decision an +explicit open item rather than embedded in §7's narrative. The §11 +R-1 row references this OD. +| OD-3 | RESOLVED | Unchanged: native tracecore processor. | +| OD-4 | OPEN (re-opened) | OTTL `transformprocessor` is upstream-only. Native tracecore processor with `regexp` is the new default unless BA-1 lands first. | +| OD-5 | OPEN (re-opened) | bbolt vs JSON depends on BA-1 vs BA-2/BA-3. Under BA-1, keep R-6 (file_storage). Under BA-2/BA-3, revert to rubric's JSON cursor. | +| OD-7 | RESOLVED → SUPERSEDED by OD-11 | "Factory-only vs Go embedding" was an upstream-OTel framing; the actual question is now BA-1 vs BA-2 vs BA-3. | + +### 15.6 How to resolve OD-11 in one week + +OD-11 (build approach) is the load-bearing open decision blocking +six other ODs (OD-1b informer choice, OD-3 rate-limit processor, OD-4 +OTTL vs native, OD-5 cursor backend, OD-9 feature gates, OD-10 binary +size). The decision tree below collapses the stakeholder meeting to +about three working days plus one async review. + +**Day 1: BA-1 feasibility spike (1 day).** +Owner: Receiver implementer. Goal: prove or disprove that a +`pipeline.OTelReceiverFactory` adapter can wrap an upstream +`receiver.Factory` cleanly. Deliverable: 200-LoC adapter skeleton +that satisfies the tracecore `pipeline.ReceiverFactory` interface and +delegates to upstream-OTel calls. Pass = adapter compiles and unit-tests +against a no-op upstream factory. Fail = field-mapping mismatch that +requires invasive changes to `internal/pipeline/`. Per-day-cap: if the +spike takes >1 day, BA-1 is implicitly downgraded. + +**Day 2: M16 owner async sign-off (~2 hours total).** +The doc has the evidence (MILESTONES.md lines 377-395, §15.1 footnote) +that M16 *could* go either tracecore-native or upstream-`prometheusreceiver`-via-BA-1. +Ask M16's owner: "If BA-1 lands as M15 infra, would you build M16 on +top of it, or build M16 native?" Two-line answer is sufficient. If +yes, BA-1 amortizes (and is the recommended path). If no, BA-1 amortizes +only to M15 and BA-2 is the cheaper alternative. + +**Day 3: stakeholder decision (1 hour).** +M15 receiver owner + Project lead + O2 stakeholder + (optional) M16 +owner. Decision matrix: + +| M16 owner says | BA-1 spike result | Decision | +|---|---|---| +| Will use BA-1 | Pass | **BA-1.** Amortization confirmed; ship adapter as M15-precursor work. | +| Will use BA-1 | Fail | **BA-2.** Adapter is too costly; port `fileconsumer` natively. M16 also goes native. | +| Will go native | Pass | **BA-2.** Adapter only serves M15; not worth standing infrastructure for one user. | +| Will go native | Fail | **BA-2 or BA-3.** Both options for M15-only-cost; lean BA-2 for prior-art benefit. | +| No answer | Pass | **BA-2 with adapter spike preserved.** Future M16 owner can retroactively adopt. | +| No answer | Fail | **BA-2.** Default to the tractable option. | + +**BA-0 and BA-4 as escape hatches.** +- BA-0 (sidecar otelcol) becomes the recommended path if the Day 1 + spike fails AND O2 stakeholder accepts the additional pod-per-node + footprint AND operator UX cost is acceptable. Lower likelihood; + needs explicit stakeholder agreement. +- BA-4 (defer M15) becomes the recommended path if the Day 1 spike + succeeds but Day 2 + Day 3 leave BA-1 amortization ambiguous AND + milestone-planning lead prefers to ship adapter-as-infrastructure + first. Lower likelihood; needs explicit milestone-planning sign-off. + +**Day 4 (async): write the design-doc resolution.** +Receiver owner drafts the single-paragraph "we chose X because Y" +section that locks the choice. Reviewers async-sign. Design phase +proceeds. + +**Fallback rule for stakeholder unavailability (per Reviewer B P1):** +If M16's owner is unavailable within the Day 2 window, the Receiver +owner defaults to BA-2 (port `fileconsumer`). Rationale: BA-2's cost is +borne by tracecore unilaterally; M16 owner can later retroactively +adopt the same model. Choosing BA-2 by default never blocks M16; the +inverse (defaulting to BA-1) commits adapter infrastructure that may +not be amortized. + +### 15.4 Why this wasn't caught earlier + +The round-1 internal-repo agent surveyed the k8sevents receiver +structurally (lifecycle, config, factory pattern, RBAC golden) but +**did not surface that the `pipeline.ReceiverFactory` interface is +tracecore-owned, not upstream-OTel-derived**. The factory.go file +header comment names "OTel component.Settings shape at v1.55.0" as the +mirror reference, which made the interface look like it might be the +upstream one. Lesson: when relaying interface-membership claims through +a sub-agent, verify the actual import path of the interface, not just +its shape. + +This is exactly the kind of layer-of-indirection error my §1 confidence +self-assessment flagged for §8 ("trusted the agent's report, not read +the source"). Reading the source on the confidence-raising pass caught +it. + +### 15.5 Updated confidence + +| Section | Pre-correction | Post-correction | +|---|---|---| +| §2 CRI format | 90% | 90% | +| §3 Rotation mechanics | 90% | 90% | +| §4 Tailer strategy | 75% | 75% | +| §5 Build approach (Depend) | 70% | **20%** (recommendation was based on a false premise) | +| §6 Pod attribution | 80% | 80% (substance unchanged; implementation owner shifts) | +| §7 SemConv namespace | 90% | 90% (re-verified naming.md) | +| §8 Internal repo prior art | 65% | **88%** (source read; agent claims validated; interface ownership corrected) | +| §11 Rubric edits | 85% | **75%** (R-6 conditional; R-3 unchanged; others fine) | +| §13.11 Pipeline config | 80% | **35%** (illustrative-only) | +| §14 OTTL recipe | 50% | **20%** (cannot run in tracecore's pipeline as-shown) | +| §15 Build-approach correction | — | 85% (load-bearing new finding) | + +Overall: confidence shifts from ~75% to roughly ~70%, with the gain in +some sections offset by the build-approach correction. The doc is now +substantively more *correct*; the lower headline number reflects honest +re-assessment, not regression. + +## 16. Security threat model (deferred from §9) + +§9 named the chart-and-RBAC additions M15 requires but did not reason +about the trust implications. The adversarial review correctly flagged +this as a load-bearing gap. The threats below are not exhaustive; they +are the surface a design-doc threat-model section should expand on. + +### 16.0 Adversary model + +The threat-model lens for M15. Without a named adversary, "mitigation" +is performative. + +**Assets:** +1. Log content from every pod on the node (read). +2. Pod-spec env vars (read, possibly secret-bearing) via the informer + if OD-1b Path B is chosen. +3. Cursor state at `/var/lib/tracecore/container_stdout/` (read/write, + host-local). +4. The tracecore binary itself on each node. + +**Trust boundary:** +The tracecore-binary process running as the DaemonSet pod is inside +the trust boundary. Anything outside the binary's process address +space, including all co-tenant pods, all in-container processes of +non-tracecore pods, and the apiserver, is outside. + +**Adversaries:** + +| ID | Adversary | Capability | Goal | +|---|---|---|---| +| A-1 | **Co-tenant pod, no node-level access** | Can run arbitrary code in a sibling pod on the same node. Cannot read tracecore's process memory or `/var/lib/tracecore`. | Read another pod's logs via M15 as a confused-deputy attack. | +| A-2 | **Pod with crafted name** | Can request to create a pod with a name that contains shell metacharacters / path separators (Kubernetes name rules prevent most of this, but not all). | Trigger path-traversal in M15's filename parser. | +| A-3 | **In-container reader of FD 1** | Can run `cat /proc/1/fd/1` or `tee` inside their own container, triggering containerd #11149. | Silently drop their own logs to evade M15-based detection. Within their own trust boundary; not an attack on others. | +| A-4 | **High-volume logger** | Can emit 1M+ lines/s to stdout. | Exhaust receiver resources (rate-limit drops, fingerprint cardinality, cursor-write FS pressure). DoS the receiver or sibling tracecore-binary functions. | +| A-5 | **Compromised tracecore image** | Replaced the published image SHA. | Exfiltrate every log on every node. Full compromise of all assets. | +| A-6 | **Compromised supply chain** | Injected malicious dep into tracecore's `go.mod` (filelog, `pkg/stanza`, bbolt, fsnotify, etc.). | Same as A-5 but with smaller blast radius (one dep). | +| A-7 | **Operator with kubeapi access** | Can patch the DaemonSet pod's env or config. | Disable the receiver, alter the namespace allowlist, exfiltrate cursor. Equivalent to legitimate admin; out-of-scope by definition. | + +**Out of scope:** node-level attackers (already have root on the node; +M15 adds no surface), apiserver attackers (compromised cluster control +plane is a project-level concern, not M15-local), DoS against the +kubelet itself (not M15's path). + +**Mitigations matrix:** + +| Adversary | Primary mitigation | Residual risk | +|---|---|---| +| A-1 | Co-tenant cannot read M15's process or `/var/lib/tracecore` because file permissions + PSS `restricted` prevent privilege escalation. The receiver does not expose an inbound network surface that a co-tenant could query. | Low. Co-tenant can still read its own logs and the kubelet log surface they were already allowed to see. | +| A-2 | §2 "split on last two underscores" defensive parse + Kubernetes name validation (RFC 1123 label). | Low. A non-conformant CRD-created object would have to bypass kubelet's own object validation, which is a cluster-level break. | +| A-3 | Cannot be mitigated at M15 (root cause is in containerd). README enumerates the failure. | Medium for self-tee pattern; low for standard workloads. | +| A-4 | Per-key token bucket (OD-3), bounded channel (k8sevents pattern), cursor compaction cap (§16.2), informer cardinality cap (§13.4 + §16.3). | Sustained DoS at world-size > rate-limiter capacity degrades the receiver gracefully (drops + degraded mode), does not crash. | +| A-5 | M3 reproducible-build + SBOM + cosign chain. Image SHA pinning at deploy time. | High by design; the binary's compromise compromises all assets. No M15-local mitigation. | +| A-6 | `go.mod` checksum DB, `pkg/stanza` version pin, Renovate-style automated dep audit. | Medium. Standard supply-chain risk. | + +### 16.1 Cluster-wide log-read surface from `/var/log` hostPath + +A read-only `hostPath: /var/log` mount on the tracecore DaemonSet +gives the receiver process **read access to every container's stdout +on the node**, including kube-system pods (kube-apiserver client +errors, kubelet, scheduler), every tenant's workload, and any sidecar +that logs secrets to stdout (token-rotator pods, image-pull-secret +controllers, mTLS-injection sidecars). On a multi-tenant cluster this +is **read-equivalent to cluster-admin for log content**. Effective +controls: + +- **Namespace allowlist** at the receiver level + (`include_namespaces` config), even though the filesystem read is + unrestricted. The receiver MUST filter at the source-of-truth level + (the file path it opens), not just at the emission level, to keep + records out of the in-process channel entirely. +- **Document the trust boundary** explicitly in the receiver README: + operators deploying M15 are granting the receiver-binary equivalent + of read-only access to every container's logs. Single-tenant + clusters: low risk. Multi-tenant: needs policy review. +- **Avoid `/var/log/containers/`** for tailing if it adds symlink- + resolution attack surface. Direct `/var/log/pods/**/*.log` reads + are simpler and equally functional per §2. +- **No supplementary capabilities.** No `CAP_SYS_PTRACE`, no hostPID, + no privileged container. The file read works at the kubelet's + log-group fsGroup with no escalation. + +### 16.2 hostPath write surface + +`/var/lib/tracecore/container_stdout/` (cursor persistence) with +`DirectoryOrCreate` writes to host filesystem **without size limits**. +Runaway bbolt growth (e.g. corrupted DB triggering rebuild loops, +fingerprint cardinality explosion under churn) can fill host root FS, +which evicts every pod on the node. Effective controls: + +- **Bound cursor DB size**. Either via filelog's `compaction.on_start` + (BA-1) or via a tracecore-native LRU eviction policy capped at a + configurable byte limit (BA-2/3). Default 100 MiB on a node-local + bbolt is a reasonable starting budget. +- **emptyDir is not viable** for this volume because cursor must + survive pod restart. The hostPath is structurally required. +- **Surface the FS-usage metric** to the receiver's self-telemetry + surface so tracecore alerting (not just host-side) can catch growth + before it evicts pods. + +### 16.3 Pod-list/watch RBAC scope + +Under §15.1 BA-2/BA-3 the receiver embeds a Pod informer for env-var +projection (OD-1b Path B) or k8sattributes-equivalent metadata +hydration. The RBAC scope choice is load-bearing: + +- **Cluster-wide pod list/watch** is the simplest and lets the receiver + attribute logs from any pod whose container the tailer sees. It + also means the receiver SA can read every PodSpec in the cluster, + including env vars that may contain secrets (Kubernetes does not + prevent operators from putting tokens in env). Trust boundary + equivalent to cluster-admin read on PodSpec. +- **Namespace-scoped** requires one informer per allowed namespace, + more RBAC machinery, and breaks the "single SharedInformer per node" + pattern from §13.4. But it caps the blast radius if the receiver is + compromised. +- **Node-scoped via FieldSelector** is the right answer, mirroring the + rubric: `FieldSelector=spec.nodeName=$NODE_NAME` on a cluster-wide + pod informer reduces apiserver load and receiver attack surface to + pods on this node. RBAC remains cluster-wide list/watch (k8s does + not scope by node), but watched data is filtered server-side. + +### 16.4 seccomp / AppArmor delta from M5b + +M5b ships a Pod Security Standard (PSS) `restricted` policy with the +DaemonSet running `runAsNonRoot`, `RuntimeDefault` seccomp, +`allowPrivilegeEscalation: false`. M15's deltas: + +- **fsGroup matching kubelet log group.** Distros vary (often `root` + via 0, sometimes a kubelet-specific group). Configurable in + `values.yaml` with a default sentinel and an operator-override. +- **hostPath read-only `/var/log`** is allowed by PSS `restricted` + *if* the policy's `volumes` allowlist includes `hostPath`. Most + hardened clusters explicitly disallow hostPath — operators will need + a policy exception, which is the practical cost of any node-local + log tailer. +- **No `procMount: Unmasked`, no `hostPID`, no `hostIPC`**. M15 stays + inside `restricted` modulo the hostPath exception. + +### 16.4a Env-var redaction (OD-1b Path B only) + +If OD-1b Path B (tracecore informer reads Pod env) is selected, +`Pod.spec.containers[].env` is a credential-leak surface. Kubernetes +does not prevent operators from setting `AWS_SECRET_ACCESS_KEY`, +`DATABASE_PASSWORD`, or `OPENAI_API_KEY` as PodSpec env. If M15 +naively projects every env var, those values land in attributes +exported downstream. + +Mitigations: + +- **Allowlist mode (recommended).** Operators name the env vars to + project (e.g., `["RANK", "WORLD_SIZE", "LOCAL_RANK", + "TORCHELASTIC_RUN_ID", "JOB_ID"]`). Default empty; missing env vars + silently skipped. Cannot leak unnamed env. +- **Pattern blocklist (defense in depth).** Block any env-var name + matching `(?i).*(SECRET|TOKEN|PASSWORD|KEY|CREDENTIAL).*` even if + allowlisted; emit `IncError(KindEnvRedacted)` for observability. +- **Value-length cap.** Env values are length-capped at, say, 256 + bytes before projection. Long values are nearly always credentials + or paths; rank-style values are at most a few bytes. + +**Body-content redaction is out of scope for M15.** A training script +that prints AWS credentials to stdout is the operator's responsibility +to handle (typically via container-level secret masking or +stdout-aware log scrubbers upstream). M15 README documents this as a +non-goal. + +### 16.5 Trust boundary summary + +Operators deploying M15 are granting the tracecore-binary on each node +the practical equivalent of: +1. Read-only access to every container's stdout (`/var/log`). +2. Read-only access to PodSpec for pods on the node (via informer). +3. Read/write access to a host-local bbolt at + `/var/lib/tracecore/container_stdout/`. + +None of these alone is a privilege escalation. Combined, the binary's +compromise would leak log content cluster-wide. The mitigation surface +is image provenance + the existing M3 reproducible-build + SBOM + +cosign chain. README must spell this out so operators don't deploy +M15 to multi-tenant clusters without a policy review. + +## 17. Failure-mode coverage (deferred from §13) + +### 17.0 Receiver-runtime contract + +**Asserted: every failure mode below preserves the +`pipeline.Receiver` runtime contract.** Concretely: + +- The receiver process does NOT panic out of any failure path. Per + PRINCIPLES §1 ("never crash the workload"), every per-file + goroutine wraps `defer recover()`; malformed CRI lines do not + cascade. +- The receiver does NOT block kubelet's SIGTERM beyond the 1s phase-1 + shutdown budget (per the rubric). +- The receiver continues to attempt forward progress in degraded + states; `SetDegraded(true)` is a signal, not a stop. +- The receiver MUST surface every failure mode via `IncError(kind)` + with a canonical or receiver-local typed `Kind` constant. Untyped + error-string passing is forbidden (cardinality risk per + `internal/selftelemetry/interface.go:200`). + +The subsections below enumerate the failure modes; each row in §17.1 +through §17.7 satisfies the contract above. Receiver tests must +verify the assertion holds for the named failure mode. + +The round-2 §13.1 dive covered kubelet's internal rotation failures +but did not cover **receiver-side failure surfaces under realistic +operational events**. The adversarial review correctly flagged the +gap. Each failure mode below should map to a row in the receiver's +RUNBOOK and (where the behavior is observable) a `Test*` identifier +in FAILURE-MODES.md per §6 doc-check rubric. + +### 17.1 Node drain (`kubectl drain`) + +The tracecore DaemonSet pod is evicted with +`terminationGracePeriodSeconds` (default 30 s). Receiver behavior: + +- Open per-file tailer goroutines should drain in-flight reads up to + the grace period, flush the cursor to disk, exit cleanly. +- Records already in the bounded channel are best-effort flushed to + the exporter pipeline within the grace period; overflow drops on + exporter back-pressure rather than blocking shutdown. +- Cursor durability across drain depends on whether the host volume + survives pod replacement. `hostPath` persists across pod restart on + the same node, so resume-after-drain is well-defined. The new pod + reads the cursor and resumes at the recorded offset; any records + the runtime wrote during the drain window are picked up. +- Receiver MUST NOT block kubelet's grace-period termination. SIGTERM + → 1 s phase-1 budget per the rubric is appropriate. + +Test target: `TestContainerStdout_GracefulShutdown`. + +### 17.2 Kubelet restart (without node restart) + +Containerd / CRI-O keep writing during kubelet downtime (they own the +log fd via `ReopenContainerLog`'s last call). Rotation cannot happen +while kubelet is down (kubelet drives it per §13.1). When kubelet +restarts: + +- Pre-restart writes accumulated in the live `0.log` past + `containerLogMaxSize` are still readable to the tailer. +- Kubelet's first monitor tick after restart triggers a backlog of + rotations across all containers. Tailer's fingerprint-based rotation + detection should handle the burst without losing records. +- The receiver's Pod informer disconnects from apiserver during the + kubelet-restart window only if it goes through kubelet (most do + not; they go to apiserver directly). So pod attribution remains + uninterrupted as long as apiserver is reachable. + +Test target: `TestContainerStdout_KubeletRestartBacklog`. + +### 17.3 Pod deletion + +Kubelet removes `/var/log/pods/__/` after the +TerminationGracePeriod elapses. Tailer behavior: + +- The per-file tailer holding an open fd on `0.log` reads until EOF + (POSIX rename / unlink leaves the fd valid). +- After EOF, the tailer must close the fd, remove the corresponding + cursor entry from bbolt / JSON, and exit. +- The directory-deletion event from the Pod informer is the canonical + trigger to garbage-collect the cursor. Without GC the cursor file + grows unboundedly under pod churn. + +Test target: `TestContainerStdout_PodDeletionCursorGC`. + +### 17.4 Pod eviction + +A subset of pod deletion: the pod object is removed but the container +may have produced its final log lines just before eviction. The +receiver needs to drain the file before the directory is removed. +This is the M19 "pod evicted" pattern's reliance on M15 — M19 must see +the tail of the evicted container's stdout to verify why it died. + +- Kubelet's deletion sequence: stop containers → delete container + filesystem → delete pod-log directory. There is a short window + between the last write and the directory removal. +- Tailer's poll interval (default 200 ms in filelog) is fast enough + to catch the tail bytes if it gets one more tick before the + directory removal. +- Edge case: very-short-lived containers that produce their last log + in <1 poll interval may have records that exist in `0.log` but the + tailer never reads them. Mitigation: fast-flush on directory-removal + event from the informer; force one final read on EOF before + cursor GC. + +Test target: `TestContainerStdout_PodEvictionTailFlush`. + +### 17.5 Sibling-receiver interaction on `/var/lib/tracecore/` + +k8sevents (M10), kernelevents (M9), and future receivers share +`/var/lib/tracecore/`. M15's cursor lives at +`/var/lib/tracecore/container_stdout/`. Concerns: + +- **Filesystem permissions.** Each receiver should own a subdirectory + with no cross-receiver write. Confirmed pattern matches: k8sevents + has no cursor today; if it adds one under M10's evolution, it goes + under `k8sevents/`, not the root. +- **bbolt single-writer per file** (per §13.3). Multiple receivers + writing to **different** bbolt files in the same directory is fine. + Multiple receivers sharing a single bbolt file is not. Cursor-per- + receiver-subdirectory keeps the lock surface trivially isolated. +- **Backup-on-corruption renames** (`..backup`) can + produce filesystem-level garbage. Receiver should clean up backups + older than N days (configurable, default 7). + +Test target: `TestContainerStdout_SiblingReceiverIsolation`. + +### 17.6 Container-runtime crash / restart + +If containerd or CRI-O crashes and is restarted by systemd: + +- The runtime's open fd to `0.log` is closed at crash. Bytes in the + shim's stdout pipe buffer that hadn't yet been written are lost + (this is mostly orthogonal to #11149 but related — same shared-pipe + surface). +- After restart, the runtime calls neither `ReopenContainerLog` nor + re-stat; it reopens via the path it had cached. The tailer sees a + brief absence of new writes followed by a resumption. + +Test target: not feasible at unit-test level; covered by chaos.yml +integration if at all. + +### 17.7 Filesystem full at `/var/lib/tracecore/` + +If the host volume backing the cursor directory is full: + +- bbolt write fails. Filestorage's recovery path under BA-1 is + recreate-from-corruption; under BA-2/3, tracecore-owned cursor + format must handle ENOSPC gracefully. +- Receiver MUST surface this via `IncError(KindCursorWriteFailed)` + and continue tailing in-memory; on next successful cursor write, + the offsets catch up. Loss-on-restart in this regime is bounded + by the time the FS was full. +- Alerting on `KindCursorWriteFailed` is essential because the only + observable downstream symptom otherwise is "records re-played after + pod restart" (cursor not durable). + +Test target: `TestContainerStdout_CursorWriteFailureGraceful`. + +## 18. Rollout posture + +This section addresses the design-team / project-lead gap flagged by +Reviewer B P1-13. + +### 18.1 Stability stage and default + +M15 ships at **alpha**, `receivers.containerstdout.enabled: false` by +default in the Helm chart. Operators opt in per the alpha-receiver +contract documented in `docs/STABILITY.md` (if missing, document the +contract as part of M15's design doc): + +- Backward-compat is opt-in (per PRINCIPLES §11). +- Config field names may rename through a 1-minor-version + deprecation; new names ship alongside old, old emits warning, old + is removed on next minor. +- Attribute names follow the §7 namespace decision; if R-1 is later + conceded, deprecated emit happens via a `tracecore_compat` processor + with a 1-version overlap window. + +### 18.2 Coexistence with sibling receivers + +M15 coexists with kernelevents (M9, shipped) and k8sevents (M10, +alpha) on the same DaemonSet pod. Boundaries: + +- **Filesystem scope:** kernelevents reads `/dev/kmsg` + journald; + M15 reads `/var/log/pods`; k8sevents reads apiserver only. Zero + shared file surface. +- **Cursor directory:** M15 owns + `/var/lib/tracecore/container_stdout/`. kernelevents and k8sevents + do not have cursors today. Reserved sibling subdirectories prevent + any future cross-receiver write collision. +- **Self-telemetry namespace:** all three receivers emit + `tracecore_receiver_*` metrics partitioned by `receiver_id`. Per + k8sevents' `KindBackpressureDrop` / `KindWatch` pattern (§8), + M15 introduces `KindRotationStalled` / `KindCursorWriteFailed`; + these MUST NOT alias any kernelevents or k8sevents kinds — verify at + PR time by grep. +- **RBAC namespacing:** each receiver ships its own ClusterRole with + a unique name. M15 introduces + `tracecore-containerstdout-clusterrole`. + +### 18.3 Upgrade path across `pkg/stanza` BREAKING changes (BA-1 only) + +Under §15.1 BA-1, M15 depends on upstream `pkg/stanza` evolution. +Per §13.3, ~3 BREAKING changes per 18 months, all in adjacent code +(OTTL, windows event logs), none in `fileconsumer` surface. Tracecore's +upgrade contract: + +- Pin contrib version in `go.mod`. +- CHANGELOG entry at every contrib bump describes operator-visible + changes. +- Feature-gate posture (OD-9): tracecore opts in to stable gates + (`filelog.decompressFingerprint`) by default; tracks alpha gates + (`filelog.protobufCheckpointEncoding`) but does not flip until + upstream marks beta. Default tracking matches upstream defaults to + minimize divergence. + +Under BA-2, M15 is decoupled from contrib churn at the receiver level +but inherits any `fileconsumer` algorithmic improvements only via +manual port. Document the upstream-port cadence in the receiver +README. + +### 18.4 Migration from alternative loggers + +Operators currently using Fluent Bit / Vector / Promtail to tail +`/var/log/pods` can run M15 alongside without conflict (read-only +mounts, distinct cursor paths). Migration to M15-only is a deployment +choice, not a contract requirement. The doc does not currently +recommend M15 as a replacement for general-purpose log shippers (it +is training-observability-focused per O1 scope); a future RFC could +revisit this. + +## 19. Alerts catalog + +The §17 failure modes name test identifiers; this section names the +corresponding alerts so operators can wire monitoring at deploy time. +Per docs/STYLE-docs.md §5, every alert binds to a RUNBOOK section. + +| Alert name | Trigger | Severity | RUNBOOK | Notes | +|---|---|---|---|---| +| `M15RotationStalled` | `tracecore_receiver_errors_total{receiver_id="containerstdout", kind="rotation_stalled"} > 0` for 5m | Warning | RUNBOOK § Rotation stall | Kubelet rotation has not happened in 30s after `0.log` exceeded `containerLogMaxSize`. | +| `M15CursorWriteFailed` | `tracecore_receiver_errors_total{receiver_id="containerstdout", kind="cursor_write_failed"} > 0` for 1m | Warning | RUNBOOK § Cursor write failure | Host FS at cursor dir failed; in-memory tailing continues but durability lost. | +| `M15BackpressureDrop` | rate of `tracecore_receiver_errors_total{receiver_id="containerstdout", kind="backpressure_drop"}[5m] > 100` | Warning | RUNBOOK § Backpressure drop | Per-key rate-limit dropping records; investigate noisy pod or raise budget. | +| `M15Degraded` | `tracecore_receiver_degraded_seconds_total{receiver_id="containerstdout"}` increasing | Warning | RUNBOOK § Degraded mode | Receiver in degraded state; any failure mode from §17 could be the cause. | +| `M15PodInformerDisconnected` | `tracecore_receiver_errors_total{kind="watch"}` > 0 for 2m for receiver_id="containerstdout" | Warning | RUNBOOK § Pod informer disconnect | apiserver unreachable; attribution falls back to filepath-only. | +| `M15Cardinality` | `tracecore_receiver_errors_total{kind="cardinality"}` rate > 0 | Critical | RUNBOOK § Cardinality cap | Fingerprint set or rank set exceeded cap; data loss. | +| `M15FileStorageCorruption` *(BA-1 only)* | log-string match on `"Database corruption detected"` from `extension/file_storage` | Critical | RUNBOOK § Filestorage corruption | bbolt rebuild; offsets lost; resume-from-EOF until next write. | +| `M15FdLeak` | host metric: open fds for tracecore process > 2× current pod count for 5m | Warning | RUNBOOK § fd hygiene | Per rubric line 372; investigate slow-closing tailers. | +| `M15HighDroppedLines` | `rate(tracecore_dropped_lines_total{receiver_id="containerstdout"}[5m]) > 1000` | Warning | RUNBOOK § Rate-limit drops | Per-pod rate limit hit sustained; tune budget or investigate. | + +**Operator-side runbook prose** must be drafted as part of M15's +RUNBOOK.md per the k8sevents template. Examples of triage steps per +alert live in `components/receivers/k8sevents/RUNBOOK.md`. + +## 20. Overhead-budget methodology (deferred from OD-8) + +MILESTONES.md line 368 gates M15 at ≤0.10% CPU, ≤20 MB RSS, ≤0.3 Mbps +egress. Without methodology, these numbers are unfalsifiable. + +### 20.1 Workload spec + +A reference workload that exercises M15's hot paths: + +- **Topology.** Single-node kind cluster, tracecore DaemonSet, 100 + fixture pods. +- **Per-pod log rate.** Configurable; default 100 lines/s × 256 B avg + line = 25.6 KB/s per pod → 2.56 MB/s aggregate on the node. +- **Rotation cadence.** Each pod hits `containerLogMaxSize` ~ every + 6.5 minutes at default. Aggregate ~15 rotations/min cluster-wide. +- **Training-pattern coverage.** 10% of pods emit dataloader-format + lines (`time:` / `data:`); 10% emit JSON-structured logs; 80% emit + free-text. +- **Pod churn.** 1 pod restart every 30s (kubelet-driven). +- **Duration.** 60-minute steady-state measurement after a 10-minute + warmup. + +### 20.2 Measurement points + +- **CPU%:** cgroup-derived process CPU% averaged over the 60-min + window. Source: + `/sys/fs/cgroup/cpu.stat` for the tracecore container. +- **RSS:** `/proc/self/status` `VmRSS` field sampled every 30s, + reported as p50 and p95. +- **Egress Mbps:** OTLP-out bytes summed at the exporter, normalized + per second. + +### 20.3 Pass/fail + +- **CPU.** Window-average ≤0.10% of one CPU core. Hard fail at any + 30s window exceeding 0.50%. +- **RSS.** p95 ≤20 MB AND p99 ≤30 MB. Hard fail at any 30s sample + >50 MB. +- **Egress.** Window-average ≤0.3 Mbps. Hard fail at any 30s window + >1 Mbps. + +### 20.4 Capacity-model extrapolation + +The reference workload sizes the receiver at ~25 MB RSS for 100 pods +under the assumed rates. Naive linear extrapolation (not +empirically validated): each tailed pod adds ~200 KB of `fileconsumer` +state plus ~0.001% CPU. So a 200-pod node would land ~50 MB RSS / +0.20% CPU — over the rubric's per-pod budget. Either: raise the +budget proportionally to pod density, or design the receiver for an +explicit 100-pod assumption. Recommend the latter (PRINCIPLES §13: +do not over-engineer for the wide tail). Document the assumption in +the receiver README. + +### 20.5 Owner + +M5 owns the harness. M15 contributes the workload spec (this section) +and the fixture pods. The 60-min steady-state run is M5's CI gate +once the harness ships. + +## 21. Follow-ups beyond research scope + +Items raised by Reviewers A / B / C that cannot be closed by more +research alone. Tagged by requirement type so the design phase and +project-lead can route each to the right owner. + +### 21.1 Requires stakeholder decision + +| # | Item | Owner | Notes | +|---|---|---|---| +| 1 | Resolve OD-11 (BA-0..BA-4 build approach) | Milestone-planning lead | Day-1-spike + Day-2 M16-owner async + Day-3 stakeholder meeting per §15.6. | +| 2 | Resolve OD-12 (`gen_ai.training.*` namespace posture: hold / hedge / concede / re-activate O4) | O4 owner per NORTHSTARS.md line 204 | Load-bearing on §7.4 negative finding (no upstream PR exists). | +| 3 | Confirm M16's build approach (will M16 use BA-1's adapter or go native?) | M16 owner | Two-line async answer suffices per §15.6 Day 2. | +| 4 | Confirm O2 stakeholder accepts BA-0 sidecar pod footprint (if BA-1 spike fails) | O2 stakeholder | Conditional; only if §15.6 Day 1 spike fails. | +| 5 | Sign off on §18.4 migration posture (no recommendation to replace Fluent Bit / Vector / Promtail) | Project lead | Statement of scope, not architectural. | +| 6 | Decide OD-1b Path A vs Path B at design time | Receiver owner | Operator-facing trade-off (downward API requires opt-in vs informer requires RBAC). | + +### 21.2 Requires kind cluster / fixture infrastructure (no production data, no GPU) + +| # | Item | Notes | +|---|---|---| +| 7 | Empirical binary-size delta of `filelogreceiver` import | Needs `go build` with filelog linked; requires committing to BA-1 path. Estimated 15–30 MB unstripped (§13.3). | +| 8 | Cross-pod fingerprint-collision property test | kind cluster + N pods with identical first-line prefixes, force rotation, assert no offset cross-pollination (§12 #13). | +| 9 | Real-cluster overhead measurement at §20 workload spec | kind cluster + 100 fixture pods + 60-min steady-state. Validates rubric line 368. | +| 10 | Containerd vs CRI-O reopen timing window | Instrumented kind cluster with both runtimes; measure rename→reopen gap (§13.1 follow-up). | +| 11 | Container-runtime crash recovery test (§17.6) | chaos.yml integration; SIGKILL containerd, observe receiver behavior. | +| 12 | OTTL recipe validation against `otelcol validate` (BA-1 only) | Download `otelcol-contrib`, instantiate the §13.11 pipeline, verify all components resolve. Only meaningful under BA-1 (under BA-2/3 the recipe is reimplemented natively). | +| 13 | `lsof`-golden fd-hygiene test | kind cluster + pod churn; assert `lsof -p $(pidof tracecore)` shows ≤ 2× pod-count entries (rubric line 372). | + +None of these require production data (synthetic kind-cluster fixtures +suffice). None require GPU. + +### 21.3 Requires external action / contribution + +| # | Item | Owner | Notes | +|---|---|---|---| +| 14 | File the upstream `gen_ai.training.*` draft PR | O4 owner | §7.4 finding: no PR exists as of 2026-05-19. Closes the §7.3 "hold the bet" posture's evidence gap. | +| 15 | Engage semantic-conventions-genai issue #88 (`rl.*` proposal) for scope overlap | O4 owner / O7 governance | Determines whether training observability lands under `rl.*` or `gen_ai.training.*` upstream. | +| 16 | Small-platform training-namespace survey (ClearML, Determined.AI, Skypilot, Anyscale) | Research follow-up | Reviewer C P2-22; weakens or confirms §13.12 "no shared namespace" finding. | +| 17 | Path-traversal hardening analysis on pod-name filepath parse | Security reviewer | Reviewer C P2-23; defensive last-two-underscores split is documented (§2) but no explicit attack-tree. | + +### 21.4 Consciously deferred (out-of-scope for M15 v0) + +| # | Item | Reason | +|---|---|---| +| 18 | MPI `OMPI_COMM_WORLD_RANK` extraction | Requires `CAP_SYS_PTRACE` + hostPID, conflicts with M5b minimal-privilege policy. Future receiver iteration. | +| 19 | PodLogs API KEP tracking | KEP-3059 dead; no replacement filed. Re-check in ~6 months (§13.9). | +| 20 | Body-content credential redaction | Operator responsibility per §16.4a. M15 README documents as non-goal. | + +### 21.5 What about GPU? Production data? + +**No M15 follow-up requires GPU.** M15 tails container stdout; the +GPU surface (DCGM, NCCL FlightRecorder, Kineto) lives in Lane 6, +which is GPU-hardware-gated and does not block M15. + +**No M15 follow-up requires production data.** Every measurement need +above is satisfiable on a kind cluster with synthetic fixtures, per +PRINCIPLES §6's "test against real components, not mocks" balanced +against the M3 reproducible-build constraint that production data +must not enter CI. + +## Sources + +Primary references consulted during this research pass (all current as +of 2026-05-19): + +- Kubernetes logging architecture: +- CRI logging design proposal: +- Kubelet log manager source: `pkg/kubelet/logs/container_log_manager.go` +- Kubelet symlink construction: `pkg/kubelet/kuberuntime/legacy.go` +- Kubernetes object naming rules: +- containerd #11149: +- OTel filelog receiver: +- OTel container operator: +- OTel `pkg/stanza/fileconsumer`: +- OTel `k8sattributesprocessor`: +- OTel SemConv naming rule: +- OTel SemConv GenAI: +- RL SemConv proposal: +- torchrun env vars: +- Kubeflow PyTorchJob env injection: +- Kubeflow MPI Operator: +- torchx Kubernetes scheduler: +- Ray Train env injection: +- JobSet concepts: +- torchvision MetricLogger: +- detectron2 events: +- PyTorch Lightning SimpleProfiler: +- NeMo TimingCallback: +- HF Trainer speed_metrics: +- MosaicML Composer SpeedMonitor: + +Round-2 additions: + +- CRI v1 proto (no log RPC): +- kubelet `/containerLogs` handler: `pkg/kubelet/server/server.go`, `pkg/kubelet/kuberuntime/kuberuntime_container.go`, `pkg/kubelet/kuberuntime/logs/logs.go` +- KubeletConfiguration validation: `pkg/kubelet/apis/config/validation/validation.go` +- OTel filelog README: +- OTel `file_storage` extension: +- OTel container operator: +- otelcol-k8s distribution manifest: +- contrib CHANGELOG (stanza/filelog churn): +- lit-GPT pretrain.py: +- lit-GPT issue #1110 (sample logs): +- lit-GPT issue #1607 (sample logs): +- torch/distributed/run.py (standalone branch): + +Round-3 additions: + +- Kubelet defaults: +- KEP-3059 (closed): +- `file_storage` extension source: `extension/storage/filestorage/extension.go`, `factory.go` +- `k8sattributesprocessor` overlap behavior: +- Triton metrics: +- SageMaker CloudWatch monitoring: +- Bedrock model-customization monitor: +- W&B GPU asset source: +- MLflow system metrics: +- OTel process registry (no `distributed.*`): +- OTel system-metrics spec (`system.*` host-only): +- OTel community SIG list (no ML SIG): +- OTTL ExtractPatterns / IsMatch: +- transformprocessor README: