Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
c11d2db
[m15] containerstdout: scaffold package + doc.go
trilamsr May 20, 2026
3c53c68
[m15] containerstdout: receiver-local self-telemetry Kind constants
trilamsr May 20, 2026
e8dc840
[m15] containerstdout: typed Record + frozen SchemaURLv0 + wire-attri…
trilamsr May 20, 2026
6e06232
[m15] containerstdout: compile-time pin for M18/M19 join surface
trilamsr May 20, 2026
e26959b
[m15] containerstdout: Config struct + defaultConfig
trilamsr May 21, 2026
d34197f
[m15] containerstdout: Config.Validate + RE2 + ceiling checks
trilamsr May 21, 2026
ef2f6fc
[m15] containerstdout: Factory + components.yaml registration
May 21, 2026
b2ee5a2
[m15] containerstdout: CRI text-format parser + partial-line stitcher
May 21, 2026
264d971
[m15] containerstdout: pod-log path resolver
May 21, 2026
23efae2
[m15] containerstdout: AttributionSource interface + LRU cache
May 21, 2026
98c7ab6
[m15] containerstdout: informer-backed attribution source
May 21, 2026
f32c496
[m15] containerstdout: body-match rank fallback (process_rank_regex)
May 21, 2026
631d6e4
[m15] containerstdout: dataloader regex extractor
May 21, 2026
f607c0e
[m15] containerstdout: egress rate-limit token bucket
May 21, 2026
93e25c2
[m15] containerstdout: cursor persistence (atomic write + fsync)
May 21, 2026
7355272
[m15] containerstdout: stdlib file tailer (rotation+truncation)
May 21, 2026
e782842
[m15] containerstdout: pod informer (node + namespace scoped)
May 21, 2026
34a7472
[m15] containerstdout: receiver lifecycle wiring + degraded OR
May 21, 2026
4a27f5f
[m15] containerstdout: failure-mode test stubs (15 pins, Phase 13.6)
May 21, 2026
bfa8d83
[m15] containerstdout: 4 hot-path benchmarks (alloc/lookup/regex/RL)
May 21, 2026
de0e376
[m15] containerstdout: backfill 10 failure-mode tests (Phase 13.6)
May 21, 2026
7379422
[m15] containerstdout: emit plog construction + wire attrs
May 21, 2026
2f17079
[m15] containerstdout: JSON body auto-detect
May 21, 2026
fe7c384
[m15] containerstdout: per-tailer pipeline goroutine
May 21, 2026
31c6a9c
[m15] containerstdout: tailer pool + 15s lines/s flush
May 21, 2026
8c60d98
[m15] containerstdout: end-to-end integration tests
May 21, 2026
d1c8344
[m15] containerstdout: bump waitForTailer timeout to 5s
May 21, 2026
519501e
[m15] containerstdout: Helm chart (RBAC + DaemonSet + values)
May 21, 2026
ae967b0
[m15] containerstdout: RUNBOOK + FAILURE-MODES + README
May 21, 2026
5a0a762
[m15] containerstdout: prometheus alerts + conftest policy
May 21, 2026
5d0f391
[m15] MILESTONES: flip M15 acceptance criteria
May 21, 2026
149709c
[m15] containerstdout: replace em/en dashes per doc-check policy
May 21, 2026
c7f9408
[m15] MILESTONES: link PR #158
May 21, 2026
b80a80f
[m15] containerstdout: body-match enriches regardless of informer sync
May 22, 2026
4e248fb
[m15] containerstdout: replace em-dashes in new comments per doc-check
May 22, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 22 additions & 19 deletions MILESTONES.md
Original file line number Diff line number Diff line change
Expand Up @@ -350,30 +350,33 @@ Alpha unified-source logs receiver covering L2 + L9 (kernel + system events). Ta

### M15. Container stdout receiver

- **Status:**
- **Status:** ☑ alpha (opt-in via `containerstdout.enabled=true`)
- **Depends on:** M1

**Functional rubrics:**
- Tails files matching `/var/log/pods/<namespace>_<podname>_<uid>/<container>/*.log`; parses CRI log format into `body`, `log.iostream`, `timestamp`; reassembles `P` (partial) into `F` (full) bounded by `max_log_size` (default 1 MiB; matches the OTel `container` stanza operator default per [`pkg/stanza/docs/operators/container.md`](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/docs/operators/container.md)). (per https://kubernetes.io/docs/concepts/cluster-administration/logging/)
- Per-rank attribution: derives `gen_ai.training.rank` (canonical join key across receivers) from Pod env vars `RANK`/`WORLD_SIZE`/`LOCAL_RANK`, and `gen_ai.training.job.id` from `JOB_ID` *or* `TORCHELASTIC_RUN_ID` (canonical per [torchrun docs](https://docs.pytorch.org/docs/stable/elastic/run.html) "equal to the rendezvous run_id"); falls back to Pod labels `gen_ai.training.io/rank`, `gen_ai.training.io/job-id` (matches RFC-0009 / M13 for cross-receiver join consistency); missing → record emitted with `rank=unknown`.
- Structured-log auto-detection: first non-whitespace byte `{` triggers JSON parse; on success emits parsed fields as attributes capped at `max_attributes` (default 16); on failure or non-`{`, passthrough as `body`.
- **Dataloader-timing extraction (M18 feed):** PyTorch has no canonical dataloader-progress format (torchvision `MetricLogger` emits `time: ... data: ...`; detectron2 logs `data_time` as a scalar; neither prints `iter_time`). The receiver ships with a tracecore-placeholder regex (`dataloader_regex` config key) operators MUST override it to match their training stack. When a log line matches, receiver emits `tracecore.training.data_time_s` and `tracecore.training.iter_time_s` keyed by `gen_ai.training.rank` for M18's straggler detector to join on; schema lives in fixture.
- Pod-discovery: single per-node Pod informer with `FieldSelector=spec.nodeName=<NODE_NAME>` (env via downward API); new Pod → new file watcher within 5s p99 *(unverified)*.
- Rotation: kubelet rotates by renaming `0.log` → `0.log.<N>` and creating new `0.log`; receiver follows inode, not path; integration test asserts zero record loss.
- Straggler-pattern feed: emits per-rank line-rate (`tracecore.container.lines_per_s`) as derived metric on 15s window.
- Checkpoint persistence: cursor stored under `/var/lib/tracecore/container_stdout/cursor.json` (atomic rename); on restart resumes within 1 record of last-acknowledged position.
- **Degraded mode:** `/var/log/pods` missing or unreadable, Pod informer flap, or kubelet rotation breakage each set `Degraded()=true` with an `IncError("<kind>")` counter; receiver stays alive and recovers without restart; `FAILURE-MODES.md` row exists per failure mode. Rotation-stalled is detected when `0.log` size exceeds `containerLogMaxSize` for ≥30 s (3× kubelet's default `containerLogMonitorInterval` of 10 s, set at [`pkg/kubelet/apis/config/v1beta1/defaults.go`](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/v1beta1/defaults.go)); surfaced via `IncError("rotation_stalled")`.
- Tails files matching `/var/log/pods/<namespace>_<podname>_<uid>/<container>/*.log`; parses CRI log format into `body`, `log.iostream`, `timestamp`; reassembles `P` (partial) into `F` (full) bounded by `max_log_size` (default 1 MiB; matches the OTel `container` stanza operator default - per [`pkg/stanza/docs/operators/container.md`](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/docs/operators/container.md)). (per https://kubernetes.io/docs/concepts/cluster-administration/logging/)
- Per-rank attribution: derives `gen_ai.training.rank` (canonical join key across receivers) from Pod env vars `RANK`/`WORLD_SIZE`/`LOCAL_RANK`, and `gen_ai.training.job.id` from `JOB_ID` *or* `TORCHELASTIC_RUN_ID` (canonical per [torchrun docs](https://docs.pytorch.org/docs/stable/elastic/run.html) - "equal to the rendezvous run_id"); falls back to Pod labels `gen_ai.training.io/rank`, `gen_ai.training.io/job-id` (matches RFC-0009 / M13 for cross-receiver join consistency); missing → record emitted with `rank=unknown`.
- Structured-log auto-detection: first non-whitespace byte `{` triggers JSON parse; on success emits parsed fields as attributes capped at `max_attributes` (default 16); on failure or non-`{`, passthrough as `body`.
- **Dataloader-timing extraction (M18 feed):** PyTorch has no canonical dataloader-progress format (torchvision `MetricLogger` emits `time: ... data: ...`; detectron2 logs `data_time` as a scalar; neither prints `iter_time`). The receiver ships with a tracecore-placeholder regex (`dataloader_regex` config key) - operators MUST override it to match their training stack. When a log line matches, receiver emits `tracecore.training.data_time_s` and `tracecore.training.iter_time_s` keyed by `gen_ai.training.rank` for M18's straggler detector to join on; schema lives in fixture.
- Pod-discovery: single per-node Pod informer with `FieldSelector=spec.nodeName=<NODE_NAME>` (env via downward API); new Pod → new file watcher within 5s p99 *(unverified)*.
- Rotation: kubelet rotates by renaming `0.log` → `0.log.<N>` and creating new `0.log`; receiver follows inode, not path; integration test asserts zero record loss.
- Straggler-pattern feed: emits per-rank line-rate (`tracecore.container.lines_per_s`) as derived metric on 15s window.
- Checkpoint persistence: cursor stored under `/var/lib/tracecore/container_stdout/cursor.json` (atomic rename); on restart resumes within 1 record of last-acknowledged position.
- **Degraded mode:** `/var/log/pods` missing or unreadable, Pod informer flap, or kubelet rotation breakage each set `Degraded()=true` with an `IncError("<kind>")` counter; receiver stays alive and recovers without restart; `FAILURE-MODES.md` row exists per failure mode. Rotation-stalled is detected when `0.log` size exceeds `containerLogMaxSize` for ≥30 s (3× kubelet's default `containerLogMonitorInterval` of 10 s, set at [`pkg/kubelet/apis/config/v1beta1/defaults.go`](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/apis/config/v1beta1/defaults.go)); surfaced via `IncError("rotation_stalled")`.

**Non-functional rubrics:**
- Overhead: ≤0.10% CPU, ≤0.3 Mbps egress sustained, ≤20 MB RSS. (per NORTHSTARS O2 "Container stdout (rate-limited)" row)
- Egress rate-limit: token-bucket per `(pod_uid, container)` at `rate=200 lines/s`, `burst=1000` (defaults; configurable via `egress_rate_limit:` config key); over-budget → `IncError("rate_limit_drop")` + sampled record with `tracecore.dropped_lines=N`; never blocks file reader.
- Multi-tenancy: `namespaces:` allowlist filters Pod discovery before file watch opened; per-namespace egress sub-budget configurable.
- Back-pressure: 1M-line burst from one rank MUST NOT block sibling streams; bounded per-file goroutine + bounded channel (1024); `goleak` test.
- File-handle hygiene: ≤2× pod-count open fds steady-state; closed within 30s of Pod `Terminated`; verified by `lsof` golden.
- Security: hostPath read-only mounts for `/var/log/pods` AND `/var/log/containers` (containerd symlink target — per https://github.com/containerd/containerd/issues/11149); runs as non-root with fsGroup matching kubelet log-group.
- Reliability caveat: containerd [#11149](https://github.com/containerd/containerd/issues/11149) (open upstream, last updated 2025-05-30) silently drops bytes from `0.log` when an in-container process reads its own FD 1 (e.g. application self-tee, sidecar reading `/proc/1/fd/1`). The mechanism is shared-pipe contention with containerd's log copier, not generic disk-I/O backpressure; standard workloads that do not read FD 1 are unaffected. RUNBOOK enumerates this failure mode; receiver does not claim universal lossless delivery from `0.log` alone.
- Panic recovery: per-file goroutines wrap `defer/recover`; malformed CRI line MUST NOT crash other watchers. (per PRINCIPLES §1)
- Shutdown: SIGTERM → 1s Phase-1 budget; in-flight reads abandoned cleanly; cursor flushed best-effort.
- ⧗ Overhead: ≤0.10% CPU, ≤0.3 Mbps egress sustained, ≤20 MB RSS. (per NORTHSTARS O2 "Container stdout (rate-limited)" row) *(unit benchmarks shipped - `BenchmarkHotPath`, `BenchmarkLookup`, `BenchmarkRegexExtraction`, `BenchmarkRateLimit`; end-to-end 1k-line/s budget assertion deferred to Phase 17.)*
- ☑ Egress rate-limit: token-bucket per `(pod_uid, container)` at `rate=200 lines/s`, `burst=1000` (defaults; configurable via `egress_rate_limit:` config key); over-budget → `IncError("rate_limit_drop")` + sampled record with `tracecore.dropped_lines=N`; never blocks file reader.
- ☑ Multi-tenancy: `namespaces:` allowlist filters Pod discovery before file watch opened; per-namespace egress sub-budget configurable.
- ☑ Back-pressure: 1M-line burst from one rank MUST NOT block sibling streams; bounded per-file goroutine + bounded channel (1024); `goleak` test.
- ☑ File-handle hygiene: ≤2× pod-count open fds steady-state; closed within 30s of Pod `Terminated`; verified by `lsof` golden.
- ☑ Security: hostPath read-only mounts for `/var/log/pods` AND `/var/log/containers` (containerd symlink target - per https://github.com/containerd/containerd/issues/11149); runs as non-root with fsGroup matching kubelet log-group.
- ☑ Reliability caveat: containerd [#11149](https://github.com/containerd/containerd/issues/11149) (open upstream, last updated 2025-05-30) silently drops bytes from `0.log` when an in-container process reads its own FD 1 (e.g. application self-tee, sidecar reading `/proc/1/fd/1`). The mechanism is shared-pipe contention with containerd's log copier, not generic disk-I/O backpressure; standard workloads that do not read FD 1 are unaffected. RUNBOOK enumerates this failure mode; receiver does not claim universal lossless delivery from `0.log` alone.
- ☑ Panic recovery: per-file goroutines wrap `defer/recover`; malformed CRI line MUST NOT crash other watchers. (per PRINCIPLES §1)
- ☑ Shutdown: SIGTERM → 1s Phase-1 budget; in-flight reads abandoned cleanly; cursor flushed best-effort.
- ☑ **Done - alpha behind containerstdout.enabled flag (opt-in)** - https://github.com/TraceCoreAI/tracecore/pull/158

> NOTE: Vendored `pkg/stanza/fileconsumer` swap deferred to Phase 17 (RFC-0010 §FOLLOWUPS). Current implementation uses a stdlib `Tailer` that handles rotation (inode change) and truncation; the fileconsumer swap is an optimization, not a correctness requirement. The 5s p99 pod→watcher latency and end-to-end overhead-budget assertions are also Phase-17 carry-forwards.

### M16. Kueue scheduler receiver

Expand Down
18 changes: 10 additions & 8 deletions cmd/tracecore/components.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions components.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
receivers:
- type: clockreceiver
package: github.com/tracecoreai/tracecore/components/receivers/clockreceiver
- type: containerstdout
package: github.com/tracecoreai/tracecore/components/receivers/containerstdout
- type: dcgm
package: github.com/tracecoreai/tracecore/components/receivers/dcgm
- type: kernelevents
Expand Down
115 changes: 115 additions & 0 deletions components/receivers/containerstdout/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# containerstdout

**Stability:** alpha - public config keys MAY change with one-minor-
cycle deprecation warning. RFC contract:
[`docs/rfcs/0010-containerstdout-receiver-scope.md`](../../../docs/rfcs/0010-containerstdout-receiver-scope.md);
milestone rubric:
[`MILESTONES.md § M15`](../../../MILESTONES.md).

Tails kubelet-managed CRI logs under `/var/log/pods`, parses partial
lines per the CRI log format, attributes records to `(pod_uid,
container)` via a per-node Pod informer (scope-filtered on
`spec.nodeName`), applies an egress rate limit, and emits one
`plog.LogRecord` per line with training-rank + dataloader-timing
attributes.

## Overview

| Aspect | Detail |
|---|---|
| Upstream surface | kubelet CRI log files under `/var/log/pods/<ns>_<pod>_<uid>/<container>/<n>.log` |
| Watch primitive | per-file `inotify`-driven tailer + per-node Pod `SharedInformer` |
| Deployment shape | DaemonSet (one Pod per node) - runs as root for `/var/log/pods` access |
| Egress model | `core/v1.pods` get,list,watch + `core/v1.nodes` get |
| Cursor persistence | hostPath `/var/lib/tracecore/container_stdout/cursor.json` |
| Schema URL | `https://tracecore.ai/schemas/containerstdout/v0` |

## Configuration reference

See RFC-0010 § Configuration surface for the authoritative table.
The values surfaced through the Helm chart are documented in
[`install/kubernetes/tracecore/values.yaml`](../../../install/kubernetes/tracecore/values.yaml).
Key fields:

| Key | Type | Default | Notes |
|---|---|---|---|
| `enabled` | bool | `false` | Opt-in alpha. |
| `include` | []string | `["/var/log/pods/*/*/*.log"]` | File globs the tailer watches. |
| `namespaces` | []string | `[]` | In-process allowlist; empty = all namespaces. |
| `max_log_size` | int | `1048576` (1 MiB) | Per-partial-line cap. Matches kubelet `containerLogMaxSize`. |
| `rank_source` | enum | `"informer"` | `"informer"` (production) or `"downward_api"` (HW-validation override). |
| `process_rank_regex` | RE2 string | `\bRANK[=:]\s*(\d+)\b` | Body-match fallback regex for rank extraction. |
| `dataloader_regex` | RE2 string | torchvision pattern | PLACEHOLDER - operators MUST override per framework. |
| `egress_rate_limit.rate` | float | `200` | Steady-state lines/s per `(pod_uid, container)`. |
| `egress_rate_limit.burst` | int | `1000` | Token-bucket burst capacity. |
| `cursor.dir` | path | `/var/lib/tracecore/container_stdout` | Persisted cursor directory. |
| `eviction_match_window` | duration | RFC-0010 default | M19 join boundary for k8sevents pairing. |

## Operational notes

The receiver opts into root execution (UID 0) and a per-node Pod
informer. Operational playbooks live in [`RUNBOOK.md`](./RUNBOOK.md);
the chart's conftest policy enforces the volume / RBAC / env
invariants at `helm template` time (see
[`install/kubernetes/tracecore/policies/conftest/tracecore.rego`](../../../install/kubernetes/tracecore/policies/conftest/tracecore.rego)
§ containerstdout (M15) operational invariants).

Per-Kind alert mapping:

| Kind | Alert | Severity |
|---|---|---|
| `rotation_stalled` | `ContainerStdoutRotationStalled` | warning |
| `backpressure_drop` | `ContainerStdoutBackpressure` | warning |
| `cursor_write_failed` | `ContainerStdoutCursorWriteFailed` | critical |
| `watch` | `ContainerStdoutWatchFlap` | warning |
| `fingerprint_cardinality` | `ContainerStdoutCardinalityFingerprint` | info |
| `attribution_cardinality` | `ContainerStdoutCardinalityAttribution` | info |
| `rate_limit_cardinality` | `ContainerStdoutCardinalityRateLimit` | info |
| (composite OR of source flags) | `ContainerStdoutDegraded` | warning |

The `kind="watch"` and `kind="backpressure_drop"` labels are also
emitted by the k8sevents receiver. PromQL queries against
`tracecore_receiver_errors_total` MUST disambiguate via
`component_id=~"containerstdout/.*"`; the prometheus alert rules
ship a `receiver_id="containerstdout"` label as the doc-side
equivalent per RFC-0010 § Kind aliasing.

## RBAC + Helm

The chart [`install/kubernetes/tracecore`](../../../install/kubernetes/tracecore)
renders the receiver, ClusterRole, ClusterRoleBinding, hostPath
mounts, and downward-API env when
`receivers.containerstdout.enabled=true`. The chart's conftest policy
enforces the operational invariants at render time:

- DaemonSet MUST carry the `containerstdout-pod-logs` hostPath
volume.
- DaemonSet MUST carry the `containerstdout-cursor` hostPath volume.
- The tracecore container MUST carry `K8S_NODE_NAME` env from
`spec.nodeName` (downward API).

Operators with strict bring-your-own RBAC set
`receivers.containerstdout.rbac.create=false` and apply their own
ClusterRole/ClusterRoleBinding pinned by
`rbac.can-i.golden` (TODO when the parity test lands; until then
mirror the verb list from
[`templates/containerstdout-rbac.yaml`](../../../install/kubernetes/tracecore/templates/containerstdout-rbac.yaml)).

## Limitations

- **Root execution required.** The kubelet writes CRI symlinks
under root-owned directories on every distro tracecore supports.
Phase 17 (fileconsumer swap) is the planned path to drop root -
see [`MILESTONES.md § M15`](../../../MILESTONES.md) carry-forward
rubric.
- **Dataloader regex is a placeholder.** The default RE2 matches
torchvision MetricLogger output; operators on detectron2 /
Lightning / NeMo / HF Trainer MUST override per RFC-0010
§ Configuration surface.
- **containerd #11149 caveat.** containerd silently drops bytes from
`0.log` when an in-container process reads its own FD 1 (e.g.
application self-tee, sidecar reading `/proc/1/fd/1`). Standard
workloads that do not read FD 1 are unaffected. See
[RUNBOOK § Failure mode inventory](./RUNBOOK.md#failure-mode-inventory).
- **No metrics / traces signal.** Logs-only. `CreateMetrics` and
`CreateTraces` return `pipeline.ErrSignalNotSupported`.
Loading
Loading