[m15] containerstdout: Kubernetes container stdout tail receiver (alpha)#158
Merged
Conversation
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Adds DataloaderExtractor parsing data_time_s / iter_time_s named captures. Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Adds the chart wiring for the containerstdout receiver (M15, alpha): - receivers.containerstdout block in values.yaml — opt-in alpha (enabled: false), with structured knobs for include globs, namespaces, max_log_size, rank_source, egress_rate_limit, plus chart-only rbac and hostPath keys. - templates/containerstdout-rbac.yaml — ClusterRole + ClusterRoleBinding granting core/v1.pods get,list,watch + core/v1.nodes get on the chart's ServiceAccount; render gated on enabled AND rbac.create. - templates/daemonset.yaml — conditional pod-level root override (kubelet CRI symlinks under /var/log/pods are root-owned), conditional /var/log/pods, /var/log/containers, and cursor hostPath volume mounts, and automountServiceAccountToken flips true so the per-node Pod informer authenticates. - templates/_helpers.tpl — omits chart-only rbac/hostPath keys from the rendered tracecore config so config.Load does not reject the output with an unknown-field error. - templates/NOTES.txt — operator-facing warning when the receiver is enabled (root, ClusterRole, RUNBOOK link). - ci/containerstdout-on-values.yaml — happy-path render fixture for the chart CI gate. RFC contract: docs/rfcs/0010-containerstdout-receiver-scope.md. Signed-off-by: Tri Lam <tri@maydow.com>
Operator-facing docs for the M15 containerstdout receiver: - components/receivers/containerstdout/RUNBOOK.md — per-Kind triage (KindRotationStalled, KindBackpressureDrop, KindCursorWriteFailed, KindWatch, KindFingerprintCardinality, KindAttributionCardinality, KindRateLimitCardinality) with alert names, root-cause taxonomy, diagnostic commands, mitigation, and the TestFailure_ identifier that pins each failure mode. Plus Recovery and Rollback sections. - components/receivers/containerstdout/README.md — receiver overview, configuration reference, operational notes, per-Kind alert mapping, RBAC + Helm pointer, and limitations. - docs/FAILURE-MODES.md — adds containerstdout to the per-component RUNBOOK link list AND the Alert -> RUNBOOK index (eight new rows, one per containerstdout alert). Every Kind from components/receivers/containerstdout/kind.go now has a RUNBOOK entry + linked test name; FAILURE-MODES routes operators to it. Signed-off-by: Tri Lam <tri@maydow.com>
- components/receivers/containerstdout/prometheus-alerts.example.yaml — eight alert rules, one per Kind in kind.go plus the composite ContainerStdoutDegraded. Per-Kind severity table in the YAML comment block (cursor_write_failed = critical; rotation / backpressure / watch = warning; cardinality kinds = info). Rules filter on component_id=~"containerstdout/.*" and stamp receiver_id=containerstdout for dashboard disambiguation of Kinds shared with k8sevents (KindWatch, KindBackpressureDrop) per RFC-0010 § Kind aliasing. - install/kubernetes/tracecore/policies/conftest/tracecore.rego — carves a containerstdout-enabled exemption to the runAsNonRoot / runAsUser==0 / runAsGroup==0 deny rules (gated on the presence of the containerstdout-pod-logs hostPath volume, NOT a values flag), then adds three new rules enforcing operational invariants when the receiver is enabled: required containerstdout-pod-logs volume, required containerstdout-cursor volume, and required K8S_NODE_NAME downward-API env from spec.nodeName. helm + conftest not on PATH locally; templates verified by visual inspection of expected render shape under both enabled=false (default) and enabled=true (ci fixture). Signed-off-by: Tri Lam <tri@maydow.com>
added 5 commits
May 20, 2026 23:56
Marks the M15 container stdout receiver acceptance criteria done for items the alpha behind containerstdout.enabled=true ships: attribution, JSON detect, dataloader extract, rotation, lines/s feed, cursor, degraded-mode, egress rate-limit, multi-tenancy, back-pressure, fd hygiene, security, containerd #11149 caveat, panic recovery, shutdown. Partial (⧗): 5s p99 pod→watcher latency and end-to-end overhead budget (≤0.10% CPU, ≤0.3 Mbps egress, ≤20 MB RSS) — unit benchmarks ship; e2e budget assertion deferred to Phase 17. NOTE: vendored pkg/stanza/fileconsumer swap deferred to Phase 17 (RFC-0010 §FOLLOWUPS). Current stdlib Tailer handles rotation + truncation correctly; the swap is an optimization carry-forward. PR link will be patched in a follow-up commit once gh pr create returns the URL. Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced May 22, 2026
Closed
10 tasks
trilamsr
pushed a commit
that referenced
this pull request
May 30, 2026
shellcheck SC2221/SC2222 flagged the path-filter case statement because `*.md` already covers CHANGELOG.md / README.md / NORTHSTARS.md etc; listing the top-level docs explicitly created overlapping patterns. Same semantics with the simpler two-arm match. Also: verify-test failed on TestTailer_TruncationWithoutRotation (components/receivers/containerstdout/tailer_test.go:347, "timed out waiting for tailer line"). That test arrived via PR #158 (M15 containerstdout, merged on main earlier today) and is not introduced by this PR. Flaky timeout on the slow ubuntu-latest runner. Triaging separately via re-run; if it persists, registers a follow-up in docs/FLAKY-TESTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr
added a commit
that referenced
this pull request
May 30, 2026
## What this PR does Lands [RFC-0013](docs/rfcs/0013-distro-first-pivot.md) as the binding decision doc for tracecore's distribution-first pivot, and propagates it across the doc surface in fourteen scoped commits. After this PR, every doc that names a soon-to-be-deleted receiver, the hand-rolled release pipeline, or the custom self-telemetry surface carries a Status banner pointing at the upstream replacement and the release boundary (v0.1.0 / v0.2.0 / v0.3.0). No code paths change. No deletions yet. The next eleven PRs (per RFC-0013 §Migration) do the receiver removals, the OCB skeleton swap, the goreleaser stack adoption, the receivers-only module split, and the migration guides. ### Scope summary | Concern | Adopt | Delete | When | |---|---|---|---| | GPU telemetry | `dcgm-exporter` (NV) + `ROCm/device-metrics-exporter` (AMD) + `xpumanager` (Intel) + Habana exporter via `prometheusreceiver` | `components/receivers/dcgm/` (cgo stub) | v0.1.0 | | Container stdout | `filelogreceiver` + container stanza + `file_storage` | `components/receivers/containerstdout/` (M15 alpha, PR #158) | v0.2.0 | | Kernel events | `journaldreceiver` + `filelogreceiver` + OTTL Xid transform | `components/receivers/kernelevents/` | v0.2.0 | | K8s events | `k8sobjectsreceiver` + OTTL `k8s.event.hint` transform | `components/receivers/k8sevents/` | v0.2.0 | | Kueue scheduler | `prometheusreceiver` + bearer-token | `components/receivers/kueue/` (never shipped) | v0.1.0 | | Python profiling | `parca-agent` (eBPF) | `components/receivers/pyspy/` + `python/tracecore_pyspy/` + `tools/pyspy-lint/` | v0.3.0 | | Kineto | Deferred pending OTel Profiles GA | `components/receivers/kineto/` (partial) | v0.1.0 | | Heartbeat | `telemetrygeneratorreceiver` | `components/receivers/clockreceiver/` | v0.1.0 | | Self-telemetry | upstream `componentstatus` + `service/telemetry` + standard `otelcol_*` | `internal/componentstatus/` + `internal/selftelemetry/` + `internal/telemetry/` | v0.1.0 | | Release | goreleaser + slsa-github-generator + cosign-installer + anchore/sbom-action + actions/attest-build-provenance | hand-rolled `.github/workflows/release.yml` | v0.1.0 | | Image publish | `ko` (push) + `Renovate` (pull) | (refactor) | v0.1.0 | | Datadog / ClickHouse | OCB bundles `datadogexporter` + `clickhouseexporter` directly | "intermediate otelcol-contrib" recipe | v0.1.0 | ### Customer-stable contracts preserved across the pivot `k8s.event.hint` 11-entry enum, `kernelevents.xid`, `gpu.id`, `gpu.vendor`, `gen_ai.training.{rank,job_id}`, NCCL FlightRecorder span schema, pattern detector outputs (M17/M18/M19). Normalized back to the stable surface via the OTTL `transform` processor in the bundled recipe. Operator alerts written against these survive the receiver swap. ### What tracecore still builds (the moat) Bounded by RFC-0013 §6: NCCL FlightRecorder receiver (`ncclfrreceiver`), windowed cross-signal join processor (`rankjoinprocessor`), pattern engine + replay corpus (`patterndetectorprocessor`), install/overhead bench harness. Everything else comes from upstream + contrib. ### Upstream contribution policy Now first-class. Tracecore contributes patches upstream first; forks only when upstream rejects. When a contribution is in-flight, tracecore ships against a `replace` directive in `go.mod` pointing at the contribution branch; the replace is removed when the upstream tag lands. ## Commits in this PR 1. `1fe08c3` - RFC-0013 binding doc + index updates 2. `2047abe` - research notes superseded headers 3. `81dc29e` - patterns input-source references at adopted receivers 4. `ca9fbc6` - RFC bodies 0001-0012 supersede/revise headers (0004 archived) 5. `d06c929` - status banners on queued-for-deletion files (15 files) 6. `d116775` - receiver READMEs + RUNBOOKs status banners (14 files) 7. `c250509` - followups triage against RFC-0013 (21 files) 8. `3edd226` - top-level orienting docs reframed (9 files) 9. `cb01365` - integration recipes + operator docs for OCB distro (10 files) 10. `946af13` - `golang.org/x/net` v0.54.0 → v0.55.0 (GO-2026-5026, blocking pre-push) 11. `4eb03c0` - em-dash + en-dash → hyphen across pivot surface (88 hits in 32 files); also fixes 3 broken link refs to archived/0004 and one fabricated test name in FAILURE-MODES.md 12. `de585ba` - language tightening + dedupe paraphrase + ancillary doc Status banners (STYLE.md, bench/install/tracecore-values.yaml, docs/examples/*.yaml, docs/integrations/examples/*.yaml, docs/notes/*, docs/proposals/*) 13. `9ceeaef` - merge `origin/main` (resolves M15 containerstdout (PR #158) alongside RFC-0013 v0.2.0 deletion banner; FAILURE-MODES.md gains containerstdout alert rows with upstream replacement column; go.mod takes both x/sys v0.45.0 and new x/time v0.14.0) ## Linked issue(s) _No linked issue._ ## Test plan - [x] `make doc-check` clean on every modified file - [x] `make ci` (full pre-push) passed locally — lint + race tests + 30s fuzz + govulncheck + build + doc-check + release-doc-parity - [x] `govulncheck ./...` reports zero vulnerabilities after the x/net bump - [x] `gh attestation` / `cosign` flag set unchanged (parity gate intact per RFC-0013 §Migration PR-C) - [ ] **CI expected to fail on `validator-recipe`**: integration docs route Datadog and ClickHouse through tracecore's own `validate` subcommand (no intermediate otelcol-contrib), but the in-tree binary on this branch does not yet have `datadogexporter` / `clickhouseexporter` registered. Lands in the follow-on OCB-skeleton PR (RFC-0013 §Migration PR-A). Expected, not a regression. ### Known follow-ups (not in this PR; tracked for next PRs per RFC-0013 §Migration) - `docs/integrations/examples/*.yaml` Status comments now reflect OCB-bundled posture (commit `de585ba`); operator placeholders unchanged. - `docs/FAILURE-MODES.md` retains references to `internal/pipeline/*` test paths that will be deleted with the v0.1.0 PR-F cut. - The `tracecoreai/tracecore-components` separate-repo module does not exist yet; created in PR-I per v0.2.0. ## Release notes ```release-notes [CHANGE] RFC-0013 adopted: tracecore pivots to a distribution-first posture. The binary is assembled via OpenTelemetry Collector Builder (OCB) from upstream + contrib components plus a thin tracecore-components module containing only the moat (NCCL FlightRecorder receiver, OTTL join processor, pattern engine, bench harness). Seven in-tree receivers and three internal self-telemetry packages are queued for deletion across v0.1.0 / v0.2.0 / v0.3.0; the release pipeline migrates to goreleaser + SLSA + cosign + ko. Customer-stable telemetry contracts (k8s.event.hint enum, kernelevents.xid, gpu.id, gpu.vendor, NCCL span schema) are preserved across the pivot via an OTTL normalization layer in the bundled recipe; operator alerts written against these survive the swap. ``` ## Checklist - [x] Tests added or updated (doc-only PR; doc-check + link-check gates updated transitively) - [x] `make check` runs green; pre-push hook passes - [x] Commits are signed off (`git commit -s`) - [x] For new components, follows the layout required by [`STYLE.md`](STYLE.md) - n/a (no new components) --------- Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements M15 (RFC-0010): Kubernetes container stdout tail receiver. Tails
/var/log/pods/**/*.logper node, parses CRI text format, attributes records to training rank via Pod informer + body-match fallback, joins with M19 pod-evicted via EvictionMatchWindow, emits to consumer.Logs.Alpha - opt-in via
receivers.containerstdout.enabled=true(default off in Helm values).Phases
Deferred to FOLLOWUP
pkg/stanza/fileconsumerswap - currently uses stdlib Tailer (RFC-0010 §FOLLOWUPS)Test plan
make checkpasses (fmt + tidy + lint + race tests across full repo)go test -race -count=2 ./components/receivers/containerstdout/...- all greenTestPipeline_E2E_*end-to-end tests passhelm template --set receivers.containerstdout.enabled=truerenders DaemonSet + RBAC + NODE_NAME envRollback
helm upgrade --set receivers.containerstdout.enabled=falseremoves DaemonSet, RBAC, and config - verified via template diff.Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com