When a distributed training run breaks, the operator gets told what broke - by name, with its evidence trail - instead of correlating signals across the stack by hand.
Tracecore is an OpenTelemetry Collector distribution + AI-training pattern library for distributed-training observability. The binary is assembled from upstream OpenTelemetry + contrib components via the OpenTelemetry Collector Builder (OCB); the differentiator is the bundled pattern detectors, NCCL FlightRecorder receiver, OTTL processors (cross-signal rank join, dataloader timing, eviction join), and the recipes that wire upstream receivers into training-cluster-shaped signal pipelines. The 15 named root-cause patterns in NORTHSTARS.md define what "told what broke" means concretely; four DCGM-observable patterns ship with walkthroughs in docs/patterns/ today, with the remainder tracked in MILESTONES.md.
The collector is open source. The synthesis engine that interprets the data is a separate, hosted product.
See RFC-0013 for the binding architectural posture (adopt upstream first; build only the four moat scopes).
Tracecore is an OTel Collector distribution - assembled via OCB from upstream + contrib components and pinned to a single release cycle. What it adds on top is the part operators cannot get from otelcol-contrib alone:
- Pattern detectors + replay corpus - the cross-signal root-cause patterns in
docs/patterns/ship as apatterndetectorprocessorwith synthetic-fixture replay tests pinning the alert math. OTel contrib has no cross-signal pattern engine; this is the moat. - NCCL FlightRecorder parsing + cross-rank join -
ncclfrreceiverandrankjoinprocessor(5s windowed join) have no upstream equivalent today; tracecore upstreams them when CNCF/OTel SIG accepts the contribution. - Pre-wired training-cluster recipes - bundled Helm chart ships OTTL normalization (
gen_ai.training.rank,gen_ai.training.job_id,k8s.event.hint,gpu.vendor) so the upstream receivers under the hood (filelogreceiver,journaldreceiver,k8sobjectsreceiver,prometheusreceiveragainstdcgm-exporter/ ROCm / Intel / Habana) emit the customer-stable contract operators alert against. - Opinionated operator UX - single OCB-built binary, the standard upstream
/metrics+/healthz+/readyzsurface, alerts ship next to RUNBOOKs and the link is CI-enforced. - When NOT to use tracecore: if your needs are application-tracing-shaped (OTel SDK + a stock contrib build is the right answer) or if you want a vendor-agnostic collector with a broader receiver set than the training stack.
Pre-alpha. The repo is mid-pivot to the distribution-first posture (RFC-0013): the binary is moving to OCB assembly from upstream + contrib components, with the in-house surface contracting to the four moat scopes (pattern detectors, OTTL processors with windowed semantics, NCCL FlightRecorder parsing, install/overhead bench). The current in-tree receivers under components/receivers/ are queued for deletion across v0.1.0 / v0.2.0 / v0.3.0; see MILESTONES.md and RFC-0013 §7 for the release-boundary schedule. See CHANGELOG.md for the moving parts.
What's safe to deploy today, what's still shipping. Honest read at HEAD; check CHANGELOG.md + MILESTONES.md for the moving parts.
| Surface | Stability | CI tests | Signed binaries |
|---|---|---|---|
| OCB-assembled distribution (binary) | pre-alpha (in transition) | ✅ unit + race + integration | ☑ (M3) |
Self-telemetry (/metrics, /healthz, /readyz) - upstream service/telemetry + standard otelcol_* metrics |
alpha | ✅ unit + integration | ☑ (M3) |
| Bundled recipes (Helm chart + OTTL normalization layer) | alpha | ✅ chart-lint + conftest policies | ☑ (M3) |
ncclfrreceiver + safe pickle parser (moat) |
alpha | ✅ unit + race + fuzz (RCE-gate) | ☑ (M3) |
patterndetectorprocessor + replay corpus (moat) |
partial - pattern #14 (pod-evicted) shipped | ✅ unit + race + replay | ☑ (M3) |
| Install + overhead bench harness (moat) | planned (M5) | - | n/a |
| Reproducible build / SBOM / sigstore signing / SLSA provenance | shipped (M3 via goreleaser + OpenSSF stack) | ✅ diffoscope + cosign + slsa-verifier | ☑ |
The honest read for production decisions:
- The OCB distribution skeleton is the v0.1.0 target. Operators evaluating tracecore today should pin to a tagged release and follow the migration boundary in RFC-0013 §4.
- Releases ship signed (cosign keyless), with a CycloneDX SBOM, SLSA v1.0 provenance, and a reproducible byte-identical rebuild path per
docs/reproducibility.md. - The bundled Helm chart at
install/kubernetes/tracecore/is alpha and ships the OTTL normalization layer that preserves the customer-stable telemetry contract (k8s.event.hint enum, kernelevents.xid, gpu.id, gpu.vendor, NCCL span schema) across the receiver swaps in RFC-0013 §3.
# 1. Build the binary. `make build` invokes the OpenTelemetry
# Collector Builder against `builder-config.yaml` (RFC-0013
# PR-A2) and emits `./_build/tracecore`.
go mod download
make build
# 2. Write a minimal config: hostmetrics' load scraper emits 3
# low-cardinality series (system.cpu.load_average.{1m,5m,15m})
# at 1s heartbeat into the debug exporter. Portable across
# linux/darwin/windows — works on any developer laptop.
cat > demo.yaml <<'YAML'
receivers:
hostmetrics:
collection_interval: 1s
scrapers:
load: {}
exporters:
debug:
verbosity: basic
service:
pipelines:
metrics:
receivers: [hostmetrics]
exporters: [debug]
YAML
# 3. Dry-run: validate the config without starting any I/O.
./_build/tracecore validate --config=demo.yaml
# 4. Run. Ctrl-C (SIGINT) for graceful shutdown; second Ctrl-C
# forces immediate exit.
./_build/tracecore --config=demo.yamlLifecycle logs go to stderr. Run ./_build/tracecore --help for the full flag set and signal semantics. For training-cluster recipes (filelog + container stanza, journald + Xid OTTL transform, k8sobjects + hint mapping, prometheusreceiver against dcgm-exporter / ROCm / Intel / Habana), see docs/integrations/ and the bundled Helm chart values.
| If you're a … | Start here |
|---|---|
| Operator running tracecore in production | docs/getting-started.md → bundled recipes under docs/integrations/ → docs/FAILURE-MODES.md |
| Contributor adding a receiver / processor / exporter | CONTRIBUTING.md → PRINCIPLES.md (the why) → STYLE.md (the what) → upstream go.opentelemetry.io/collector component/receiver/processor/exporter packages |
| Maintainer making architectural calls | docs/STRATEGY.md → NORTHSTARS.md → docs/rfcs/ → MILESTONES.md → docs/FOLLOWUPS.md |
| Evaluating tracecore for your fleet | This README + CHANGELOG.md → docs/STRATEGY.md "single load-bearing principle" |
| Verifying a published release end-to-end (auditor / supply-chain) | docs/reproducibility.md (rebuild → diffoscope → cosign → SLSA → SBOM) |
Full doc index with one-line purpose per file: docs/README.md.