Skip to content

TraceCoreAI/tracecore

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

386 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

tracecore

When a distributed training run breaks, the operator gets told what broke - by name, with its evidence trail - instead of correlating signals across the stack by hand.

Tracecore is an OpenTelemetry Collector distribution + AI-training pattern library for distributed-training observability. The binary is assembled from upstream OpenTelemetry + contrib components via the OpenTelemetry Collector Builder (OCB); the differentiator is the bundled pattern detectors, NCCL FlightRecorder receiver, OTTL processors (cross-signal rank join, dataloader timing, eviction join), and the recipes that wire upstream receivers into training-cluster-shaped signal pipelines. The 15 named root-cause patterns in NORTHSTARS.md define what "told what broke" means concretely; four DCGM-observable patterns ship with walkthroughs in docs/patterns/ today, with the remainder tracked in MILESTONES.md.

The collector is open source. The synthesis engine that interprets the data is a separate, hosted product.

See RFC-0013 for the binding architectural posture (adopt upstream first; build only the four moat scopes).

Why tracecore vs. OpenTelemetry Collector?

Tracecore is an OTel Collector distribution - assembled via OCB from upstream + contrib components and pinned to a single release cycle. What it adds on top is the part operators cannot get from otelcol-contrib alone:

  • Pattern detectors + replay corpus - the cross-signal root-cause patterns in docs/patterns/ ship as a patterndetectorprocessor with synthetic-fixture replay tests pinning the alert math. OTel contrib has no cross-signal pattern engine; this is the moat.
  • NCCL FlightRecorder parsing + cross-rank join - ncclfrreceiver and rankjoinprocessor (5s windowed join) have no upstream equivalent today; tracecore upstreams them when CNCF/OTel SIG accepts the contribution.
  • Pre-wired training-cluster recipes - bundled Helm chart ships OTTL normalization (gen_ai.training.rank, gen_ai.training.job_id, k8s.event.hint, gpu.vendor) so the upstream receivers under the hood (filelogreceiver, journaldreceiver, k8sobjectsreceiver, prometheusreceiver against dcgm-exporter / ROCm / Intel / Habana) emit the customer-stable contract operators alert against.
  • Opinionated operator UX - single OCB-built binary, the standard upstream /metrics + /healthz + /readyz surface, alerts ship next to RUNBOOKs and the link is CI-enforced.
  • When NOT to use tracecore: if your needs are application-tracing-shaped (OTel SDK + a stock contrib build is the right answer) or if you want a vendor-agnostic collector with a broader receiver set than the training stack.

Status

Pre-alpha. The repo is mid-pivot to the distribution-first posture (RFC-0013): the binary is moving to OCB assembly from upstream + contrib components, with the in-house surface contracting to the four moat scopes (pattern detectors, OTTL processors with windowed semantics, NCCL FlightRecorder parsing, install/overhead bench). The current in-tree receivers under components/receivers/ are queued for deletion across v0.1.0 / v0.2.0 / v0.3.0; see MILESTONES.md and RFC-0013 §7 for the release-boundary schedule. See CHANGELOG.md for the moving parts.

Production readiness

What's safe to deploy today, what's still shipping. Honest read at HEAD; check CHANGELOG.md + MILESTONES.md for the moving parts.

Surface Stability CI tests Signed binaries
OCB-assembled distribution (binary) pre-alpha (in transition) ✅ unit + race + integration ☑ (M3)
Self-telemetry (/metrics, /healthz, /readyz) - upstream service/telemetry + standard otelcol_* metrics alpha ✅ unit + integration ☑ (M3)
Bundled recipes (Helm chart + OTTL normalization layer) alpha ✅ chart-lint + conftest policies ☑ (M3)
ncclfrreceiver + safe pickle parser (moat) alpha ✅ unit + race + fuzz (RCE-gate) ☑ (M3)
patterndetectorprocessor + replay corpus (moat) partial - pattern #14 (pod-evicted) shipped ✅ unit + race + replay ☑ (M3)
Install + overhead bench harness (moat) planned (M5) - n/a
Reproducible build / SBOM / sigstore signing / SLSA provenance shipped (M3 via goreleaser + OpenSSF stack) ✅ diffoscope + cosign + slsa-verifier

The honest read for production decisions:

  • The OCB distribution skeleton is the v0.1.0 target. Operators evaluating tracecore today should pin to a tagged release and follow the migration boundary in RFC-0013 §4.
  • Releases ship signed (cosign keyless), with a CycloneDX SBOM, SLSA v1.0 provenance, and a reproducible byte-identical rebuild path per docs/reproducibility.md.
  • The bundled Helm chart at install/kubernetes/tracecore/ is alpha and ships the OTTL normalization layer that preserves the customer-stable telemetry contract (k8s.event.hint enum, kernelevents.xid, gpu.id, gpu.vendor, NCCL span schema) across the receiver swaps in RFC-0013 §3.

Quickstart

# 1. Build the binary. `make build` invokes the OpenTelemetry
#    Collector Builder against `builder-config.yaml` (RFC-0013
#    PR-A2) and emits `./_build/tracecore`.
go mod download
make build

# 2. Write a minimal config: hostmetrics' load scraper emits 3
#    low-cardinality series (system.cpu.load_average.{1m,5m,15m})
#    at 1s heartbeat into the debug exporter. Portable across
#    linux/darwin/windows — works on any developer laptop.
cat > demo.yaml <<'YAML'
receivers:
  hostmetrics:
    collection_interval: 1s
    scrapers:
      load: {}
exporters:
  debug:
    verbosity: basic
service:
  pipelines:
    metrics:
      receivers: [hostmetrics]
      exporters: [debug]
YAML

# 3. Dry-run: validate the config without starting any I/O.
./_build/tracecore validate --config=demo.yaml

# 4. Run. Ctrl-C (SIGINT) for graceful shutdown; second Ctrl-C
#    forces immediate exit.
./_build/tracecore --config=demo.yaml

Lifecycle logs go to stderr. Run ./_build/tracecore --help for the full flag set and signal semantics. For training-cluster recipes (filelog + container stanza, journald + Xid OTTL transform, k8sobjects + hint mapping, prometheusreceiver against dcgm-exporter / ROCm / Intel / Habana), see docs/integrations/ and the bundled Helm chart values.

Documentation

If you're a … Start here
Operator running tracecore in production docs/getting-started.md → bundled recipes under docs/integrations/docs/FAILURE-MODES.md
Contributor adding a receiver / processor / exporter CONTRIBUTING.mdPRINCIPLES.md (the why) → STYLE.md (the what) → upstream go.opentelemetry.io/collector component/receiver/processor/exporter packages
Maintainer making architectural calls docs/STRATEGY.mdNORTHSTARS.mddocs/rfcs/MILESTONES.mddocs/FOLLOWUPS.md
Evaluating tracecore for your fleet This README + CHANGELOG.mddocs/STRATEGY.md "single load-bearing principle"
Verifying a published release end-to-end (auditor / supply-chain) docs/reproducibility.md (rebuild → diffoscope → cosign → SLSA → SBOM)

Full doc index with one-line purpose per file: docs/README.md.

License

Apache License 2.0.

About

Cross-vendor observability collector for distributed AI training

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors