Tracecore is a distribution of the OpenTelemetry Collector - assembled
by the OpenTelemetry Collector Builder (OCB) from upstream and contrib
components plus a thin layer of pattern-detection processors that
live in the in-repo Go submodule at module/ (path
github.com/tracecoreai/tracecore/module). Welcome.
This walkthrough takes you from a fresh clone to your first OTLP byte in five commands. No GPU, no container runtime, no Kubernetes - just Go and a terminal.
The default demo wires the upstream hostmetrics receiver (load
scraper) to the debug exporter so you can verify the binary works
on your host before pointing it at real signals. Production sources
(GPU telemetry, kernel events, k8s events, …) flow through upstream
receivers (prometheusreceiver, journaldreceiver, filelogreceiver,
k8sobjectsreceiver) per the adoption matrix in
RFC-0013 §2.
- Go ≥ the version pinned in
.go-version(read withcat .go-version). - A POSIX shell (bash/zsh/etc.) - Linux or macOS.
# 1. Clone and build. `make build` invokes the OpenTelemetry Collector
# Builder (`builder --config=builder-config.yaml`); OCB generates
# main.go from the manifest, then compiles the assembled binary.
# The Go toolchain + lint/test tools are pinned via `go.mod` tool
# directives; no separate install step.
git clone https://github.com/tracecoreai/tracecore.git
cd tracecore
make build
# 2. Write a minimal config. `hostmetrics`' load scraper emits 3
# low-cardinality series (system.cpu.load_average.{1m,5m,15m})
# at 1s heartbeat — the OCB-bundled replacement for the retired
# in-tree `clockreceiver` per RFC-0013 PR-E. `debug` prints OTLP
# to stdout. Portable across linux/darwin/windows.
cat > demo.yaml <<'YAML'
receivers:
hostmetrics:
collection_interval: 1s
scrapers:
load: {}
exporters:
debug:
verbosity: detailed
service:
pipelines:
metrics:
receivers: [hostmetrics]
exporters: [debug]
YAML
# 3. Dry-run the config without starting any I/O. Catches typos,
# unknown component types, and missing required fields with
# file:line:column.
./_build/tracecore validate --config=demo.yaml
# 4. Run. Stderr carries structured lifecycle logs; stdout carries
# the OTLP-encoded debug output. Ctrl-C (SIGINT) triggers graceful
# shutdown; a second Ctrl-C forces immediate exit.
./_build/tracecore --config=demo.yamlYou should see structured logs on stderr from service/telemetry -
the upstream self-telemetry surface - naming the active pipelines
and components, followed by per-interval debug-exporter dumps on
stdout.
Stop with Ctrl-C. The collector logs "Shutdown complete" and exits
0. Exit code 0 = clean. Exit code 1 = a deadline elapsed during
shutdown or a forced second-Ctrl-C exit. Exit code 2 = config /
data issue. See ./_build/tracecore --help for the full subcommand
table.
For a cluster install, deploy from a local checkout of the in-repo
chart at install/kubernetes/tracecore/:
helm install tracecore install/kubernetes/tracecore \
--namespace tracecore-system --create-namespaceThe default values pair the upstream hostmetrics receiver with the
debug exporter so a fresh install boots cleanly on a no-GPU cluster.
See install/kubernetes/tracecore/README.md
for the values surface, PSS posture, and per-receiver opt-ins.
A standalone OCI chart at
oci://ghcr.io/tracecoreai/charts/tracecore-recipes (versioned
independently of the binary per
RFC-0013 §8 (Recipe versioning))
is the post-PR-L target; until then, install from the in-repo path
above. The container image is ko-built per RFC-0013 PR-D and signed
- attested via the release pipeline (see
docs/reproducibility.md).
For clusters without internet egress, every release publishes a
self-contained bundle that holds the binary, container image, Helm
chart, every cosign signature, and the SLSA v1.0 provenance — plus a
verify.sh that re-verifies the full chain offline. The bundle is
the single artifact you copy across the air gap.
TAG=v0.2.0
REPO=TraceCoreAI/tracecore
VERSION="${TAG#v}"
gh release download "$TAG" --repo "$REPO" \
--pattern "tracecore-airgap-${VERSION}.tar.gz" \
--pattern "tracecore-airgap-${VERSION}.tar.gz.cosign.bundle" \
--pattern "tracecore-airgap-${VERSION}.tar.gz.intoto.jsonl"cosign verify-blob \
--bundle "tracecore-airgap-${VERSION}.tar.gz.cosign.bundle" \
--certificate-identity-regexp "^https://github.com/${REPO}/\.github/workflows/release\.yml@refs/tags/" \
--certificate-oidc-issuer 'https://token.actions.githubusercontent.com' \
--certificate-github-workflow-ref "refs/tags/$TAG" \
--certificate-github-workflow-trigger 'push' \
"tracecore-airgap-${VERSION}.tar.gz"Warning — do not skip lightly. Skipping this step means trusting the download channel. The bundle's offline cosign signature is still re-verified internally by
verify.shonce unpacked, but a man-in-the-middle on the connected host could replace the bundle before extraction (substituting a different signed bundle, or a bundle signed by a different repo's workflow). Air-gapped operators with strict supply-chain requirements MUST verify the outer bundle here before copying across the air gap.
The three files above are everything you copy. Once on the air-gapped host:
tar xzf "tracecore-airgap-${VERSION}.tar.gz"
cd "tracecore-airgap-${VERSION}"
# Inventory + every signature + the SLSA bundle, verified offline.
# verify.sh sets `cosign verify-blob --offline=true` and
# `gh attestation verify --bundle` so no network is required.
bash verify.shFork operators:
verify.shpins the OIDC identity toTraceCoreAI/tracecoreby default. If you are verifying a bundle built by a fork's release workflow, exportTRACECORE_REPO=ForkOwner/repobefore runningverify.shso the identity regexp matches the fork's signing certificate.
verify.sh exits 0 on success. Inspect MANIFEST.txt for the image
digest, chart version, source commit, and per-file sha256s — those are
the values the script binds its identity checks against.
Pick one of the two formats the bundle ships — your registry's documentation will tell you which:
# Option A: docker-load tarball (Docker Engine, podman, containerd).
docker load < images/tracecore.docker.tar
docker tag ghcr.io/tracecoreai/tracecore:"$VERSION" \
<your-registry>/tracecore:"$VERSION"
docker push <your-registry>/tracecore:"$VERSION"
# Option B: OCI layout push (crane, Harbor, JFrog, Artifactory).
crane push images/tracecore.oci <your-registry>/tracecore:"$VERSION"helm install tracecore chart/tracecore-*.tgz \
--set image.repository=<your-registry>/tracecore \
--set image.tag="$VERSION" \
--namespace tracecore-system --create-namespaceIf your registry requires authentication, also set
--set imagePullSecrets[0].name=<your-secret> (the chart wires this
through to the DaemonSet's spec.imagePullSecrets).
The cosign identity binding holds across the registry move: the
signature is bound to the source workflow + tag, not to the registry
hostname, so a post-install re-verification against your private
registry mirror (cosign verify <your-registry>/tracecore@$DIGEST ...)
passes with the same identity flags from
docs/reproducibility.md step 8.
Operators write alerts against these attribute names; they are stable across the pivot per RFC-0013 §3 (Customer-stable telemetry contracts):
| Contract | Surface | Notes |
|---|---|---|
k8s.event.hint |
log attribute | 11-entry enum (pod_evicted, oom_killed, image_pull_failure, …). |
kernelevents.xid |
log attribute | NVRM Xid code (integer). |
gpu.id |
log + metric attribute | PCI BDF for the GPU at fault. |
gpu.vendor |
resource attribute | nvidia | amd | intel | habana, normalized by OTTL. |
tracecore.container.lines_per_s |
metric | Per-rank line rate (15s window). (pre-v0.1.0 placeholder; final name follows RFC-0013 §3) |
gen_ai.training.rank |
resource attribute | Cross-receiver join key. |
gen_ai.training.job_id |
resource attribute | Cross-receiver join key. |
| NCCL FlightRecorder span schema | trace | ncclfrreceiver (retained per RFC-0013 §6). |
| Pattern detector outputs | log + trace | M17/M18/M19 pattern engine. |
The bundled chart ships these as a default OTTL pipeline. You inherit them unless you explicitly override.
service/telemetry exposes Prometheus metrics, a health-check
endpoint via the healthcheck extension, and zpages via the
zpages extension. The default chart values enable all three on
localhost. Standard otelcol_* metric names apply per
RFC-0013 §3;
alerts must not assume the legacy tracecore_* prefix.
Pick the receiver shape that matches your source. All routes below are upstream contrib unless noted:
| Source | Receiver | Reference |
|---|---|---|
| GPU device telemetry (NVIDIA) | prometheus scraping dcgm-exporter |
NVIDIA 1st-party; OTTL normalizes gpu.vendor. |
| GPU device telemetry (AMD) | prometheus scraping ROCm/device-metrics-exporter |
AMD 1st-party. |
| GPU device telemetry (Intel) | prometheus scraping intel/xpumanager |
Intel 1st-party. |
| Kernel log + systemd journal | journald + filelog + OTTL Xid transform |
Contrib. |
| NCCL FlightRecorder dumps | ncclfr |
In-repo module/receiver/ncclfrreceiver; the moat. |
| Kubernetes events | k8sobjects (watch mode on events) + OTTL hint mapping |
Contrib. |
| Kueue scheduler metrics | prometheus (bearer-token + TLS) |
Contrib. |
| Container stdout (pod logs) | filelog + container stanza + file_storage |
Contrib. |
| On-demand Python stack sampling | parca-agent (separate DaemonSet, OTLP profiles) |
Polar Signals / CNCF; deferred to v0.4.0+ per #222. |
| Heartbeat / install-bench primitive | hostmetrics (load scraper) |
Contrib; OCB-bundled. |
For exporters, debug (used above) is the dev path; otlphttp
(integrations/otel-backend.md) is
the production sink to an OTel Collector or backend. The OCB distro
also bundles datadogexporter and clickhouseexporter directly -
see integrations/datadog.md and
integrations/clickhouse-direct.md.
- A receiver is silent → check
otelcol_receiver_accepted_*/otelcol_receiver_refused_*against the standard self-telemetry surface, then the per-recipe runbook bundled with the chart. - A runtime-level failure mode you can't find in a recipe → check
docs/FAILURE-MODES.mdfor lifecycle, shutdown, signal, and self-telemetry behavior. - A panic, crash, or unexpected exit → grab the binary version, stderr, and a minimal reproducing config and file an issue.
- Operator:
docs/integrations/- backend recipes (Datadog, ClickHouse, Honeycomb, generic OTel collector). - Operator (deploy shape):
docs/reference-architectures/- three starting-point archs by cluster topology (single-cluster, multi-cluster aggregation, air-gapped). Each names scale envelope, required components, values overlay, NetworkPolicy posture, and validation checklist. - Operator (envelope check):
docs/SUPPORT-MATRIX.md- which Kubernetes versions, GPU vendors, CNIs, and Linux distros the v1.0-rc1 cut commits to, with the CI gate behind each row. - Contributor:
CONTRIBUTING.md- RFC process, PR conventions, upstream-contribution policy (RFC-0013 §5). - Evaluator:
docs/rfcs/0013-distro-first-pivot.md- the binding doc for the distribution-first posture; covers what is built upstream vs in the tracecore-components moat.