Skip to content

Latest commit

 

History

History
290 lines (238 loc) · 12.6 KB

File metadata and controls

290 lines (238 loc) · 12.6 KB

Getting started

Tracecore is a distribution of the OpenTelemetry Collector - assembled by the OpenTelemetry Collector Builder (OCB) from upstream and contrib components plus a thin layer of pattern-detection processors that live in the in-repo Go submodule at module/ (path github.com/tracecoreai/tracecore/module). Welcome.

This walkthrough takes you from a fresh clone to your first OTLP byte in five commands. No GPU, no container runtime, no Kubernetes - just Go and a terminal.

The default demo wires the upstream hostmetrics receiver (load scraper) to the debug exporter so you can verify the binary works on your host before pointing it at real signals. Production sources (GPU telemetry, kernel events, k8s events, …) flow through upstream receivers (prometheusreceiver, journaldreceiver, filelogreceiver, k8sobjectsreceiver) per the adoption matrix in RFC-0013 §2.

Prerequisites

  • Go ≥ the version pinned in .go-version (read with cat .go-version).
  • A POSIX shell (bash/zsh/etc.) - Linux or macOS.

Walkthrough

# 1. Clone and build. `make build` invokes the OpenTelemetry Collector
#    Builder (`builder --config=builder-config.yaml`); OCB generates
#    main.go from the manifest, then compiles the assembled binary.
#    The Go toolchain + lint/test tools are pinned via `go.mod` tool
#    directives; no separate install step.
git clone https://github.com/tracecoreai/tracecore.git
cd tracecore
make build

# 2. Write a minimal config. `hostmetrics`' load scraper emits 3
#    low-cardinality series (system.cpu.load_average.{1m,5m,15m})
#    at 1s heartbeat — the OCB-bundled replacement for the retired
#    in-tree `clockreceiver` per RFC-0013 PR-E. `debug` prints OTLP
#    to stdout. Portable across linux/darwin/windows.
cat > demo.yaml <<'YAML'
receivers:
  hostmetrics:
    collection_interval: 1s
    scrapers:
      load: {}
exporters:
  debug:
    verbosity: detailed
service:
  pipelines:
    metrics:
      receivers: [hostmetrics]
      exporters: [debug]
YAML

# 3. Dry-run the config without starting any I/O. Catches typos,
#    unknown component types, and missing required fields with
#    file:line:column.
./_build/tracecore validate --config=demo.yaml

# 4. Run. Stderr carries structured lifecycle logs; stdout carries
#    the OTLP-encoded debug output. Ctrl-C (SIGINT) triggers graceful
#    shutdown; a second Ctrl-C forces immediate exit.
./_build/tracecore --config=demo.yaml

You should see structured logs on stderr from service/telemetry - the upstream self-telemetry surface - naming the active pipelines and components, followed by per-interval debug-exporter dumps on stdout.

Stop with Ctrl-C. The collector logs "Shutdown complete" and exits 0. Exit code 0 = clean. Exit code 1 = a deadline elapsed during shutdown or a forced second-Ctrl-C exit. Exit code 2 = config / data issue. See ./_build/tracecore --help for the full subcommand table.

Install via Helm

For a cluster install, deploy from a local checkout of the in-repo chart at install/kubernetes/tracecore/:

helm install tracecore install/kubernetes/tracecore \
  --namespace tracecore-system --create-namespace

The default values pair the upstream hostmetrics receiver with the debug exporter so a fresh install boots cleanly on a no-GPU cluster. See install/kubernetes/tracecore/README.md for the values surface, PSS posture, and per-receiver opt-ins.

A standalone OCI chart at oci://ghcr.io/tracecoreai/charts/tracecore-recipes (versioned independently of the binary per RFC-0013 §8 (Recipe versioning)) is the post-PR-L target; until then, install from the in-repo path above. The container image is ko-built per RFC-0013 PR-D and signed

Air-gapped install

For clusters without internet egress, every release publishes a self-contained bundle that holds the binary, container image, Helm chart, every cosign signature, and the SLSA v1.0 provenance — plus a verify.sh that re-verifies the full chain offline. The bundle is the single artifact you copy across the air gap.

1. On a connected host: download

TAG=v0.2.0
REPO=TraceCoreAI/tracecore
VERSION="${TAG#v}"
gh release download "$TAG" --repo "$REPO" \
  --pattern "tracecore-airgap-${VERSION}.tar.gz" \
  --pattern "tracecore-airgap-${VERSION}.tar.gz.cosign.bundle" \
  --pattern "tracecore-airgap-${VERSION}.tar.gz.intoto.jsonl"

2. Verify the outer bundle (still connected, recommended)

cosign verify-blob \
  --bundle "tracecore-airgap-${VERSION}.tar.gz.cosign.bundle" \
  --certificate-identity-regexp "^https://github.com/${REPO}/\.github/workflows/release\.yml@refs/tags/" \
  --certificate-oidc-issuer 'https://token.actions.githubusercontent.com' \
  --certificate-github-workflow-ref "refs/tags/$TAG" \
  --certificate-github-workflow-trigger 'push' \
  "tracecore-airgap-${VERSION}.tar.gz"

Warning — do not skip lightly. Skipping this step means trusting the download channel. The bundle's offline cosign signature is still re-verified internally by verify.sh once unpacked, but a man-in-the-middle on the connected host could replace the bundle before extraction (substituting a different signed bundle, or a bundle signed by a different repo's workflow). Air-gapped operators with strict supply-chain requirements MUST verify the outer bundle here before copying across the air gap.

3. Copy across the air gap, unpack, verify offline

The three files above are everything you copy. Once on the air-gapped host:

tar xzf "tracecore-airgap-${VERSION}.tar.gz"
cd "tracecore-airgap-${VERSION}"

# Inventory + every signature + the SLSA bundle, verified offline.
# verify.sh sets `cosign verify-blob --offline=true` and
# `gh attestation verify --bundle` so no network is required.
bash verify.sh

Fork operators: verify.sh pins the OIDC identity to TraceCoreAI/tracecore by default. If you are verifying a bundle built by a fork's release workflow, export TRACECORE_REPO=ForkOwner/repo before running verify.sh so the identity regexp matches the fork's signing certificate.

verify.sh exits 0 on success. Inspect MANIFEST.txt for the image digest, chart version, source commit, and per-file sha256s — those are the values the script binds its identity checks against.

4. Load the image into your private registry

Pick one of the two formats the bundle ships — your registry's documentation will tell you which:

# Option A: docker-load tarball (Docker Engine, podman, containerd).
docker load < images/tracecore.docker.tar
docker tag ghcr.io/tracecoreai/tracecore:"$VERSION" \
  <your-registry>/tracecore:"$VERSION"
docker push <your-registry>/tracecore:"$VERSION"

# Option B: OCI layout push (crane, Harbor, JFrog, Artifactory).
crane push images/tracecore.oci <your-registry>/tracecore:"$VERSION"

5. Install the chart pointing at your private registry

helm install tracecore chart/tracecore-*.tgz \
  --set image.repository=<your-registry>/tracecore \
  --set image.tag="$VERSION" \
  --namespace tracecore-system --create-namespace

If your registry requires authentication, also set --set imagePullSecrets[0].name=<your-secret> (the chart wires this through to the DaemonSet's spec.imagePullSecrets).

The cosign identity binding holds across the registry move: the signature is bound to the source workflow + tag, not to the registry hostname, so a post-install re-verification against your private registry mirror (cosign verify <your-registry>/tracecore@$DIGEST ...) passes with the same identity flags from docs/reproducibility.md step 8.

Customer-stable telemetry contracts

Operators write alerts against these attribute names; they are stable across the pivot per RFC-0013 §3 (Customer-stable telemetry contracts):

Contract Surface Notes
k8s.event.hint log attribute 11-entry enum (pod_evicted, oom_killed, image_pull_failure, …).
kernelevents.xid log attribute NVRM Xid code (integer).
gpu.id log + metric attribute PCI BDF for the GPU at fault.
gpu.vendor resource attribute nvidia | amd | intel | habana, normalized by OTTL.
tracecore.container.lines_per_s metric Per-rank line rate (15s window). (pre-v0.1.0 placeholder; final name follows RFC-0013 §3)
gen_ai.training.rank resource attribute Cross-receiver join key.
gen_ai.training.job_id resource attribute Cross-receiver join key.
NCCL FlightRecorder span schema trace ncclfrreceiver (retained per RFC-0013 §6).
Pattern detector outputs log + trace M17/M18/M19 pattern engine.

The bundled chart ships these as a default OTTL pipeline. You inherit them unless you explicitly override.

Self-telemetry surface

service/telemetry exposes Prometheus metrics, a health-check endpoint via the healthcheck extension, and zpages via the zpages extension. The default chart values enable all three on localhost. Standard otelcol_* metric names apply per RFC-0013 §3; alerts must not assume the legacy tracecore_* prefix.

Add a real receiver

Pick the receiver shape that matches your source. All routes below are upstream contrib unless noted:

Source Receiver Reference
GPU device telemetry (NVIDIA) prometheus scraping dcgm-exporter NVIDIA 1st-party; OTTL normalizes gpu.vendor.
GPU device telemetry (AMD) prometheus scraping ROCm/device-metrics-exporter AMD 1st-party.
GPU device telemetry (Intel) prometheus scraping intel/xpumanager Intel 1st-party.
Kernel log + systemd journal journald + filelog + OTTL Xid transform Contrib.
NCCL FlightRecorder dumps ncclfr In-repo module/receiver/ncclfrreceiver; the moat.
Kubernetes events k8sobjects (watch mode on events) + OTTL hint mapping Contrib.
Kueue scheduler metrics prometheus (bearer-token + TLS) Contrib.
Container stdout (pod logs) filelog + container stanza + file_storage Contrib.
On-demand Python stack sampling parca-agent (separate DaemonSet, OTLP profiles) Polar Signals / CNCF; deferred to v0.4.0+ per #222.
Heartbeat / install-bench primitive hostmetrics (load scraper) Contrib; OCB-bundled.

For exporters, debug (used above) is the dev path; otlphttp (integrations/otel-backend.md) is the production sink to an OTel Collector or backend. The OCB distro also bundles datadogexporter and clickhouseexporter directly - see integrations/datadog.md and integrations/clickhouse-direct.md.

When something goes wrong

  • A receiver is silent → check otelcol_receiver_accepted_* / otelcol_receiver_refused_* against the standard self-telemetry surface, then the per-recipe runbook bundled with the chart.
  • A runtime-level failure mode you can't find in a recipe → check docs/FAILURE-MODES.md for lifecycle, shutdown, signal, and self-telemetry behavior.
  • A panic, crash, or unexpected exit → grab the binary version, stderr, and a minimal reproducing config and file an issue.

Next steps

  • Operator: docs/integrations/ - backend recipes (Datadog, ClickHouse, Honeycomb, generic OTel collector).
  • Operator (deploy shape): docs/reference-architectures/ - three starting-point archs by cluster topology (single-cluster, multi-cluster aggregation, air-gapped). Each names scale envelope, required components, values overlay, NetworkPolicy posture, and validation checklist.
  • Operator (envelope check): docs/SUPPORT-MATRIX.md - which Kubernetes versions, GPU vendors, CNIs, and Linux distros the v1.0-rc1 cut commits to, with the CI gate behind each row.
  • Contributor: CONTRIBUTING.md - RFC process, PR conventions, upstream-contribution policy (RFC-0013 §5).
  • Evaluator: docs/rfcs/0013-distro-first-pivot.md - the binding doc for the distribution-first posture; covers what is built upstream vs in the tracecore-components moat.