Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ labels: enhancement

<!--
Before filing: please check existing issues and the milestone roadmap in
MILESTONES.md to see whether this is already planned or out of scope.
docs/MILESTONES.md to see whether this is already planned or out of scope.

If the change is *architectural* (new subsystem, new dependency, new
public surface), please file an RFC under docs/rfcs/ instead. See
Expand Down
2 changes: 1 addition & 1 deletion .github/branch-protection.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ allow_force_push: false
allow_deletions: false
require_conversation_resolution: true

# Solo maintainer until NORTHSTARS.md §O7's M18 target (≥1 non-employee
# Solo maintainer until docs/NORTHSTARS.md §O7's M18 target (≥1 non-employee
# maintainer active) is met. GitHub doesn't allow authors to approve their
# own PRs, so `1` review required + sole maintainer = unmergeable PRs.
# Bump both back to `1` / `true` the day a second maintainer joins.
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/chaos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ name: Chaos
# rides on upstream `go.opentelemetry.io/collector/service` and is
# covered by upstream's own chaos tests.
#
# Matrix-of-patterns rule: per MILESTONES.md §M4b the workflow grows
# Matrix-of-patterns rule: per docs/MILESTONES.md §M4b the workflow grows
# a row when each pattern lands. M17 / M18 are still open and will
# add their own rows when they land.

Expand Down
6 changes: 3 additions & 3 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,16 +147,16 @@ Universal rules every change should respect. The file is capped at

- **Cross-receiver join contracts resolve at design-lock, not
implementation-merge.** Every `M-X emits Y` rubric line in
MILESTONES.md is implicitly a contract with every `M-Z consumes Y`
receiver. Before opening a receiver-scope RFC PR, grep MILESTONES.md
docs/MILESTONES.md is implicitly a contract with every `M-Z consumes Y`
receiver. Before opening a receiver-scope RFC PR, grep docs/MILESTONES.md
and every sibling RFC for each attribute name, label name, or field
name the new receiver introduces; reconcile divergences in the same
PR (or its sibling-RFC's PR), not in a follow-up shard row saying
"whichever lands first". The "first-author-pays, second-author-fights-the-amendment"
pattern is the failure mode. Anchor: PR #94 found that the M15
rubric's pod-label fallback `tracecore.io/rank` diverged from
RFC-0009's `gen_ai.training.io/rank`; Phase-3 review caught it,
amended MILESTONES.md line 358 in-PR, and the follow-up row was
amended docs/MILESTONES.md line 358 in-PR, and the follow-up row was
marked resolved in the same commit.

## Topic index - repo-wide
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Thanks for your interest. Tracecore is in early development and we welcome focus
- Read [RFC-0013](docs/rfcs/0013-distro-first-pivot.md) for the binding architectural posture (distribution-first; adopt upstream first; in-house code bounded to the four moat scopes in §6).
- Skim [`AGENTS.md`](AGENTS.md) for the load-bearing lessons every change should respect, and the topic index pointing into `docs/notes/` for deeper per-area guidance.
- For governance questions - who has commit access, how RFCs are sponsored, how security disclosure works - read [`docs/maintainership.md`](docs/maintainership.md). This file owns day-to-day PR mechanics; that one owns who decides what.
- **Keep the tracking docs current.** Any PR that starts, advances, ships, or re-scopes work tracked in [`MILESTONES.md`](MILESTONES.md) or the follow-up shards under [`docs/followups/`](docs/followups/README.md) updates the corresponding entry (and, for milestones, the lane table) in the same PR; see [`MILESTONES.md` § "Keeping this document current"](MILESTONES.md#keeping-this-document-current) for the exact transitions covering both. Status drift is a review blocker. The legacy `docs/FOLLOWUPS.md` is now a redirect stub; add new items directly to the matching shard, and see [`docs/followups/README.md`](docs/followups/README.md) for the filing convention and conflict-free append rules.
- **Keep the tracking docs current.** Any PR that starts, advances, ships, or re-scopes work tracked in [`MILESTONES.md`](docs/MILESTONES.md) or the follow-up shards under [`docs/followups/`](docs/followups/README.md) updates the corresponding entry (and, for milestones, the lane table) in the same PR; see [`MILESTONES.md` § "Keeping this document current"](docs/MILESTONES.md#keeping-this-document-current) for the exact transitions covering both. Status drift is a review blocker. The legacy `docs/FOLLOWUPS.md` is now a redirect stub; add new items directly to the matching shard, and see [`docs/followups/README.md`](docs/followups/README.md) for the filing convention and conflict-free append rules.
- File an **RFC** in `docs/rfcs/` for load-bearing decisions you want documented for future contributors - architectural calls with non-obvious alternatives, policy changes, deprecations. Not gated. Small architectural choices can land via a clear commit message + a code comment. Judgment-called; over-RFC beats under-RFC at this stage, but velocity beats both.
- Check open issues to avoid duplicate work.

Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

**When a distributed training run breaks, the operator gets told what broke - by name, with its evidence trail - instead of correlating signals across the stack by hand.**

Tracecore is **an OpenTelemetry Collector distribution + AI-training pattern library for distributed-training observability**. The binary is assembled from upstream OpenTelemetry + contrib components via the OpenTelemetry Collector Builder (OCB); the differentiator is the bundled **pattern detectors**, **NCCL FlightRecorder receiver**, **OTTL processors** (cross-signal rank join, dataloader timing, eviction join), and the **recipes** that wire upstream receivers into training-cluster-shaped signal pipelines. The 15 named root-cause patterns in [`NORTHSTARS.md`](NORTHSTARS.md#appendix-a-the-15-named-root-cause-patterns) define what "told what broke" means concretely; four DCGM-observable patterns ship with walkthroughs in [`docs/patterns/`](docs/patterns/) today, with the remainder tracked in [`MILESTONES.md`](MILESTONES.md).
Tracecore is **an OpenTelemetry Collector distribution + AI-training pattern library for distributed-training observability**. The binary is assembled from upstream OpenTelemetry + contrib components via the OpenTelemetry Collector Builder (OCB); the differentiator is the bundled **pattern detectors**, **NCCL FlightRecorder receiver**, **OTTL processors** (cross-signal rank join, dataloader timing, eviction join), and the **recipes** that wire upstream receivers into training-cluster-shaped signal pipelines. The 15 named root-cause patterns in [`NORTHSTARS.md`](docs/NORTHSTARS.md#appendix-a-the-15-named-root-cause-patterns) define what "told what broke" means concretely; four DCGM-observable patterns ship with walkthroughs in [`docs/patterns/`](docs/patterns/) today, with the remainder tracked in [`MILESTONES.md`](docs/MILESTONES.md).

The collector is open source. The synthesis engine that interprets the data is a separate, hosted product.

Expand All @@ -20,11 +20,11 @@ Tracecore **is** an OTel Collector distribution - assembled via OCB from upstrea

## Status

Pre-alpha. The repo is mid-pivot to the distribution-first posture (RFC-0013): the binary is moving to OCB assembly from upstream + contrib components, with the in-house surface contracting to the four moat scopes (pattern detectors, OTTL processors with windowed semantics, NCCL FlightRecorder parsing, install/overhead bench). The current in-tree receivers under [`components/receivers/`](components/receivers/) are queued for deletion across v0.1.0 / v0.2.0 / v0.3.0; see [`MILESTONES.md`](MILESTONES.md) and RFC-0013 §7 for the release-boundary schedule. See [`CHANGELOG.md`](CHANGELOG.md) for the moving parts.
Pre-alpha. The repo is mid-pivot to the distribution-first posture (RFC-0013): the binary is moving to OCB assembly from upstream + contrib components, with the in-house surface contracting to the four moat scopes (pattern detectors, OTTL processors with windowed semantics, NCCL FlightRecorder parsing, install/overhead bench). The current in-tree receivers under [`components/receivers/`](components/receivers/) are queued for deletion across v0.1.0 / v0.2.0 / v0.3.0; see [`MILESTONES.md`](docs/MILESTONES.md) and RFC-0013 §7 for the release-boundary schedule. See [`CHANGELOG.md`](CHANGELOG.md) for the moving parts.

## Production readiness

What's safe to deploy today, what's still shipping. Honest read at HEAD; check [`CHANGELOG.md`](CHANGELOG.md) + [`MILESTONES.md`](MILESTONES.md) for the moving parts.
What's safe to deploy today, what's still shipping. Honest read at HEAD; check [`CHANGELOG.md`](CHANGELOG.md) + [`MILESTONES.md`](docs/MILESTONES.md) for the moving parts.

| Surface | Stability | CI tests | Signed binaries |
|---|---|---|---|
Expand Down Expand Up @@ -86,7 +86,7 @@ Lifecycle logs go to stderr. Run `./_build/tracecore --help` for the full flag s
|---|---|
| **Operator** running tracecore in production | [`docs/getting-started.md`](docs/getting-started.md) → bundled recipes under [`docs/integrations/`](docs/integrations/) → [`docs/FAILURE-MODES.md`](docs/FAILURE-MODES.md) |
| **Contributor** adding a receiver / processor / exporter | [`CONTRIBUTING.md`](CONTRIBUTING.md) → [`PRINCIPLES.md`](PRINCIPLES.md) (the *why*) → [`STYLE.md`](STYLE.md) (the *what*) → upstream [`go.opentelemetry.io/collector`](https://pkg.go.dev/go.opentelemetry.io/collector) component/receiver/processor/exporter packages |
| **Maintainer** making architectural calls | [`docs/STRATEGY.md`](docs/STRATEGY.md) → [`NORTHSTARS.md`](NORTHSTARS.md) → [`docs/rfcs/`](docs/rfcs/) → [`MILESTONES.md`](MILESTONES.md) → [`docs/FOLLOWUPS.md`](docs/FOLLOWUPS.md) |
| **Maintainer** making architectural calls | [`docs/STRATEGY.md`](docs/STRATEGY.md) → [`NORTHSTARS.md`](docs/NORTHSTARS.md) → [`docs/rfcs/`](docs/rfcs/) → [`MILESTONES.md`](docs/MILESTONES.md) → [`docs/FOLLOWUPS.md`](docs/FOLLOWUPS.md) |
| **Evaluating** tracecore for your fleet | This README + [`CHANGELOG.md`](CHANGELOG.md) → [`docs/STRATEGY.md`](docs/STRATEGY.md) "single load-bearing principle" |
| **Verifying** a published release end-to-end (auditor / supply-chain) | [`docs/reproducibility.md`](docs/reproducibility.md) (rebuild → diffoscope → cosign → SLSA → SBOM) |

Expand Down
2 changes: 1 addition & 1 deletion bench/overhead/nccl_fr_bench_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

// Package overhead measures the steady-state cost of receivers replaying
// realistic input volumes. The benchmarks here are advisory (not gating)
// per STYLE.md §Testing; the MILESTONES.md §M11 line 517 rubric asserts
// per STYLE.md §Testing; the docs/MILESTONES.md §M11 line 517 rubric asserts
// the cost is within NORTHSTARS O2.
package overhead

Expand Down
4 changes: 2 additions & 2 deletions components/receivers/pyspy/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ Operator pip-installs [`tracecore-pyspy`](../../../python/) into the workload tr
## References

- [RFC-0009](../../../docs/rfcs/0009-pyspy-receiver-scope.md) - design source.
- [NORTHSTARS.md §O2](../../../NORTHSTARS.md) - per-receiver overhead budget (CPU ≤0.05%, RSS ≤10 MB, egress ≤0.05 Mbps).
- [NORTHSTARS.md §O4](../../../NORTHSTARS.md) - `gen_ai.training.*` semconv shepherding commitment.
- [NORTHSTARS.md §O2](../../../docs/NORTHSTARS.md) - per-receiver overhead budget (CPU ≤0.05%, RSS ≤10 MB, egress ≤0.05 Mbps).
- [NORTHSTARS.md §O4](../../../docs/NORTHSTARS.md) - `gen_ai.training.*` semconv shepherding commitment.
- [PRINCIPLES.md §1](../../../PRINCIPLES.md) - never crash the workload.
- [`docs/FOLLOWUPS.md`](../../../docs/FOLLOWUPS.md) - Phase 3/4 deferred items.
Loading
Loading