Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/MILESTONES.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,7 +315,7 @@ Lane 6 covers NVIDIA-side device telemetry (DCGM), NCCL collective diagnostics (

- **Status:** ☐
- **Depends on:** M8 cgo, M9, M11, M4b
- **Reference:** [`docs/patterns/pattern-1-nvlink-degradation.md`](patterns/pattern-1-nvlink-degradation.md)
- **Reference:** [`docs/patterns/01-nvlink-degradation-walkthrough.md`](patterns/01-nvlink-degradation-walkthrough.md)
- **Hardware:** replay corpus: none. Integration variants exercising real DCGM: Linux + NVIDIA GPU (flood-gated).

**Rubric summary:** Replay fixture `internal/synthesis/replay/nvlink_silent/fixture_xid79_nccl_hang/` (synthetic Xid 79 dmesg + NCCL FR hang pickle across 8 ranks) → exactly one `pattern.id="1"` verdict within 60s. **Cross-rank join export:** M17 creates `pkg/nccl/fr_parser/cross_rank.go` exporting `JoinByCollectiveSeq(perRank [][]Record) StepTree` as public API (M18 build-time consumer); tolerates `N-1` missing ranks (`state=unknown_hung`). Evidence trail (a) Xid 79 / Xid 74 [Hopper/Ampere] kernel event, (b) DCGM `hw.gpu.nvlink.io` 50%-of-median divergence, (c) NCCL FR `state=started, !completed`; missing layer → `confidence=partial`. Structured-query lint (no raw PromQL). Hermetic replay (no network/GPU); `//go:build integration` for real DCGM variants. Headline regex `^NVLink degradation on rank \d+ at step \d+$`. Negative + adjacent-pattern fixtures (#2/#6/#8) zero-verdict. NFR: zero FP on healthy + adjacent corpora; 64-rank join ≤100ms p99 *(unverified)*; NCCL 2.29.x vs 2.30.x golden-identical; component+rank+step+missing-layer in every return path.
Expand Down
8 changes: 4 additions & 4 deletions docs/NORTHSTARS.md
Original file line number Diff line number Diff line change
Expand Up @@ -407,11 +407,11 @@ The pattern set tracecore is built to root-cause end-to-end. Coverage of this li

| # | Pattern | Symptom | Layers crossed | Spec |
|---|---|---|---|---|
| 1 | NVLink silent degradation | Xid 79/74; NCCL hangs on AllReduce | DCGM + Xid + NVLink fabric + NCCL | ☑ [walkthrough](patterns/pattern-1-nvlink-degradation.md) |
| 1 | NVLink silent degradation | Xid 79/74; NCCL hangs on AllReduce | DCGM + Xid + NVLink fabric + NCCL | ☑ [walkthrough](patterns/01-nvlink-degradation-walkthrough.md) |
| 2 | InfiniBand link flap | NCCL "HCA error" or sudden bandwidth collapse | IB/RDMA + NCCL | ☐ [spec](patterns/02-ib-link-flap.md) |
| 3 | Uncorrectable HBM ECC | Xid 48/63/64; single GPU dies | DCGM + dmesg | ☑ [walkthrough](patterns/pattern-3-hbm-ecc.md) |
| 4 | Thermal throttling cascade | Rack throttles; GPUs become stragglers | DCGM + power telemetry + stragglers | ☑ [walkthrough](patterns/pattern-4-thermal-throttle.md) |
| 5 | PCIe AER cascade | Correctable PCIe errors; no crash | PCIe/BMC + DCGM | ☑ [walkthrough](patterns/pattern-5-pcie-aer.md) |
| 3 | Uncorrectable HBM ECC | Xid 48/63/64; single GPU dies | DCGM + dmesg | ☑ [walkthrough](patterns/03-hbm-ecc-walkthrough.md) |
| 4 | Thermal throttling cascade | Rack throttles; GPUs become stragglers | DCGM + power telemetry + stragglers | ☑ [walkthrough](patterns/04-thermal-throttle-walkthrough.md) |
| 5 | PCIe AER cascade | Correctable PCIe errors; no crash | PCIe/BMC + DCGM | ☑ [walkthrough](patterns/05-pcie-aer-walkthrough.md) |
| 6 | Stragglers from slow node | `data_time` 3× normal on one node | CPU/DRAM + dataloader I/O + stragglers | ☑ in-tree detector |
| 7 | Dataloader hang | Worker death, FUSE stall, S3 throttling | Dataloader + storage + Python runtime | ☐ [spec](patterns/07-dataloader-hang.md) |
| 8 | NCCL timeout, no hardware cause | All-reduce hangs without HW signal | NCCL + distributed framework | ☐ [spec](patterns/08-nccl-timeout-no-hw.md) |
Expand Down
7 changes: 4 additions & 3 deletions docs/followups/M4b.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,10 +87,11 @@ implementation gated on an upstream package that has not landed yet.
cover `failure-inject` at all. Fix: extend the flag help text +
add a RUNBOOK section. *Trigger:* first operator script that
depends on the exit code, OR M21 docs sweep.
- **`docs/patterns/pattern-14-pod-evicted.md` missing.** Patterns
- **`docs/patterns/14-pod-evicted-walkthrough.md` missing.** Patterns
#1, #3, #4, #5 each carry a research-note companion under
`docs/patterns/`. M19 inverted the order (detector first) and
the companion was not tracked. *Trigger:* first M17/M18
`docs/patterns/` (per the `NN-slug-walkthrough.md` convention in
`docs/patterns/README.md`). M19 inverted the order (detector first)
and the companion was not tracked. *Trigger:* first M17/M18
contributor notices the asymmetry, OR M21 docs sweep.
- **Verdict JSON Schema path pattern-#14-specific at package root.**
`module/pkg/patterns/testdata/verdict.schema.json` hard-
Expand Down
10 changes: 5 additions & 5 deletions docs/integrations/prometheus-scrape.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,7 +296,7 @@ The link index lift uses `Int(ExtractPatterns(metric.name,
so the resulting attribute is integer-typed (matches the semconv
proposal's `hw.gpu.nvlink.link: int`). Per-link decomposition is
the diagnostic-critical surface for
[pattern #1 silent NVLink degradation](../patterns/pattern-1-nvlink-degradation.md);
[pattern #1 silent NVLink degradation](../patterns/01-nvlink-degradation-walkthrough.md);
without it the alert query has no group-by axis.

> **dcgm-exporter opt-in required.** The
Expand Down Expand Up @@ -325,7 +325,7 @@ references.

The attribute names match the semconv `hw.errors` shape (see
[hw common](https://opentelemetry.io/docs/specs/semconv/hardware/common/)).
[Pattern #3 doc](../patterns/pattern-3-hbm-ecc.md) consumes the
[Pattern #3 doc](../patterns/03-hbm-ecc-walkthrough.md) consumes the
`error.persistence=volatile` row in its alert query.

### Pattern #4 — Thermal throttle cascade
Expand All @@ -352,7 +352,7 @@ resolves the vocabulary. Tracked at
[#272](https://github.com/TraceCoreAI/tracecore/issues/272) for the
upstream proposal extension.

[Pattern #4 doc](../patterns/pattern-4-thermal-throttle.md) alerts
[Pattern #4 doc](../patterns/04-thermal-throttle-walkthrough.md) alerts
on the `reason=thermal` row; the other reasons are diagnostic
context (`power` correlates with PSU sag, `hw_slowdown` is the
"GPU has decided to clock itself down" hard signal).
Expand All @@ -372,7 +372,7 @@ GPU set.
The `pci_bus_id` label is lifted to the resource-level
`hw.gpu.pci.bdf` so pattern #5's escalation matrix can cross-
reference dmesg `PCIe Bus Error: Corrected` lines against the
same BDF without joining series. [Pattern #5 doc](../patterns/pattern-5-pcie-aer.md)
same BDF without joining series. [Pattern #5 doc](../patterns/05-pcie-aer-walkthrough.md)
shows the divergence query in PromQL form.

### Pattern #10 — CUDA OOM (framebuffer)
Expand Down Expand Up @@ -446,7 +446,7 @@ statement evaluates. Renaming first would short-circuit the
attribute stamps because the second statement's guard would no
longer find the original name.

[Pattern #2 doc](../patterns/pattern-2-ib-link-flap.md) consumes the
[Pattern #2 doc](../patterns/02-ib-link-flap-walkthrough.md) consumes the
joined record via
[`projectIBPortStateRecord`](../../module/processor/patterndetectorprocessor/ib_link_flap.go)
(gate: `hw.network.ib.port.state` AND `hw.network.ib.device` AND
Expand Down
2 changes: 1 addition & 1 deletion docs/patterns/02-ib-link-flap.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Pattern #2 — InfiniBand link flap

**Status:** ☑ shipped — detector library at `module/pkg/patterns/ib_link_flap.go`, processor wiring at `module/processor/patterndetectorprocessor/ib_link_flap.go`. Operator-facing walkthrough at [pattern-2-ib-link-flap.md](pattern-2-ib-link-flap.md).
**Status:** ☑ shipped — detector library at `module/pkg/patterns/ib_link_flap.go`, processor wiring at `module/processor/patterndetectorprocessor/ib_link_flap.go`. Operator-facing walkthrough at [02-ib-link-flap-walkthrough.md](02-ib-link-flap-walkthrough.md).

Design spec for the pattern-#2 detector. Distinct from the operator-facing walkthroughs (`pattern-N-*.md`) — this is the engineering contract that a TDD red test gets written against.

Expand Down
31 changes: 25 additions & 6 deletions docs/patterns/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,11 +41,11 @@ Five of [NORTHSTARS Appendix A's 15 patterns](../NORTHSTARS.md#appendix-a-the-15

| Pattern | File | DCGM signal |
|---|---|---|
| #1 NVLink silent degradation | [pattern-1-nvlink-degradation.md](pattern-1-nvlink-degradation.md) | per-link `hw.gpu.nvlink.io` Tx/Rx divergence |
| #2 InfiniBand link flap | [pattern-2-ib-link-flap.md](pattern-2-ib-link-flap.md) | `hw.network.ib.port.state` ACTIVE↔DOWN transitions (>=2 in 2m window) joined to same-node NCCL FR stuck collective |
| #3 Uncorrectable HBM ECC | [pattern-3-hbm-ecc.md](pattern-3-hbm-ecc.md) | `hw.errors{error.type=uncorrected}` non-zero |
| #4 Thermal throttling cascade | [pattern-4-thermal-throttle.md](pattern-4-thermal-throttle.md) | `hw.gpu.throttle.duration{reason=thermal}` rate-of-change |
| #5 PCIe AER cascade | [pattern-5-pcie-aer.md](pattern-5-pcie-aer.md) | `hw.gpu.io` Tx/Rx counter discontinuities |
| #1 NVLink silent degradation | [01-nvlink-degradation-walkthrough.md](01-nvlink-degradation-walkthrough.md) | per-link `hw.gpu.nvlink.io` Tx/Rx divergence |
| #2 InfiniBand link flap | [02-ib-link-flap-walkthrough.md](02-ib-link-flap-walkthrough.md) | `hw.network.ib.port.state` ACTIVE↔DOWN transitions (>=2 in 2m window) joined to same-node NCCL FR stuck collective |
| #3 Uncorrectable HBM ECC | [03-hbm-ecc-walkthrough.md](03-hbm-ecc-walkthrough.md) | `hw.errors{error.type=uncorrected}` non-zero |
| #4 Thermal throttling cascade | [04-thermal-throttle-walkthrough.md](04-thermal-throttle-walkthrough.md) | `hw.gpu.throttle.duration{reason=thermal}` rate-of-change |
| #5 PCIe AER cascade | [05-pcie-aer-walkthrough.md](05-pcie-aer-walkthrough.md) | `hw.gpu.io` Tx/Rx counter discontinuities |
| #7 Dataloader hang | [07-dataloader-hang.md](07-dataloader-hang.md) | `tracecore.alert.training_step_stalled.*` + `dataloader.error_class` OR `k8s.event.reason{FailedMount\|VolumeMountFailure}` |

## Design specs (planned detectors — TDD red-test inputs)
Expand All @@ -54,7 +54,7 @@ Engineering-facing pattern-design specs for the 8 unspec'd v1 patterns. Each fol

| Pattern | Spec | Status |
|---|---|---|
| #2 InfiniBand link flap | [02-ib-link-flap.md](02-ib-link-flap.md) | ☑ shipped (operator walkthrough: [pattern-2-ib-link-flap.md](pattern-2-ib-link-flap.md)) |
| #2 InfiniBand link flap | [02-ib-link-flap.md](02-ib-link-flap.md) | ☑ shipped (operator walkthrough: [02-ib-link-flap-walkthrough.md](02-ib-link-flap-walkthrough.md)) |
| #8 NCCL timeout, no hardware cause | [08-nccl-timeout-no-hw.md](08-nccl-timeout-no-hw.md) | ☐ planned |
| #9 NCCL bootstrap timeout | [09-nccl-bootstrap-timeout.md](09-nccl-bootstrap-timeout.md) | ☑ shipped |
| #10 CUDA OOM, deceptive allocator | [10-cuda-oom-deceptive.md](10-cuda-oom-deceptive.md) | ☐ planned ([#303](https://github.com/TraceCoreAI/tracecore/issues/303) filed) |
Expand Down Expand Up @@ -90,6 +90,25 @@ Patterns are emergent - operators in the field find them.
Contributions welcome via the standard PR flow; the format below
should be preserved for consistency.

## Filename convention

Every file in this directory uses a zero-padded numeric prefix
matching the NORTHSTARS Appendix A pattern number, so lexsort and
pattern-number ordering agree:

- **`NN-slug.md`** — engineering-facing design spec (the TDD
red-test input). One per pattern. Lands first when a detector is
spec'd before implementation; remains as the contract once the
detector ships.
- **`NN-slug-walkthrough.md`** — operator-facing runbook (Symptom →
Signal → Query → Alert → Escalation → Replay). Lands when the
detector ships and an operator needs to triage a real incident.
May coexist with the spec at `NN-slug.md` (they cross-reference).

Pattern #2 carries both today (`02-ib-link-flap.md` spec +
`02-ib-link-flap-walkthrough.md` runbook). New patterns follow the
same split when both audiences need a page.

## Format

Each pattern walkthrough has the same sections:
Expand Down
2 changes: 1 addition & 1 deletion docs/rfcs/0014-metrics-to-logs-pattern-input.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ No operator-visible change at this RFC. PR-A's OTTL transform (already shipped)

- Issue [#260](https://github.com/tracecoreai/tracecore/issues/260) — recipe extension + metrics-side detector plumbing.
- [`docs/rfcs/0013-distro-first-pivot.md`](0013-distro-first-pivot.md) §3 (customer-stable contracts), §5 (upstream contribution policy), §6 (the four in-house moat scopes).
- [`docs/patterns/pattern-1-nvlink-degradation.md`](../patterns/pattern-1-nvlink-degradation.md) — the first metrics-sourced pattern unblocked by this decision.
- [`docs/patterns/01-nvlink-degradation-walkthrough.md`](../patterns/01-nvlink-degradation-walkthrough.md) — the first metrics-sourced pattern unblocked by this decision.
- [`docs/proposals/semconv-hw-gpu-extensions.md`](../proposals/semconv-hw-gpu-extensions.md) §3 — the `hw.gpu.nvlink.io` shape PR-A's OTTL transform emits.
- [transformprocessor README @ v0.130.0](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/v0.130.0/processor/transformprocessor/README.md#config) — the upstream signal-context table cited above.
- [opentelemetry-collector-contrib connector tree @ v0.130.0](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/v0.130.0/connector) — surveyed for any metrics→logs primitive.
4 changes: 2 additions & 2 deletions module/pkg/patterns/hbm_ecc.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ import (
// DefaultHBMECCWindow is the maximum gap between a DCGM uncorrected
// double-bit volatile ECC counter rise and an uncorrectable Xid
// kernel event on the same GPU for the detector to join them. 5min
// mirrors the alert window in docs/patterns/pattern-3-hbm-ecc.md
// mirrors the alert window in docs/patterns/03-hbm-ecc-walkthrough.md
// (PromQL `increase(...[5m]) > 0`) and the typical operator-facing
// "is this happening now" window. Operators on long scrape paths
// (30s+ scrape interval) raise this via
Expand Down Expand Up @@ -42,7 +42,7 @@ var uncorrectableHBMXidCodes = map[int]struct{}{
// double-bit volatile ECC counter delta the detector consumes. The
// patterndetectorprocessor builds these from log records derived
// (via OTTL transform on the metrics→logs path) from the customer-
// stable `hw.errors` Counter (per the docs/patterns/pattern-3-hbm-ecc.md
// stable `hw.errors` Counter (per the docs/patterns/03-hbm-ecc-walkthrough.md
// receiver-emitted signal). Detectors read HBMECCRecord values
// directly — no plog grep — so a schema rename surfaces as a
// compile error.
Expand Down
4 changes: 2 additions & 2 deletions module/pkg/patterns/pcie_aer.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ import (
// DefaultPCIeAERWindow is the maximum gap between a PCIe AER kernel
// message and a same-BDF hw.gpu.io rate-collapse sample for the
// detector to join them. 5min mirrors the rate window in
// docs/patterns/pattern-5-pcie-aer.md (PromQL `rate(...[5m])`) and
// docs/patterns/05-pcie-aer-walkthrough.md (PromQL `rate(...[5m])`) and
// the typical post-AER Tx/Rx renegotiation latency — the link
// re-trains to a lower Gen/width within tens of seconds, and the
// next scrape lands on the new rate within one scrape interval.
Expand Down Expand Up @@ -70,7 +70,7 @@ type PCIeAERRecord struct {
// rate sample. The patterndetectorprocessor builds these from log
// records derived (via OTTL transform on the metrics→logs path)
// from the customer-stable hw.gpu.io Counter (per the
// docs/patterns/pattern-5-pcie-aer.md receiver-emitted signal). The
// docs/patterns/05-pcie-aer-walkthrough.md receiver-emitted signal). The
// ADR-0001 PR-B metrics-path consumer pattern shipped for cuda_oom
// (#10) via issue #437; the PCIe AER metrics-path consumer is a
// pending sibling follow-up under RFC-0014. BaselineBytesPerSecond
Expand Down
2 changes: 1 addition & 1 deletion module/pkg/patterns/thermal_throttle.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import (
// DefaultThermalThrottleWindow is the rolling time window over which
// per-GPU thermal throttle deltas are summed before applying the
// cascade predicate. 5min mirrors the spec's PromQL `[5m]` rate
// window in docs/patterns/pattern-4-thermal-throttle.md and the
// window in docs/patterns/04-thermal-throttle-walkthrough.md and the
// typical operator-facing "is this happening now" horizon. Operators
// on long scrape paths (30s+ scrape interval) raise this via
// ThermalThrottleDetector.Window.
Expand Down
2 changes: 1 addition & 1 deletion module/pkg/replay/thermal_throttle/canonical/manifest.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"pattern_id": "4",
"fixture_name": "canonical_4gpu_cascade",
"description": "Four GPUs on gpu-node-0001 each accumulate >=30s of thermal throttle duration within the 5-min rolling window. Pattern #4 canonical positive case — the half-rack cascade described in docs/patterns/pattern-4-thermal-throttle.md escalation step 2. Headline must name thermal throttling and the node; remediation must reference airflow / HVAC / cooling / facilities.",
"description": "Four GPUs on gpu-node-0001 each accumulate >=30s of thermal throttle duration within the 5-min rolling window. Pattern #4 canonical positive case — the half-rack cascade described in docs/patterns/04-thermal-throttle-walkthrough.md escalation step 2. Headline must name thermal throttling and the node; remediation must reference airflow / HVAC / cooling / facilities.",
"expected_timing": "4 GPU records spaced 10s apart, each ThrottleDelta >= 30s; one verdict naming the node and 4 GPUIDs"
}