Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 14 additions & 7 deletions docs/ATTRIBUTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,9 @@ hardware signal.
| `tracecore.alert.checkpointer_hang.stall_seconds` | int | tracecore-ext | alpha | Wall-clock stall duration promoted on `checkpointer_hang` verdicts so dashboards graph hang severity. | `patterndetectorprocessor.appendCheckpointerHangVerdict` | Operator dashboards |
| `tracecore.alert.checkpointer_hang.phase` | string | tracecore-ext | alpha | Checkpoint phase the stall caught (`plan` / `write` / `barrier`) promoted on `checkpointer_hang` verdicts so operators triage by phase. | `appendCheckpointerHangVerdict` | Operator dashboards |
| `tracecore.alert.checkpointer_hang.storage_backend` | string | tracecore-ext | alpha | Inferred storage backend on `checkpointer_hang` verdicts (`lustre` / `fsx` / `weka` / `nfs` / `s3` / `unknown`). | `appendCheckpointerHangVerdict` | Operator dashboards |
| `tracecore.alert.nccl_bootstrap_timeout.cohort_size` | int | tracecore-ext | alpha | Distinct-rank count the pattern-#9 detector observed pod-Ready signals for, promoted on `nccl_bootstrap` verdicts | `patterndetectorprocessor.appendNCCLBootstrapVerdict` | Operator dashboards |
| `tracecore.alert.nccl_bootstrap_timeout.failed_rank_count` | int | tracecore-ext | alpha | Number of cohort ranks with no NCCL FR record past `BootstrapDeadline` | `patterndetectorprocessor.appendNCCLBootstrapVerdict` | Operator dashboards |
| `tracecore.alert.nccl_bootstrap_timeout.discriminator` | string | tracecore-ext | alpha | Pattern-#9 discriminator branch (`cni_error` / `socket_ifname_mismatch` / `rendezvous_unreachable` / `unknown`) | `patterndetectorprocessor.appendNCCLBootstrapVerdict` | Operator dashboards |
| `tracecore.alert.training_step_stalled.no_progress_seconds` | int | tracecore-ext | alpha | Wall-clock no-progress seconds on `gen_ai.training.step_duration_seconds`; bridge attribute emitted by the OTTL stanza that gates pattern #7 (dataloader hang) | OTTL recipe (sibling to RFC-0014 metrics→logs path) | `projectTrainingStepStallRecord` (gates dataloader_hang pattern) |
| `tracecore.alert.training_step_stalled.last_step_ns` | int | tracecore-ext | alpha | Wall-clock nanosecond stamp of the last observed step-progress sample; carried for operator triage on the dataloader_hang bridge record | OTTL recipe | `projectTrainingStepStallRecord` |
| `tracecore.alert.dataloader_hang.stall_seconds` | int | tracecore-ext | alpha | Promoted wall-clock stall duration on `dataloader_hang` verdicts so dashboards render a histogram without parsing JSON | `patterndetectorprocessor.appendDataLoaderHangVerdict` | Operator dashboards (verdict-stream tier) |
Expand Down Expand Up @@ -235,6 +238,7 @@ extracted by the `k8sobjects-events` recipe's OTTL transform.
| `k8s.regarding.namespace` | string | k8s-semconv | stable | Regarding-object namespace | k8sobjectsreceiver | `projectObjectRef` |
| `k8s.regarding.uid` | string | k8s-semconv | stable | Regarding-object uid | k8sobjectsreceiver | `projectObjectRef` |
| `k8s.pod_evicted_at` | string (RFC3339Nano) | tracecore-ext | stable | Eviction timestamp stamped onto rank records by the rank-join processor when an eviction window aligns with the rank's NCCL FlightRecorder activity | `rankjoinprocessor` (`AttrPodEvictedAt`) | Cross-signal joins, dashboards |
| `k8s.pod.ready_time` | int (unix-nano) \| string (RFC3339Nano) | tracecore-ext (k8s.* extension) | alpha | Pod `Ready` condition LastTransitionTime, stamped per training-pod log record by the k8sobjectsreceiver+OTTL recipe so pattern #9 (nccl_bootstrap) can measure BootstrapDeadline from pod-ready (not job-creation, per spec edge case "cold cache prefetch"). int-nanos preferred for OTel wire-format precision; RFC3339Nano string accepted as a fallback for recipe authors who stamp human-readable. | k8sobjectsreceiver+OTTL recipe (sibling to the nccl_bootstrap detector PR) | `projectTrainingPodRecord` |

---

Expand Down Expand Up @@ -298,16 +302,18 @@ is no equivalent upstream contract.

Upstream OTel `gen_ai` namespace. We use it as the cross-receiver
join surface for distributed training workloads (rank, job id, eval
accuracy). `gen_ai.training.rank` is consumed by the NCCL
FlightRecorder projection; `gen_ai.training.job_id` is the same-job
join key for pattern #13 (silent data corruption); the
`eval_accuracy.*` + `checkpoint.*` keys gate the SDC detector's
discriminator + false-positive guards.
accuracy). `gen_ai.training.rank` is the primary rank key consumed
across the pattern library (NCCL FlightRecorder projection +
pattern-#9 pod-Ready signal). `gen_ai.training.job_id` is the cohort-
grouping key for pattern #9 (nccl_bootstrap) AND the same-job join
key for pattern #13 (silent data corruption). The `eval_accuracy.*`
+ `checkpoint.*` keys gate the SDC detector's discriminator + false-
positive guards.

| Attribute | Type | Source | Stability | Description | Emitted by | Consumed by |
|---|---|---|---|---|---|---|
| `gen_ai.training.rank` | int | upstream-semconv (alpha) | development | Rank index — canonical per M19 | `rankjoinprocessor` (`module/processor/rankjoinprocessor`) | `projectNCCLFRRecord` (preferred over `nccl.rank` / `nccl.fr.rank`) |
| `gen_ai.training.job_id` | string | upstream-semconv (alpha) | alpha | Training-job id — same-job join key for pattern #13 (vendor SDC counter → eval accuracy). Verdict scalar on `silent_data_corruption` | eval-pipeline OTTL recipe + vendor SDC OTTL recipe (TBD) | `projectEvalAccuracyRecord`, `projectSDCCounterRecord` |
| `gen_ai.training.rank` | int | upstream-semconv (alpha) | development | Rank index — canonical per M19 | `rankjoinprocessor` (`module/processor/rankjoinprocessor`) | `projectNCCLFRRecord` (preferred over `nccl.rank` / `nccl.fr.rank`); `projectTrainingPodRecord` |
| `gen_ai.training.job_id` | string | upstream-semconv (alpha) | alpha | Training-job id — cohort key for pattern #9 (nccl_bootstrap; namespace-only fallback when unstamped) AND same-job join key for pattern #13 (vendor SDC counter → eval accuracy). Verdict scalar on `silent_data_corruption` | k8sobjectsreceiver + eval-pipeline OTTL recipe + vendor SDC OTTL recipe (TBD) | `projectTrainingPodRecord`; `appendNCCLBootstrapVerdict`; `projectEvalAccuracyRecord`, `projectSDCCounterRecord` |
| `gen_ai.training.step` | int | upstream-semconv (alpha) | alpha | Most recent observed training step; carried on dataloader_hang bridge records so the warmup guard (step < 2) can skip first-load prefetch | OTTL recipe on `gen_ai.training.step_duration_seconds` | `projectTrainingStepStallRecord` (warmup guard) |
| `gen_ai.training.step_duration_seconds` | int | tracecore-ext (proposed upstream) | alpha | Wall-clock training-step duration in seconds. Stamped by the OTTL step-progress bridge when step duration exceeded StallThreshold. Gates the `checkpointer_hang` (pattern #11) and future `dataloader_hang` (pattern #7) input projections. | OTTL step-progress bridge (TBD) | `projectTrainingStepStallRecord` (gates `checkpointer_hang` pattern) |
| `gen_ai.training.phase` | string | tracecore-ext (upstream proposal target) | alpha | Training-loop phase (`train` / `eval`); load-bearing for the dataloader_hang eval-phase guard (eval pauses produce no step-progress samples but are not a hang) | OTTL recipe | `projectTrainingStepStallRecord` (eval-phase guard) |
Expand Down Expand Up @@ -358,6 +364,7 @@ need to fire?" without reading source.
| `thermal_throttle` | `hw.gpu.throttle.duration.delta` + `hw.gpu.throttle.reason` + `gpu.id` | `hw.gpu.index`, `k8s.node.name` | `pattern.*`, `hw.gpu.throttle.cascade_size` |
| `pcie_aer` *(wiring on a follow-up PR — projections present in library)* | `kernelevents.pcie_aer.severity` + `gpu.id` OR `tracecore.alert.pcie_rate_collapse.bytes_per_second` + `gpu.id` | `kernelevents.pcie_aer.type`, `network.io.direction`, `tracecore.alert.pcie_rate_collapse.{baseline_bytes_per_second,direction}` | `pattern.*` |
| `nccl_hang` | `nccl.fr.collective_seq_id` + (one of `gen_ai.training.rank` \| `nccl.rank` \| `nccl.fr.rank`) | `nccl.fr.{pg_id,state,profiling_name,time_discovered_started_ns}` | `pattern.*`, `nccl.fr.{pg_id,collective_seq_id,hanging_ranks_count}` |
| `nccl_bootstrap` | (pod input) `k8s.pod.ready_time` + `gen_ai.training.rank` + `k8s.namespace.name`; (event input) `k8s.event.reason` ∈ `{FailedCreatePodSandBox, NetworkNotReady, CNIError}` + namespace | `gen_ai.training.job_id` (cohort grouping), `k8s.pod.name`, `k8s.node.name`, `k8s.event.{uid,note}`, `k8s.regarding.name` | `pattern.*`, `k8s.namespace.name`, `gen_ai.training.job_id`, `tracecore.alert.nccl_bootstrap_timeout.{cohort_size,failed_rank_count,discriminator}` |
| `dataloader_hang` | `tracecore.alert.training_step_stalled.no_progress_seconds` + `k8s.pod.name` (stall) AND ONE OF: `dataloader.error_class` + `k8s.pod.name` (worker-killed) OR `k8s.event.reason` in {`FailedMount`,`VolumeMountFailure`,`FailedAttachVolume`,`VolumeFailedMount`} + `k8s.node.name` (storage-event) | `tracecore.alert.training_step_stalled.last_step_ns`, `gen_ai.training.{step,phase}`, `dataloader.worker_pid`, `k8s.event.note` | `pattern.*`, `k8s.pod.{name,namespace}`, `k8s.node.name`, `tracecore.alert.dataloader_hang.{discriminator,stall_seconds}`, `dataloader.{worker_pid,error_class}` OR `k8s.event.reason` |
| `silent_data_corruption` | (eval) `gen_ai.training.eval_accuracy` + `gen_ai.training.job_id`; (sdc) `hw.gpu.sdc.delta` + `gen_ai.training.job_id` | `gen_ai.training.eval_accuracy.baseline`, `gen_ai.training.eval_set.{checksum,baseline_checksum}`, `gen_ai.training.checkpoint.{step,baseline_step}`, `gen_ai.training.job.{start,end}_unix_nano`, `hw.gpu.sdc.kind`, `gpu.id`, `k8s.node.name` | `pattern.*`, `tracecore.alert.silent_data_corruption.{kind,accuracy_drop,suspect_gpu_id,suspect_node}`, `gen_ai.training.job_id` |
| `node_condition` (input to multiple patterns) | `k8s.node.name` + `k8s.node.condition.pressure` | `k8s.node.{uid,condition.message}` | (no direct verdict — feeds pod-eviction correlation) |
Expand Down
20 changes: 16 additions & 4 deletions docs/patterns/09-nccl-bootstrap-timeout.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Pattern #9 — NCCL bootstrap timeout

**Status:** ☐ planned (no detector implementation yet)
**Status:** ☑ shipped — detector at [`module/pkg/patterns/nccl_bootstrap.go`](../../module/pkg/patterns/nccl_bootstrap.go); wiring at [`module/processor/patterndetectorprocessor/nccl_bootstrap.go`](../../module/processor/patterndetectorprocessor/nccl_bootstrap.go). Verdict schema [`module/pkg/patterns/testdata/nccl_bootstrap_verdict.schema.json`](../../module/pkg/patterns/testdata/nccl_bootstrap_verdict.schema.json).

Design spec for the pattern-#9 detector. Sibling to pattern #8 — fires at job start, not mid-run.
Design spec retained below for the operator-facing walkthrough. Sibling to pattern #8 — fires at job start, not mid-run.

## Symptom

Expand All @@ -26,11 +26,11 @@ A new training job starts; NCCL never completes `ncclCommInitRank`. Logs show `N

```
for each TrainingJobCohort cohort (grouped by gen_ai.training.job_id):
pod_ready_time = max(cohort.pods.ReadyTimestamp)
pod_ready_time = min(cohort.pods.ReadyTimestamp) # see Impl note 5
age_since_ready = Now - pod_ready_time
if age_since_ready < BootstrapDeadline: skip
bootstrap_failed_ranks = ranks where
no NCCLFRRecord exists for rank
no NCCLFRRecord exists for rank # ABSENCE signal
OR the only NCCL log lines are bootstrap-phase, no collective records
network_event_present = exists k8sobjects event in cohort with
reason in {FailedCreatePodSandBox, NetworkNotReady, CNIError}
Expand All @@ -41,6 +41,8 @@ for each TrainingJobCohort cohort (grouped by gen_ai.training.job_id):
emit NCCLBootstrapTimeoutVerdict (confidence: partial)
```

**Pattern #8 (NCCL hang) vs Pattern #9 (NCCL bootstrap) — disjoint triggers.** Pattern #8 fires on the **PRESENCE** of non-completed FR records (a collective started but never completed mid-run); pattern #9 fires on the **ABSENCE** of any FR record for a rank past the bootstrap deadline. Both patterns can fire concurrently on the same cohort when a heterogeneous bootstrap leaves some ranks past `collective_seq_id == 0` and others not — by design, this is the operator-visible signal that a hetero-bootstrap is in progress, not a duplication bug.

## Verdict attributes

| Key | Type | Description |
Expand Down Expand Up @@ -73,3 +75,13 @@ Replay fixture under `module/pkg/replay/nccl_bootstrap_timeout/` (planned). Seed
2. **`BootstrapDeadline` default.** 5 min is reasonable for warm-cache clusters, too tight for cold-cache. Per-job override via annotation?
3. **CNI signal vocabulary.** Each CNI (Cilium, Calico, multus, ENI, GKE-native) emits different error strings on the same fault. Single OTTL recipe stanza or per-CNI? Pre-impl decision.
4. **Cohort-size discovery.** Without a `gen_ai.training.world_size` attribute, the detector cannot tell "3 of 8 failed" from "3 of 3 succeeded, 5 pods never started." Need a canonical world-size signal.

## Implementation notes (shipped detector)

The v0 detector resolves the open questions above with the most-conservative interpretation:

1. **Cohort grouping.** Pods are grouped by `(gen_ai.training.job_id, k8s.namespace.name)` when the alpha `gen_ai.training.job_id` resource attribute is stamped; otherwise the detector falls back to namespace-only grouping and emits the verdict with an empty `gen_ai.training.job_id`. The empty job-id is the operator-visible signal that the fallback path fired.
2. **`BootstrapDeadline` default.** Ships at 5 min (`DefaultBootstrapDeadline`); operators raise it via `nccl_bootstrap_deadline` for cold-cache clusters.
3. **CNI signal vocabulary.** v0 ships the K8s-event-level vocabulary only (`FailedCreatePodSandBox` / `NetworkNotReady` / `CNIError` — the K8s control-plane-visible reasons that every CNI emits via the kubelet sandbox-setup path). Per-CNI raw-error parsing (Cilium / Calico / multus / ENI / GKE-native distinct strings) is a follow-up that lights up the `socket_ifname_mismatch` and `rendezvous_unreachable` discriminator branches.
4. **Cohort-size discovery.** v0 cohort size is the count of distinct ranks the detector observed pod-Ready signals for. Pods that never reached Ready (image-pull stuck) don't enter the cohort — they belong to pattern #15 (pod-evicted / scheduled-but-not-Ready). The verdict's `tracecore.alert.nccl_bootstrap_timeout.cohort_size` is the post-Ready count; the operator can compare it to the underlying `gen_ai.training.world_size` (when stamped) to detect a sub-quorum "pods never started" subcase.
5. **Deadline gate uses `min(ReadyAt)`; `max(ReadyAt)` anchors evidence.** The pseudocode above says `pod_ready_time = min(cohort.pods.ReadyTimestamp)` — a deliberate departure from an earlier draft that used `max(ReadyAt)`. The intent of the original `max()` phrasing was to gate against slow-image-pull false positives (a cold-cache rolling-readiness cohort). The problem with `max()`: a late-joining rank pushes the deadline forward and silently SUPPRESSES verdicts for genuinely-stuck early ranks. Concretely — rank-0 ReadyAt = T−10min, rank-1 ReadyAt = T+2min, deadline = 5min: `max(ReadyAt) = T+2min` means `age = Now − T−2min`, which falls under the 5min gate at any plausible "now"; rank-0 has been ready 15min and is genuinely stuck but never flagged. The detector therefore uses `min(ReadyAt)` for the deadline gate — measuring the bootstrap window from the FIRST rank to become Ready, which is the rank whose bootstrap is genuinely stuck. The slow-image-pull guard is naturally handled because pods that haven't reached Ready don't enter the cohort at all (edge case "Slow image pull"). `max(ReadyAt)` is retained as the cohort's last-known-good Ready signal for evidence-trail timestamp anchoring — the operator-visible "most recent Ready event on this cohort" surface.
2 changes: 1 addition & 1 deletion docs/patterns/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Engineering-facing pattern-design specs for the 8 unspec'd v1 patterns. Each fol
|---|---|---|
| #2 InfiniBand link flap | [02-ib-link-flap.md](02-ib-link-flap.md) | ☐ planned |
| #8 NCCL timeout, no hardware cause | [08-nccl-timeout-no-hw.md](08-nccl-timeout-no-hw.md) | ☐ planned |
| #9 NCCL bootstrap timeout | [09-nccl-bootstrap-timeout.md](09-nccl-bootstrap-timeout.md) | ☐ planned |
| #9 NCCL bootstrap timeout | [09-nccl-bootstrap-timeout.md](09-nccl-bootstrap-timeout.md) | ☑ shipped |
| #10 CUDA OOM, deceptive allocator | [10-cuda-oom-deceptive.md](10-cuda-oom-deceptive.md) | ☐ planned ([#303](https://github.com/TraceCoreAI/tracecore/issues/303) filed) |
| #11 Checkpointer hang | [11-checkpointer-hang.md](11-checkpointer-hang.md) | ☑ shipped (detector + processor wiring) |
| #12 Loss spike → NaN | [12-loss-spike-nan.md](12-loss-spike-nan.md) | ☐ planned |
Expand Down
Loading
Loading