Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 23 additions & 6 deletions docs/ATTRIBUTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ trail for downstream consumers that want it.

| Attribute | Type | Source | Stability | Description | Emitted by | Consumed by |
|---|---|---|---|---|---|---|
| `pattern.id` | string | tracecore-ext | stable | Canonical pattern identifier (`pod_evicted`, `xid_correlation`, `hbm_ecc`, `nccl_hang`, `thermal_throttle`, `pcie_aer`) | `patterndetectorprocessor` (`VerdictAttrPatternID`) | Dashboards, LogQL filters, runbooks |
| `pattern.id` | string | tracecore-ext | stable | Canonical pattern identifier (`pod_evicted`, `xid_correlation`, `hbm_ecc`, `nccl_hang`, `thermal_throttle`, `pcie_aer`, `ib_link_flap`, `cuda_oom`, `silent_data_corruption`) | `patterndetectorprocessor` (`VerdictAttrPatternID`) | Dashboards, LogQL filters, runbooks |
| `pattern.confidence` | string | tracecore-ext | stable | Verdict confidence (`high`, `partial`) | `patterndetectorprocessor` (`VerdictAttrConfidence`) | Dashboards |
| `pattern.headline` | string | tracecore-ext | stable | Operator-facing one-line summary | `patterndetectorprocessor` (`VerdictAttrHeadline`) | Dashboards, alerting |
| `pattern.remediation` | string | tracecore-ext | stable | Operator-actionable remediation prose | `patterndetectorprocessor` (`VerdictAttrRemediation`) | Dashboards |
Expand Down Expand Up @@ -112,6 +112,10 @@ hardware signal.
| `tracecore.alert.pcie_rate_collapse.direction` | string | tracecore-ext | alpha | `transmit` or `receive` — falls back to upstream `network.io.direction` if absent | OTTL metrics→logs recipe | `projectPCIeIORecord` |
| `tracecore.alert.pcie_rate_collapse.drop_ratio` | double | tracecore-ext | alpha | Promoted drop-ratio scalar on `pcie_aer` verdicts so dashboards render histograms without parsing JSON | `patterndetectorprocessor.appendPCIeAERVerdict` | Operator dashboards (verdict-stream tier) |
| `tracecore.alert.ib_link_flap.transition_count` | int | tracecore-ext | alpha | In-window ACTIVE→DOWN transition count promoted on `ib_link_flap` verdicts so dashboards distinguish "noisy 4 flaps" from "thrashing 40 flaps" | `patterndetectorprocessor.appendIBLinkFlapVerdict` | Operator dashboards |
| `tracecore.alert.silent_data_corruption.kind` | string | tracecore-ext | alpha | `vendor_signaled` (full confidence; same-job `hw.gpu.sdc.*` rose during job window) or `accuracy_only` (partial; eval drop >= 2x threshold, no vendor signal) | `patterndetectorprocessor.appendSilentDataCorruptionVerdict` | Operator dashboards (verdict-stream tier) |
| `tracecore.alert.silent_data_corruption.accuracy_drop` | double | tracecore-ext | alpha | `baseline - observed` accuracy drop in absolute units. Range [0, 1]. Promoted so dashboards bucket regression magnitudes without parsing JSON | `appendSilentDataCorruptionVerdict` | Operator dashboards |
| `tracecore.alert.silent_data_corruption.suspect_gpu_id` | string | tracecore-ext | alpha | PCI BDF of the GPU whose vendor SDC counter rose during the job window. Omitted (not empty-stamped) on `accuracy_only` verdicts to avoid empty-filter false-matches | `appendSilentDataCorruptionVerdict` | Operator dashboards (drain candidate) |
| `tracecore.alert.silent_data_corruption.suspect_node` | string | tracecore-ext | alpha | Kubernetes node name carrying the suspect GPU. Omitted on `accuracy_only` verdicts | `appendSilentDataCorruptionVerdict` | Operator dashboards |

---

Expand Down Expand Up @@ -157,6 +161,8 @@ issues [#265](https://github.com/TraceCoreAI/tracecore/issues/265)
| `hw.gpu.throttle.cascade_size` | int | tracecore-ext | development | Promoted scalar on `thermal_throttle` verdicts (see `pattern.*` table) | `patterndetectorprocessor` | Dashboards |
| `hw.gpu.memory.free` | int | upstream-proposal ([#303](https://github.com/TraceCoreAI/tracecore/issues/303)) | development | Per-GPU framebuffer free bytes projected from `DCGM_FI_DEV_FB_FREE` onto the logs path | DCGM OTTL transform | Pattern #10 (CUDA OOM) |
| `hw.gpu.memory.total` | int | upstream-proposal ([#303](https://github.com/TraceCoreAI/tracecore/issues/303)) | development | Per-GPU framebuffer capacity (`DCGM_FI_DEV_FB_USED + FREE`) projected onto the logs path | DCGM OTTL transform | Pattern #10 |
| `hw.gpu.sdc.delta` | int | tracecore-ext | alpha | Per-GPU vendor SDC counter rise (NVIDIA SDC catcher / row-remap / AMD ECC at non-fatal threshold). Counter rise > 0 inside a job window gates pattern #13's full-confidence branch | OTTL metrics→logs recipe (TBD; blocked on RFC-0014 PR-B + vendor exporter wiring) | `projectSDCCounterRecord` (gates `silent_data_corruption`) |
| `hw.gpu.sdc.kind` | string | tracecore-ext | alpha | Which vendor SDC family rose — `remap_pending` / `remap_failure` / `catcher_count`. Carried onto the verdict's evidence trail for operator-side DCGM debug queries | OTTL metrics→logs recipe | `projectSDCCounterRecord` |

---

Expand Down Expand Up @@ -255,15 +261,25 @@ is no equivalent upstream contract.
## `gen_ai.training.*` — training-job join keys

Upstream OTel `gen_ai` namespace. We use it as the cross-receiver
join surface for distributed training workloads (rank, job id).
Today only `gen_ai.training.rank` is consumed; `gen_ai.training.job_id`
is contracted in RFC-0013 §3 but not yet wired through the pattern
library (future cross-receiver join surface).
join surface for distributed training workloads (rank, job id, eval
accuracy). `gen_ai.training.rank` is consumed by the NCCL
FlightRecorder projection; `gen_ai.training.job_id` is the same-job
join key for pattern #13 (silent data corruption); the
`eval_accuracy.*` + `checkpoint.*` keys gate the SDC detector's
discriminator + false-positive guards.

| Attribute | Type | Source | Stability | Description | Emitted by | Consumed by |
|---|---|---|---|---|---|---|
| `gen_ai.training.rank` | int | upstream-semconv (alpha) | development | Rank index — canonical per M19 | `rankjoinprocessor` (`module/processor/rankjoinprocessor`) | `projectNCCLFRRecord` (preferred over `nccl.rank` / `nccl.fr.rank`) |
| `gen_ai.training.job_id` | string | upstream-semconv (alpha) | alpha | Training-job id — contracted but not wired into pattern library yet | (future) | (future) |
| `gen_ai.training.job_id` | string | upstream-semconv (alpha) | alpha | Training-job id — same-job join key for pattern #13 (vendor SDC counter → eval accuracy). Verdict scalar on `silent_data_corruption` | eval-pipeline OTTL recipe + vendor SDC OTTL recipe (TBD) | `projectEvalAccuracyRecord`, `projectSDCCounterRecord` |
| `gen_ai.training.eval_accuracy` | double | upstream-proposal | development | Eval-pass accuracy in [0, 1]. Gates pattern #13 | eval-pipeline OTTL recipe (blocked on upstream framework instrumentation per `docs/patterns/13` §"Open questions" 5) | `projectEvalAccuracyRecord` |
| `gen_ai.training.eval_accuracy.baseline` | double | tracecore-ext | alpha | Operator-stamped reference accuracy the current eval is compared against. Zero or absent skips pattern #13 evaluation (no comparator) | eval-pipeline OTTL recipe / operator config | `projectEvalAccuracyRecord` |
| `gen_ai.training.eval_set.checksum` | string | tracecore-ext | alpha | Eval-set checksum. Compared against `baseline_checksum` to suppress dataset-drift false positives on pattern #13 | eval-pipeline OTTL recipe | `projectEvalAccuracyRecord` |
| `gen_ai.training.eval_set.baseline_checksum` | string | tracecore-ext | alpha | Eval-set checksum the baseline accuracy was measured against | eval-pipeline OTTL recipe / operator config | `projectEvalAccuracyRecord` |
| `gen_ai.training.checkpoint.step` | int | tracecore-ext | alpha | Training step the evaluated checkpoint was saved at. Compared against `baseline_step` to suppress cherry-picked-checkpoint false positives on pattern #13 | eval-pipeline OTTL recipe | `projectEvalAccuracyRecord` |
| `gen_ai.training.checkpoint.baseline_step` | int | tracecore-ext | alpha | Training step the baseline accuracy was measured at | eval-pipeline OTTL recipe / operator config | `projectEvalAccuracyRecord` |
| `gen_ai.training.job.start_unix_nano` | int | tracecore-ext | alpha | Job start wall-clock (unix nanos). Lower bound for the same-job SDC counter join window on pattern #13 | eval-pipeline OTTL recipe / `rankjoinprocessor` | `projectEvalAccuracyRecord` |
| `gen_ai.training.job.end_unix_nano` | int | tracecore-ext | alpha | Job end wall-clock (unix nanos). Upper bound for the SDC counter join window. Falls back to eval record's Timestamp when absent | eval-pipeline OTTL recipe / `rankjoinprocessor` | `projectEvalAccuracyRecord` |

---

Expand Down Expand Up @@ -303,6 +319,7 @@ need to fire?" without reading source.
| `thermal_throttle` | `hw.gpu.throttle.duration.delta` + `hw.gpu.throttle.reason` + `gpu.id` | `hw.gpu.index`, `k8s.node.name` | `pattern.*`, `hw.gpu.throttle.cascade_size` |
| `pcie_aer` *(wiring on a follow-up PR — projections present in library)* | `kernelevents.pcie_aer.severity` + `gpu.id` OR `tracecore.alert.pcie_rate_collapse.bytes_per_second` + `gpu.id` | `kernelevents.pcie_aer.type`, `network.io.direction`, `tracecore.alert.pcie_rate_collapse.{baseline_bytes_per_second,direction}` | `pattern.*` |
| `nccl_hang` | `nccl.fr.collective_seq_id` + (one of `gen_ai.training.rank` \| `nccl.rank` \| `nccl.fr.rank`) | `nccl.fr.{pg_id,state,profiling_name,time_discovered_started_ns}` | `pattern.*`, `nccl.fr.{pg_id,collective_seq_id,hanging_ranks_count}` |
| `silent_data_corruption` | (eval) `gen_ai.training.eval_accuracy` + `gen_ai.training.job_id`; (sdc) `hw.gpu.sdc.delta` + `gen_ai.training.job_id` | `gen_ai.training.eval_accuracy.baseline`, `gen_ai.training.eval_set.{checksum,baseline_checksum}`, `gen_ai.training.checkpoint.{step,baseline_step}`, `gen_ai.training.job.{start,end}_unix_nano`, `hw.gpu.sdc.kind`, `gpu.id`, `k8s.node.name` | `pattern.*`, `tracecore.alert.silent_data_corruption.{kind,accuracy_drop,suspect_gpu_id,suspect_node}`, `gen_ai.training.job_id` |
| `node_condition` (input to multiple patterns) | `k8s.node.name` + `k8s.node.condition.pressure` | `k8s.node.{uid,condition.message}` | (no direct verdict — feeds pod-eviction correlation) |

---
Expand Down
4 changes: 2 additions & 2 deletions docs/patterns/13-silent-data-corruption.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Pattern #13 — Silent data corruption (SDC)

**Status:** ☐ planned (no detector implementation yet) — frontier-layer pattern, hardest to detect
**Status:** ☑ shipped — `patterns.SilentDataCorruptionDetector` in `module/pkg/patterns/silent_data_corruption.go`, wired via `module/processor/patterndetectorprocessor/silent_data_corruption.go`. Verdict shape pinned by `module/pkg/patterns/testdata/silent_data_corruption_verdict.schema.json`. Spec preserved here as the engineering record; the detector implements the algorithm in §"Detector evaluation rule" with conservative-by-default thresholds.

Design spec for the pattern-#13 detector. SDC is the highest-difficulty pattern in the 15-set: the run completes, loss looks normal, but downstream eval shows degraded model quality. The detector must surface a *suspicion* with a clear evidence trail; certainty requires re-run.
Design spec for the pattern-#13 detector. SDC is the highest-difficulty pattern in the 15-set: the run completes, loss looks normal, but downstream eval shows degraded model quality. The detector surfaces a *suspicion* with a clear evidence trail; certainty requires re-run on different hardware.

## Symptom

Expand Down
2 changes: 1 addition & 1 deletion docs/patterns/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Engineering-facing pattern-design specs for the 8 unspec'd v1 patterns. Each fol
| #10 CUDA OOM, deceptive allocator | [10-cuda-oom-deceptive.md](10-cuda-oom-deceptive.md) | ☐ planned ([#303](https://github.com/TraceCoreAI/tracecore/issues/303) filed) |
| #11 Checkpointer hang | [11-checkpointer-hang.md](11-checkpointer-hang.md) | ☐ planned |
| #12 Loss spike → NaN | [12-loss-spike-nan.md](12-loss-spike-nan.md) | ☐ planned |
| #13 Silent data corruption | [13-silent-data-corruption.md](13-silent-data-corruption.md) | ☐ planned |
| #13 Silent data corruption | [13-silent-data-corruption.md](13-silent-data-corruption.md) | ☑ shipped |

## Replay test fixture

Expand Down
Loading