TraceCoreAI · trilamsr · Jun 1, 2026 · Jun 1, 2026 · Jun 1, 2026
diff --git a/docs/ATTRIBUTES.md b/docs/ATTRIBUTES.md
@@ -53,7 +53,7 @@ trail for downstream consumers that want it.
 
 | Attribute | Type | Source | Stability | Description | Emitted by | Consumed by |
 |---|---|---|---|---|---|---|
-| `pattern.id` | string | tracecore-ext | stable | Canonical pattern identifier (`pod_evicted`, `xid_correlation`, `hbm_ecc`, `nccl_hang`, `thermal_throttle`, `pcie_aer`) | `patterndetectorprocessor` (`VerdictAttrPatternID`) | Dashboards, LogQL filters, runbooks |
+| `pattern.id` | string | tracecore-ext | stable | Canonical pattern identifier (`pod_evicted`, `xid_correlation`, `hbm_ecc`, `nccl_hang`, `thermal_throttle`, `pcie_aer`, `ib_link_flap`, `cuda_oom`, `silent_data_corruption`) | `patterndetectorprocessor` (`VerdictAttrPatternID`) | Dashboards, LogQL filters, runbooks |
 | `pattern.confidence` | string | tracecore-ext | stable | Verdict confidence (`high`, `partial`) | `patterndetectorprocessor` (`VerdictAttrConfidence`) | Dashboards |
 | `pattern.headline` | string | tracecore-ext | stable | Operator-facing one-line summary | `patterndetectorprocessor` (`VerdictAttrHeadline`) | Dashboards, alerting |
 | `pattern.remediation` | string | tracecore-ext | stable | Operator-actionable remediation prose | `patterndetectorprocessor` (`VerdictAttrRemediation`) | Dashboards |
@@ -112,6 +112,10 @@ hardware signal.
 | `tracecore.alert.pcie_rate_collapse.direction` | string | tracecore-ext | alpha | `transmit` or `receive` — falls back to upstream `network.io.direction` if absent | OTTL metrics→logs recipe | `projectPCIeIORecord` |
 | `tracecore.alert.pcie_rate_collapse.drop_ratio` | double | tracecore-ext | alpha | Promoted drop-ratio scalar on `pcie_aer` verdicts so dashboards render histograms without parsing JSON | `patterndetectorprocessor.appendPCIeAERVerdict` | Operator dashboards (verdict-stream tier) |
 | `tracecore.alert.ib_link_flap.transition_count` | int | tracecore-ext | alpha | In-window ACTIVE→DOWN transition count promoted on `ib_link_flap` verdicts so dashboards distinguish "noisy 4 flaps" from "thrashing 40 flaps" | `patterndetectorprocessor.appendIBLinkFlapVerdict` | Operator dashboards |
+| `tracecore.alert.silent_data_corruption.kind` | string | tracecore-ext | alpha | `vendor_signaled` (full confidence; same-job `hw.gpu.sdc.*` rose during job window) or `accuracy_only` (partial; eval drop >= 2x threshold, no vendor signal) | `patterndetectorprocessor.appendSilentDataCorruptionVerdict` | Operator dashboards (verdict-stream tier) |
+| `tracecore.alert.silent_data_corruption.accuracy_drop` | double | tracecore-ext | alpha | `baseline - observed` accuracy drop in absolute units. Range [0, 1]. Promoted so dashboards bucket regression magnitudes without parsing JSON | `appendSilentDataCorruptionVerdict` | Operator dashboards |
+| `tracecore.alert.silent_data_corruption.suspect_gpu_id` | string | tracecore-ext | alpha | PCI BDF of the GPU whose vendor SDC counter rose during the job window. Omitted (not empty-stamped) on `accuracy_only` verdicts to avoid empty-filter false-matches | `appendSilentDataCorruptionVerdict` | Operator dashboards (drain candidate) |
+| `tracecore.alert.silent_data_corruption.suspect_node` | string | tracecore-ext | alpha | Kubernetes node name carrying the suspect GPU. Omitted on `accuracy_only` verdicts | `appendSilentDataCorruptionVerdict` | Operator dashboards |
 
 ---
 
@@ -157,6 +161,8 @@ issues [#265](https://github.com/TraceCoreAI/tracecore/issues/265)
 | `hw.gpu.throttle.cascade_size` | int | tracecore-ext | development | Promoted scalar on `thermal_throttle` verdicts (see `pattern.*` table) | `patterndetectorprocessor` | Dashboards |
 | `hw.gpu.memory.free` | int | upstream-proposal ([#303](https://github.com/TraceCoreAI/tracecore/issues/303)) | development | Per-GPU framebuffer free bytes projected from `DCGM_FI_DEV_FB_FREE` onto the logs path | DCGM OTTL transform | Pattern #10 (CUDA OOM) |
 | `hw.gpu.memory.total` | int | upstream-proposal ([#303](https://github.com/TraceCoreAI/tracecore/issues/303)) | development | Per-GPU framebuffer capacity (`DCGM_FI_DEV_FB_USED + FREE`) projected onto the logs path | DCGM OTTL transform | Pattern #10 |
+| `hw.gpu.sdc.delta` | int | tracecore-ext | alpha | Per-GPU vendor SDC counter rise (NVIDIA SDC catcher / row-remap / AMD ECC at non-fatal threshold). Counter rise > 0 inside a job window gates pattern #13's full-confidence branch | OTTL metrics→logs recipe (TBD; blocked on RFC-0014 PR-B + vendor exporter wiring) | `projectSDCCounterRecord` (gates `silent_data_corruption`) |
+| `hw.gpu.sdc.kind` | string | tracecore-ext | alpha | Which vendor SDC family rose — `remap_pending` / `remap_failure` / `catcher_count`. Carried onto the verdict's evidence trail for operator-side DCGM debug queries | OTTL metrics→logs recipe | `projectSDCCounterRecord` |
 
 ---
 
@@ -255,15 +261,25 @@ is no equivalent upstream contract.
 ## `gen_ai.training.*` — training-job join keys
 
 Upstream OTel `gen_ai` namespace. We use it as the cross-receiver
-join surface for distributed training workloads (rank, job id).
-Today only `gen_ai.training.rank` is consumed; `gen_ai.training.job_id`
-is contracted in RFC-0013 §3 but not yet wired through the pattern
-library (future cross-receiver join surface).
+join surface for distributed training workloads (rank, job id, eval
+accuracy). `gen_ai.training.rank` is consumed by the NCCL
+FlightRecorder projection; `gen_ai.training.job_id` is the same-job
+join key for pattern #13 (silent data corruption); the
+`eval_accuracy.*` + `checkpoint.*` keys gate the SDC detector's
+discriminator + false-positive guards.
 
 | Attribute | Type | Source | Stability | Description | Emitted by | Consumed by |
 |---|---|---|---|---|---|---|
 | `gen_ai.training.rank` | int | upstream-semconv (alpha) | development | Rank index — canonical per M19 | `rankjoinprocessor` (`module/processor/rankjoinprocessor`) | `projectNCCLFRRecord` (preferred over `nccl.rank` / `nccl.fr.rank`) |
-| `gen_ai.training.job_id` | string | upstream-semconv (alpha) | alpha | Training-job id — contracted but not wired into pattern library yet | (future) | (future) |
+| `gen_ai.training.job_id` | string | upstream-semconv (alpha) | alpha | Training-job id — same-job join key for pattern #13 (vendor SDC counter → eval accuracy). Verdict scalar on `silent_data_corruption` | eval-pipeline OTTL recipe + vendor SDC OTTL recipe (TBD) | `projectEvalAccuracyRecord`, `projectSDCCounterRecord` |
+| `gen_ai.training.eval_accuracy` | double | upstream-proposal | development | Eval-pass accuracy in [0, 1]. Gates pattern #13 | eval-pipeline OTTL recipe (blocked on upstream framework instrumentation per `docs/patterns/13` §"Open questions" 5) | `projectEvalAccuracyRecord` |
+| `gen_ai.training.eval_accuracy.baseline` | double | tracecore-ext | alpha | Operator-stamped reference accuracy the current eval is compared against. Zero or absent skips pattern #13 evaluation (no comparator) | eval-pipeline OTTL recipe / operator config | `projectEvalAccuracyRecord` |
+| `gen_ai.training.eval_set.checksum` | string | tracecore-ext | alpha | Eval-set checksum. Compared against `baseline_checksum` to suppress dataset-drift false positives on pattern #13 | eval-pipeline OTTL recipe | `projectEvalAccuracyRecord` |
+| `gen_ai.training.eval_set.baseline_checksum` | string | tracecore-ext | alpha | Eval-set checksum the baseline accuracy was measured against | eval-pipeline OTTL recipe / operator config | `projectEvalAccuracyRecord` |
+| `gen_ai.training.checkpoint.step` | int | tracecore-ext | alpha | Training step the evaluated checkpoint was saved at. Compared against `baseline_step` to suppress cherry-picked-checkpoint false positives on pattern #13 | eval-pipeline OTTL recipe | `projectEvalAccuracyRecord` |
+| `gen_ai.training.checkpoint.baseline_step` | int | tracecore-ext | alpha | Training step the baseline accuracy was measured at | eval-pipeline OTTL recipe / operator config | `projectEvalAccuracyRecord` |
+| `gen_ai.training.job.start_unix_nano` | int | tracecore-ext | alpha | Job start wall-clock (unix nanos). Lower bound for the same-job SDC counter join window on pattern #13 | eval-pipeline OTTL recipe / `rankjoinprocessor` | `projectEvalAccuracyRecord` |
+| `gen_ai.training.job.end_unix_nano` | int | tracecore-ext | alpha | Job end wall-clock (unix nanos). Upper bound for the SDC counter join window. Falls back to eval record's Timestamp when absent | eval-pipeline OTTL recipe / `rankjoinprocessor` | `projectEvalAccuracyRecord` |
 
 ---
 
@@ -303,6 +319,7 @@ need to fire?" without reading source.
 | `thermal_throttle` | `hw.gpu.throttle.duration.delta` + `hw.gpu.throttle.reason` + `gpu.id` | `hw.gpu.index`, `k8s.node.name` | `pattern.*`, `hw.gpu.throttle.cascade_size` |
 | `pcie_aer` *(wiring on a follow-up PR — projections present in library)* | `kernelevents.pcie_aer.severity` + `gpu.id` OR `tracecore.alert.pcie_rate_collapse.bytes_per_second` + `gpu.id` | `kernelevents.pcie_aer.type`, `network.io.direction`, `tracecore.alert.pcie_rate_collapse.{baseline_bytes_per_second,direction}` | `pattern.*` |
 | `nccl_hang` | `nccl.fr.collective_seq_id` + (one of `gen_ai.training.rank` \| `nccl.rank` \| `nccl.fr.rank`) | `nccl.fr.{pg_id,state,profiling_name,time_discovered_started_ns}` | `pattern.*`, `nccl.fr.{pg_id,collective_seq_id,hanging_ranks_count}` |
+| `silent_data_corruption` | (eval) `gen_ai.training.eval_accuracy` + `gen_ai.training.job_id`; (sdc) `hw.gpu.sdc.delta` + `gen_ai.training.job_id` | `gen_ai.training.eval_accuracy.baseline`, `gen_ai.training.eval_set.{checksum,baseline_checksum}`, `gen_ai.training.checkpoint.{step,baseline_step}`, `gen_ai.training.job.{start,end}_unix_nano`, `hw.gpu.sdc.kind`, `gpu.id`, `k8s.node.name` | `pattern.*`, `tracecore.alert.silent_data_corruption.{kind,accuracy_drop,suspect_gpu_id,suspect_node}`, `gen_ai.training.job_id` |
 | `node_condition` (input to multiple patterns) | `k8s.node.name` + `k8s.node.condition.pressure` | `k8s.node.{uid,condition.message}` | (no direct verdict — feeds pod-eviction correlation) |
 
 ---

diff --git a/docs/patterns/13-silent-data-corruption.md b/docs/patterns/13-silent-data-corruption.md
@@ -1,8 +1,8 @@
 # Pattern #13 — Silent data corruption (SDC)
 
-**Status:** ☐ planned (no detector implementation yet) — frontier-layer pattern, hardest to detect
+**Status:** ☑ shipped — `patterns.SilentDataCorruptionDetector` in `module/pkg/patterns/silent_data_corruption.go`, wired via `module/processor/patterndetectorprocessor/silent_data_corruption.go`. Verdict shape pinned by `module/pkg/patterns/testdata/silent_data_corruption_verdict.schema.json`. Spec preserved here as the engineering record; the detector implements the algorithm in §"Detector evaluation rule" with conservative-by-default thresholds.
 
-Design spec for the pattern-#13 detector. SDC is the highest-difficulty pattern in the 15-set: the run completes, loss looks normal, but downstream eval shows degraded model quality. The detector must surface a *suspicion* with a clear evidence trail; certainty requires re-run.
+Design spec for the pattern-#13 detector. SDC is the highest-difficulty pattern in the 15-set: the run completes, loss looks normal, but downstream eval shows degraded model quality. The detector surfaces a *suspicion* with a clear evidence trail; certainty requires re-run on different hardware.
 
 ## Symptom
 

diff --git a/docs/patterns/README.md b/docs/patterns/README.md
@@ -59,7 +59,7 @@ Engineering-facing pattern-design specs for the 8 unspec'd v1 patterns. Each fol
 | #10 CUDA OOM, deceptive allocator | [10-cuda-oom-deceptive.md](10-cuda-oom-deceptive.md) | ☐ planned ([#303](https://github.com/TraceCoreAI/tracecore/issues/303) filed) |
 | #11 Checkpointer hang | [11-checkpointer-hang.md](11-checkpointer-hang.md) | ☐ planned |
 | #12 Loss spike → NaN | [12-loss-spike-nan.md](12-loss-spike-nan.md) | ☐ planned |
-| #13 Silent data corruption | [13-silent-data-corruption.md](13-silent-data-corruption.md) | ☐ planned |
+| #13 Silent data corruption | [13-silent-data-corruption.md](13-silent-data-corruption.md) | ☑ shipped |
 
 ## Replay test fixture