Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/ATTRIBUTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ hardware signal.
| `tracecore.alert.dataloader_hang.stall_seconds` | int | tracecore-ext | alpha | Promoted wall-clock stall duration on `dataloader_hang` verdicts so dashboards render a histogram without parsing JSON | `patterndetectorprocessor.appendDataLoaderHangVerdict` | Operator dashboards (verdict-stream tier) |
| `tracecore.alert.dataloader_hang.discriminator` | string | tracecore-ext | alpha | Promoted discriminator branch on `dataloader_hang` verdicts (`worker_killed` \| `storage_event`) so dashboards fan out by root cause | `appendDataLoaderHangVerdict` | Operator dashboards |
| `tracecore.alert.silent_data_corruption.kind` | string | tracecore-ext | alpha | `vendor_signaled` (full confidence; same-job `hw.gpu.sdc.*` rose during job window) or `accuracy_only` (partial; eval drop >= 2x threshold, no vendor signal) | `patterndetectorprocessor.appendSilentDataCorruptionVerdict` | Operator dashboards (verdict-stream tier) |
| `tracecore.alert.silent_data_corruption.accuracy_drop` | double | tracecore-ext | alpha | `baseline - observed` accuracy drop in absolute units. Range [0, 1]. Promoted so dashboards bucket regression magnitudes without parsing JSON | `appendSilentDataCorruptionVerdict` | Operator dashboards |
| `tracecore.alert.silent_data_corruption.accuracy_drop` | double | tracecore-ext | alpha | `baseline - observed` accuracy drop in absolute units. Range [0, 1]. Promoted so dashboards bucket regression magnitudes without parsing JSON. The emission threshold is operator-tunable via `sdc_accuracy_drop_threshold` (default `0.005`/0.5pp — heuristic; see `docs/patterns/13-silent-data-corruption.md` §"Threshold tuning") | `appendSilentDataCorruptionVerdict` | Operator dashboards |
| `tracecore.alert.silent_data_corruption.suspect_gpu_id` | string | tracecore-ext | alpha | PCI BDF of the GPU whose vendor SDC counter rose during the job window. Omitted (not empty-stamped) on `accuracy_only` verdicts to avoid empty-filter false-matches | `appendSilentDataCorruptionVerdict` | Operator dashboards (drain candidate) |
| `tracecore.alert.silent_data_corruption.suspect_node` | string | tracecore-ext | alpha | Kubernetes node name carrying the suspect GPU. Omitted on `accuracy_only` verdicts | `appendSilentDataCorruptionVerdict` | Operator dashboards |

Expand Down
30 changes: 30 additions & 0 deletions docs/patterns/13-silent-data-corruption.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,36 @@ See [correlation-window semantics comparison](README.md#correlation-window-seman
- **Dataset drift.** Eval-set changes (new sharding, new tokenizer) cause apparent accuracy drops. Detector requires `eval_set.checksum` to match the baseline's; otherwise suppress.
- **Cherry-picked checkpoint.** If the operator evaluates an intermediate checkpoint vs. a final-checkpoint baseline, the drop is expected. Verdict suppressed unless `gen_ai.training.checkpoint.step` matches baseline's.
- **Hardware-attribution false positive.** A vendor SDC counter rising doesn't *prove* THIS job consumed corrupted data — it proves the GPU saw an SDC event. Verdict says "suspect," not "confirmed."
- **Sub-threshold SDC.** Real-world SDC can land below the 0.5pp default — single-bit weight flips in small models and transient single-batch corruption in large training jobs both produce sub-threshold regressions. The default is a precision floor, not a physical lower bound; see "Threshold tuning" below for when to lower it.

## Threshold tuning

`AccuracyDropThreshold` (operator-facing knob `sdc_accuracy_drop_threshold`; default `0.005`, 0.5pp) is a precision-vs-recall knob, not a physical SDC floor. The detector skips an eval cycle when `accuracy_drop < threshold`, so the threshold MUST exceed the recipe's known run-to-run variance band — but otherwise the lowest value that stays above it is preferred (more recall, same precision). Validation rejects values outside `[0, 1]`. The companion knob `sdc_accuracy_only_multiplier` (default `2.0`) gates the partial-confidence (`kind=accuracy_only`) branch — that branch requires `accuracy_drop >= threshold × multiplier`.

**When to raise above 0.5pp (lose recall, gain precision):**

- Recipes with high run-to-run variance — small-batch fine-tunes, low-precision training (FP8/BF16 with aggressive loss-scale), data-augmented pipelines with non-deterministic augmentation order — where the recipe's own variance band exceeds 0.5pp. Raise to the upper bound of the measured variance band (e.g. `0.01` / 1pp for a recipe whose seeded re-runs span ±0.7pp).
- Noisy eval sets (small held-out splits, high-variance eval metrics like ROUGE-L on long-form generation) where the eval itself contributes variance independent of the model.

**When to lower below 0.5pp (gain recall, keep precision via vendor join):**

- High-precision recipes with tight measured variance — large supervised fine-tunes where seeded re-runs span <0.1pp, deterministic eval suites (token-level accuracy on a fixed eval set with deterministic dataloader), final-checkpoint evals only. Lower to ~2-3× the recipe's measured run-to-run std-dev (e.g. `0.002` / 0.2pp when the recipe's variance band is ±0.07pp).
- Catching single-bit weight flips in small models. A single-bit flip in a small (<1B-param) model's weight tensor can shift downstream eval by 0.1-0.3pp — below the default. If the operator has paired vendor SDC counter coverage (DCGM SDC catcher running on every node), lowering the threshold lets the `vendor_signaled` branch fire on smaller regressions where the vendor counter provides the high-precision discriminator.
- Catching transient single-batch corruption in large training. A single corrupted minibatch on one rank in a 1000-rank job can shift the eval-pass accuracy by <0.2pp because the corruption is dilution-averaged across rank gradients. Operators chasing these sub-batch events lower the threshold AND rely on the `vendor_signaled` branch (which requires the SDC counter to rise during the job window).
- Detector running in advisory-only mode where operators want maximum recall and triage false positives manually. Lowering the threshold widens the `accuracy_only` band too; operators lowering the threshold SHOULD also raise `sdc_accuracy_only_multiplier` (e.g. to `3.0` or `4.0`) to preserve precision on the no-vendor-signal branch.

Example — lower threshold to 0.3pp while keeping the no-vendor-signal branch quiet:

```yaml
processors:
patterndetector:
sdc_accuracy_drop_threshold: 0.003
sdc_accuracy_only_multiplier: 3.0 # accuracy_only branch still needs >= 0.9pp
```

**Why the default stays at 0.005.** The 0.5pp default targets the recall/precision sweet spot for typical foundation-model training recipes whose seeded run-to-run variance bands sit in the 0.1-0.3pp range. Setting the default lower would generate `accuracy_only` false positives on recipes the operator hasn't characterized yet (the most common case at first-deployment). The default is the conservative "ship without per-recipe tuning" value; operators who measure their recipe's variance band SHOULD tune in either direction.

**Empirical basis (heuristic, not a published study).** The 0.5-3pp band cited in the §Symptom section is observational, drawn from public SDC incident reports (Meta's *Detecting Silent Data Corruptions in the Wild*, 2022; Google's *Cores that don't count*, HotOS 2021) and the NVIDIA SDC catcher synthetic-event behaviour. These describe the *typical* observed band, not a hard physical floor — sub-0.5pp incidents are reported less often because they're harder to attribute, not because they don't occur. Treat the 0.5pp default as a heuristic; pick a per-recipe value where one is measurable.

## Tested-against

Expand Down