diff --git a/docs/ATTRIBUTES.md b/docs/ATTRIBUTES.md index 195d405f..421f2e41 100644 --- a/docs/ATTRIBUTES.md +++ b/docs/ATTRIBUTES.md @@ -136,7 +136,7 @@ hardware signal. | `tracecore.alert.dataloader_hang.stall_seconds` | int | tracecore-ext | alpha | Promoted wall-clock stall duration on `dataloader_hang` verdicts so dashboards render a histogram without parsing JSON | `patterndetectorprocessor.appendDataLoaderHangVerdict` | Operator dashboards (verdict-stream tier) | | `tracecore.alert.dataloader_hang.discriminator` | string | tracecore-ext | alpha | Promoted discriminator branch on `dataloader_hang` verdicts (`worker_killed` \| `storage_event`) so dashboards fan out by root cause | `appendDataLoaderHangVerdict` | Operator dashboards | | `tracecore.alert.silent_data_corruption.kind` | string | tracecore-ext | alpha | `vendor_signaled` (full confidence; same-job `hw.gpu.sdc.*` rose during job window) or `accuracy_only` (partial; eval drop >= 2x threshold, no vendor signal) | `patterndetectorprocessor.appendSilentDataCorruptionVerdict` | Operator dashboards (verdict-stream tier) | -| `tracecore.alert.silent_data_corruption.accuracy_drop` | double | tracecore-ext | alpha | `baseline - observed` accuracy drop in absolute units. Range [0, 1]. Promoted so dashboards bucket regression magnitudes without parsing JSON | `appendSilentDataCorruptionVerdict` | Operator dashboards | +| `tracecore.alert.silent_data_corruption.accuracy_drop` | double | tracecore-ext | alpha | `baseline - observed` accuracy drop in absolute units. Range [0, 1]. Promoted so dashboards bucket regression magnitudes without parsing JSON. The emission threshold is operator-tunable via `sdc_accuracy_drop_threshold` (default `0.005`/0.5pp — heuristic; see `docs/patterns/13-silent-data-corruption.md` §"Threshold tuning") | `appendSilentDataCorruptionVerdict` | Operator dashboards | | `tracecore.alert.silent_data_corruption.suspect_gpu_id` | string | tracecore-ext | alpha | PCI BDF of the GPU whose vendor SDC counter rose during the job window. Omitted (not empty-stamped) on `accuracy_only` verdicts to avoid empty-filter false-matches | `appendSilentDataCorruptionVerdict` | Operator dashboards (drain candidate) | | `tracecore.alert.silent_data_corruption.suspect_node` | string | tracecore-ext | alpha | Kubernetes node name carrying the suspect GPU. Omitted on `accuracy_only` verdicts | `appendSilentDataCorruptionVerdict` | Operator dashboards | diff --git a/docs/patterns/13-silent-data-corruption.md b/docs/patterns/13-silent-data-corruption.md index d7d1828d..4a58e2b4 100644 --- a/docs/patterns/13-silent-data-corruption.md +++ b/docs/patterns/13-silent-data-corruption.md @@ -70,6 +70,36 @@ See [correlation-window semantics comparison](README.md#correlation-window-seman - **Dataset drift.** Eval-set changes (new sharding, new tokenizer) cause apparent accuracy drops. Detector requires `eval_set.checksum` to match the baseline's; otherwise suppress. - **Cherry-picked checkpoint.** If the operator evaluates an intermediate checkpoint vs. a final-checkpoint baseline, the drop is expected. Verdict suppressed unless `gen_ai.training.checkpoint.step` matches baseline's. - **Hardware-attribution false positive.** A vendor SDC counter rising doesn't *prove* THIS job consumed corrupted data — it proves the GPU saw an SDC event. Verdict says "suspect," not "confirmed." +- **Sub-threshold SDC.** Real-world SDC can land below the 0.5pp default — single-bit weight flips in small models and transient single-batch corruption in large training jobs both produce sub-threshold regressions. The default is a precision floor, not a physical lower bound; see "Threshold tuning" below for when to lower it. + +## Threshold tuning + +`AccuracyDropThreshold` (operator-facing knob `sdc_accuracy_drop_threshold`; default `0.005`, 0.5pp) is a precision-vs-recall knob, not a physical SDC floor. The detector skips an eval cycle when `accuracy_drop < threshold`, so the threshold MUST exceed the recipe's known run-to-run variance band — but otherwise the lowest value that stays above it is preferred (more recall, same precision). Validation rejects values outside `[0, 1]`. The companion knob `sdc_accuracy_only_multiplier` (default `2.0`) gates the partial-confidence (`kind=accuracy_only`) branch — that branch requires `accuracy_drop >= threshold × multiplier`. + +**When to raise above 0.5pp (lose recall, gain precision):** + +- Recipes with high run-to-run variance — small-batch fine-tunes, low-precision training (FP8/BF16 with aggressive loss-scale), data-augmented pipelines with non-deterministic augmentation order — where the recipe's own variance band exceeds 0.5pp. Raise to the upper bound of the measured variance band (e.g. `0.01` / 1pp for a recipe whose seeded re-runs span ±0.7pp). +- Noisy eval sets (small held-out splits, high-variance eval metrics like ROUGE-L on long-form generation) where the eval itself contributes variance independent of the model. + +**When to lower below 0.5pp (gain recall, keep precision via vendor join):** + +- High-precision recipes with tight measured variance — large supervised fine-tunes where seeded re-runs span <0.1pp, deterministic eval suites (token-level accuracy on a fixed eval set with deterministic dataloader), final-checkpoint evals only. Lower to ~2-3× the recipe's measured run-to-run std-dev (e.g. `0.002` / 0.2pp when the recipe's variance band is ±0.07pp). +- Catching single-bit weight flips in small models. A single-bit flip in a small (<1B-param) model's weight tensor can shift downstream eval by 0.1-0.3pp — below the default. If the operator has paired vendor SDC counter coverage (DCGM SDC catcher running on every node), lowering the threshold lets the `vendor_signaled` branch fire on smaller regressions where the vendor counter provides the high-precision discriminator. +- Catching transient single-batch corruption in large training. A single corrupted minibatch on one rank in a 1000-rank job can shift the eval-pass accuracy by <0.2pp because the corruption is dilution-averaged across rank gradients. Operators chasing these sub-batch events lower the threshold AND rely on the `vendor_signaled` branch (which requires the SDC counter to rise during the job window). +- Detector running in advisory-only mode where operators want maximum recall and triage false positives manually. Lowering the threshold widens the `accuracy_only` band too; operators lowering the threshold SHOULD also raise `sdc_accuracy_only_multiplier` (e.g. to `3.0` or `4.0`) to preserve precision on the no-vendor-signal branch. + +Example — lower threshold to 0.3pp while keeping the no-vendor-signal branch quiet: + +```yaml +processors: + patterndetector: + sdc_accuracy_drop_threshold: 0.003 + sdc_accuracy_only_multiplier: 3.0 # accuracy_only branch still needs >= 0.9pp +``` + +**Why the default stays at 0.005.** The 0.5pp default targets the recall/precision sweet spot for typical foundation-model training recipes whose seeded run-to-run variance bands sit in the 0.1-0.3pp range. Setting the default lower would generate `accuracy_only` false positives on recipes the operator hasn't characterized yet (the most common case at first-deployment). The default is the conservative "ship without per-recipe tuning" value; operators who measure their recipe's variance band SHOULD tune in either direction. + +**Empirical basis (heuristic, not a published study).** The 0.5-3pp band cited in the §Symptom section is observational, drawn from public SDC incident reports (Meta's *Detecting Silent Data Corruptions in the Wild*, 2022; Google's *Cores that don't count*, HotOS 2021) and the NVIDIA SDC catcher synthetic-event behaviour. These describe the *typical* observed band, not a hard physical floor — sub-0.5pp incidents are reported less often because they're harder to attribute, not because they don't occur. Treat the 0.5pp default as a heuristic; pick a per-recipe value where one is measurable. ## Tested-against