Skip to content

feat(sdc): pattern-13 silent data corruption detector#344

Merged
trilamsr merged 2 commits into
mainfrom
feat/pattern-15-silent-data-corruption
Jun 1, 2026
Merged

feat(sdc): pattern-13 silent data corruption detector#344
trilamsr merged 2 commits into
mainfrom
feat/pattern-15-silent-data-corruption

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Ships pattern #13 (silent data corruption) end-to-end per the spec at docs/patterns/13-silent-data-corruption.md — the hardest pattern in the 15-set, where the run completes, loss looks normal, but downstream eval shows degraded model quality.

  • Library: SilentDataCorruptionDetector in module/pkg/patterns/silent_data_corruption.go consumes EvalAccuracyRecord + SDCCounterRecord typed projections and emits SilentDataCorruptionVerdict. Discriminator follows the spec's evaluation rule:
    • accuracy_drop >= AccuracyDropThreshold AND same-job hw.gpu.sdc.* counter rose during the job window → kind=vendor_signaled (full confidence)
    • accuracy_drop >= AccuracyDropThreshold * 2 alone → kind=accuracy_only (partial confidence)
    • sub-2x-threshold without vendor signal → no verdict (conservative-by-default noise band)
  • False-positive guards (spec §"Edge cases"): dataset-checksum mismatch suppresses verdicts; checkpoint-step mismatch suppresses verdicts. Both guards are opt-in (only fire when both sides set the field).
  • Wiring: patterndetectorprocessor projects gen_ai.training.eval_accuracy* + hw.gpu.sdc.* log records, runs the detector, emits the verdict with promoted scalars per issue patterndetectorprocessor: promote operator-facing scalar attrs onto verdict records #270.
  • Schema: module/pkg/patterns/testdata/silent_data_corruption_verdict.schema.json pins the verdict wire format; 13 schema-drift falsifiers + conformance round-trip for both branches.
  • Docs: spec status flipped to shipped; docs/patterns/README.md table updated; docs/ATTRIBUTES.md registers the 13 new customer-stable attribute keys (hw.gpu.sdc.{delta,kind}, gen_ai.training.eval_accuracy.*, gen_ai.training.eval_set.*, gen_ai.training.checkpoint.*, gen_ai.training.job.{start,end}_unix_nano, tracecore.alert.silent_data_corruption.*) and adds the silent_data_corruption row to the per-pattern matrix.

Verdict remains explicitly advisory per spec §"Detector evaluation rule" — both branches route the operator to a same-recipe + same-seed re-run on different hardware as the disambiguator, because SDC repro is non-deterministic.

Test plan

  • cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... -count=1 — green (19 SDC library tests incl. 13 schema falsifiers, 7 wiring tests).
  • cd module && go vet ./... — clean.
  • golangci-lint run ./... — 0 issues (via pre-commit hook).
  • attribute-namespace-check — 82/82 unique attribute literals documented (via pre-commit hook).
  • Full module go build ./... — clean.

Integration gaps (upstream-blocked, documented in-place)

  • gen_ai.training.eval_accuracy upstream framework instrumentation (spec §"Open questions" 5) — projection tested against hand-crafted records; real-world input arrives once the OTTL recipe lands.
  • hw.gpu.sdc.* OTTL recipe for DCGM SDC catcher / row-remap / AMD ECC counters — blocked on RFC-0014 PR-B (metrics→logs pattern). Cross-vendor support depends on hw.gpu.sdc.* semconv that doesn't exist yet (spec §"Open questions" 2).
  • Baseline-accuracy provenance + cross-run state are v0.3+ product questions (spec §"Open questions" 1, 4) — the detector accepts operator-stamped baselines today.
Add pattern #13 (silent data corruption) detector. Surfaces suspected SDC when an eval-accuracy regression (`gen_ai.training.eval_accuracy` vs baseline) crosses the configured threshold; a same-job `hw.gpu.sdc.*` counter rise during the job window flips the verdict to high confidence (`kind=vendor_signaled`). The verdict is advisory — both branches recommend a same-recipe re-run on different hardware as the disambiguator. New config knobs: `sdc_accuracy_drop_threshold` (default `0.005`) and `sdc_accuracy_only_multiplier` (default `2.0`).

Tri Lam added 2 commits June 1, 2026 01:34
Add SilentDataCorruptionDetector + EvalAccuracyRecord +
SDCCounterRecord typed projections and the verdict shape per
docs/patterns/13. Discriminator follows the spec's evaluation rule:
accuracy drop >= threshold joined by a same-job hw.gpu.sdc.* counter
during the job window emits kind=vendor_signaled (full); drop alone
>= 2x threshold emits kind=accuracy_only (partial). Verdict is
advisory — both branches route to a same-recipe re-run on different
hardware as the disambiguator.

Conservative-by-default thresholds (0.005 absolute drop, 2x partial
gate) reject the sub-2x-threshold-no-vendor-signal noise band. False-
positive guards on dataset checksum + checkpoint step suppress
expected-drop scenarios. Library-side schema (19 detector tests + 13
schema-drift falsifiers) pins the verdict wire format.

Signed-off-by: Tri Lam <tri@maydow.com>
Wire SilentDataCorruptionDetector into the patterndetectorprocessor
ConsumeLogs path. Add projectEvalAccuracyRecord +
projectSDCCounterRecord projections (gated on gen_ai.training.* +
hw.gpu.sdc.* customer-stable namespaces) and
appendSilentDataCorruptionVerdict to emit the verdict log record
with promoted scalars per issue #270.

Register the new signal attributes in docs/ATTRIBUTES.md:
hw.gpu.sdc.{delta,kind}, gen_ai.training.eval_accuracy.*,
gen_ai.training.eval_set.*, gen_ai.training.checkpoint.*,
gen_ai.training.job.{start,end}_unix_nano, and
tracecore.alert.silent_data_corruption.{kind,accuracy_drop,
suspect_gpu_id,suspect_node}. Update docs/patterns/13 status to
shipped and bump docs/patterns/README accordingly.

Wiring tests cover: vendor_signaled (full) + accuracy_only (partial)
end-to-end emission, partial suppression toggle, no-eval-no-verdict
gating, threshold configurability, and Validate guards on the new
config fields.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr

trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Reviewer SHIP-WITH-NITS. Tracking yellow finding (threshold-lowering guidance for sub-0.5pp SDC) in #348 for follow-up. Blue nit (panic guard) is defensive-only — library tests pass; punted. Question about 0.5pp citation moves to #348.

@trilamsr trilamsr merged commit a7fcfc5 into main Jun 1, 2026
15 checks passed
@trilamsr trilamsr deleted the feat/pattern-15-silent-data-corruption branch June 1, 2026 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant