feat(sdc): pattern-13 silent data corruption detector by trilamsr · Pull Request #344 · TraceCoreAI/tracecore

trilamsr · 2026-06-01T08:39:30Z

Summary

Ships pattern #13 (silent data corruption) end-to-end per the spec at docs/patterns/13-silent-data-corruption.md — the hardest pattern in the 15-set, where the run completes, loss looks normal, but downstream eval shows degraded model quality.

Library: SilentDataCorruptionDetector in module/pkg/patterns/silent_data_corruption.go consumes EvalAccuracyRecord + SDCCounterRecord typed projections and emits SilentDataCorruptionVerdict. Discriminator follows the spec's evaluation rule:
- accuracy_drop >= AccuracyDropThreshold AND same-job hw.gpu.sdc.* counter rose during the job window → kind=vendor_signaled (full confidence)
- accuracy_drop >= AccuracyDropThreshold * 2 alone → kind=accuracy_only (partial confidence)
- sub-2x-threshold without vendor signal → no verdict (conservative-by-default noise band)
False-positive guards (spec §"Edge cases"): dataset-checksum mismatch suppresses verdicts; checkpoint-step mismatch suppresses verdicts. Both guards are opt-in (only fire when both sides set the field).
Wiring: patterndetectorprocessor projects gen_ai.training.eval_accuracy* + hw.gpu.sdc.* log records, runs the detector, emits the verdict with promoted scalars per issue patterndetectorprocessor: promote operator-facing scalar attrs onto verdict records #270.
Schema: module/pkg/patterns/testdata/silent_data_corruption_verdict.schema.json pins the verdict wire format; 13 schema-drift falsifiers + conformance round-trip for both branches.
Docs: spec status flipped to shipped; docs/patterns/README.md table updated; docs/ATTRIBUTES.md registers the 13 new customer-stable attribute keys (hw.gpu.sdc.{delta,kind}, gen_ai.training.eval_accuracy.*, gen_ai.training.eval_set.*, gen_ai.training.checkpoint.*, gen_ai.training.job.{start,end}_unix_nano, tracecore.alert.silent_data_corruption.*) and adds the silent_data_corruption row to the per-pattern matrix.

Verdict remains explicitly advisory per spec §"Detector evaluation rule" — both branches route the operator to a same-recipe + same-seed re-run on different hardware as the disambiguator, because SDC repro is non-deterministic.

Test plan

cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... -count=1 — green (19 SDC library tests incl. 13 schema falsifiers, 7 wiring tests).
cd module && go vet ./... — clean.
golangci-lint run ./... — 0 issues (via pre-commit hook).
attribute-namespace-check — 82/82 unique attribute literals documented (via pre-commit hook).
Full module go build ./... — clean.

Integration gaps (upstream-blocked, documented in-place)

gen_ai.training.eval_accuracy upstream framework instrumentation (spec §"Open questions" 5) — projection tested against hand-crafted records; real-world input arrives once the OTTL recipe lands.
hw.gpu.sdc.* OTTL recipe for DCGM SDC catcher / row-remap / AMD ECC counters — blocked on RFC-0014 PR-B (metrics→logs pattern). Cross-vendor support depends on hw.gpu.sdc.* semconv that doesn't exist yet (spec §"Open questions" 2).
Baseline-accuracy provenance + cross-run state are v0.3+ product questions (spec §"Open questions" 1, 4) — the detector accepts operator-stamped baselines today.

Add pattern #13 (silent data corruption) detector. Surfaces suspected SDC when an eval-accuracy regression (`gen_ai.training.eval_accuracy` vs baseline) crosses the configured threshold; a same-job `hw.gpu.sdc.*` counter rise during the job window flips the verdict to high confidence (`kind=vendor_signaled`). The verdict is advisory — both branches recommend a same-recipe re-run on different hardware as the disambiguator. New config knobs: `sdc_accuracy_drop_threshold` (default `0.005`) and `sdc_accuracy_only_multiplier` (default `2.0`).

Add SilentDataCorruptionDetector + EvalAccuracyRecord + SDCCounterRecord typed projections and the verdict shape per docs/patterns/13. Discriminator follows the spec's evaluation rule: accuracy drop >= threshold joined by a same-job hw.gpu.sdc.* counter during the job window emits kind=vendor_signaled (full); drop alone >= 2x threshold emits kind=accuracy_only (partial). Verdict is advisory — both branches route to a same-recipe re-run on different hardware as the disambiguator. Conservative-by-default thresholds (0.005 absolute drop, 2x partial gate) reject the sub-2x-threshold-no-vendor-signal noise band. False- positive guards on dataset checksum + checkpoint step suppress expected-drop scenarios. Library-side schema (19 detector tests + 13 schema-drift falsifiers) pins the verdict wire format. Signed-off-by: Tri Lam <tri@maydow.com>

Wire SilentDataCorruptionDetector into the patterndetectorprocessor ConsumeLogs path. Add projectEvalAccuracyRecord + projectSDCCounterRecord projections (gated on gen_ai.training.* + hw.gpu.sdc.* customer-stable namespaces) and appendSilentDataCorruptionVerdict to emit the verdict log record with promoted scalars per issue #270. Register the new signal attributes in docs/ATTRIBUTES.md: hw.gpu.sdc.{delta,kind}, gen_ai.training.eval_accuracy.*, gen_ai.training.eval_set.*, gen_ai.training.checkpoint.*, gen_ai.training.job.{start,end}_unix_nano, and tracecore.alert.silent_data_corruption.{kind,accuracy_drop, suspect_gpu_id,suspect_node}. Update docs/patterns/13 status to shipped and bump docs/patterns/README accordingly. Wiring tests cover: vendor_signaled (full) + accuracy_only (partial) end-to-end emission, partial suppression toggle, no-eval-no-verdict gating, threshold configurability, and Validate guards on the new config fields. Signed-off-by: Tri Lam <tri@maydow.com>

trilamsr · 2026-06-01T08:46:48Z

Reviewer SHIP-WITH-NITS. Tracking yellow finding (threshold-lowering guidance for sub-0.5pp SDC) in #348 for follow-up. Blue nit (panic guard) is defensive-only — library tests pass; punted. Question about 0.5pp citation moves to #348.

Tri Lam added 2 commits June 1, 2026 01:34

trilamsr mentioned this pull request Jun 1, 2026

[rc1+] sdc detector: document threshold-lowering for sub-0.5pp SDC #348

Closed

trilamsr merged commit a7fcfc5 into main Jun 1, 2026
15 checks passed

trilamsr deleted the feat/pattern-15-silent-data-corruption branch June 1, 2026 08:55

This was referenced Jun 1, 2026

feat(schema): publish Verdict v1.0-rc1 envelope schema #351

Merged

docs(pattern-13): threshold-lowering guidance for sub-0.5pp SDC (#348) #410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(sdc): pattern-13 silent data corruption detector#344

feat(sdc): pattern-13 silent data corruption detector#344
trilamsr merged 2 commits into
mainfrom
feat/pattern-15-silent-data-corruption

trilamsr commented Jun 1, 2026

Uh oh!

trilamsr commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 1, 2026

Summary

Test plan

Integration gaps (upstream-blocked, documented in-place)

Uh oh!

trilamsr commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant