feat(sdc): pattern-13 silent data corruption detector#344
Merged
Conversation
added 2 commits
June 1, 2026 01:34
Add SilentDataCorruptionDetector + EvalAccuracyRecord + SDCCounterRecord typed projections and the verdict shape per docs/patterns/13. Discriminator follows the spec's evaluation rule: accuracy drop >= threshold joined by a same-job hw.gpu.sdc.* counter during the job window emits kind=vendor_signaled (full); drop alone >= 2x threshold emits kind=accuracy_only (partial). Verdict is advisory — both branches route to a same-recipe re-run on different hardware as the disambiguator. Conservative-by-default thresholds (0.005 absolute drop, 2x partial gate) reject the sub-2x-threshold-no-vendor-signal noise band. False- positive guards on dataset checksum + checkpoint step suppress expected-drop scenarios. Library-side schema (19 detector tests + 13 schema-drift falsifiers) pins the verdict wire format. Signed-off-by: Tri Lam <tri@maydow.com>
Wire SilentDataCorruptionDetector into the patterndetectorprocessor ConsumeLogs path. Add projectEvalAccuracyRecord + projectSDCCounterRecord projections (gated on gen_ai.training.* + hw.gpu.sdc.* customer-stable namespaces) and appendSilentDataCorruptionVerdict to emit the verdict log record with promoted scalars per issue #270. Register the new signal attributes in docs/ATTRIBUTES.md: hw.gpu.sdc.{delta,kind}, gen_ai.training.eval_accuracy.*, gen_ai.training.eval_set.*, gen_ai.training.checkpoint.*, gen_ai.training.job.{start,end}_unix_nano, and tracecore.alert.silent_data_corruption.{kind,accuracy_drop, suspect_gpu_id,suspect_node}. Update docs/patterns/13 status to shipped and bump docs/patterns/README accordingly. Wiring tests cover: vendor_signaled (full) + accuracy_only (partial) end-to-end emission, partial suppression toggle, no-eval-no-verdict gating, threshold configurability, and Validate guards on the new config fields. Signed-off-by: Tri Lam <tri@maydow.com>
Contributor
Author
This was referenced Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships pattern #13 (silent data corruption) end-to-end per the spec at
docs/patterns/13-silent-data-corruption.md— the hardest pattern in the 15-set, where the run completes, loss looks normal, but downstream eval shows degraded model quality.SilentDataCorruptionDetectorinmodule/pkg/patterns/silent_data_corruption.goconsumesEvalAccuracyRecord+SDCCounterRecordtyped projections and emitsSilentDataCorruptionVerdict. Discriminator follows the spec's evaluation rule:accuracy_drop >= AccuracyDropThresholdAND same-jobhw.gpu.sdc.*counter rose during the job window →kind=vendor_signaled(full confidence)accuracy_drop >= AccuracyDropThreshold * 2alone →kind=accuracy_only(partial confidence)patterndetectorprocessorprojectsgen_ai.training.eval_accuracy*+hw.gpu.sdc.*log records, runs the detector, emits the verdict with promoted scalars per issue patterndetectorprocessor: promote operator-facing scalar attrs onto verdict records #270.module/pkg/patterns/testdata/silent_data_corruption_verdict.schema.jsonpins the verdict wire format; 13 schema-drift falsifiers + conformance round-trip for both branches.docs/patterns/README.mdtable updated;docs/ATTRIBUTES.mdregisters the 13 new customer-stable attribute keys (hw.gpu.sdc.{delta,kind},gen_ai.training.eval_accuracy.*,gen_ai.training.eval_set.*,gen_ai.training.checkpoint.*,gen_ai.training.job.{start,end}_unix_nano,tracecore.alert.silent_data_corruption.*) and adds thesilent_data_corruptionrow to the per-pattern matrix.Verdict remains explicitly advisory per spec §"Detector evaluation rule" — both branches route the operator to a same-recipe + same-seed re-run on different hardware as the disambiguator, because SDC repro is non-deterministic.
Test plan
cd module && go test ./pkg/patterns/... ./processor/patterndetectorprocessor/... -count=1— green (19 SDC library tests incl. 13 schema falsifiers, 7 wiring tests).cd module && go vet ./...— clean.golangci-lint run ./...— 0 issues (via pre-commit hook).attribute-namespace-check— 82/82 unique attribute literals documented (via pre-commit hook).go build ./...— clean.Integration gaps (upstream-blocked, documented in-place)
gen_ai.training.eval_accuracyupstream framework instrumentation (spec §"Open questions" 5) — projection tested against hand-crafted records; real-world input arrives once the OTTL recipe lands.hw.gpu.sdc.*OTTL recipe for DCGM SDC catcher / row-remap / AMD ECC counters — blocked on RFC-0014 PR-B (metrics→logs pattern). Cross-vendor support depends onhw.gpu.sdc.*semconv that doesn't exist yet (spec §"Open questions" 2).