docs(sdc): threshold-tuning guidance + heuristic framing (#348) by trilamsr · Pull Request #408 · TraceCoreAI/tracecore

trilamsr · 2026-06-01T22:51:12Z

Summary

Addresses #348 — pattern-13 (silent data corruption) shipped with sdc_accuracy_drop_threshold defaulting to 0.005 (0.5pp) but the spec only carried "raise" guidance and offered no citation for the 0.5pp basis. Real-world SDC can land sub-threshold (single-bit weight flips in small models; transient single-batch corruption in large training).

Three prose-only changes:

New §"Threshold tuning" in docs/patterns/13-silent-data-corruption.md covering both directions (raise + lower) with concrete triggers, the relationship to sdc_accuracy_only_multiplier, a copy-pasteable YAML example, and a "Why the default stays at 0.005" note.
New "Sub-threshold SDC" bullet in §"Edge cases" pointing readers to the tuning section.
§"Empirical basis" paragraph honestly framing the 0.5pp band as a heuristic drawn from public SDC incident reports (Meta 2022; Google HotOS 2021) and the NVIDIA SDC catcher fixture — not a published study.
docs/ATTRIBUTES.md row for tracecore.alert.silent_data_corruption.accuracy_drop now notes the threshold is operator-tunable and points to the new §.

Root cause

The spec asserted "typical SDC regressions land between 0.5pp and 3pp" with no citation, and the §"Edge cases" guard only documented when to raise the threshold above the recipe's variance band. Operators with high-precision recipes (variance bands <0.1pp) or paired vendor SDC counter coverage had no in-spec rationale to lower the knob — so legitimate sub-0.5pp regressions were silently skipped.

Claims verified against detector code

Default sdc_accuracy_drop_threshold = 0.005 — module/processor/patterndetectorprocessor/config.go:178.
Default sdc_accuracy_only_multiplier = 2.0 — config.go:189.
Validation rejects [0, 1]-outside — config.go:429-430.
Skip condition is accuracy_drop < threshold — module/pkg/patterns/silent_data_corruption.go:286.
accuracy_only branch gate is drop >= threshold * multiplier — silent_data_corruption.go:295.
YAML lives under processors.patterndetector — module/processor/patterndetectorprocessor/example_config.yaml.

No new field names invented; no thresholds cited that don't exist in code.

Test plan

Pre-commit hooks pass (golangci-lint 0 issues; go vet; go mod verify; attribute-namespace-check 100/100; hit-line-format-stable; no-autoupdate-check).
Spec links resolve — §"Threshold tuning" cross-referenced from the new §"Edge cases" bullet and from ATTRIBUTES.md.
YAML example uses only existing knobs (sdc_accuracy_drop_threshold, sdc_accuracy_only_multiplier) under the documented processors.patterndetector parent.

NONE

Closes #348

Signed-off-by: Tri Lam <tri@maydow.com>

docs(sdc): threshold-tuning guidance + heuristic framing (#348)

2ab8188

Signed-off-by: Tri Lam <tri@maydow.com>

trilamsr mentioned this pull request Jun 1, 2026

docs(pattern-13): threshold-lowering guidance for sub-0.5pp SDC (#348) #410

Closed

4 tasks

trilamsr enabled auto-merge (squash) June 1, 2026 22:54

trilamsr merged commit 497f877 into main Jun 1, 2026
12 checks passed

trilamsr deleted the docs/sdc-threshold-tuning-348-v2 branch June 1, 2026 22:59

trilamsr mentioned this pull request Jun 1, 2026

audit(wave-2026-06-01): post-wave cross-cut review #421

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs(sdc): threshold-tuning guidance + heuristic framing (#348)#408

docs(sdc): threshold-tuning guidance + heuristic framing (#348)#408
trilamsr merged 1 commit into
mainfrom
docs/sdc-threshold-tuning-348-v2

trilamsr commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 1, 2026

Summary

Root cause

Claims verified against detector code

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant