Skip to content

docs(sdc): threshold-tuning guidance + heuristic framing (#348)#408

Merged
trilamsr merged 1 commit into
mainfrom
docs/sdc-threshold-tuning-348-v2
Jun 1, 2026
Merged

docs(sdc): threshold-tuning guidance + heuristic framing (#348)#408
trilamsr merged 1 commit into
mainfrom
docs/sdc-threshold-tuning-348-v2

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Addresses #348 — pattern-13 (silent data corruption) shipped with sdc_accuracy_drop_threshold defaulting to 0.005 (0.5pp) but the spec only carried "raise" guidance and offered no citation for the 0.5pp basis. Real-world SDC can land sub-threshold (single-bit weight flips in small models; transient single-batch corruption in large training).

Three prose-only changes:

  • New §"Threshold tuning" in docs/patterns/13-silent-data-corruption.md covering both directions (raise + lower) with concrete triggers, the relationship to sdc_accuracy_only_multiplier, a copy-pasteable YAML example, and a "Why the default stays at 0.005" note.
  • New "Sub-threshold SDC" bullet in §"Edge cases" pointing readers to the tuning section.
  • §"Empirical basis" paragraph honestly framing the 0.5pp band as a heuristic drawn from public SDC incident reports (Meta 2022; Google HotOS 2021) and the NVIDIA SDC catcher fixture — not a published study.
  • docs/ATTRIBUTES.md row for tracecore.alert.silent_data_corruption.accuracy_drop now notes the threshold is operator-tunable and points to the new §.

Root cause

The spec asserted "typical SDC regressions land between 0.5pp and 3pp" with no citation, and the §"Edge cases" guard only documented when to raise the threshold above the recipe's variance band. Operators with high-precision recipes (variance bands <0.1pp) or paired vendor SDC counter coverage had no in-spec rationale to lower the knob — so legitimate sub-0.5pp regressions were silently skipped.

Claims verified against detector code

  • Default sdc_accuracy_drop_threshold = 0.005module/processor/patterndetectorprocessor/config.go:178.
  • Default sdc_accuracy_only_multiplier = 2.0config.go:189.
  • Validation rejects [0, 1]-outside — config.go:429-430.
  • Skip condition is accuracy_drop < thresholdmodule/pkg/patterns/silent_data_corruption.go:286.
  • accuracy_only branch gate is drop >= threshold * multipliersilent_data_corruption.go:295.
  • YAML lives under processors.patterndetectormodule/processor/patterndetectorprocessor/example_config.yaml.

No new field names invented; no thresholds cited that don't exist in code.

Test plan

  • Pre-commit hooks pass (golangci-lint 0 issues; go vet; go mod verify; attribute-namespace-check 100/100; hit-line-format-stable; no-autoupdate-check).
  • Spec links resolve — §"Threshold tuning" cross-referenced from the new §"Edge cases" bullet and from ATTRIBUTES.md.
  • YAML example uses only existing knobs (sdc_accuracy_drop_threshold, sdc_accuracy_only_multiplier) under the documented processors.patterndetector parent.
NONE

Closes #348

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 22:54
@trilamsr trilamsr merged commit 497f877 into main Jun 1, 2026
12 checks passed
@trilamsr trilamsr deleted the docs/sdc-threshold-tuning-348-v2 branch June 1, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[rc1+] sdc detector: document threshold-lowering for sub-0.5pp SDC

1 participant