test(replay): land 7-detector replay corpora (#366 Path A)#402
Merged
Conversation
Adds replay/<pattern>/{canonical,_negative,_real_world}/ fixture
corpora for the seven detectors that lacked one. Each canonical
fixture exercises the pattern's positive-discriminator path with
realistic inputs (PCI BDFs, mlx5 device names, real Xid codes); each
_negative fixture exercises one false-positive guard. Goldens were
generated from the detectors via UPDATE_REPLAY_GOLDEN=1 and pinned
so future drift in detector output (headline / remediation prose,
evidence-trail UID shape, scalar field rename) shows up as a JSONEq
diff against the on-disk fixture.
Patterns covered: hbm_ecc (#3), thermal_throttle (#4), pcie_aer
(#5), ib_link_flap (#2), nccl_hang (#15), cuda_oom (#10),
xid_correlation (#16). The existing pod_evicted (#14) corpus is
unchanged.
Detector heterogeneity: each detector takes a different input shape
(HBMECCRecord+XidRecord, ThermalThrottleRecord, PCIeAERRecord+
PCIeIORecord, ...), so the legacy LoadFixturesUnder helper (typed
on Record+NodeRecord) cannot be reused. Each detector gets its own
*_replay_test.go that inlines the JSON read; shared discovery /
golden-assert helpers live in helpers_test.go.
Two detectors (nccl_hang, ib_link_flap) take a Now reference. Tests
pin Now to a fixed timestamp matching the fixture's started_ns so
hang age / flap-window inclusion stay deterministic across replay
runs and aren't drift-prone vs the wall clock.
Wires every row of the chaos.yml pattern-detectors matrix to
run_corpus: true now that every detector has a corpus.
Closes #366.
```release-notes
NONE
```
Signed-off-by: Tri Lam <tri@maydow.com>
This was referenced Jun 1, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
## Summary Closes #428. Path B replay corpora for 4 missing detectors + manifest schema-lock test + chaos.yml matrix coverage. ### What landed - `module/pkg/replay/<pattern>/canonical/{manifest,inputs,golden}.json` + 2 negatives per detector for: `dataloader_hang`, `nccl_bootstrap`, `checkpointer_hang`, `silent_data_corruption` - `manifest_schema_test.go` — corpus-wide regression test - `.github/workflows/chaos.yml` — matrix extended 8 → 12 patterns ### Scope correction Issue #428 said "5 missing" but `pod_evicted` was already shipped in #402 with `canonical/` + 3 negatives. Actual gap was 4. Verified by `ls module/pkg/replay/`. ### Test plan - [x] `go test ./pkg/replay/...` — full replay suite green (12 patterns × canonical + negatives + schema lock) - [x] `go test ./pkg/patterns/...` — detector unit tests still green - [x] `go vet ./pkg/replay/...` — clean - [x] Mutation test on `manifest_schema_test.go` — confirmed schema check fails when a required manifest field is removed - [x] Verified chaos.yml `run_regex` prefixes match real `pkg/patterns` tests - [x] Confirmed pre-existing tests untouched ### BLOCK fix follow-up (post-review) chaos.yml `silent_data_corruption` run_regex was `^TestSDCDetector` which missed the new `TestSilentDataCorruptionReplay`. Fixed to `^(TestSDCDetector|TestSilentDataCorruptionReplay)`. Other 3 detectors share prefix between unit + replay tests. ```release-notes test(replay): Path B replay corpora for 4 detectors (dataloader_hang, nccl_bootstrap, checkpointer_hang, silent_data_corruption) with manifest schema-lock test and chaos.yml matrix coverage extension (8 → 12 patterns). ``` Refs #421 #402 #366. --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands replay corpora for the seven pattern detectors that lacked one,
closing the Path-A test gap named in the v1-rc1 audit (#366). Each
pattern now ships
module/pkg/replay/<pattern>/{canonical,_negative, _real_world}/fixtures plus a*_replay_test.gorunner that JSON-eqsdetector output against the on-disk golden.
Detectors covered:
hbm_ecc(#3),thermal_throttle(#4),pcie_aer(#5),ib_link_flap(#2),nccl_hang(#15),cuda_oom(#10),
xid_correlation(#16).pod_evicted(#14) corpus isunchanged.
Design
HBMECCRecord + XidRecordforhbm_ecc,ThermalThrottleRecordforthermal_throttle, ...), so the existingLoadFixturesUnderhelper(typed on
Record + NodeRecord) cannot be reused. Each detectorgets its own
*_replay_test.gothat inlines the per-detector JSONread; shared fixture-discovery and golden-assert helpers live in
helpers_test.go.nccl_hang,ib_link_flap) take aNowreference.Tests pin
Nowto a fixed timestamp matching the fixture'sstarted_nsso hang-age and flap-window inclusion staydeterministic across replay runs (otherwise wall-clock drift would
silently flip the verdicts as the fixture aged).
UPDATE_REPLAY_GOLDEN=1 go test ./module/pkg/replay/...and pinned;future drift in detector output (headline / remediation prose,
evidence-trail UID shape, scalar-field rename) surfaces as a
JSONEqdiff against the fixture. Operators can also eyeball thegolden to assert what they EXPECT vs what fires.
code, single GPU, no AER, no eviction, completed state, single
transition, no OOM log) so a regression in one false-positive guard
lights up the corresponding row only.
run_corpus: trueon every row of the chaos.ymlpattern-detectors matrix now that every detector has a corpus.
Test plan
make check— cleango test -race -count=1 ./pkg/replay/...— 35 tests pass(28 new + 7 pre-existing pod_evicted)
go test -count=1 ./processor/patterndetectorprocessor/...—unchanged, still green
make verify(pre-push) — cleanpattern-detectorsmatrix — 8 rows, each runshermetic regex + replay-corpus step
Closes #366.