Skip to content

test(replay): land 7-detector replay corpora (#366 Path A)#402

Merged
trilamsr merged 1 commit into
mainfrom
worktree-agent-a14b7ee36ea1155cb
Jun 1, 2026
Merged

test(replay): land 7-detector replay corpora (#366 Path A)#402
trilamsr merged 1 commit into
mainfrom
worktree-agent-a14b7ee36ea1155cb

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Lands replay corpora for the seven pattern detectors that lacked one,
closing the Path-A test gap named in the v1-rc1 audit (#366). Each
pattern now ships module/pkg/replay/<pattern>/{canonical,_negative, _real_world}/ fixtures plus a *_replay_test.go runner that JSON-eqs
detector output against the on-disk golden.

Detectors covered: hbm_ecc (#3), thermal_throttle (#4),
pcie_aer (#5), ib_link_flap (#2), nccl_hang (#15), cuda_oom
(#10), xid_correlation (#16). pod_evicted (#14) corpus is
unchanged.

Design

  • Each detector takes a different input shape (e.g. HBMECCRecord + XidRecord for hbm_ecc, ThermalThrottleRecord for
    thermal_throttle, ...), so the existing LoadFixturesUnder helper
    (typed on Record + NodeRecord) cannot be reused. Each detector
    gets its own *_replay_test.go that inlines the per-detector JSON
    read; shared fixture-discovery and golden-assert helpers live in
    helpers_test.go.
  • Two detectors (nccl_hang, ib_link_flap) take a Now reference.
    Tests pin Now to a fixed timestamp matching the fixture's
    started_ns so hang-age and flap-window inclusion stay
    deterministic across replay runs (otherwise wall-clock drift would
    silently flip the verdicts as the fixture aged).
  • Goldens were generated from the live detectors via
    UPDATE_REPLAY_GOLDEN=1 go test ./module/pkg/replay/... and pinned;
    future drift in detector output (headline / remediation prose,
    evidence-trail UID shape, scalar-field rename) surfaces as a
    JSONEq diff against the fixture. Operators can also eyeball the
    golden to assert what they EXPECT vs what fires.
  • Negative fixtures each exercise a distinct discriminator (wrong Xid
    code, single GPU, no AER, no eviction, completed state, single
    transition, no OOM log) so a regression in one false-positive guard
    lights up the corresponding row only.
  • Flipped run_corpus: true on every row of the chaos.yml
    pattern-detectors matrix now that every detector has a corpus.

Test plan

  • make check — clean
  • go test -race -count=1 ./pkg/replay/... — 35 tests pass
    (28 new + 7 pre-existing pod_evicted)
  • go test -count=1 ./processor/patterndetectorprocessor/...
    unchanged, still green
  • make verify (pre-push) — clean
  • CI: chaos.yml pattern-detectors matrix — 8 rows, each runs
    hermetic regex + replay-corpus step

Closes #366.

NONE

Adds replay/<pattern>/{canonical,_negative,_real_world}/ fixture
corpora for the seven detectors that lacked one. Each canonical
fixture exercises the pattern's positive-discriminator path with
realistic inputs (PCI BDFs, mlx5 device names, real Xid codes); each
_negative fixture exercises one false-positive guard. Goldens were
generated from the detectors via UPDATE_REPLAY_GOLDEN=1 and pinned
so future drift in detector output (headline / remediation prose,
evidence-trail UID shape, scalar field rename) shows up as a JSONEq
diff against the on-disk fixture.

Patterns covered: hbm_ecc (#3), thermal_throttle (#4), pcie_aer
(#5), ib_link_flap (#2), nccl_hang (#15), cuda_oom (#10),
xid_correlation (#16). The existing pod_evicted (#14) corpus is
unchanged.

Detector heterogeneity: each detector takes a different input shape
(HBMECCRecord+XidRecord, ThermalThrottleRecord, PCIeAERRecord+
PCIeIORecord, ...), so the legacy LoadFixturesUnder helper (typed
on Record+NodeRecord) cannot be reused. Each detector gets its own
*_replay_test.go that inlines the JSON read; shared discovery /
golden-assert helpers live in helpers_test.go.

Two detectors (nccl_hang, ib_link_flap) take a Now reference. Tests
pin Now to a fixed timestamp matching the fixture's started_ns so
hang age / flap-window inclusion stay deterministic across replay
runs and aren't drift-prone vs the wall clock.

Wires every row of the chaos.yml pattern-detectors matrix to
run_corpus: true now that every detector has a corpus.

Closes #366.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 22:29
@trilamsr trilamsr merged commit 8f668e1 into main Jun 1, 2026
23 checks passed
@trilamsr trilamsr deleted the worktree-agent-a14b7ee36ea1155cb branch June 1, 2026 22:30
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

Closes #428. Path B replay corpora for 4 missing detectors + manifest
schema-lock test + chaos.yml matrix coverage.

### What landed
- `module/pkg/replay/<pattern>/canonical/{manifest,inputs,golden}.json`
+ 2 negatives per detector for: `dataloader_hang`, `nccl_bootstrap`,
`checkpointer_hang`, `silent_data_corruption`
- `manifest_schema_test.go` — corpus-wide regression test
- `.github/workflows/chaos.yml` — matrix extended 8 → 12 patterns

### Scope correction
Issue #428 said "5 missing" but `pod_evicted` was already shipped in
#402 with `canonical/` + 3 negatives. Actual gap was 4. Verified by `ls
module/pkg/replay/`.

### Test plan
- [x] `go test ./pkg/replay/...` — full replay suite green (12 patterns
× canonical + negatives + schema lock)
- [x] `go test ./pkg/patterns/...` — detector unit tests still green
- [x] `go vet ./pkg/replay/...` — clean
- [x] Mutation test on `manifest_schema_test.go` — confirmed schema
check fails when a required manifest field is removed
- [x] Verified chaos.yml `run_regex` prefixes match real `pkg/patterns`
tests
- [x] Confirmed pre-existing tests untouched

### BLOCK fix follow-up (post-review)
chaos.yml `silent_data_corruption` run_regex was `^TestSDCDetector`
which missed the new `TestSilentDataCorruptionReplay`. Fixed to
`^(TestSDCDetector|TestSilentDataCorruptionReplay)`. Other 3 detectors
share prefix between unit + replay tests.

```release-notes
test(replay): Path B replay corpora for 4 detectors (dataloader_hang, nccl_bootstrap, checkpointer_hang, silent_data_corruption) with manifest schema-lock test and chaos.yml matrix coverage extension (8 → 12 patterns).
```

Refs #421 #402 #366.

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[test-gap] replay corpora for 5 detectors (Path A from chaos.yml)

1 participant