Context
Audit in docs/v1-rc1-test-audit.md §4.
.github/workflows/chaos.yml has three rows. Only one (pattern-pod-evicted) covers a detector pattern; the other two (harness-determinism, cpu-steal-mpstat) are pivot-neutral harness gates. Today six detector patterns ship on main (#14 pod-evicted, #8 nccl-hang, #3 hbm-ecc, #4 thermal-throttle, #5 pcie-aer, plus xid-correlation), but only one has a chaos row.
Cut-criterion §1 targets ≥ 12 of 15 patterns at rc1 — every shipping detector needs a chaos-matrix row, otherwise the chaos gate's coverage doesn't track the moat's coverage.
Fix — Path A from the audit
Convert pattern-pod-evicted into a matrix-strategy job:
strategy:
fail-fast: false
matrix:
pattern:
- pod_evicted
- nccl_hang
- xid_correlation
- hbm_ecc
- thermal_throttle
- pcie_aer
Each row runs go test -race -run TestReplay_<Pattern> ./module/pkg/replay/... against a committed fixture under module/pkg/replay/<pattern>/testdata/. Reuses the existing race-detector + golden-SHA shape — no new failure-inject mode required.
Path B (per-pattern failure-inject subcommands) is deferred to v1.x — too much code for the rc1 window. (Audit §4.)
Acceptance
- All six matrix rows green on
main.
- Each pattern has at least one committed fixture under
module/pkg/replay/<pattern>/testdata/.
- Job timeout-minutes stays ≤ 10 (current 5 × six rows = matrix parallelism keeps total ≤ 5).
Effort
M (1–2 days). The bulk is curating six replay-corpus fixtures; the workflow change is ~10 lines.
Context
Audit in
docs/v1-rc1-test-audit.md§4..github/workflows/chaos.ymlhas three rows. Only one (pattern-pod-evicted) covers a detector pattern; the other two (harness-determinism,cpu-steal-mpstat) are pivot-neutral harness gates. Today six detector patterns ship onmain(#14 pod-evicted, #8 nccl-hang, #3 hbm-ecc, #4 thermal-throttle, #5 pcie-aer, plus xid-correlation), but only one has a chaos row.Cut-criterion §1 targets ≥ 12 of 15 patterns at rc1 — every shipping detector needs a chaos-matrix row, otherwise the chaos gate's coverage doesn't track the moat's coverage.
Fix — Path A from the audit
Convert
pattern-pod-evictedinto a matrix-strategy job:Each row runs
go test -race -run TestReplay_<Pattern> ./module/pkg/replay/...against a committed fixture undermodule/pkg/replay/<pattern>/testdata/. Reuses the existing race-detector + golden-SHA shape — no newfailure-injectmode required.Path B (per-pattern
failure-injectsubcommands) is deferred to v1.x — too much code for the rc1 window. (Audit §4.)Acceptance
main.module/pkg/replay/<pattern>/testdata/.Effort
M (1–2 days). The bulk is curating six replay-corpus fixtures; the workflow change is ~10 lines.