ci(chaos): extend matrix to all v1-rc1 detectors#341
Merged
Conversation
added 2 commits
June 1, 2026 01:31
Replaces the single pattern-pod-evicted job with a matrix-strategy pattern-detectors job covering every detector named in v1-rc1 cut-criterion §1: hbm_ecc, thermal_throttle, pcie_aer, pod_evicted, nccl_hang, xid_correlation. Each row runs the detector's positive + negative + schema *_test.go prefix under -race. Positive asserts verdict emitted with correct pattern.id; negative asserts zero verdicts on baseline / confounder input — the false-positive gate the chaos matrix exists to enforce. The pod_evicted row additionally runs the on-disk replay corpus under module/pkg/replay/pod_evicted/; sibling rows skip that step until their fixtures land (Path A in docs/v1-rc1-test-audit.md §4). Per-row timeout stays at 5 min; observed local runtime per row is ~1.3 s under -race, matrix parallelism keeps total wall-time well inside the chaos-gate budget. Closes #331. Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>
Contributor
Author
|
Addressed reviewer findings: added ib_link_flap + cuda_oom rows (closes #338 coverage gap), tightened pod_evicted regex to actual test names (^Test(PodEvictedDetector|PodEvictedVerdict)), scoped replay glob to ./pkg/replay/pod_evicted/... so future replay dirs don't trip the gate. |
Signed-off-by: Tri Lam <tri@maydow.com>
Contributor
Author
|
CI fix: my prior glob tightening to |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #331. Replaces the single
pattern-pod-evictedchaos job with amatrix-strategy
pattern-detectorsjob covering every detector namedin v1-rc1 cut-criterion §1:
hbm_ecc,thermal_throttle,pcie_aer,pod_evicted,nccl_hang,xid_correlation.Each row runs the detector's positive + negative + schema
*_test.goprefix under
-race:pattern.idwithin bounded time.baseline / confounder input (the false-positive gate the chaos
matrix exists to enforce).
The
pod_evictedrow additionally runs the on-disk replay corpus undermodule/pkg/replay/pod_evicted/; sibling rows skip that step untiltheir
replay/<pattern>/fixtures land — tracked as Path A indocs/v1-rc1-test-audit.md§4.Why this matrix shape, not new
failure-injectmodesAudit §4 explicitly recommends Path A (matrix over existing
hermetic detector tests) for rc1 and defers Path B (per-pattern
failure-injectCLI modes) to v1.x. Path A reuses therace-detector + golden-SHA shape already validated by
pattern-pod-evictedand costs ~60 LoC of workflow YAML instead offive new CLI subcommands.
Wall-time budget
Observed per-row runtime under
-race(measured locally on thisworktree):
Per-row
timeout-minutes: 5is ~3× headroom; matrix parallelismkeeps total wall-time well inside the chaos-gate budget.
Test plan
actionlint .github/workflows/chaos.yml— clean.python3 -c "import yaml; yaml.safe_load(open('.github/workflows/chaos.yml'))"— parses, 3 jobs (harness-determinism,cpu-steal-mpstat,pattern-detectors), 6-row matrix.cd module && go test -race -count=1 -run '<regex>' ./pkg/patterns/...— all green (counts: HBMECC=16, ThermalThrottle=17, PCIeAER=15, PodEvicted+PressureFromNote=12, NCCLHang=10, XidCorrelation=14).cd module && go test -race -count=1 ./pkg/replay/...(pod_evicted corpus step) — green.mainafter merge (verifiable only post-merge — nightly cron + push-to-main).