Skip to content

ci(chaos): extend matrix to all v1-rc1 detectors#341

Merged
trilamsr merged 3 commits into
mainfrom
ci/chaos-matrix-all-detectors
Jun 1, 2026
Merged

ci(chaos): extend matrix to all v1-rc1 detectors#341
trilamsr merged 3 commits into
mainfrom
ci/chaos-matrix-all-detectors

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #331. Replaces the single pattern-pod-evicted chaos job with a
matrix-strategy pattern-detectors job covering every detector named
in v1-rc1 cut-criterion §1: hbm_ecc, thermal_throttle, pcie_aer,
pod_evicted, nccl_hang, xid_correlation.

Each row runs the detector's positive + negative + schema *_test.go
prefix under -race:

  • Positive / Edge asserts — verdict emitted with correct
    pattern.id within bounded time.
  • Negative / SchemaRejectsDrift asserts — zero verdicts on
    baseline / confounder input (the false-positive gate the chaos
    matrix exists to enforce).

The pod_evicted row additionally runs the on-disk replay corpus under
module/pkg/replay/pod_evicted/; sibling rows skip that step until
their replay/<pattern>/ fixtures land — tracked as Path A in
docs/v1-rc1-test-audit.md §4.

Why this matrix shape, not new failure-inject modes

Audit §4 explicitly recommends Path A (matrix over existing
hermetic detector tests) for rc1 and defers Path B (per-pattern
failure-inject CLI modes) to v1.x. Path A reuses the
race-detector + golden-SHA shape already validated by
pattern-pod-evicted and costs ~60 LoC of workflow YAML instead of
five new CLI subcommands.

Wall-time budget

Observed per-row runtime under -race (measured locally on this
worktree):

Row runtime
hbm_ecc 1.35s
thermal_throttle 1.33s
pcie_aer 1.30s
pod_evicted (incl. replay corpus) 1.35s + 1.34s
nccl_hang 1.31s
xid_correlation 1.32s

Per-row timeout-minutes: 5 is ~3× headroom; matrix parallelism
keeps total wall-time well inside the chaos-gate budget.

Test plan

  • actionlint .github/workflows/chaos.yml — clean.
  • python3 -c "import yaml; yaml.safe_load(open('.github/workflows/chaos.yml'))" — parses, 3 jobs (harness-determinism, cpu-steal-mpstat, pattern-detectors), 6-row matrix.
  • For every matrix row, cd module && go test -race -count=1 -run '<regex>' ./pkg/patterns/... — all green (counts: HBMECC=16, ThermalThrottle=17, PCIeAER=15, PodEvicted+PressureFromNote=12, NCCLHang=10, XidCorrelation=14).
  • cd module && go test -race -count=1 ./pkg/replay/... (pod_evicted corpus step) — green.
  • Pre-commit hook: golangci-lint + go vet + go mod verify + attribute-namespace-check — all green.
  • Workflow runs green on main after merge (verifiable only post-merge — nightly cron + push-to-main).
ci: chaos.yml now matrix-tests all six shipped detectors (#14
pod_evicted, #8 nccl_hang, #3 hbm_ecc, #4 thermal_throttle, #5
pcie_aer, xid_correlation) instead of just pod_evicted. Closes the
chaos-coverage gap called out in v1-rc1-test-audit §4. No
operator-visible behavior change.

Tri Lam added 2 commits June 1, 2026 01:31
Replaces the single pattern-pod-evicted job with a matrix-strategy
pattern-detectors job covering every detector named in v1-rc1
cut-criterion §1: hbm_ecc, thermal_throttle, pcie_aer, pod_evicted,
nccl_hang, xid_correlation.

Each row runs the detector's positive + negative + schema *_test.go
prefix under -race. Positive asserts verdict emitted with correct
pattern.id; negative asserts zero verdicts on baseline / confounder
input — the false-positive gate the chaos matrix exists to enforce.
The pod_evicted row additionally runs the on-disk replay corpus
under module/pkg/replay/pod_evicted/; sibling rows skip that step
until their fixtures land (Path A in docs/v1-rc1-test-audit.md §4).

Per-row timeout stays at 5 min; observed local runtime per row is
~1.3 s under -race, matrix parallelism keeps total wall-time well
inside the chaos-gate budget.

Closes #331.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr

trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Addressed reviewer findings: added ib_link_flap + cuda_oom rows (closes #338 coverage gap), tightened pod_evicted regex to actual test names (^Test(PodEvictedDetector|PodEvictedVerdict)), scoped replay glob to ./pkg/replay/pod_evicted/... so future replay dirs don't trip the gate.

@trilamsr trilamsr changed the title ci(chaos): extend matrix to all shipped detectors ci(chaos): extend matrix to all v1-rc1 detectors Jun 1, 2026
@trilamsr

trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

CI fix: my prior glob tightening to ./pkg/replay/pod_evicted/... matched zero Go packages (replay package's runner_test.go lives at module/pkg/replay/, fixtures only at subdirs). Restored original ./pkg/replay/... glob with explanatory comment — runner_test fixture-walks pod_evicted today; future patterns flip their row's run_corpus flag and reuse the same runner_test discovery.

@trilamsr trilamsr merged commit 030a9fd into main Jun 1, 2026
22 checks passed
@trilamsr trilamsr deleted the ci/chaos-matrix-all-detectors branch June 1, 2026 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[rc1-prep] test-gap: extend chaos.yml matrix to all six shipped detectors

1 participant