Skip to content

[rc1-prep] test-gap: extend chaos.yml matrix to all six shipped detectors #331

Description

@trilamsr

Context

Audit in docs/v1-rc1-test-audit.md §4.

.github/workflows/chaos.yml has three rows. Only one (pattern-pod-evicted) covers a detector pattern; the other two (harness-determinism, cpu-steal-mpstat) are pivot-neutral harness gates. Today six detector patterns ship on main (#14 pod-evicted, #8 nccl-hang, #3 hbm-ecc, #4 thermal-throttle, #5 pcie-aer, plus xid-correlation), but only one has a chaos row.

Cut-criterion §1 targets ≥ 12 of 15 patterns at rc1 — every shipping detector needs a chaos-matrix row, otherwise the chaos gate's coverage doesn't track the moat's coverage.

Fix — Path A from the audit

Convert pattern-pod-evicted into a matrix-strategy job:

strategy:
  fail-fast: false
  matrix:
    pattern:
      - pod_evicted
      - nccl_hang
      - xid_correlation
      - hbm_ecc
      - thermal_throttle
      - pcie_aer

Each row runs go test -race -run TestReplay_<Pattern> ./module/pkg/replay/... against a committed fixture under module/pkg/replay/<pattern>/testdata/. Reuses the existing race-detector + golden-SHA shape — no new failure-inject mode required.

Path B (per-pattern failure-inject subcommands) is deferred to v1.x — too much code for the rc1 window. (Audit §4.)

Acceptance

  • All six matrix rows green on main.
  • Each pattern has at least one committed fixture under module/pkg/replay/<pattern>/testdata/.
  • Job timeout-minutes stays ≤ 10 (current 5 × six rows = matrix parallelism keeps total ≤ 5).

Effort

M (1–2 days). The bulk is curating six replay-corpus fixtures; the workflow change is ~10 lines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions