Skip to content

ci(bench): cover all 6 detectors in % delta gate (#302)#483

Merged
trilamsr merged 1 commit into
mainfrom
ci/bench-registry-cover-all-detectors
Jun 2, 2026
Merged

ci(bench): cover all 6 detectors in % delta gate (#302)#483
trilamsr merged 1 commit into
mainfrom
ci/bench-registry-cover-all-detectors

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

scripts/bench-registry.sh::bench_entries covered only BenchmarkPodEvictedDetector and the pyspy parser benches, so the % delta regression gate (scripts/bench-check.sh via bench-check-all.sh) missed five of the six detectors. The absolute allocs/event ceiling (allocs_gate) caught a detector that drifted past its hard ceiling, but a detector that drifted +50% while staying under the ceiling would ship silently. This PR closes the gap by registering all six detectors against both gates.

Root cause

The registry was authored when pod_evicted was the only detector with a bench. The five later detectors (added in #417/#418/#434/#438/#448/#456) were plumbed into allocs_gate (absolute ceiling) but the parallel % delta gate was never extended — a historical artifact, not an architectural defect. Fix: extend bench_entries to cover the missing five.

What changed

  • scripts/bench-registry.sh: added one entry covering five detectors against ./bench/detectors/, single-regex / single go test invocation (~5s incremental vs ~30s if split into 5 entries).
  • bench/detectors/testdata/bench-baseline.txt: generated baseline (count=10 × 500ms, Apple M1 Max). Hardware-invariant signals (B/op + allocs/op) pin the gate; sec/op stays advisory.
  • Existing ceilings in allocs_gate unchanged. Existing baselines unchanged.

Mutation verification

Bumped PCIeAERDetector allocs baseline down by ~18% (524 → 430):

PCIeAERDetector-10  430.0 ± 0%  524.0 ± 0%  +21.86% (p=0.000 n=10)
REGRESSION: the following benchmarks exceeded the 10% threshold vs baseline:
  PCIeAERDetector-10  +21.86% (p=0.000 n=10)

scripts/bench-check-all.sh exit 1. Revert → exit 0. Gate fires on regression, stays clean on baseline.

Coverage gap context

Cross-link: #302 (allocs/event rollup), #417 (xid_correlation NORTHSTAR), #418 (nccl_hang NORTHSTAR), #434 (pod_evicted NORTHSTAR), #438/#448/#456 (sibling perf PRs). Every detector that hit NORTHSTAR under the absolute ceiling now also has a relative-regression gate.

Test plan

  • make bench-check exit 0 (covers all 6 detectors now)
  • make bench-allocs-check exit 0
  • make bench-detectors-check exit 0 (soft-gate, unchanged)
  • Mutation: lower baseline → exit 1; revert → exit 0
  • Lint + vet (pre-commit hook): 0 issues
ci(bench): % delta regression gate now covers all six detectors (#302). xid_correlation, hbm_ecc, nccl_hang, thermal_throttle, and pcie_aer were previously gated only by the absolute allocs/event ceiling — they now also fail builds on >10% allocs/op or B/op drift vs the committed baseline.

bench_entries previously had only BenchmarkPodEvictedDetector +
the pyspy parser benches. The five other detectors (xid_correlation,
hbm_ecc, nccl_hang, thermal_throttle, pcie_aer) were covered by the
absolute allocs/event ceiling (allocs_gate) but had no relative
regression catch — a detector that drifted +50% but stayed under its
ceiling would ship silently.

Add a single registry entry that gates all five via one go-test
invocation against ./bench/detectors/ (single-pass keeps gate ~5s
vs ~30s if split into 5 entries). Generate the matching baseline at
bench/detectors/testdata/bench-baseline.txt on Apple M1 Max
(allocs/op + B/op are hardware-invariant; sec/op stays advisory).

Mutation-verified: artificially lower the PCIeAERDetector allocs
baseline by ~18% -> scripts/bench-check-all.sh exits 1 with
"+21.86% (p=0.000 n=10)"; revert -> exit 0. Confirms gate fires on
regression and stays clean on the unmutated baseline.

Related: #302 (allocs/event rollup), #417 (xid_correlation NORTHSTAR),
#418 (nccl_hang NORTHSTAR), #434 (pod_evicted NORTHSTAR), #438/#448/#456
(sibling perf PRs).

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Review: A (ship)

B/A/A+ Criteria

Coverage Verification

  • Regex ^Benchmark(XidCorrelation|HBMECC|NCCLHang|ThermalThrottle|PCIeAER)Detector$ covers 5/5 gated detectors (PodEvicted intentionally separate).
  • find bench -name '*_bench_test.go' → 2 files; nccl_fr_bench_test.go correctly excluded (1 GiB/iter advisory per comment L19).
  • Baseline: 50 data rows (5 detectors × 10 runs) + 6 header/trailer lines = 56 total. ✓

Mutation Test Proof

Builder's claim: PCIeAER -18% baseline → gate fires at 10% threshold.
Gate uses B/op + allocs/op (hardware-invariant); sec/op advisory per bench-check.sh L41–50. ✓

No Blocking Issues

  • Single-regex choice (~5s vs ~30s split): justified by comment; maintainability acceptable.
  • Baseline generation on Apple M1 Max: allocs/op + B/op are hardware-invariant per bench-check.sh design.
  • Comments load-bearing: explain gate strategy, reference related issues, motivate perf choice.

Grade: A

Ship. Ready to merge.

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 05:19
@trilamsr trilamsr merged commit 877caac into main Jun 2, 2026
12 checks passed
@trilamsr trilamsr deleted the ci/bench-registry-cover-all-detectors branch June 2, 2026 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant