ci(bench): cover all 6 detectors in % delta gate (#302)#483
Merged
Conversation
bench_entries previously had only BenchmarkPodEvictedDetector + the pyspy parser benches. The five other detectors (xid_correlation, hbm_ecc, nccl_hang, thermal_throttle, pcie_aer) were covered by the absolute allocs/event ceiling (allocs_gate) but had no relative regression catch — a detector that drifted +50% but stayed under its ceiling would ship silently. Add a single registry entry that gates all five via one go-test invocation against ./bench/detectors/ (single-pass keeps gate ~5s vs ~30s if split into 5 entries). Generate the matching baseline at bench/detectors/testdata/bench-baseline.txt on Apple M1 Max (allocs/op + B/op are hardware-invariant; sec/op stays advisory). Mutation-verified: artificially lower the PCIeAERDetector allocs baseline by ~18% -> scripts/bench-check-all.sh exits 1 with "+21.86% (p=0.000 n=10)"; revert -> exit 0. Confirms gate fires on regression and stays clean on the unmutated baseline. Related: #302 (allocs/event rollup), #417 (xid_correlation NORTHSTAR), #418 (nccl_hang NORTHSTAR), #434 (pod_evicted NORTHSTAR), #438/#448/#456 (sibling perf PRs). Signed-off-by: Tri Lam <tree@lumalabs.ai>
Contributor
Author
Review: A (ship)B/A/A+ Criteria
Coverage Verification
Mutation Test ProofBuilder's claim: PCIeAER -18% baseline → gate fires at 10% threshold. No Blocking Issues
Grade: AShip. Ready to merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/bench-registry.sh::bench_entriescovered onlyBenchmarkPodEvictedDetectorand the pyspy parser benches, so the % delta regression gate (scripts/bench-check.shviabench-check-all.sh) missed five of the six detectors. The absolute allocs/event ceiling (allocs_gate) caught a detector that drifted past its hard ceiling, but a detector that drifted +50% while staying under the ceiling would ship silently. This PR closes the gap by registering all six detectors against both gates.Root cause
The registry was authored when
pod_evictedwas the only detector with a bench. The five later detectors (added in #417/#418/#434/#438/#448/#456) were plumbed intoallocs_gate(absolute ceiling) but the parallel % delta gate was never extended — a historical artifact, not an architectural defect. Fix: extendbench_entriesto cover the missing five.What changed
scripts/bench-registry.sh: added one entry covering five detectors against./bench/detectors/, single-regex / singlego testinvocation (~5s incremental vs ~30s if split into 5 entries).bench/detectors/testdata/bench-baseline.txt: generated baseline (count=10 × 500ms, Apple M1 Max). Hardware-invariant signals (B/op + allocs/op) pin the gate; sec/op stays advisory.allocs_gateunchanged. Existing baselines unchanged.Mutation verification
Bumped
PCIeAERDetectorallocs baseline down by ~18% (524 → 430):scripts/bench-check-all.shexit 1. Revert → exit 0. Gate fires on regression, stays clean on baseline.Coverage gap context
Cross-link: #302 (allocs/event rollup), #417 (xid_correlation NORTHSTAR), #418 (nccl_hang NORTHSTAR), #434 (pod_evicted NORTHSTAR), #438/#448/#456 (sibling perf PRs). Every detector that hit NORTHSTAR under the absolute ceiling now also has a relative-regression gate.
Test plan
make bench-checkexit 0 (covers all 6 detectors now)make bench-allocs-checkexit 0make bench-detectors-checkexit 0 (soft-gate, unchanged)