[telemetry] bench-check: actually gate on regression#19
Merged
Conversation
Caught pre-merge: the `make bench-check` Makefile target advertised
"fails if any row regresses >10%", but benchstat itself always exits
0 — it prints the comparison, no gate. Reviewer flagged the truth-
in-advertising gap.
Fix: add scripts/bench-check.sh wrapping benchstat. After printing
the table, awk-parses the `+NN.NN%` deltas in each row and exits 1
if any exceeds the threshold (default 10%, overridable via
THRESHOLD env). Tested locally:
- Identical inputs → PASS, exit 0.
- Synthetic +31% regression → REGRESSION + exit 1 with the
offending row + the delta printed to stderr.
Compatible with macOS BSD awk (POSIX match()+substr() rather than
gawk's array-form match()).
Makefile bench-check now delegates to the wrapper:
THRESHOLD=5 make bench-check # tighter local gate
README's regression-detection section rewritten to reflect actual
behavior + explain when + how to update the baseline intentionally.
The baseline file is still the load-bearing artifact (per the
reviewer); CI auto-gating filed under FOLLOWUPS for the next PR
since regressing during heavy refactor weeks would block legitimate
work.
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two reviewer micro-nits on the regression gate:
1. **Regex `^\+[0-9]+\.[0-9]+%$` required decimals.** Today's
benchstat always emits two decimals (e.g., `+31.28%`); a future
major version that prints round numbers without decimals
(`+31%`) would silently pass the gate. Loosened to
`^\+[0-9]+(\.[0-9]+)?%$` so both forms catch.
2. **Regressed-row output now includes (p=… n=…) when benchstat
emits it.** Significant rows get the annotation; the geomean
summary row prints without (benchstat doesn't attach p+n there).
Example output on a synthetic +31% regression:
WindowedRate_Observe-14 +31.28% (p=0.008 n=5)
geomean +31.28%
Operators see significance + sample count alongside the delta
without re-reading the benchstat table.
Verified end-to-end: script exits 1 on regression; synthetic
"+31%" (no decimals) catches via the loosened regex.
Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What landed (at a glance)
scripts/bench-check.shparses benchstat+NN.NN%deltas and exits non-zero if any row exceeds the thresholdmake bench-check(withTHRESHOLD=Noverride) delegates to the wrapperinternal/telemetry/README.mdregression-detection section rewritten to describe actual behaviorWhat this PR does
Post-merge fix for the M2 self-telemetry PR (#17). The
make bench-checktarget in that PR claimed "fails if any row regresses >10%" — butbenchstatitself always exits 0, so the target reported deltas without gating on them. Caught by the post-merge reviewer.Wraps benchstat in
scripts/bench-check.sh:+NN.NN%deltasTested locally:
PASS, exit 0.REGRESSION: WindowedRate_Observe-14 +31.28%+ exit 1.POSIX-awk compatible (BSD awk on macOS, gawk on Linux). README's regression-detection section rewritten to match actual behavior and document the "intentional regression — update the baseline" workflow.
Linked issue(s)
Refs #17 (post-merge fix).
Release notes
```release-notes
NONE
```
Checklist