Skip to content

[telemetry] bench-check: actually gate on regression#19

Merged
trilamsr merged 2 commits into
mainfrom
fix/bench-check-regression-gate
May 14, 2026
Merged

[telemetry] bench-check: actually gate on regression#19
trilamsr merged 2 commits into
mainfrom
fix/bench-check-regression-gate

Conversation

@trilamsr

Copy link
Copy Markdown
Contributor

What landed (at a glance)

Area Shipped
Bench-regression gate scripts/bench-check.sh parses benchstat +NN.NN% deltas and exits non-zero if any row exceeds the threshold
Makefile target make bench-check (with THRESHOLD=N override) delegates to the wrapper
README correction internal/telemetry/README.md regression-detection section rewritten to describe actual behavior

What this PR does

Post-merge fix for the M2 self-telemetry PR (#17). The make bench-check target in that PR claimed "fails if any row regresses >10%" — but benchstat itself always exits 0, so the target reported deltas without gating on them. Caught by the post-merge reviewer.

Wraps benchstat in scripts/bench-check.sh:

  • Runs the benchmark output through benchstat (printed verbatim so operators still see the comparison table)
  • Awk-parses the per-row +NN.NN% deltas
  • Exits 1 if any benchmark regressed by more than the configured threshold; prints the offending row + delta to stderr

Tested locally:

  • Identical input files → PASS, exit 0.
  • Synthetic +31% regression → REGRESSION: WindowedRate_Observe-14 +31.28% + exit 1.

POSIX-awk compatible (BSD awk on macOS, gawk on Linux). README's regression-detection section rewritten to match actual behavior and document the "intentional regression — update the baseline" workflow.

Linked issue(s)

Refs #17 (post-merge fix).

Release notes

```release-notes
NONE
```

Checklist

  • Tests added or updated — manual end-to-end test of both pass + fail paths documented in commit message; baseline file unchanged so the gate is a no-op until someone actually regresses.
  • `make ci` passes locally
  • Commits are signed off (`git commit -s`)
  • Commits cryptographically signed (SSH)
  • N/A — no new components

trilamsr added 2 commits May 14, 2026 13:46
Caught pre-merge: the `make bench-check` Makefile target advertised
"fails if any row regresses >10%", but benchstat itself always exits
0 — it prints the comparison, no gate. Reviewer flagged the truth-
in-advertising gap.

Fix: add scripts/bench-check.sh wrapping benchstat. After printing
the table, awk-parses the `+NN.NN%` deltas in each row and exits 1
if any exceeds the threshold (default 10%, overridable via
THRESHOLD env). Tested locally:
- Identical inputs → PASS, exit 0.
- Synthetic +31% regression → REGRESSION + exit 1 with the
  offending row + the delta printed to stderr.

Compatible with macOS BSD awk (POSIX match()+substr() rather than
gawk's array-form match()).

Makefile bench-check now delegates to the wrapper:
    THRESHOLD=5 make bench-check     # tighter local gate
README's regression-detection section rewritten to reflect actual
behavior + explain when + how to update the baseline intentionally.

The baseline file is still the load-bearing artifact (per the
reviewer); CI auto-gating filed under FOLLOWUPS for the next PR
since regressing during heavy refactor weeks would block legitimate
work.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
Two reviewer micro-nits on the regression gate:

1. **Regex `^\+[0-9]+\.[0-9]+%$` required decimals.** Today's
   benchstat always emits two decimals (e.g., `+31.28%`); a future
   major version that prints round numbers without decimals
   (`+31%`) would silently pass the gate. Loosened to
   `^\+[0-9]+(\.[0-9]+)?%$` so both forms catch.

2. **Regressed-row output now includes (p=… n=…) when benchstat
   emits it.** Significant rows get the annotation; the geomean
   summary row prints without (benchstat doesn't attach p+n there).
   Example output on a synthetic +31% regression:

       WindowedRate_Observe-14  +31.28% (p=0.008 n=5)
       geomean  +31.28%

   Operators see significance + sample count alongside the delta
   without re-reading the benchstat table.

Verified end-to-end: script exits 1 on regression; synthetic
"+31%" (no decimals) catches via the loosened regex.

Assisted-by: Anthropic:claude-opus-4-7 [Claude Code]
Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr enabled auto-merge (squash) May 14, 2026 20:53
@trilamsr trilamsr merged commit 5d8035b into main May 14, 2026
9 checks passed
@trilamsr trilamsr deleted the fix/bench-check-regression-gate branch May 14, 2026 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant