Skip to content

feat(bench): absolute allocs/event ceiling gate (#302)#416

Merged
trilamsr merged 1 commit into
mainfrom
a16-bench-allocs-gate
Jun 1, 2026
Merged

feat(bench): absolute allocs/event ceiling gate (#302)#416
trilamsr merged 1 commit into
mainfrom
a16-bench-allocs-gate

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Closes #302.

Summary

Adds a hard absolute alloc-per-event ceiling gate complementing the existing
% delta gate. Wired into make bench-check (the CI step), so a PR that lands
above the per-event ceiling on any pattern-library detector fails closed.

  • New: scripts/bench-allocs-check.sh runs every BenchmarkXxxDetector
    under ./bench/detectors/ at count=10 × 500ms, takes the median
    allocs/op, divides by the per-entry window size (1024), compares to the
    ceiling in scripts/bench-registry.sh::allocs_gate.
  • Extended: scripts/bench-registry.sh gains an allocs_gate array with
    one row per detector (bench_name|window|ceiling_allocs_per_event).
  • Wired: make bench-check now invokes both gates (% delta + absolute).
    A new standalone make bench-allocs-check target runs the new gate alone.
  • Documented: bench/detectors/README.md gains a "Two gates" section
    describing how the hard ceiling and the soft % delta gate compose.

Why an absolute gate

make bench-check today gates against a committed bench-baseline.txt via
benchstat % delta. That catches relative regression but not absolute
breach: a PR that updates the baseline ALSO raises the ceiling implicitly,
and a reverted optimization can creep back above the prior floor without
showing up as a >10% delta from its own (post-revert) baseline.

The absolute ceiling lives in git diff scripts/bench-registry.sh
any change to the ceiling is reviewable on the diff alone. Optimization PRs
must lower the ceiling to lock the win in; reverts cannot silently raise
it back. That is the ratchet semantic #302 asks for.

Initial ceilings (measured baseline + 1 slack, Apple M1 Max)

testing.B's allocs/op is hardware-invariant — same value on CI's
linux/amd64 runner — so the M1-captured numbers gate cleanly.

Detector allocs/op allocs/event ceiling
pod_evicted 15635 15.27 16
xid_correlation 12699 12.40 13
hbm_ecc 1429 1.40 2
nccl_hang 4088 3.99 5
thermal_throttle 780 0.76 1
pcie_aer 524 0.51 1

NORTHSTAR is 2 allocs/event; pod_evicted / xid_correlation / nccl_hang are
the three above the bar. Per-detector optimization trackers will be filed
as follow-ups to drive the ratchet path down.

Test plan

  • make bench-allocs-check locally → PASS (all six detectors below
    ceiling).
  • Mutation test (red) — tightened HBM ceiling 2 → 1 in
    allocs_gate, re-ran: gate exit 1, BenchmarkHBMECCDetector allocs/event=1.3955 > ceiling=1 ... BREACH. Reverted.
  • Fail-closed test — replaced a gated bench name with
    BenchmarkNonexistentDetector, re-ran: gate exit 2, "no measurement
    for gated bench". Reverted.
  • make vet, make lint, pre-commit hooks → green.
  • CI bench-check on linux/amd64 → confirm M1 ceilings hold (this PR's
    check run is the validation).
- New `make bench-allocs-check` (wired into `make bench-check`) enforces
  absolute allocs-per-event ceilings on every pattern-library detector;
  ratchets downward as optimization PRs land.

Adds scripts/bench-allocs-check.sh + an allocs_gate array in
scripts/bench-registry.sh that pins each per-detector benchmark to
an absolute allocs-per-event ceiling, complementing the existing
% delta gate. Wired into `make bench-check` so CI fails closed on
breach.

The new gate is the load-bearing one for ratchet semantics:
optimization PRs must lower the ceiling in the registry to lock
the win in, and reverts cannot silently raise it back without
showing up in `git diff scripts/bench-registry.sh`.

Initial ceilings (measured baseline + 1 slack on Apple M1 Max;
allocs/op is hardware-invariant so the same numbers gate on CI):

  - pod_evicted       15.27 ev → ceiling 16
  - xid_correlation   12.40 ev → ceiling 13
  - hbm_ecc            1.40 ev → ceiling 2
  - nccl_hang          3.99 ev → ceiling 5
  - thermal_throttle   0.76 ev → ceiling 1
  - pcie_aer           0.51 ev → ceiling 1

NORTHSTAR is 2 allocs/event; the per-detector optimization
trackers (filed separately) drive the ratchet path down.

Mutation-verified: tightening HBM ceiling from 2 → 1 in the
registry trips the gate (exit 1, "BREACH"). Fail-closed verified:
a gated bench with no measurement exits 2.

```release-notes
- New `make bench-allocs-check` (wired into `make bench-check`)
  enforces absolute allocs-per-event ceilings on every pattern-
  library detector; ratchets downward as optimization PRs land.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 23:06
@trilamsr trilamsr merged commit d8715ef into main Jun 1, 2026
12 checks passed
@trilamsr trilamsr deleted the a16-bench-allocs-gate branch June 1, 2026 23:12
@trilamsr

trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

Independent Review: feat(bench): absolute allocs/event ceiling gate (#302)

Mutation Verification: PASS

  • Tightened HBM ceiling 2→1 in scripts/bench-registry.sh
  • Gate exited 1 with "BREACH" message (correct)
  • Restored original; gate exited 0 with "PASS" (correct)
  • ✓ Red/green cycle works as specified

Scope Alignment: PASS

Issue #302 asks for "ratchet gate"; PR delivers absolute ceiling with hard-fail on CI.

  • Window size mismatch in issue (varied per detector 200-1024) vs PR implementation (uniform 1024)
  • PR chose uniform 1024-event window per benchWindowSize const in detectors_bench_test.go
  • All 6 detectors enumerated in allocs_gate array ✓
  • Initial ceilings (measured + 1 slack) match PR body

CI Wiring: PASS

  • make bench-check now invokes both scripts/bench-check-all.sh (% delta) + scripts/bench-allocs-check.sh (absolute)
  • .github/workflows/ci.yml:144 runs make bench-check
  • Fail-closed on config error (exit 2) ✓

Dual-Gate Logic: SIMPLIFICATION OPPORTUNITY

PR gates on BOTH % delta AND absolute ceiling. Assessment:

  • % delta gate (benchstat vs committed baseline) catches relative regression
  • Absolute ceiling gate (hard-coded script) catches absolute breach
  • Design question: issue A16: per-detector allocs/event bench + ratchet gate #302 asks for "ratchet" — absolute ceiling IS the ratchet. The % delta gate is now redundant/orthogonal.
    • Can optimize: drop % delta gate if absolute ceiling is the load-bearing control? Or keep both for belt-and-suspenders?
    • Current design is defensible (layers), but added complexity. Bias deletion: is % delta actually used for decision-making, or is it just advisory now?

Hardware Invariance Claim: UNVERIFIED

PR asserts "allocs/op is hardware-invariant" and M1 ceilings will hold on ubuntu-latest/amd64.

  • Claim is sound in principle (Go's escape analysis is deterministic per AST)
  • CI has not yet run this PR on amd64; baseline was M1-only
  • Risk: if amd64 escape analysis differs (rare but possible), CI will flake
  • Mitigation: this PR's own CI run will validate; no blocker

Minor: Over-Engineering Check

  • scripts/bench-allocs-check.sh uses awk/grep for median extraction (portable)
  • Handles 6 detectors correctly; scales to N
  • Exit code semantics (0/1/2) clear ✓
  • No refactoring needed; clean implementation

Findings

  1. Dual-gate necessity unclear: The PR wires both % delta gate + absolute ceiling gate into make bench-check. Per A16: per-detector allocs/event bench + ratchet gate #302 ("ratchet gate"), the absolute ceiling IS the ratchet. Is % delta (benchstat vs baseline.txt) still a load-bearing decision point, or is it now just noise? Ask author: can we drop one gate, or do they serve separate purposes in your intent?

  2. Hardware invariance unverified on CI: PR cites M1 Max baselines as "hardware-invariant" but CI (ubuntu-latest/amd64) has not yet validated. Go escape analysis is deterministic, so risk is low. This PR's CI check run itself is the validation gate — no blocker now.

Verdict

SHIP — mutation test passes, all 6 detectors gated, CI wiring correct, fail-closed semantics solid. Dual-gate trade-off is defensible. Let CI run validate amd64 parity.

trilamsr added a commit that referenced this pull request Jun 2, 2026
Closes #424.

## Root cause

PR #416 shipped `make bench-allocs-check` as the absolute allocs/event
ceiling gate and wired it into `make bench-check` (so CI's
`verify-static` job runs it). It was **not** wired into `make ci-full`.
Result: local pre-PR `make ci-full` diverges from what CI enforces — a
per-event allocs ceiling breach passes locally and only fails after the
PR is opened, defeating the per-PR enforcement intent of #302.

(The issue's framing claims per-PR enforcement is entirely missing; in
fact CI's verify-static already enforces it via the `bench-check` step.
The actual gap is local/CI divergence, which is what this PR closes.)

## Change

One-line dep addition: `bench-allocs-check` appended to `ci-full`'s
prerequisite list. Plus comment + PRINCIPLES.md §10 update naming the
new wall-time floor.

```diff
-ci-full: ... build smoke-quickstart  ## ...
+ci-full: ... build smoke-quickstart bench-allocs-check  ## ...
```

## Wall-time impact (M1 Max)

| Run | Before | After |
|---|---|---|
| `make ci-full` (cold testcache) | 229s | 283s |
| `make bench-allocs-check` standalone | n/a | ~80s |

Delta: **+54s** (effective, with warm Go build cache from prior
`ci-full` steps; cold-only standalone bench is ~80s).

PRINCIPLES.md §10's historical "~2.5m" budget was already stale
(cold-cache `ci-full` measured 229s = 3.8m before this PR). Updated to
~5m and named the dominant steps (`coverage-check`, `build`,
`bench-allocs-check`).

## Mutation test (TDD red)

Tightened HBM ceiling 2 → 1 in `scripts/bench-registry.sh`, ran `make
bench-allocs-check`:

```
BenchmarkHBMECCDetector ceiling=1 allocs/op=1429 allocs/ev=1.3955 BREACH
FAIL: the following detectors exceeded their allocs/event ceiling:
  BenchmarkHBMECCDetector allocs/event=1.3955 > ceiling=1 (allocs/op=1429, window=1024)
make: *** [bench-allocs-check] Error 1
```

Reverted. Confirmed `make -n ci-full | grep bench-allocs-check` resolves
to `scripts/bench-allocs-check.sh` (i.e., the new dep is in `ci-full`'s
graph).

## Verify (green)

- `make ci-full` (cold testcache) → PASS in 283.54s; tail shows all six
detectors at-or-below ceiling.
- `make lint`, `make vet`, pre-commit hooks → green.

## Audit (A+ criterion)

Other bench targets and CI status:

- `bench-check` (composite: % delta + allocs ceiling) — CI's
verify-static runs it; local `ci-full` now runs the allocs-ceiling half.
The % delta half (`bench-check-all.sh`) is **not** added to `ci-full`
because it requires `benchstat` (CI installs it; local devs would need a
manual `go install`). Acceptable: CI is the ground truth; local
divergence on the % delta gate is acknowledged.
- `bench-detectors-check` — intentionally SOFT gate (script comment),
runs in `.github/workflows/bench.yml` as a PR-comment annotation.
Graduates to hard-fail per `bench/detectors/README.md` roadmap. No
action.
- `bench`, `bench-detectors`, `bench-baseline`,
`bench-detectors-baseline` — manual exploration / regen helpers. Not
gates. No action.

Why not `verify` (pre-push)? `verify` is documented as `<30s`; +80s
busts the budget 3x.

## Follow-up

Will file separately: ratchet path to fold `bench-allocs-check`
wall-time down (faster `benchtime` or `count`) so the gate eventually
folds back under the historical 2.5m local budget. Tracked-against-#424
in the issue's existing "Cost" line.

```release-notes
- `make ci-full` now invokes `bench-allocs-check`, closing the local/CI gap that let per-event allocs-ceiling breaches pass local pre-PR but fail in CI (#424).
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A16: per-detector allocs/event bench + ratchet gate

1 participant