Skip to content

ci(bench): re-baseline make bench-check + add regression gate#244

Merged
trilamsr merged 3 commits into
mainfrom
ci/i227-bench-check
May 31, 2026
Merged

ci(bench): re-baseline make bench-check + add regression gate#244
trilamsr merged 3 commits into
mainfrom
ci/i227-bench-check

Conversation

@trilamsr

@trilamsr trilamsr commented May 31, 2026

Copy link
Copy Markdown
Contributor

Closes #227.

Summary

make bench-check has been a no-op since PR-K.2 (#217) deleted components/receivers/k8sevents/ — its testdata/bench-baseline.txt was the sole row in the gate's for pkg in … loop, leaving the loop body literally empty (for pkg in ; do …). The target stayed as a stub so make ci automation kept a stable invocation, with a TODO to re-register packages later. That re-registration is this PR.

Root cause

Makefile:71-101 (pre-diff) iterated an empty list. The wrapper around benchstat+scripts/bench-check.sh was intact; only the registry of which packages to gate had been emptied. So the fix is purely a registration + baseline-capture exercise — no logic change to the gate's comparison machinery, plus one scope-tightening change called out under "Adversarial-review fix" below.

What this PR does

  1. Registers two packages with deterministic-allocation micro-benchmarks for gating:

  2. Captures committed baselines at module/pkg/patterns/testdata/bench-baseline.txt and components/receivers/pyspy/testdata/bench-baseline.txt, both at count=10 -benchtime=500ms -benchmem. count=10 is the minimum that survives benchstat's ≥6 samples for confidence interval at level 0.95 warning with headroom for one outlier.

  3. Extracts the registry to scripts/bench-registry.sh sourced by both scripts/bench-check-all.sh (gate runner, wired to make bench-check) and scripts/bench-baseline.sh (regenerator, wired to make bench-baseline). Single source of truth — drift between "what we gate" and "what we regenerate" can no longer happen.

    Why scripts and not an inline Make loop: the bench regex ^Benchmark(ParseDump|StackID)$ contains parentheses that the Make-invoked /bin/sh tokenises as subshell metacharacters. Quoting under make shellquote escaping is more fragile than just punting to plain shell.

  4. Restricts the gate to B/op + allocs/op tables in scripts/bench-check.sh; sec/op is no longer gated. See "Adversarial-review fix" below.

  5. Wires the gate into CI as a new step in .github/workflows/ci.yml's verify-static job, after make build. A pinned go install golang.org/x/perf/cmd/benchstat@… step installs the tool on the runner.

Adversarial-review fix (mid-PR)

The first commit gated sec/op too, with the design note claiming pure wall-clock jitter would show as benchstat's ~ (non-significant) and never reach the +NN% parser. That's false. Repeating make bench-check immediately after a baseline capture, with identical code on identical hardware, fired:

PodEvictedDetector_1kEventWindow-10  sec/op  +16.50% (p=0.001 n=10)
PodEvictedDetector_1kEventWindow-10  B/op    +0.01% (p=0.005 n=10)
PodEvictedDetector_1kEventWindow-10  allocs/op  ~ (p=1.000 n=10)

The 16.5% delta was background-load drift between two runs minutes apart. benchstat marks it significant because the CV is small (~5-10%) and the means are reliably different — both true symptoms of real but uninteresting wall-clock variance. Gating on sec/op would produce continuous false fires across the team without catching anything code review can act on.

Fix: scripts/bench-check.sh now tracks the active benchstat table by indented header line (sec/op / B/op / allocs/op) and only counts +NN% deltas under B/op + allocs/op. Both stay pinned to 0% CV across runs (deterministic Go allocator + identical bytes-allocated per code path) and only move when the code does. make bench-check on clean main now passes consistently across runs; the TDD planted regression (heap-allocating sink in stackID) still trips the gate at +800% B/op + +133% allocs/op.

Design notes

  • Why these two benches and not bench/overhead/nccl_fr_bench_test.go: that benchmark replays 1 GiB of fixtures per iter (~90s on M1 Max) and already self-asserts on HeapAlloc delta against the NORTHSTARS O2 ceiling. Adding it to bench-check would push CI past the budget and produce nothing the in-bench assertion doesn't already gate. Stays advisory.
  • Hardware skew is irrelevant for B/op + allocs/op: baselines were captured on Apple M1 Max; CI runs on ubuntu-latest (x86_64). The Go allocator is deterministic across GOARCH for a given code path — b.ReportAllocs() returns identical numbers on both. The historical k8sevents baseline lived under the same cross-arch arrangement.
  • make ci unchanged: bench-check is NOT added to the make ci recipe. That would force every contributor to install benchstat. CI installs benchstat itself in the workflow; local dev runs make bench-check only when intentionally vetting a perf change.
  • module/ submodule resolution: module/pkg/patterns lives in the in-repo Go submodule (module/go.mod); local dev runs resolve through go.work and OCB through builder-config.yaml's replaces: ./module. go test ./module/pkg/patterns/… works from the repo root because of the workspace.

TDD record

  • Define gate: scripts/bench-check-all.sh + bench-registry.sh register two packages; threshold 10%, env-overridable; only B/op + allocs/op rows count.
  • Red (no plant): make bench-check on clean main passes — exit 0, both packages report PASS: no benchmarks regressed by more than 10% vs baseline. (Re-run multiple times across different machine load — stays green.)
  • Red→Green (planted regression): edited stackID to allocate 64 B/iter via a package-level sink (escape-analysis-routed to the heap, so allocs/op actually moved); make bench-check exit 2 (make wraps the inner exit 1). Output flagged StackID-10 +800.00% (p=0.000 n=10) on B/op and +133.33% (p=0.000 n=10) on allocs/op.
  • Green (revert): plant removed; make bench-check exit 0; git diff main -- components/receivers/pyspy/stackid.go empty.

Release notes

### Added
- `make bench-check` is now a real perf-regression gate again (was a no-op since PR-K.2). Registry-driven via `scripts/bench-registry.sh` (single source of truth, sourced by both `bench-check-all.sh` runner and `bench-baseline.sh` regenerator); the runner gates `B/op` + `allocs/op` deltas against a committed `testdata/bench-baseline.txt` via benchstat; any row regressing past `THRESHOLD%` (default 10, env-overridable) fails. `sec/op` is deliberately not gated (wall-clock CV on dev hardware and shared CI runners routinely crosses 10% from background load alone, even when benchstat marks the delta significant). Two packages registered: `module/pkg/patterns` (M19 PodEvictedDetector budget) and `components/receivers/pyspy` (M18 ParseDump + StackID hash). Regenerate baselines with `make bench-baseline` after a vetted, intentional perf change and commit the diff.
- CI `verify-static` job now runs `make bench-check` after installing a pinned benchstat. Local dev only needs benchstat when intentionally running the perf gate.

Test plan

  • make check (fmt, tidy-check, lint, vet, mod-verify) — clean (re-run post-adversarial-fix)
  • make actionlint — clean
  • make zizmor — clean (No findings to report. Good job!)
  • make license-check — clean
  • make doc-check — clean
  • make bench-check on clean main — exit 0, both packages PASS, repeated runs stay green (false-positive bug fixed)
  • make bench-check with deliberate alloc-regression planted in stackID — exit 2, gate flags +800% B/op + +133% allocs/op
  • make bench-check after revert — exit 0 again, stackid.go diff vs main is empty
  • CI verify-static runs make bench-check green on this PR (awaiting CI)

Self-grading

A — root cause named, fix scoped to missing registration without expanding scope (no new benchmark framework, no rewrite of the benchstat-comparison machinery beyond the sec/op-vs-alloc scoping fix). TDD cycle complete with a deliberate regression that the gate catches. Adversarial self-review found a real false-positive bug (sec/op gating) and fixed it in-PR before merge. Registry extracted to a single source of truth so the two consumers can't drift. CHANGELOG entry written.

Not A+ because: even with sec/op excluded, the gate still depends on benchstat's text-output format remaining stable; a future major version that changes column layout could silently disable the gate (it would skip every row, exit 0, print PASS). A future iteration could parse benchstat's -format=csv output instead, which is its versioned machine-readable interface. Out of scope here; the current text format hasn't changed for years and matches the existing scripts/bench-check.sh shape that already shipped with the k8sevents baseline.

Tri Lam added 3 commits May 31, 2026 13:27
Per #227. The 'bench-check' target had been an empty 'for pkg in ;'
loop since PR-K.2 deleted the in-tree k8sevents receiver (the gate's
sole baseline row). Perf regressions could ship.

Restore the gate by registering two fast, deterministic-allocation
benchmarks against committed baselines:
  - internal/synthesis/patterns (M19 PodEvictedDetector budget)
  - components/receivers/pyspy   (M18 ParseDump + StackID hash)

The 1 GiB bench/overhead/nccl_fr_bench_test.go stays advisory: it
self-asserts on HeapAlloc against NORTHSTARS O2 and is too slow for
CI gating.

The Make → shell indirection (scripts/bench-check-all.sh) exists
because the bench regex contains parentheses that Make-invoked
/bin/sh tokenises as subshell metacharacters.

Signed-off-by: Tri Lam <tri@maydow.com>
Signed-off-by: Tri Lam <tri@maydow.com>

# Conflicts:
#	module/pkg/patterns/testdata/bench-baseline.txt
Adversarial-review pass on the prior commit caught two real issues:

1. The prior commit gated sec/op too, but wall-clock CV on dev
   hardware routinely crosses the 10% threshold from background load
   alone even when benchstat marks the delta significant (p<0.05).
   Reproduced locally: 'make bench-check' false-fired on a +16.5%
   ns/op row immediately after baseline capture, with allocs/op + B/op
   both pristine at 0% CV. Hide sec/op from the gate; alloc-count +
   B/op are hardware-invariant and only move when the code does.

2. scripts/bench-{check-all,baseline}.sh duplicated the package
   registry — DRY violation that would silently desync. Extract to
   scripts/bench-registry.sh sourced by both consumers.

Re-verified TDD plant (alloc regression in stackID): gate still
trips with exit 2 on B/op (+800%) + allocs/op (+133%); plant
reverted; clean tree exits 0.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) May 31, 2026 20:43
@trilamsr trilamsr merged commit 3991646 into main May 31, 2026
18 checks passed
@trilamsr trilamsr deleted the ci/i227-bench-check branch May 31, 2026 20:48
trilamsr pushed a commit that referenced this pull request May 31, 2026
Land both processors (rankjoin + patterndetector) at v0.120 and
keep bench-check scaffolding from #244 alongside the OTel bump.
go mod tidy in module/ collapsed indirects to the v0.120 surface
incl. xprocessor + consumertest.

Signed-off-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

make bench-check is no-op — re-baseline perf gates

1 participant