From b41d212222157ad431dd456846a58c6e2fb99279 Mon Sep 17 00:00:00 2001 From: Tri Lam Date: Sun, 31 May 2026 13:27:55 -0700 Subject: [PATCH 1/2] ci(bench): re-baseline make bench-check + add regression gate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per #227. The 'bench-check' target had been an empty 'for pkg in ;' loop since PR-K.2 deleted the in-tree k8sevents receiver (the gate's sole baseline row). Perf regressions could ship. Restore the gate by registering two fast, deterministic-allocation benchmarks against committed baselines: - internal/synthesis/patterns (M19 PodEvictedDetector budget) - components/receivers/pyspy (M18 ParseDump + StackID hash) The 1 GiB bench/overhead/nccl_fr_bench_test.go stays advisory: it self-asserts on HeapAlloc against NORTHSTARS O2 and is too slow for CI gating. The Make → shell indirection (scripts/bench-check-all.sh) exists because the bench regex contains parentheses that Make-invoked /bin/sh tokenises as subshell metacharacters. Signed-off-by: Tri Lam --- .github/workflows/ci.yml | 6 ++ CHANGELOG.md | 2 + Makefile | 31 ++++------ .../pyspy/testdata/bench-baseline.txt | 26 +++++++++ .../patterns/testdata/bench-baseline.txt | 16 ++++++ scripts/bench-baseline.sh | 31 ++++++++++ scripts/bench-check-all.sh | 57 +++++++++++++++++++ 7 files changed, 148 insertions(+), 21 deletions(-) create mode 100644 components/receivers/pyspy/testdata/bench-baseline.txt create mode 100644 internal/synthesis/patterns/testdata/bench-baseline.txt create mode 100755 scripts/bench-baseline.sh create mode 100755 scripts/bench-check-all.sh diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml index 6770c89f..7260c30a 100644 --- a/.github/workflows/ci.yml +++ b/.github/workflows/ci.yml @@ -128,6 +128,12 @@ jobs: run: make doc-check - name: build run: make build + - name: install benchstat + run: | + go install golang.org/x/perf/cmd/benchstat@v0.0.0-20260512194132-3cf34090a3db + echo "$(go env GOPATH)/bin" >> "$GITHUB_PATH" + - name: bench-check (perf regression gate) + run: make bench-check # M6: validate every docs/integrations/examples/*.yaml against the # right binary. Tracecore-tagged recipes use the in-tree binary; diff --git a/CHANGELOG.md b/CHANGELOG.md index 4ff7e3ea..ceff91d0 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,8 @@ User-visible changes are documented here. Format: [Keep a Changelog](https://kee ## [Unreleased] +**`make bench-check` re-baselined as a real perf-regression gate.** Since PR-K.2 deleted the in-tree k8sevents receiver (whose `testdata/bench-baseline.txt` was the gate's sole row), the target had been a no-op `for pkg in ;` empty loop. The loop now iterates a quoted-shell registry in `scripts/bench-check-all.sh` covering `internal/synthesis/patterns` (`BenchmarkPodEvictedDetector_1kEventWindow`, the M19 detector budget) and `components/receivers/pyspy` (`BenchmarkParseDump` + `BenchmarkStackID`, the M18 join-key hash + faulthandler parser). Each package owns a checked-in `testdata/bench-baseline.txt` captured at `count=10 -benchtime=500ms -benchmem` (benchstat needs ≥6 samples for a meaningful CI band); regenerate with `make bench-baseline` after a vetted, intentional perf change and commit the diff. `scripts/bench-check.sh` (unchanged) parses the `+NN.NN%` deltas from `benchstat baseline new` and exits non-zero on any row past `THRESHOLD` (default 10%, env-overridable). The wire-up keeps in-tree CV-noisy `sec/op` rows from false-firing — only deltas exceeding run-to-run noise reach the gate — while `B/op` + `allocs/op` (hardware-invariant) gate precisely. The 1 GiB `bench/overhead/nccl_fr_bench_test.go` stays advisory (self-asserts on `HeapAlloc` against NORTHSTARS O2; too slow + already gated internally) and is NOT in the registry. CI runs the gate in `verify-static` after a pinned `go install golang.org/x/perf/cmd/benchstat@...` step. The shell-registry indirection (Makefile → `scripts/bench-check-all.sh`) exists because the bench regex `^Benchmark(ParseDump|StackID)$` contains parentheses that the Make-invoked `/bin/sh` tokenises as subshell metacharacters. Closes #227. + **`tracecore --version` now reports the git-describe state of the build commit**, not the hardcoded `builder-config.yaml` `dist.version` value. `make build` resolves the injected version in order: `$TRACECORE_VERSION` (CI / release override) → `git describe --tags --always --match 'v*' --dirty=-dev` → the unmodified `dist.version` fallback (used when git is unavailable, e.g. a tarball extract). The `v*` filter keeps `module/vX.Y.Z` submodule tags from shadowing the binary's release namespace; the `-dev` suffix flags uncommitted working trees so a laptop build is never indistinguishable from a tagged release. The injection sed-rewrites a temporary copy of `builder-config.yaml` and restores the original on exit via a trap, so the on-disk source-tree value stays in lockstep with `Chart.yaml` `appVersion` and `make doc-check`'s parity gate. `.github/workflows/release.yml` pins `TRACECORE_VERSION="$TAG"` for both the goreleaser pre-build matrix and the ko-publish step so the in-image binary's `--version` matches the published tag verbatim regardless of git-history depth. Closes RFC-0013 PR-D follow-up. Pre-alpha. **Distribution-first pivot adopted ([RFC-0013](docs/rfcs/0013-distro-first-pivot.md))** - binary now assembled via the OpenTelemetry Collector Builder (OCB) from upstream + contrib components plus a thin in-repo Go submodule at `module/` (path `github.com/tracecoreai/tracecore/module`) containing only the moat (NCCL FlightRecorder receiver, OTTL processors with windowed semantics, pattern detectors). The M1 in-tree pipeline runtime + factory-based assembly is queued for deletion at v0.1.0 in favor of the OCB-generated boot path; the canonical `clockreceiver` + `stdoutexporter` examples ship for one PR cycle and then exit. Targeting v0.1.0 / v0.2.0 / v0.3.0 release boundaries per RFC-0013 §4. diff --git a/Makefile b/Makefile index 82d37019..2cc09068 100644 --- a/Makefile +++ b/Makefile @@ -2,7 +2,7 @@ .PHONY: help build clean hooks # Test suites -.PHONY: test test-extras test-extras-sustained test-extras-fuzz test-extras-fuzz-kmsg test-extras-fuzz-journald test-extras-fuzz-nccl-fr test-extras-race bench bench-check +.PHONY: test test-extras test-extras-sustained test-extras-fuzz test-extras-fuzz-kmsg test-extras-fuzz-journald test-extras-fuzz-nccl-fr test-extras-race bench bench-check bench-baseline # Format + tidy .PHONY: fmt fmt-fix vet lint lint-fix tidy tidy-check mod-verify @@ -79,26 +79,15 @@ test: ## Run unit tests with the race detector. bench: ## Run benchmarks across the repo with -benchmem, count=5. go test -bench=Benchmark -benchmem -benchtime=500ms -count=5 -run='^$$' ./... -bench-check: ## Run benches; fail if any row regresses >10% vs the per-package baseline under testdata/bench-baseline.txt. THRESHOLD overridable. - @# No per-package baselines remain in tree at this point — the - @# k8sevents baseline was retired in RFC-0013 PR-K.2 alongside the - @# in-tree k8sevents receiver. Future per-package baselines wire - @# back into the `for pkg in ...` loop below. The target stays so - @# downstream automation (`make ci`) keeps a stable invocation. - @status=0; \ - out=$$(mktemp); \ - trap 'rm -f $$out' EXIT; \ - for pkg in ; do \ - baseline=$$pkg/testdata/bench-baseline.txt; \ - if [ ! -f $$baseline ]; then \ - echo "bench-check: missing baseline $$baseline" >&2; \ - status=2; continue; \ - fi; \ - echo "==> $$pkg"; \ - go test -bench=Benchmark -benchmem -benchtime=500ms -count=5 -run='^$$' ./$$pkg/ > $$out 2>&1; \ - scripts/bench-check.sh $$baseline $$out $${THRESHOLD:-10} || status=$$?; \ - done; \ - exit $$status +bench-check: ## Run perf gate: bench each gated pkg vs its committed baseline; fail if any row regresses >THRESHOLD% (default 10). + @# Per-package registry + execution lives in scripts/bench-check-all.sh + @# rather than inline here: the bench regex contains parentheses that + @# Make-invoked /bin/sh tokenises as subshell metacharacters. THRESHOLD + @# env var passes through ("THRESHOLD=20 make bench-check" etc.). + scripts/bench-check-all.sh + +bench-baseline: ## Regenerate every per-package bench-baseline.txt. Commit the diff after vetting it on your hardware. + scripts/bench-baseline.sh fmt: ## Check formatting; fails if any file is not gofumpt-clean. diff --git a/components/receivers/pyspy/testdata/bench-baseline.txt b/components/receivers/pyspy/testdata/bench-baseline.txt new file mode 100644 index 00000000..4dfe3ef4 --- /dev/null +++ b/components/receivers/pyspy/testdata/bench-baseline.txt @@ -0,0 +1,26 @@ +goos: darwin +goarch: arm64 +pkg: github.com/tracecoreai/tracecore/components/receivers/pyspy +cpu: Apple M1 Max +BenchmarkParseDump-10 10000 69886 ns/op 66339 B/op 14 allocs/op +BenchmarkParseDump-10 10000 109513 ns/op 66337 B/op 14 allocs/op +BenchmarkParseDump-10 10000 50370 ns/op 66337 B/op 14 allocs/op +BenchmarkParseDump-10 10000 78885 ns/op 66337 B/op 14 allocs/op +BenchmarkParseDump-10 33966 19766 ns/op 66336 B/op 14 allocs/op +BenchmarkParseDump-10 42043 19763 ns/op 66336 B/op 14 allocs/op +BenchmarkParseDump-10 36619 16521 ns/op 66336 B/op 14 allocs/op +BenchmarkParseDump-10 10000 68161 ns/op 66337 B/op 14 allocs/op +BenchmarkParseDump-10 33739 24151 ns/op 66336 B/op 14 allocs/op +BenchmarkParseDump-10 32496 17291 ns/op 66336 B/op 14 allocs/op +BenchmarkStackID-10 1000000 2463 ns/op 32 B/op 3 allocs/op +BenchmarkStackID-10 1000000 2175 ns/op 32 B/op 3 allocs/op +BenchmarkStackID-10 1000000 2419 ns/op 32 B/op 3 allocs/op +BenchmarkStackID-10 1000000 2151 ns/op 32 B/op 3 allocs/op +BenchmarkStackID-10 788203 2649 ns/op 32 B/op 3 allocs/op +BenchmarkStackID-10 442069 2578 ns/op 32 B/op 3 allocs/op +BenchmarkStackID-10 605858 2137 ns/op 32 B/op 3 allocs/op +BenchmarkStackID-10 1000000 2168 ns/op 32 B/op 3 allocs/op +BenchmarkStackID-10 1000000 1993 ns/op 32 B/op 3 allocs/op +BenchmarkStackID-10 982612 2085 ns/op 32 B/op 3 allocs/op +PASS +ok github.com/tracecoreai/tracecore/components/receivers/pyspy 32.159s diff --git a/internal/synthesis/patterns/testdata/bench-baseline.txt b/internal/synthesis/patterns/testdata/bench-baseline.txt new file mode 100644 index 00000000..653879d3 --- /dev/null +++ b/internal/synthesis/patterns/testdata/bench-baseline.txt @@ -0,0 +1,16 @@ +goos: darwin +goarch: arm64 +pkg: github.com/tracecoreai/tracecore/internal/synthesis/patterns +cpu: Apple M1 Max +BenchmarkPodEvictedDetector_1kEventWindow-10 100 5140860 ns/op 847365 B/op 15636 allocs/op +BenchmarkPodEvictedDetector_1kEventWindow-10 100 6203007 ns/op 847152 B/op 15636 allocs/op +BenchmarkPodEvictedDetector_1kEventWindow-10 370 4990751 ns/op 847097 B/op 15635 allocs/op +BenchmarkPodEvictedDetector_1kEventWindow-10 100 6649042 ns/op 847089 B/op 15635 allocs/op +BenchmarkPodEvictedDetector_1kEventWindow-10 150 4384935 ns/op 846940 B/op 15635 allocs/op +BenchmarkPodEvictedDetector_1kEventWindow-10 100 5087184 ns/op 846956 B/op 15634 allocs/op +BenchmarkPodEvictedDetector_1kEventWindow-10 250 5339458 ns/op 846707 B/op 15634 allocs/op +BenchmarkPodEvictedDetector_1kEventWindow-10 100 5578500 ns/op 846980 B/op 15635 allocs/op +BenchmarkPodEvictedDetector_1kEventWindow-10 153 4252240 ns/op 847009 B/op 15635 allocs/op +BenchmarkPodEvictedDetector_1kEventWindow-10 500 5139900 ns/op 846761 B/op 15634 allocs/op +PASS +ok github.com/tracecoreai/tracecore/internal/synthesis/patterns 13.295s diff --git a/scripts/bench-baseline.sh b/scripts/bench-baseline.sh new file mode 100755 index 00000000..5711f3a7 --- /dev/null +++ b/scripts/bench-baseline.sh @@ -0,0 +1,31 @@ +#!/usr/bin/env bash +# bench-baseline.sh — regenerate every per-package bench baseline +# registered in scripts/bench-check-all.sh. +# +# benchstat needs >= 6 samples for confidence intervals at level 0.95 +# (anything less yields a "need >= 6 samples" warning + ± ∞ band, which +# defeats the noise filter). count=10 gives headroom + matches what +# `make bench-check` runs, so a freshly-regenerated baseline self- +# compares cleanly under `benchstat `. +# +# Run on the hardware you intend to gate on. The committed baseline +# pins B/op + allocs/op (hardware-invariant) precisely; sec/op gating +# only fires when a regression exceeds the run-to-run CV, which is the +# right behavior — alloc-count growth is the load-bearing signal, not +# wall-clock jitter. +set -euo pipefail + +entries=( + "internal/synthesis/patterns|^BenchmarkPodEvictedDetector" + "components/receivers/pyspy|^Benchmark(ParseDump|StackID)\$" +) + +for entry in "${entries[@]}"; do + pkg="${entry%%|*}" + pat="${entry#*|}" + baseline="$pkg/testdata/bench-baseline.txt" + mkdir -p "$pkg/testdata" + echo "==> regenerating $baseline ($pat)" + go test -bench="$pat" -benchmem -benchtime=500ms -count=10 \ + -run='^$' "./$pkg/" > "$baseline" +done diff --git a/scripts/bench-check-all.sh b/scripts/bench-check-all.sh new file mode 100755 index 00000000..b2435fb4 --- /dev/null +++ b/scripts/bench-check-all.sh @@ -0,0 +1,57 @@ +#!/usr/bin/env bash +# bench-check-all.sh — perf gate runner. +# +# Iterates the registry of (package, bench-regex) pairs below and, for +# each, runs the matching benchmarks and compares results against the +# package's frozen `testdata/bench-baseline.txt` via +# scripts/bench-check.sh (benchstat under the hood). +# +# Exit codes: +# 0 every gated row stayed within THRESHOLD (default 10%). +# 1 one or more rows regressed past THRESHOLD. +# 2 a baseline file was missing or benchstat itself was missing. +# +# Why a separate script (not a Makefile loop): the bench regex +# `^Benchmark(ParseDump|StackID)$` contains parentheses, which the +# Make-invoked /bin/sh tokenises as subshell metacharacters. Pinning +# the registry in shell with quoted strings avoids that escape-hell. +# +# Why a registry (not `go test -bench=. ./...`): each row is an +# explicit, reviewed contract — adding a benchmark to the gate is an +# intentional act, not an accident of placing a `func Benchmark…` in +# the tree. Volume benchmarks (e.g. bench/overhead/nccl_fr_bench_test.go +# at 1 GiB/iter with self-asserting HeapAlloc check) stay advisory and +# OUT of this registry; only fast, deterministic-allocation micro- +# benchmarks belong here. +set -euo pipefail + +threshold="${THRESHOLD:-10}" + +# Registry: one entry per gated package. +# pkg — Go import path relative to repo root. +# pattern — `-bench` regex; quote-safe since this is plain shell. +entries=( + "internal/synthesis/patterns|^BenchmarkPodEvictedDetector" + "components/receivers/pyspy|^Benchmark(ParseDump|StackID)\$" +) + +status=0 +out=$(mktemp) +trap 'rm -f "$out"' EXIT + +for entry in "${entries[@]}"; do + pkg="${entry%%|*}" + pat="${entry#*|}" + baseline="$pkg/testdata/bench-baseline.txt" + if [[ ! -f "$baseline" ]]; then + echo "bench-check: missing baseline $baseline" >&2 + status=2 + continue + fi + echo "==> $pkg ($pat)" + go test -bench="$pat" -benchmem -benchtime=500ms -count=10 \ + -run='^$' "./$pkg/" > "$out" 2>&1 + scripts/bench-check.sh "$baseline" "$out" "$threshold" || status=$? +done + +exit $status From 43fcd5ea7073b7a9b24d105d920c39ca183de38a Mon Sep 17 00:00:00 2001 From: Tri Lam Date: Sun, 31 May 2026 13:41:42 -0700 Subject: [PATCH 2/2] ci(bench): gate B/op + allocs/op only; extract registry to shared file MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Adversarial-review pass on the prior commit caught two real issues: 1. The prior commit gated sec/op too, but wall-clock CV on dev hardware routinely crosses the 10% threshold from background load alone even when benchstat marks the delta significant (p<0.05). Reproduced locally: 'make bench-check' false-fired on a +16.5% ns/op row immediately after baseline capture, with allocs/op + B/op both pristine at 0% CV. Hide sec/op from the gate; alloc-count + B/op are hardware-invariant and only move when the code does. 2. scripts/bench-{check-all,baseline}.sh duplicated the package registry — DRY violation that would silently desync. Extract to scripts/bench-registry.sh sourced by both consumers. Re-verified TDD plant (alloc regression in stackID): gate still trips with exit 2 on B/op (+800%) + allocs/op (+133%); plant reverted; clean tree exits 0. Signed-off-by: Tri Lam --- CHANGELOG.md | 2 +- scripts/bench-baseline.sh | 10 ++++------ scripts/bench-check-all.sh | 35 +++++++++-------------------------- scripts/bench-check.sh | 28 ++++++++++++++++++++++++---- scripts/bench-registry.sh | 25 +++++++++++++++++++++++++ 5 files changed, 63 insertions(+), 37 deletions(-) create mode 100644 scripts/bench-registry.sh diff --git a/CHANGELOG.md b/CHANGELOG.md index ceff91d0..0a6411aa 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,7 +4,7 @@ User-visible changes are documented here. Format: [Keep a Changelog](https://kee ## [Unreleased] -**`make bench-check` re-baselined as a real perf-regression gate.** Since PR-K.2 deleted the in-tree k8sevents receiver (whose `testdata/bench-baseline.txt` was the gate's sole row), the target had been a no-op `for pkg in ;` empty loop. The loop now iterates a quoted-shell registry in `scripts/bench-check-all.sh` covering `internal/synthesis/patterns` (`BenchmarkPodEvictedDetector_1kEventWindow`, the M19 detector budget) and `components/receivers/pyspy` (`BenchmarkParseDump` + `BenchmarkStackID`, the M18 join-key hash + faulthandler parser). Each package owns a checked-in `testdata/bench-baseline.txt` captured at `count=10 -benchtime=500ms -benchmem` (benchstat needs ≥6 samples for a meaningful CI band); regenerate with `make bench-baseline` after a vetted, intentional perf change and commit the diff. `scripts/bench-check.sh` (unchanged) parses the `+NN.NN%` deltas from `benchstat baseline new` and exits non-zero on any row past `THRESHOLD` (default 10%, env-overridable). The wire-up keeps in-tree CV-noisy `sec/op` rows from false-firing — only deltas exceeding run-to-run noise reach the gate — while `B/op` + `allocs/op` (hardware-invariant) gate precisely. The 1 GiB `bench/overhead/nccl_fr_bench_test.go` stays advisory (self-asserts on `HeapAlloc` against NORTHSTARS O2; too slow + already gated internally) and is NOT in the registry. CI runs the gate in `verify-static` after a pinned `go install golang.org/x/perf/cmd/benchstat@...` step. The shell-registry indirection (Makefile → `scripts/bench-check-all.sh`) exists because the bench regex `^Benchmark(ParseDump|StackID)$` contains parentheses that the Make-invoked `/bin/sh` tokenises as subshell metacharacters. Closes #227. +**`make bench-check` re-baselined as a real perf-regression gate.** Since PR-K.2 deleted the in-tree k8sevents receiver (whose `testdata/bench-baseline.txt` was the gate's sole row), the target had been a no-op `for pkg in ;` empty loop. The loop now iterates a quoted-shell registry in `scripts/bench-check-all.sh` covering `module/pkg/patterns` (`BenchmarkPodEvictedDetector_1kEventWindow`, the M19 detector budget) and `components/receivers/pyspy` (`BenchmarkParseDump` + `BenchmarkStackID`, the M18 join-key hash + faulthandler parser). Each package owns a checked-in `testdata/bench-baseline.txt` captured at `count=10 -benchtime=500ms -benchmem` (benchstat needs ≥6 samples for a meaningful CI band); regenerate with `make bench-baseline` after a vetted, intentional perf change and commit the diff. `scripts/bench-check.sh` parses the `+NN.NN%` deltas from `benchstat baseline new` for the `B/op` + `allocs/op` tables and exits non-zero on any row past `THRESHOLD` (default 10%, env-overridable); `sec/op` rows are skipped because wall-clock CV on dev hardware and shared CI runners routinely crosses 10% from background load alone, even when benchstat marks the delta statistically significant. Alloc-count + B/op are the hardware-invariant signals that pin to 0% CV across runs and only move when the code does. The 1 GiB `bench/overhead/nccl_fr_bench_test.go` stays advisory (self-asserts on `HeapAlloc` against NORTHSTARS O2; too slow + already gated internally) and is NOT in the registry. CI runs the gate in `verify-static` after a pinned `go install golang.org/x/perf/cmd/benchstat@...` step. The shell-registry indirection (Makefile → `scripts/bench-check-all.sh`) exists because the bench regex `^Benchmark(ParseDump|StackID)$` contains parentheses that the Make-invoked `/bin/sh` tokenises as subshell metacharacters. Closes #227. **`tracecore --version` now reports the git-describe state of the build commit**, not the hardcoded `builder-config.yaml` `dist.version` value. `make build` resolves the injected version in order: `$TRACECORE_VERSION` (CI / release override) → `git describe --tags --always --match 'v*' --dirty=-dev` → the unmodified `dist.version` fallback (used when git is unavailable, e.g. a tarball extract). The `v*` filter keeps `module/vX.Y.Z` submodule tags from shadowing the binary's release namespace; the `-dev` suffix flags uncommitted working trees so a laptop build is never indistinguishable from a tagged release. The injection sed-rewrites a temporary copy of `builder-config.yaml` and restores the original on exit via a trap, so the on-disk source-tree value stays in lockstep with `Chart.yaml` `appVersion` and `make doc-check`'s parity gate. `.github/workflows/release.yml` pins `TRACECORE_VERSION="$TAG"` for both the goreleaser pre-build matrix and the ko-publish step so the in-image binary's `--version` matches the published tag verbatim regardless of git-history depth. Closes RFC-0013 PR-D follow-up. diff --git a/scripts/bench-baseline.sh b/scripts/bench-baseline.sh index 4b888096..675a1300 100755 --- a/scripts/bench-baseline.sh +++ b/scripts/bench-baseline.sh @@ -1,6 +1,6 @@ #!/usr/bin/env bash # bench-baseline.sh — regenerate every per-package bench baseline -# registered in scripts/bench-check-all.sh. +# registered in scripts/bench-registry.sh. # # benchstat needs >= 6 samples for confidence intervals at level 0.95 # (anything less yields a "need >= 6 samples" warning + ± ∞ band, which @@ -15,12 +15,10 @@ # wall-clock jitter. set -euo pipefail -entries=( - "module/pkg/patterns|^BenchmarkPodEvictedDetector" - "components/receivers/pyspy|^Benchmark(ParseDump|StackID)\$" -) +# shellcheck source=scripts/bench-registry.sh +source "$(dirname "$0")/bench-registry.sh" -for entry in "${entries[@]}"; do +for entry in "${bench_entries[@]}"; do pkg="${entry%%|*}" pat="${entry#*|}" baseline="$pkg/testdata/bench-baseline.txt" diff --git a/scripts/bench-check-all.sh b/scripts/bench-check-all.sh index f11cf520..5fe607c7 100755 --- a/scripts/bench-check-all.sh +++ b/scripts/bench-check-all.sh @@ -1,45 +1,28 @@ #!/usr/bin/env bash # bench-check-all.sh — perf gate runner. # -# Iterates the registry of (package, bench-regex) pairs below and, for -# each, runs the matching benchmarks and compares results against the -# package's frozen `testdata/bench-baseline.txt` via -# scripts/bench-check.sh (benchstat under the hood). +# Iterates the registry in scripts/bench-registry.sh and, for each +# (package, bench-regex) entry, runs the matching benchmarks and +# compares results against the package's frozen +# `testdata/bench-baseline.txt` via scripts/bench-check.sh (benchstat +# under the hood). # # Exit codes: # 0 every gated row stayed within THRESHOLD (default 10%). # 1 one or more rows regressed past THRESHOLD. # 2 a baseline file was missing or benchstat itself was missing. -# -# Why a separate script (not a Makefile loop): the bench regex -# `^Benchmark(ParseDump|StackID)$` contains parentheses, which the -# Make-invoked /bin/sh tokenises as subshell metacharacters. Pinning -# the registry in shell with quoted strings avoids that escape-hell. -# -# Why a registry (not `go test -bench=. ./...`): each row is an -# explicit, reviewed contract — adding a benchmark to the gate is an -# intentional act, not an accident of placing a `func Benchmark…` in -# the tree. Volume benchmarks (e.g. bench/overhead/nccl_fr_bench_test.go -# at 1 GiB/iter with self-asserting HeapAlloc check) stay advisory and -# OUT of this registry; only fast, deterministic-allocation micro- -# benchmarks belong here. set -euo pipefail -threshold="${THRESHOLD:-10}" +# shellcheck source=scripts/bench-registry.sh +source "$(dirname "$0")/bench-registry.sh" -# Registry: one entry per gated package. -# pkg — Go import path relative to repo root. -# pattern — `-bench` regex; quote-safe since this is plain shell. -entries=( - "module/pkg/patterns|^BenchmarkPodEvictedDetector" - "components/receivers/pyspy|^Benchmark(ParseDump|StackID)\$" -) +threshold="${THRESHOLD:-10}" status=0 out=$(mktemp) trap 'rm -f "$out"' EXIT -for entry in "${entries[@]}"; do +for entry in "${bench_entries[@]}"; do pkg="${entry%%|*}" pat="${entry#*|}" baseline="$pkg/testdata/bench-baseline.txt" diff --git a/scripts/bench-check.sh b/scripts/bench-check.sh index f2dd5642..486ecd2b 100755 --- a/scripts/bench-check.sh +++ b/scripts/bench-check.sh @@ -37,12 +37,32 @@ trap 'rm -f "$cmp"' EXIT benchstat "$baseline" "$new" | tee "$cmp" -# Parse `+NN.NN%` regressions from benchstat output. The vs-base -# column shows `~` for noise, `+X.YZ%` for slowdowns, `-X.YZ%` for -# speedups. We only care about + with magnitude > threshold. +# Parse `+NN.NN%` regressions from benchstat output, restricted to the +# B/op and allocs/op tables. sec/op rows are deliberately ignored: the +# wall-clock CV on dev hardware (and on shared CI runners) routinely +# crosses the 10% threshold from background-load jitter alone, even +# when benchstat marks the delta as statistically significant — gating +# on it would false-fire on most PRs without catching anything code +# review can act on. Alloc-count and B/op growth are the +# hardware-invariant signals that track real regressions (escape- +# analysis bugs, accidentally-heap-allocated locals, hot-loop string +# concatenations, etc.); they pin to 0% CV across runs and only move +# when the code does. # -# awk extracts {benchmark_name, delta_percent} pairs, then we filter. +# Tracking the active table: benchstat prints an indented header line +# containing only whitespace, box-drawing chars (`│`, U+2502 — multi- +# byte, not portable to match as a literal in awk regex), and the unit +# label (`sec/op` / `B/op` / `allocs/op`) before each comparison table. +# We detect these by `index(line, "")` since the unit substrings +# do not appear in benchmark names (Go disallows `/` in identifiers). +# Header lines start with whitespace; data rows start with an alpha +# benchmark name; the indent prefix is therefore an unambiguous +# disambiguator. Set `gate=1` only inside B/op + allocs/op tables. regressed=$(awk -v thr="$threshold" ' + /^ +/ && index($0, "sec/op") { gate = 0; next } + /^ +/ && index($0, "B/op") { gate = 1; next } + /^ +/ && index($0, "allocs/op") { gate = 1; next } + !gate { next } # Skip header lines that contain column titles (no benchmark name). /^[A-Za-z]/ && /\+[0-9]+(\.[0-9]+)?%/ { # Match either +NN.NN% (current benchstat output) or +NN% diff --git a/scripts/bench-registry.sh b/scripts/bench-registry.sh new file mode 100644 index 00000000..6a57f89f --- /dev/null +++ b/scripts/bench-registry.sh @@ -0,0 +1,25 @@ +# shellcheck shell=bash +# bench-registry.sh — single source of truth for the perf-gate registry. +# +# Sourced by scripts/bench-check-all.sh (gate runner) and +# scripts/bench-baseline.sh (regenerator); not directly executable. +# Defines the `bench_entries` array; each entry is "pkg|regex": +# pkg — Go import path relative to repo root. +# regex — `-bench` regex; quote-safe since both consumers run under +# plain shell, not Make-invoked /bin/sh (which would tokenise +# unescaped parentheses as subshell metacharacters). +# +# Gating criteria for a new entry: +# 1. Benchmark must be deterministic-allocation — B/op + allocs/op are +# hardware-invariant and gate cleanly across CI runner and dev box; +# sec/op is noisy on dev machines and only fires the gate when the +# regression exceeds the run-to-run CV (which is the right behavior). +# 2. Benchmark must run fast (≲30s per package at count=10 × 500ms × +# benches-in-package). Volume benchmarks such as +# bench/overhead/nccl_fr_bench_test.go (1 GiB/iter, self-asserting +# HeapAlloc) stay advisory and OUT of this registry. + +bench_entries=( + "module/pkg/patterns|^BenchmarkPodEvictedDetector" + "components/receivers/pyspy|^Benchmark(ParseDump|StackID)\$" +)