ci(bench): re-baseline make bench-check + add regression gate by trilamsr · Pull Request #244 · TraceCoreAI/tracecore

trilamsr · 2026-05-31T20:29:38Z

Closes #227.

Summary

make bench-check has been a no-op since PR-K.2 (#217) deleted components/receivers/k8sevents/ — its testdata/bench-baseline.txt was the sole row in the gate's for pkg in … loop, leaving the loop body literally empty (for pkg in ; do …). The target stayed as a stub so make ci automation kept a stable invocation, with a TODO to re-register packages later. That re-registration is this PR.

Root cause

Makefile:71-101 (pre-diff) iterated an empty list. The wrapper around benchstat+scripts/bench-check.sh was intact; only the registry of which packages to gate had been emptied. So the fix is purely a registration + baseline-capture exercise — no logic change to the gate's comparison machinery, plus one scope-tightening change called out under "Adversarial-review fix" below.

What this PR does

Registers two packages with deterministic-allocation micro-benchmarks for gating:
- module/pkg/patterns — BenchmarkPodEvictedDetector_1kEventWindow (the M19 detector budget). (Path is post-PR-I.2a / refactor(module): relocate patterns + replay into module/pkg/ (PR-I.2a) #242; this branch was rebased onto that merge during the PR.)
- components/receivers/pyspy — BenchmarkParseDump + BenchmarkStackID (the M18 join-key hash + faulthandler parser).
Captures committed baselines at module/pkg/patterns/testdata/bench-baseline.txt and components/receivers/pyspy/testdata/bench-baseline.txt, both at count=10 -benchtime=500ms -benchmem. count=10 is the minimum that survives benchstat's ≥6 samples for confidence interval at level 0.95 warning with headroom for one outlier.
Extracts the registry to scripts/bench-registry.sh sourced by both scripts/bench-check-all.sh (gate runner, wired to make bench-check) and scripts/bench-baseline.sh (regenerator, wired to make bench-baseline). Single source of truth — drift between "what we gate" and "what we regenerate" can no longer happen.

Why scripts and not an inline Make loop: the bench regex ^Benchmark(ParseDump|StackID)$ contains parentheses that the Make-invoked /bin/sh tokenises as subshell metacharacters. Quoting under make shellquote escaping is more fragile than just punting to plain shell.
Restricts the gate to B/op + allocs/op tables in scripts/bench-check.sh; sec/op is no longer gated. See "Adversarial-review fix" below.
Wires the gate into CI as a new step in .github/workflows/ci.yml's verify-static job, after make build. A pinned go install golang.org/x/perf/cmd/benchstat@… step installs the tool on the runner.

Adversarial-review fix (mid-PR)

The first commit gated sec/op too, with the design note claiming pure wall-clock jitter would show as benchstat's ~ (non-significant) and never reach the +NN% parser. That's false. Repeating make bench-check immediately after a baseline capture, with identical code on identical hardware, fired:

PodEvictedDetector_1kEventWindow-10  sec/op  +16.50% (p=0.001 n=10)
PodEvictedDetector_1kEventWindow-10  B/op    +0.01% (p=0.005 n=10)
PodEvictedDetector_1kEventWindow-10  allocs/op  ~ (p=1.000 n=10)

The 16.5% delta was background-load drift between two runs minutes apart. benchstat marks it significant because the CV is small (~5-10%) and the means are reliably different — both true symptoms of real but uninteresting wall-clock variance. Gating on sec/op would produce continuous false fires across the team without catching anything code review can act on.

Fix: scripts/bench-check.sh now tracks the active benchstat table by indented header line (sec/op / B/op / allocs/op) and only counts +NN% deltas under B/op + allocs/op. Both stay pinned to 0% CV across runs (deterministic Go allocator + identical bytes-allocated per code path) and only move when the code does. make bench-check on clean main now passes consistently across runs; the TDD planted regression (heap-allocating sink in stackID) still trips the gate at +800% B/op + +133% allocs/op.

Design notes

Why these two benches and not bench/overhead/nccl_fr_bench_test.go: that benchmark replays 1 GiB of fixtures per iter (~90s on M1 Max) and already self-asserts on HeapAlloc delta against the NORTHSTARS O2 ceiling. Adding it to bench-check would push CI past the budget and produce nothing the in-bench assertion doesn't already gate. Stays advisory.
Hardware skew is irrelevant for B/op + allocs/op: baselines were captured on Apple M1 Max; CI runs on ubuntu-latest (x86_64). The Go allocator is deterministic across GOARCH for a given code path — b.ReportAllocs() returns identical numbers on both. The historical k8sevents baseline lived under the same cross-arch arrangement.
make ci unchanged: bench-check is NOT added to the make ci recipe. That would force every contributor to install benchstat. CI installs benchstat itself in the workflow; local dev runs make bench-check only when intentionally vetting a perf change.
module/ submodule resolution: module/pkg/patterns lives in the in-repo Go submodule (module/go.mod); local dev runs resolve through go.work and OCB through builder-config.yaml's replaces: ./module. go test ./module/pkg/patterns/… works from the repo root because of the workspace.

TDD record

Define gate: scripts/bench-check-all.sh + bench-registry.sh register two packages; threshold 10%, env-overridable; only B/op + allocs/op rows count.
Red (no plant): make bench-check on clean main passes — exit 0, both packages report PASS: no benchmarks regressed by more than 10% vs baseline. (Re-run multiple times across different machine load — stays green.)
Red→Green (planted regression): edited stackID to allocate 64 B/iter via a package-level sink (escape-analysis-routed to the heap, so allocs/op actually moved); make bench-check exit 2 (make wraps the inner exit 1). Output flagged StackID-10 +800.00% (p=0.000 n=10) on B/op and +133.33% (p=0.000 n=10) on allocs/op.
Green (revert): plant removed; make bench-check exit 0; git diff main -- components/receivers/pyspy/stackid.go empty.

Release notes

### Added
- `make bench-check` is now a real perf-regression gate again (was a no-op since PR-K.2). Registry-driven via `scripts/bench-registry.sh` (single source of truth, sourced by both `bench-check-all.sh` runner and `bench-baseline.sh` regenerator); the runner gates `B/op` + `allocs/op` deltas against a committed `testdata/bench-baseline.txt` via benchstat; any row regressing past `THRESHOLD%` (default 10, env-overridable) fails. `sec/op` is deliberately not gated (wall-clock CV on dev hardware and shared CI runners routinely crosses 10% from background load alone, even when benchstat marks the delta significant). Two packages registered: `module/pkg/patterns` (M19 PodEvictedDetector budget) and `components/receivers/pyspy` (M18 ParseDump + StackID hash). Regenerate baselines with `make bench-baseline` after a vetted, intentional perf change and commit the diff.
- CI `verify-static` job now runs `make bench-check` after installing a pinned benchstat. Local dev only needs benchstat when intentionally running the perf gate.

Test plan

make check (fmt, tidy-check, lint, vet, mod-verify) — clean (re-run post-adversarial-fix)
make actionlint — clean
make zizmor — clean (No findings to report. Good job!)
make license-check — clean
make doc-check — clean
make bench-check on clean main — exit 0, both packages PASS, repeated runs stay green (false-positive bug fixed)
make bench-check with deliberate alloc-regression planted in stackID — exit 2, gate flags +800% B/op + +133% allocs/op
make bench-check after revert — exit 0 again, stackid.go diff vs main is empty
CI verify-static runs make bench-check green on this PR (awaiting CI)

Self-grading

A — root cause named, fix scoped to missing registration without expanding scope (no new benchmark framework, no rewrite of the benchstat-comparison machinery beyond the sec/op-vs-alloc scoping fix). TDD cycle complete with a deliberate regression that the gate catches. Adversarial self-review found a real false-positive bug (sec/op gating) and fixed it in-PR before merge. Registry extracted to a single source of truth so the two consumers can't drift. CHANGELOG entry written.

Not A+ because: even with sec/op excluded, the gate still depends on benchstat's text-output format remaining stable; a future major version that changes column layout could silently disable the gate (it would skip every row, exit 0, print PASS). A future iteration could parse benchstat's -format=csv output instead, which is its versioned machine-readable interface. Out of scope here; the current text format hasn't changed for years and matches the existing scripts/bench-check.sh shape that already shipped with the k8sevents baseline.

Per #227. The 'bench-check' target had been an empty 'for pkg in ;' loop since PR-K.2 deleted the in-tree k8sevents receiver (the gate's sole baseline row). Perf regressions could ship. Restore the gate by registering two fast, deterministic-allocation benchmarks against committed baselines: - internal/synthesis/patterns (M19 PodEvictedDetector budget) - components/receivers/pyspy (M18 ParseDump + StackID hash) The 1 GiB bench/overhead/nccl_fr_bench_test.go stays advisory: it self-asserts on HeapAlloc against NORTHSTARS O2 and is too slow for CI gating. The Make → shell indirection (scripts/bench-check-all.sh) exists because the bench regex contains parentheses that Make-invoked /bin/sh tokenises as subshell metacharacters. Signed-off-by: Tri Lam <tri@maydow.com>

Signed-off-by: Tri Lam <tri@maydow.com> # Conflicts: # module/pkg/patterns/testdata/bench-baseline.txt

Adversarial-review pass on the prior commit caught two real issues: 1. The prior commit gated sec/op too, but wall-clock CV on dev hardware routinely crosses the 10% threshold from background load alone even when benchstat marks the delta significant (p<0.05). Reproduced locally: 'make bench-check' false-fired on a +16.5% ns/op row immediately after baseline capture, with allocs/op + B/op both pristine at 0% CV. Hide sec/op from the gate; alloc-count + B/op are hardware-invariant and only move when the code does. 2. scripts/bench-{check-all,baseline}.sh duplicated the package registry — DRY violation that would silently desync. Extract to scripts/bench-registry.sh sourced by both consumers. Re-verified TDD plant (alloc regression in stackID): gate still trips with exit 2 on B/op (+800%) + allocs/op (+133%); plant reverted; clean tree exits 0. Signed-off-by: Tri Lam <tri@maydow.com>

Land both processors (rankjoin + patterndetector) at v0.120 and keep bench-check scaffolding from #244 alongside the OTel bump. go mod tidy in module/ collapsed indirects to the v0.120 surface incl. xprocessor + consumertest. Signed-off-by: Tri Lam <tri@maydow.com>

Tri Lam added 3 commits May 31, 2026 13:27

Merge remote-tracking branch 'origin/main' into ci/i227-bench-check

ebbd3cc

Signed-off-by: Tri Lam <tri@maydow.com> # Conflicts: # module/pkg/patterns/testdata/bench-baseline.txt

trilamsr enabled auto-merge (squash) May 31, 2026 20:43

trilamsr merged commit 3991646 into main May 31, 2026
18 checks passed

trilamsr deleted the ci/i227-bench-check branch May 31, 2026 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(bench): re-baseline make bench-check + add regression gate#244

ci(bench): re-baseline make bench-check + add regression gate#244
trilamsr merged 3 commits into
mainfrom
ci/i227-bench-check

trilamsr commented May 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

What this PR does

Adversarial-review fix (mid-PR)

Design notes

TDD record

Release notes

Test plan

Self-grading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

trilamsr commented May 31, 2026 •

edited

Loading