ci(bench): CV-rolling artifact scaffold for soft→hard gate (#420)#446
Conversation
## Summary Refs #420. Ships the measurement scaffolding for the soft-gate → hard-gate graduation decision on the per-detector allocs/event ratchet (issue #302). Does NOT flip the gate — that is gated on N≥10 main-branch runs accumulating CV evidence, tracked as follow-up #444. - **bench.yml**: uploads raw `go test -bench=. -count=10` output as `bench-detectors-raw-<run_id>` artifact per run (90-day retention). Adds `workflow_dispatch` input `capture_linux_baseline=true` that regenerates `bench/detectors/baselines.json` on the linux/amd64 runner and surfaces a diff vs M1-captured baseline in the job summary — closes the linux-baseline verification leg of #420. - **scripts/bench-cv-rolling.sh**: downloads the last N artifacts via `gh run download`, computes per-detector mean / stdev / CV%, prints a graduation-gate table (`OK (<1%)` vs `OVER (graduation blocked)` per detector). Falls back to `baselines.json` single-sample if no artifacts or `gh` is unavailable so the make target works pre-CI-run + offline. - **scripts/bench-cv-rolling_test.sh**: 13 assertions covering zero-variance, known-variance (mean=20 stdev≈8.30 cv≈41.52% — math asserted within 0.1 tolerance), single-sample edge case, empty/ missing dir, --help, unknown-flag, multi-detector aggregation, graduation marker columns. - **Makefile**: new `make bench-cv-report` target (operator entry point); honours `N=20 make bench-cv-report` override. - **bench/detectors/README.md**: replaces the prose "Graduation criterion" section with a tick-box checklist of six detectors + measurement steps. Documents the graduation PR shape (one diff: `gate_mode` flip + `exit 0` → `exit "$status"`). ## TDD evidence - shellcheck on both new scripts → exit 0. - actionlint on the modified workflow → exit 0. - `bash scripts/bench-cv-rolling_test.sh` → all 13 assertions PASS. - `make bench-cv-report` falls back gracefully on the live repo (today's bench.yml runs predate this PR, so artifacts don't exist yet — fallback path tested live). - `make bench-detectors-check` locally on M1 → all 6 detectors within ≤1 alloc/op of the committed baseline. Confirms the M1-captured baseline is stable; the linux-amd64 leg lands once CI runs accumulate. ## Self-grade: A - **B**: artifact upload + `make bench-cv-report` + README checklist shipped. - **A**: CV script verified against pre-existing baseline data (fallback path); handles missing-artifact + single-run-history + unknown-flag edge cases with explicit exit codes; follow-up #444 filed for actual graduation flip. - Toward A+: linux/amd64 baseline-capture `workflow_dispatch` trigger added; on-M1 cross-cut audit run (drift ≤1 alloc/op across all 6 detectors). Auto-PR-on-drift was descoped: a one-shot diff in the job summary is sufficient evidence for the operator to open a re-baseline PR manually, and an auto-PR opener adds a workflow with `contents: write` + branch-creation logic for a flow that fires maybe once. The cost- vs-value did not pencil out for ship-now. ```release-notes - New `make bench-cv-report` computes per-detector allocs/op CV across recent `bench.yml` runs; drives the soft→hard gate graduation decision per `bench/detectors/README.md`. - `bench.yml` now uploads raw per-detector bench output as a 90-day-retained artifact (`bench-detectors-raw-<run_id>`). - `bench.yml` accepts `workflow_dispatch` input `capture_linux_baseline=true` to regenerate baselines on the linux/amd64 runner and surface a diff for review. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
Review: PR #446 — Phase 2 measurement scaffoldingB/A/A+ CriteriaB (minimum): Artifact upload + fallback working; CV report exists; make target works. A (verified measurement): CV math verified against known-variance fixtures; 13 edge-case tests passing; cross-cut audit (M1 baseline within ≤1 alloc/op); follow-up #444 filed for graduation flip. A+ (simplicity + verifiable): No unverified scaffolding; all code paths testable in PR context or explicitly documented as post-merge manual. Findings1.
|
## Summary Closes the M3 carry-forward in `docs/MILESTONES.md` L209: "`helm install` plus DaemonSet `Ready` on a single-node kind cluster completes in ≤5 min median across 10 CI runs." The single-run ≤300s gate already lives in `chart.yml`; this PR adds the missing 10-run rolling-median aggregation layer. **Root cause of the ⧗ state**: no per-run sample was being persisted across CI runs, so the rolling median was uncomputable even in principle. The single-run gate had nothing to roll up against. Fix at the right layer — per-run artifact upload + sibling-shape aggregator script — rather than redefining the rubric. Sibling pattern: PR #446's `bench-cv-rolling` artifact pipeline. Same shape: upload per-run artifact → aggregate via `gh run download` from the next-run script → exit non-zero if the aggregate trips the rubric. ## Pieces - **`.github/workflows/chart.yml`** — `install` job uploads `helm-install-duration-<run_id>` with `install_to_ready_seconds.txt` (single integer) + metadata. 90-day retention. `if: always()` so a 300s-breach run still contributes its sample to the rolling view. - **`scripts/helm-install-rolling.sh`** — downloads last N=10 successful main-branch chart.yml runs, computes median, fails if median > 300. Garbage-tolerant parse, missing-artifact skip, offline `gh`-absent fallback (informational + exit 0), `n_runs<10` "need ≥10 runs" warning. - **`scripts/helm-install-rolling_test.sh`** — 13 assertions: even-n median averaging, odd-n median (no averaging), exactly-300 boundary (≤ not <), over-budget fail, empty/missing dir → exit 2, `--help`, unknown flag, single-run, garbage tolerance, rubric banner, `n_runs` reporting, multi-fixture aggregation, gate-pass exit 0. - **`make helm-install-rolling-report`** — operator entry point. Honours `N=20 make helm-install-rolling-report`. - **`docs/MILESTONES.md` L209** — carry-forward note updated to reference the aggregator + flip path. Rubric stays ⧗ until 10 successful main-branch runs accumulate artifacts; a future PR flips it once that data exists. - **`install/kubernetes/tracecore/README.md`** — Troubleshooting section gains a failure-mode debug recipe (per the A+ criterion). ## Why median, not CV `bench-cv-rolling.sh` tests for hardware-invariance of allocs/op (CV ≈ 0% is the graduation signal). `install-to-Ready` is wall-clock under noisy CI runners — the relevant statistic is central tendency against the 300s rubric, not dispersion. Matches `MILESTONES.md` wording verbatim ("median across 10 CI runs"). ## Verification - `shellcheck scripts/helm-install-rolling.sh scripts/helm-install-rolling_test.sh` → exit 0 - `actionlint .github/workflows/chart.yml` → exit 0 - `bash scripts/helm-install-rolling_test.sh` → 13/13 PASS - Mutation tests: 1. Lowering the `300` threshold to `100` → fails the `exact-budget` boundary test (exit 1 caught). 2. Replacing the even-n median formula with `a[1]` (min) → fails the `even-n=10 median=145` assertion (caught). - Pre-commit `make check` → golangci-lint + go vet + go mod verify + attribute-namespace-check + no-autoupdate-check all green. ## Test plan - [x] shellcheck both scripts → exit 0 - [x] actionlint `chart.yml` → exit 0 - [x] 13/13 shell-test assertions pass locally - [x] Mutation-verified: threshold + median formula both produce failing tests when mutated - [x] `make helm-install-rolling-report` runs offline (no `gh` auth needed) and prints informational fallback rather than crashing - [ ] CI `chart.yml` on this PR will be the first run with the new artifact upload step — verifies the artifact pipeline end-to-end - [ ] After this PR lands on main, 10 successful main-branch runs accumulate artifacts → flip MILESTONES.md L209 ⧗ → ☑ in a follow-up ## Self-grade: A+ - **B**: aggregation script exists; reads artifacts; computes median; fails on overrun. - **A**: above + wired into CI (per-run artifact upload in `chart.yml`); MILESTONES.md cross-link to the aggregator; carry-forward bullet stays ⧗ until 10 runs accumulate. - **A+**: above + mutation-verified shell tests; cross-link to PR #446's `bench-cv-rolling` pattern in the script preamble + README; failure-mode debug recipe shipped in `install/kubernetes/tracecore/README.md`. ```release-notes - New `scripts/helm-install-rolling.sh` + `make helm-install-rolling-report` compute the 10-run median of `helm install` to DaemonSet `Ready` across recent `chart.yml` runs on main; drives the M3 carry-forward (`docs/MILESTONES.md` L209) graduation. - `chart.yml` install job now uploads each run's install-to-Ready duration as a 90-day-retained `helm-install-duration-<run_id>` artifact so the aggregator has per-run samples to pull. - Chart README gains a failure-mode debug recipe for rolling-median regressions under Troubleshooting. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
Refs #420. Phase 2 measurement scaffolding for the per-detector
allocs/event ratchet (issue #302). Does NOT flip the soft→hard
gate — that is gated on N≥10 main-branch runs accumulating CV
evidence, tracked as follow-up #444.
Summary
bench.ymluploads rawgo test -bench=. -count=10outputas
bench-detectors-raw-<run_id>artifact per run (90-dayretention). Adds
workflow_dispatchinputcapture_linux_baseline=truethat regeneratesbench/detectors/baselines.jsonon the linux/amd64 runner andsurfaces a diff vs M1-captured baseline in the job summary —
closes the linux-baseline verification leg of Phase 2: verify linux baseline + graduate bench-check soft gate to hard (#302 follow-up) #420.
scripts/bench-cv-rolling.shdownloads the last N artifactsvia
gh run download, computes per-detector mean / stdev / CV%,prints a graduation-gate table (
OK (<1%)vsOVER (graduation blocked)per detector). Falls back tobaselines.jsonsingle-sample if no artifacts orghisunavailable so the make target works pre-CI-run + offline.
scripts/bench-cv-rolling_test.sh— 13 assertions coveringzero-variance, known-variance, single-sample edge case, empty/
missing dir, --help, unknown-flag, multi-detector aggregation,
graduation marker columns.
make bench-cv-report— operator entry point. HonoursN=20 make bench-cv-reportoverride.bench/detectors/README.md— replaces prose "Graduationcriterion" with a tick-box checklist (one box per detector +
the linux-baseline capture step).
Why this shape (not the gate flip itself)
The task specifies the measurement scaffold, not the flip. The
flip needs N=10 runs of CV evidence to be objectively defensible;
that evidence doesn't exist today because this PR is what creates
the artifact pipeline. Flipping the gate today would either be
based on Apple-M1 numbers (the task explicitly cautions against)
or on the single push-to-main run that lands this PR — neither
clears the README's stated criterion.
Issue #444 is filed with the full graduation checklist + the
3-step playbook to run once N=10 accumulates.
CV math (known-variance test, evidence trail)
3 synthetic runs with allocs/op = {10, 20, 30}, each run yielding
10 identical samples (matches the production
count=10shape).Pooled samples: 10×10, 10×20, 10×30 (n=30). Mean=20. Sum of
squared deviations = 10·(10-20)² + 10·(20-20)² + 10·(30-20)² =
2000. Sample stdev = √(2000 / 29) ≈ 8.3046. CV = 8.3046 / 20 × 100
≈ 41.52%. The script's output matches within 0.1 of the analytic
answer (8.3045 / 41.5227).
Test plan
actionlint .github/workflows/bench.yml→ exit 0.shellcheck scripts/bench-cv-rolling.sh scripts/bench-cv-rolling_test.sh→ exit 0.bash scripts/bench-cv-rolling_test.sh→ 13/13 PASS.make bench-cv-report→ graceful fallback tobaselines.json(today'sbench.ymlruns predate this PRso artifacts don't exist yet; fallback exercised live).
make bench-detectors-checkon M1 → all 6 detectors within≤ 1 alloc/op of committed baseline. Confirms the M1
numbers are stable; the linux/amd64 leg lands once CI runs
accumulate.
make check(pre-commit hook) → green.bench-patternsjob on this PR → first run with thenew artifact-upload step (this PR's check run is the
validation).
Self-grade: A
make bench-cv-report+ READMEchecklist shipped.
(fallback path); handles missing-artifact + single-run-history
Phase 3: flip bench soft→hard gate once 10-run CV evidence accumulates (#420 follow-up) #444 filed for actual graduation flip.
workflow_dispatchlinux-baseline-capture triggeradded; on-M1 cross-cut audit run (drift ≤ 1 alloc/op across
all 6 detectors). Auto-PR-on-drift was descoped — a one-shot
diff in the job summary is sufficient evidence for the operator
to open a re-baseline PR manually, and an auto-PR opener adds
a workflow with
contents: write+ branch-creation logic fora flow that fires maybe once. The cost-vs-value did not pencil
out for ship-now.