ci(chart): 10-run helm-install median aggregator (M3 #209) by trilamsr · Pull Request #471 · TraceCoreAI/tracecore

trilamsr · 2026-06-02T03:49:40Z

Summary

Closes the M3 carry-forward in docs/MILESTONES.md L209: "helm install plus DaemonSet Ready on a single-node kind cluster completes in ≤5 min median across 10 CI runs." The single-run ≤300s gate already lives in chart.yml; this PR adds the missing 10-run rolling-median aggregation layer.

Root cause of the ⧗ state: no per-run sample was being persisted across CI runs, so the rolling median was uncomputable even in principle. The single-run gate had nothing to roll up against. Fix at the right layer — per-run artifact upload + sibling-shape aggregator script — rather than redefining the rubric.

Sibling pattern: PR #446's bench-cv-rolling artifact pipeline. Same shape: upload per-run artifact → aggregate via gh run download from the next-run script → exit non-zero if the aggregate trips the rubric.

Pieces

.github/workflows/chart.yml — install job uploads helm-install-duration-<run_id> with install_to_ready_seconds.txt (single integer) + metadata. 90-day retention. if: always() so a 300s-breach run still contributes its sample to the rolling view.
scripts/helm-install-rolling.sh — downloads last N=10 successful main-branch chart.yml runs, computes median, fails if median > 300. Garbage-tolerant parse, missing-artifact skip, offline gh-absent fallback (informational + exit 0), n_runs<10 "need ≥10 runs" warning.
scripts/helm-install-rolling_test.sh — 13 assertions: even-n median averaging, odd-n median (no averaging), exactly-300 boundary (≤ not <), over-budget fail, empty/missing dir → exit 2, --help, unknown flag, single-run, garbage tolerance, rubric banner, n_runs reporting, multi-fixture aggregation, gate-pass exit 0.
make helm-install-rolling-report — operator entry point. Honours N=20 make helm-install-rolling-report.
docs/MILESTONES.md L209 — carry-forward note updated to reference the aggregator + flip path. Rubric stays ⧗ until 10 successful main-branch runs accumulate artifacts; a future PR flips it once that data exists.
install/kubernetes/tracecore/README.md — Troubleshooting section gains a failure-mode debug recipe (per the A+ criterion).

Why median, not CV

bench-cv-rolling.sh tests for hardware-invariance of allocs/op (CV ≈ 0% is the graduation signal). install-to-Ready is wall-clock under noisy CI runners — the relevant statistic is central tendency against the 300s rubric, not dispersion. Matches MILESTONES.md wording verbatim ("median across 10 CI runs").

Verification

shellcheck scripts/helm-install-rolling.sh scripts/helm-install-rolling_test.sh → exit 0
actionlint .github/workflows/chart.yml → exit 0
bash scripts/helm-install-rolling_test.sh → 13/13 PASS
Mutation tests:
1. Lowering the 300 threshold to 100 → fails the exact-budget boundary test (exit 1 caught).
2. Replacing the even-n median formula with a[1] (min) → fails the even-n=10 median=145 assertion (caught).
Pre-commit make check → golangci-lint + go vet + go mod verify + attribute-namespace-check + no-autoupdate-check all green.

Test plan

shellcheck both scripts → exit 0
actionlint chart.yml → exit 0
13/13 shell-test assertions pass locally
Mutation-verified: threshold + median formula both produce failing tests when mutated
make helm-install-rolling-report runs offline (no gh auth needed) and prints informational fallback rather than crashing
CI chart.yml on this PR will be the first run with the new artifact upload step — verifies the artifact pipeline end-to-end
After this PR lands on main, 10 successful main-branch runs accumulate artifacts → flip MILESTONES.md L209 ⧗ → ☑ in a follow-up

Self-grade: A+

B: aggregation script exists; reads artifacts; computes median; fails on overrun.
A: above + wired into CI (per-run artifact upload in chart.yml); MILESTONES.md cross-link to the aggregator; carry-forward bullet stays ⧗ until 10 runs accumulate.
A+: above + mutation-verified shell tests; cross-link to PR ci(bench): CV-rolling artifact scaffold for soft→hard gate (#420) #446's bench-cv-rolling pattern in the script preamble + README; failure-mode debug recipe shipped in install/kubernetes/tracecore/README.md.

- New `scripts/helm-install-rolling.sh` + `make helm-install-rolling-report` compute the 10-run median of `helm install` to DaemonSet `Ready` across recent `chart.yml` runs on main; drives the M3 carry-forward (`docs/MILESTONES.md` L209) graduation.
- `chart.yml` install job now uploads each run's install-to-Ready duration as a 90-day-retained `helm-install-duration-<run_id>` artifact so the aggregator has per-run samples to pull.
- Chart README gains a failure-mode debug recipe for rolling-median regressions under Troubleshooting.

Closes the M3 carry-forward in docs/MILESTONES.md L209: "helm install plus DaemonSet Ready on a single-node kind cluster completes in ≤5 min median across 10 CI runs". The single-run ≤300s gate already lives in chart.yml; this PR adds the 10-run rolling-median aggregation layer. Sibling pattern: PR #446's bench-cv-rolling artifact pipeline. Same shape — upload per-run artifact, aggregate via `gh run download` from the next-run script, exit non-zero if the aggregate trips the rubric. ## Pieces - `.github/workflows/chart.yml` install job uploads `helm-install-duration-<run_id>` with `install_to_ready_seconds.txt` (single integer) + metadata. 90-day retention. `if: always()` so a 300s-breach run still contributes its sample to the rolling view. - `scripts/helm-install-rolling.sh` downloads the last N=10 successful main-branch chart.yml runs, computes median, fails if median > 300. Edge cases: missing artifacts skipped, garbage content tolerated, offline mode (no gh) prints informational message + exits 0, n_runs<10 prints "need ≥10 runs" warning (rubric still informative). - `scripts/helm-install-rolling_test.sh` — 13+ assertions: even-n median averaging, odd-n median (no averaging), exactly-300 boundary (≤ not <), over-budget fail, empty/missing dir exit 2, --help, --bogus, single-run, garbage-tolerance, rubric banner, n_runs reporting. - `make helm-install-rolling-report` operator entry point. `N=20 make ...` override supported. - MILESTONES.md L209 carry-forward note updated to reference the aggregator + flip path; rubric stays ⧗ until 10 runs accumulate artifacts on main. - install/kubernetes/tracecore/README.md Troubleshooting section gains a failure-mode debug recipe (per A+ criterion). ## Why median, not CV bench-cv-rolling.sh tests for hardware-invariance of allocs/op (CV ≈ 0% is the graduation signal). install-to-Ready is wall-clock under noisy CI runners — the relevant statistic is the central tendency against the 300s rubric, not the dispersion. Matches MILESTONES.md wording verbatim ("median across 10 CI runs"). ## Verification - shellcheck scripts/helm-install-rolling.sh scripts/helm-install-rolling_test.sh → exit 0 - actionlint .github/workflows/chart.yml → exit 0 - bash scripts/helm-install-rolling_test.sh → 13/13 PASS - mutation tests: lowering the 300 threshold to 100 fails the exact-budget tests; replacing the even-n median formula with min fails the even-n assertion. Tests catch both mutations. ## Self-grade: A+ - B: aggregation script exists, reads artifacts, computes median, fails on overrun. - A: above + wired into CI (per-run artifact upload landed in chart.yml); MILESTONES.md cross-link to the aggregator; the rubric bullet stays ⧗ until 10 runs accumulate (a future PR flips it once the artifact set is populated). - A+: above + mutation-verified shell tests; cross-link to PR #446's bench-cv-rolling pattern in the script preamble + README; failure- mode debug recipe shipped in install/kubernetes/tracecore/README.md. ```release-notes - New `scripts/helm-install-rolling.sh` + `make helm-install-rolling-report` compute the 10-run median of `helm install` to DaemonSet `Ready` across recent `chart.yml` runs on main; drives the M3 carry-forward (docs/MILESTONES.md L209) graduation. - `chart.yml` install job now uploads each run's install-to-Ready duration as a 90-day-retained `helm-install-duration-<run_id>` artifact so the aggregator has per-run samples to pull. - Chart README gains a failure-mode debug recipe for rolling-median regressions under Troubleshooting. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr · 2026-06-02T03:54:23Z

Independent Adversarial Review — PR #471

B/A/A+ Grade: A+ (ship)

Correctness ✅

Median computation proven:

Even n: (a[n/2] + a[n/2+1]) / 2 ✓
Odd n: a[(n+1)/2] ✓
Test cases verified: {100..190}→145, {50..90}→70, {290..310}→300 all correct

Gate logic verified:

awk 'BEGIN { exit (m > 300) ? 0 : 1 }' correctly distinguishes ≤300 (pass) vs >300 (fail)

Artifact upload condition sound:

if: always() && steps.install.outputs.install_to_ready_seconds != ''
Does NOT mask failures; install step's exit 1 still propagates
Correctly uploads even on 300s breach so regression data is captured (makes rolling median more useful)

Test Coverage ✅

13 assertions across comprehensive scenarios:

Normal case (all 120s → median 120)
Even-n median (10 values → 145)
Odd-n median (5 values → 70)
Boundary at 300s (median=300, ≤ gate passes)
Over budget (median=325, gate fails)
Empty/missing/bad --dir paths
Garbage content tolerance (skip malformed, aggregate valid)
Fallback warnings ("need ≥10 runs")
Output format assertions (rubric banner, n_runs)

No gaps. Edge cases thorough.

Duplication vs. Bench-CV-Rolling (PR #446) ✅

Moderate but justified:

Boilerplate ~65% identical (arg parsing, CI flow, caching)
Critical logic orthogonal (CV % for bench vs median for helm)
Abstraction tax (~80 new lines to save ~45) not justified in bash 3.2
Both scripts reference each other; parity established
Verdict: keep parallel; no shared helper needed.

Sibling Parity ✅

Makefile: .PHONY: helm-install-rolling-report mirrors bench-cv-report
MILESTONES.md: marker updated from ⧗ to "now live"
README: 27-line runbook (operator playbook on 300s breach, debug recipe)
Artifact retention: 90 days (matches PR ci(bench): CV-rolling artifact scaffold for soft→hard gate (#420) #446)

Findings

🟡 risk: preamble duplicates README playbook. Lines 42–52 in helm-install-rolling.sh ("Failure-mode debug recipe") are verbatim copies of the README section. Trim preamble from 70→35 lines; reference README instead.

🟡 risk: 90-day retention undocumented. No comment in chart.yml explaining why 90 days is the right window. (Non-blocking; bench-cv-rolling also lacks rationale. Document in follow-up.)

🔵 nit: comment density in preamble excessive. "How it works" + "Edge cases" + "Failure modes" + "Usage" = 72 lines. Collapse to ~30 lines; reference docs.

Simplification Sweep

helm-install-rolling.sh line 1–72 (preamble): Trim duplicate failure-mode recipe; consolidate to ~30 lines. Keep structure but excise the playbook (README is SSOT). Update sed -n '2,72p' → sed -n '2,40p' in --help.

Rationale: --help is not primary UX (operator uses make helm-install-rolling-report); Makefile comment is the real target. README has the full playbook; script should reference, not duplicate.

Test file: Comments are proportionate (13 fixtures × 2-line explanations). No bloat detected.

No Issues With

Mutex/race conditions (single artifact per run_id)
Off-by-one errors in median formula (proven via test data)
Output masking by if: always() (correct choice; includes regression data)
10-sample minimum gate (advisory until n≥10, then hard; spec matches MILESTONES.md)

Ship it. Simplification feedback should be applied pre-merge to clean up preamble, but correctness is solid.

trilamsr enabled auto-merge (squash) June 2, 2026 03:54

trilamsr merged commit d0815da into main Jun 2, 2026
19 checks passed

trilamsr deleted the ci/chart-install-rolling-median branch June 2, 2026 03:58

trilamsr mentioned this pull request Jun 2, 2026

audit(wave-2026-06-02): autonomous-wave cross-cut review #488

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci(chart): 10-run helm-install median aggregator (M3 #209)#471

ci(chart): 10-run helm-install median aggregator (M3 #209)#471
trilamsr merged 1 commit into
mainfrom
ci/chart-install-rolling-median

trilamsr commented Jun 2, 2026

Uh oh!

trilamsr commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 2, 2026

Summary

Pieces

Why median, not CV

Verification

Test plan

Self-grade: A+

Uh oh!

trilamsr commented Jun 2, 2026

Independent Adversarial Review — PR #471

B/A/A+ Grade: A+ (ship)

Correctness ✅

Test Coverage ✅

Duplication vs. Bench-CV-Rolling (PR #446) ✅

Sibling Parity ✅

Findings

Simplification Sweep

No Issues With

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant