Skip to content

[ci] kernelevents SLI test: warmup + alert tier + sli emit helper#76

Merged
trilamsr merged 7 commits into
mainfrom
ci/sli-test-warmup-and-deflake
May 19, 2026
Merged

[ci] kernelevents SLI test: warmup + alert tier + sli emit helper#76
trilamsr merged 7 commits into
mainfrom
ci/sli-test-warmup-and-deflake

Conversation

@trilamsr

@trilamsr trilamsr commented May 19, 2026

Copy link
Copy Markdown
Contributor

Summary

Deflakes TestReceiver_SLIBudget in components/receivers/kernelevents/bench_test.go (observed a 539ms spike under -race from GH Actions runner contention; local baseline is 10–18ms). Ships four substantive changes:

  • Warmup phase. First 500 events of a 5500-event fixture have their latencies discarded before p99 is computed over the remaining 5000-event measurement window. Absorbs GC, race-detector metadata, and scheduler cold-start.
  • Remove t.Parallel(). Measuring latency under cross-test contention measures contention, not the receiver.
  • Graduated p99 signal levels. alertP99Ms (100ms race / 0.5ms non-race) emits an SLI ALERT line to the job summary without failing; targetP99Ms (500ms / 5ms) is the hard ceiling. Restores regression signal value the loose 500ms ceiling lost — a 20× regression now produces actionable summary signal even though the test still passes.
  • Falsifying test for the trim. TestSLIBudget_WarmupDiscardIsLoadBearing injects synthetic cold-start spikes and asserts the trim collapses p99 by ≥10×. Mutation-resistant: if a future refactor drops the discard, this test fails.

Plus three pieces of supporting infrastructure:

  • Reusable SLI emit helper at internal/sli/. PublishObservation handles the GITHUB_STEP_SUMMARY env-var gate + append/close + best-effort error handling. 100% test coverage. Next perf-budget test in the repo can reuse it.
  • CI p99 visibility via GITHUB_STEP_SUMMARY. make coverage-check runs go test -race without -v, so t.Logf is hidden in CI. The helper surfaces p99 in the PR job-summary tab and accumulates across PRs — the empirical history the budget comment conditions tightening on.
  • Defensive-guard rationale. The len(latencies) > warmupEvents guard now documents the invariant it protects (sink.count and len(latencies) can diverge if a future receiver emits records without ObservedTimestamp).

FOLLOWUPS updated: SLI-visibility entry closed (this PR ships it); three replacements opened — tighten the 500ms ceiling once CI history accumulates, switch to baseline-relative SLI assertion, repo-wide flake-pattern lint gate.

Empirical validation

  • Local 5 isolated runs (-race): p99 9.9 – 12.6 ms.
  • Local 10 runs under GOMAXPROCS=2: p99 9.3 – 10.9 ms.
  • Local 10 runs under GOMAXPROCS=2 + -count=10 + full kernelevents package: p99 9.9 – 17.0 ms (median 13.4 ms).
  • CI on this PR: verify-test green, race-stability (count=10) green five consecutive times (50+ SLI assertions under stress).
  • New TestSLIBudget_WarmupDiscardIsLoadBearing: 3 runs under -race PASS. Asserts p99(untrimmed) ≥ 1s, p99(trimmed) < 100ms, ratio ≥ 10×.
  • internal/sli package: 3 unit tests, 100% statement coverage.
Stabilizes the kernelevents SLI test (`TestReceiver_SLIBudget`) with a 500-event warmup phase, serial execution, and a graduated `SLI ALERT` signal tier emitted to `GITHUB_STEP_SUMMARY` so CI runs surface a visible p99 history per PR. Extracts the summary-publishing helper into `internal/sli` for reuse by future perf-budget tests. No change to the operator-facing 5 ms non-race SLO or the 500 ms race-mode flake guard.

PR #73 caught a flake: TestReceiver_SLIBudget asserted p99 emit
latency < 500ms under -race, observed 539ms. Local baseline is
10-18ms; the spike was GitHub Actions runner contention, not a
receiver regression. The aggregator bypass fix (PR #74) means this
flake will now correctly block subsequent PRs.

Three changes:

1. Warmup phase. The first 500 events feed a cold pipeline (GC not
   yet stabilized, race-detector metadata warming, scheduler still
   bouncing). Those latencies are now discarded before p99 is
   computed over the 5000-event measurement window. Locally, p99
   tightens to a 9.9-12.6ms range (was 10-18ms).

2. Remove t.Parallel(). The test specifically measures
   latency-under-load on the receiver hot path. Running it
   concurrently with other parallel tests in the same binary
   measures goroutine-pool contention, not the receiver. Wall-time
   cost: ~200ms of serial execution within the kernelevents
   package.

3. Fix the misleading comment. The previous comment claimed the
   budget was tuned against -count=10 contention from the
   race-stability workflow. That workflow is `make
   test-extras-race`, not `make coverage-check`. The bound was
   historically chased upward to absorb GH Actions runner noise,
   which is what actually happens under `coverage-check`. The
   comment now names the real noise source and the path forward
   (tighten only after sustained sub-250ms CI observations).

Local validation: 5 isolated runs under -race produced p99 9.9 -
12.6 ms. Within the full kernelevents package (-race ./...):
15.3 ms. Both comfortably within the 500ms budget.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
…ve value (#78)

## Summary

`TestReceiver_SetDegraded` asserted the `degraded_seconds_total` counter
matched `0\.0[0-9]+` — an implicit upper bound of `[0.0, 0.1)`. The
test's stated intent (per its comment, "Value must be >0") is just
"counter advanced past zero"; the upper bound is brittle artifact.

Caught on PR #76's rerun, where a slow GitHub Actions runner observed
0.126s of degraded time — well past the regex's implicit 0.1s ceiling,
but well within the test's actual intent. Same flake shape as the SLI
deflake landed in PR #76: a CI-timing assertion calibrated to
fast-runner expectations failing on a slow runner.

**Fix:** regex becomes `\d+\.[0-9]*[1-9]` — match any value with at
least one non-zero digit after the decimal point. Accepts `0.05`,
`0.126`, `1.5`, etc.; correctly rejects `0`, `0.0`, `0.000`.

## Test plan

- [x] 10 isolated runs of `TestReceiver_SetDegraded` under `-race`: all
PASS.
- [x] 20-iteration stress run under `GOMAXPROCS=2` + full
`internal/selftelemetry` package: all PASS (2.554s).
- [x] Other `require.Regexp` calls in the file inspected (lines 72, 75,
92, 110, 154): all assert exact integer counter values (`2`, `1`, `8`,
etc.) or histogram counts — none have the same upper-bound-on-zero
shape. No further changes needed.
- [ ] CI: `verify-test` passes on this PR.
- [ ] Once this lands, re-run PR #76's CI to clear the path for the SLI
deflake.

```release-notes
Stabilizes the `internal/selftelemetry` degraded-counter test by relaxing its regex to accept any positive counter value (the upper bound of 0.1s was an artifact, not load-bearing). Reduces CI flakes from slow GitHub Actions runners.
```

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added 6 commits May 18, 2026 23:51
Drops the historical bullet trail (50ms → 250ms → 500ms across
2026-04/05) and the 'PR #73' reference. Per the user-memory rule
on comments: don't reference specific PRs or one-off incidents
that rot as the codebase evolves; the budget rationale (CI flake
guard, not regression bound) is what matters going forward.

Comment goes from 10 lines to 6 lines. Mechanical cleanup only,
no behavior change.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Surfaced during PR #76's review cycle. Both items arose from the
session that landed the SLI and SetDegraded deflakes; neither is
in scope for this PR but both warrant tracking.

1. Flake-pattern audit. Two flakes of the same shape (timing-
   sensitive regex assertions calibrated to fast-runner timings)
   landed in one session. A grep sweep would find any sibling
   cases before they bite future PRs.

2. SLI p99 visibility. The deflaked test's comment instructs
   tightening only after sustained sub-250ms CI history, but
   `make coverage-check` runs without `-v`, so the p99 line is
   suppressed. A future maintainer has no operable mechanism to
   observe the threshold. Two paths recorded.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Drives PR #76 from B+ to A+ within the same PR via three additions:

1. Emit p99 + throughput + heap to GITHUB_STEP_SUMMARY when the
   env var is set. `make coverage-check` runs without `-v`, so the
   `t.Logf("SLI: ...")` line was hidden in CI. The summary surfaces
   in the PR job-summary tab and accumulates across PRs — the
   empirical history the budget comment conditions tightening on.
   Best-effort: errors ignored so the test never fails on a missing
   summary file or permission quirk.

2. New test `TestSLIBudget_WarmupDiscardIsLoadBearing`. Constructs
   a fabricated latency stream of warmupEvents 1s spikes followed
   by targetEvents 10ms steady-state samples, then asserts:
   - Untrimmed p99 is ≥ 1s (sanity).
   - After applying the same slice trim the SLI test uses, p99
     collapses to < 100ms.
   - The trim collapses p99 by ≥10×.
   Falsifies the mechanism: if a future refactor changes the slice
   semantics or removes the discard, this test fails.

3. Defensive guard `if len(latencies) > warmupEvents` now carries a
   one-line rationale explaining why it isn't redundant with the
   prior `Eventually` waiter — they track different invariants
   (`sink.count` vs `len(sink.latencies)`) that can diverge if the
   receiver ever emits records without `ObservedTimestamp` set.

FOLLOWUPS: SLI-visibility entry marked closed; three replacements
opened (tighten the 500ms ceiling once CI history accumulates;
switch to baseline-relative SLI assertion; repo-wide flake-pattern
lint gate).

Local: 3 isolated runs of both tests under -race PASS. `make vet`
clean, `make doc-check` clean.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
…nal/sli

Three changes drive PR #76 toward A+:

1. Graduated p99 signal levels. The test now defines two ceilings:
   - alertP99Ms (100ms under -race; 0.5ms non-race): emits a
     "SLI ALERT" line to the job summary when crossed but does NOT
     fail the test. Restores regression-signal value the loose 500ms
     ceiling lost.
   - targetP99Ms (500ms / 5ms): unchanged hard-failure budget.
   A 20x regression now produces actionable signal in the PR summary
   tab without forcing a flake; a 100x+ regression still fails.

2. Reusable helper at internal/sli/emit.go. PublishObservation handles
   the GITHUB_STEP_SUMMARY env-var gate + append/close + best-effort
   error handling. Next perf-budget test in the repo can use it
   without duplicating the boilerplate (or the G304 nosec). bench_test
   now imports it; the inline os.OpenFile block in bench_test is gone.

3. Unit tests for the helper. internal/sli/emit_test.go pins three
   behaviors: no-op when env unset, append when env points at a
   writable file, silent when path is unwritable. 100% coverage.

Also fixes the gosec G304 finding that broke verify-lint on the prior
commit (false positive: GITHUB_STEP_SUMMARY is a runner-controlled
writable path, not attacker-controlled input). The annotation now
documents the rationale.

Verified locally per `feedback_test_plan_before_pr` memory:
- make lint: 0 issues
- make vet: clean
- go test -race -count=3 on both SLI tests: PASS
- go test -race -cover ./internal/sli/: 100% statements, 3/3 PASS
- go test -race ./components/receivers/kernelevents/: full package PASS

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Drops comments that don't earn six-months-cold-reader weight: the
two-tier-signal explanation collapses from 14 lines to 4, the trim
+ guard block from 14 to 3, the falsifying test docstring from 19
to 4, the sli package doc from 10 to 5. Sli emit_test docstrings
drop boilerplate intent that the test name already conveys.

The rationale that remains is what code can't tell a reader on its
own: WHY no t.Parallel, why two tiers, why the count/latencies
guard, why the slice-trim is load-bearing, why GITHUB_STEP_SUMMARY.

No semantic change. Lint + vet + tests clean locally.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr changed the title [ci] kernelevents SLI test: warmup phase + remove t.Parallel [ci] kernelevents SLI test: warmup + alert tier + sli emit helper May 19, 2026
@trilamsr trilamsr merged commit 869450b into main May 19, 2026
17 checks passed
@trilamsr trilamsr deleted the ci/sli-test-warmup-and-deflake branch May 19, 2026 07:48
trilamsr added a commit that referenced this pull request May 19, 2026
…83)

## Summary

Flake-pattern audit follow-up to PR #76 + #78. Two assertions in
`tools/failure-inject/cpusteal/cpusteal_test.go` match the same shape we
fixed in `TestReceiver_SLIBudget` and `TestReceiver_SetDegraded`: hard
absolute upper bound on observed timing, calibrated to fast-runner
expectations.

| Before | After | What changed |
|---|---|---|
| `require.Less(elapsed, 500ms)` for 100ms request |
`require.Less(elapsed, 2s)` | Hang sentinel, not perf bound — busy-loop
scheduler delay under contention can run a 100ms request to 300-400ms |
| `require.Less(elapsed, 250ms)` for cancel response |
`require.Less(elapsed, 2s)` | Same — context-cancellation latency varies
by an order of magnitude under contention |

The lower-bound assertion on `TestRun_HonorsDuration` (`elapsed >=
95ms`) still pins the real contract (busy-loop runs for the requested
time). The upper bounds only catch "never returned." This matches the
lesson landed in `AGENTS.md` via PR #81 — *match perf-budget assertions
by the invariant only*.

## Test plan

- [x] Local: `go test -race -count=3 -v
./tools/failure-inject/cpusteal/` — all 4 tests PASS each iteration.
- [x] `make lint` clean.
- [x] `make vet` clean.
- [x] Audit completeness verified: broader grep sweep
(`require.Less.*Millisecond`, `assert.Less.*Millisecond`, `elapsed >
N*time.X`, `WithinDuration`, `Budget` callsites, `isRaceBuild`
callsites) found no other instances of the same shape outside the
kernelevents SLI test we already covered.
- [ ] CI on this PR.

## Rollback

Single Edit to restore the original numeric bounds. No dependents; the
bounds are local to two test functions.

```release-notes
NONE — test stability only. Relaxes two absolute-time assertions in cpusteal's test to hang sentinels rather than performance bounds, matching the pattern landed in PR #76 and #78. No production behavior change.
```

Signed-off-by: Tri Lam <trilamsr@gmail.com>
trilamsr added a commit that referenced this pull request May 19, 2026
## Summary

Three observations from the recent session that didn't fit the
structured surfaces (already used for the load-bearing AGENTS.md entries
in PR #81 and the agent-internal notes in PR #82). Each captured via the
`learn-from-mistakes` flow and lands in its existing topic note.

**`.claude/notes/automation.md`** — *Memory captures rationale; hooks
enforce.* The pre-PR checklist personal memory landed mid-session was
followed by a lint failure shipped to CI within the hour. The same gap
closed reliably by the `PreToolUse` hook installed shortly after. For
any "always do X before Y" pattern, prefer the hook; the memory
documents *why* the hook exists.

**`docs/notes/ci.md`** — *Frame CI / perf projections as ranges, not
single numbers.* PR #72's 155s wall-time projection vs 242s actual cost
an investigation round (later landed in PR #77) because the projection's
setup-go-cache amortization assumption was unverified. Either verify
assumptions empirically before publishing the number, or frame as a
range.

**`.claude/notes/review-patterns.md`** — *Self-rate work, then write
criteria for the next grade up.* Forces articulation of measurable
improvements rather than free-form "anything else?". PR #76's B+ → A →
A+ came from two iterations of this exact pattern; each iteration closed
real structural gaps.

A fourth lesson — "fix existing tools before proposing new ones" — was
captured to personal memory (no PR, lives in
`~/.claude/projects/.../memory/`), not the repo, because it's a judgment
heuristic about my own decision-making rather than a repo-resident
convention.

## Test plan

- [x] `make doc-check` clean (banned-phrase lint, link resolution, all
gates).
- [x] `learn-from-mistakes` format check: banned vocabulary absent, no
first-person AI phrasing, no AI attribution, all three entries carry
`Anchor:` lines pointing at concrete PRs.
- [ ] CI on this PR exercises `doc-check` + `pr-lint`.

## Rollback

Each entry is a self-contained `### title` + body + `Anchor:` block at
the top of its file. No dependents elsewhere; reverting is a single Edit
per file.

```release-notes
NONE — documentation only. Three meta-lessons from a recent session retrospective land in their existing topic notes (`automation.md`, `ci.md`, `review-patterns.md`). No runtime behavior change.
```

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant