[ci] kernelevents SLI test: warmup + alert tier + sli emit helper#76
Merged
Conversation
PR #73 caught a flake: TestReceiver_SLIBudget asserted p99 emit latency < 500ms under -race, observed 539ms. Local baseline is 10-18ms; the spike was GitHub Actions runner contention, not a receiver regression. The aggregator bypass fix (PR #74) means this flake will now correctly block subsequent PRs. Three changes: 1. Warmup phase. The first 500 events feed a cold pipeline (GC not yet stabilized, race-detector metadata warming, scheduler still bouncing). Those latencies are now discarded before p99 is computed over the 5000-event measurement window. Locally, p99 tightens to a 9.9-12.6ms range (was 10-18ms). 2. Remove t.Parallel(). The test specifically measures latency-under-load on the receiver hot path. Running it concurrently with other parallel tests in the same binary measures goroutine-pool contention, not the receiver. Wall-time cost: ~200ms of serial execution within the kernelevents package. 3. Fix the misleading comment. The previous comment claimed the budget was tuned against -count=10 contention from the race-stability workflow. That workflow is `make test-extras-race`, not `make coverage-check`. The bound was historically chased upward to absorb GH Actions runner noise, which is what actually happens under `coverage-check`. The comment now names the real noise source and the path forward (tighten only after sustained sub-250ms CI observations). Local validation: 5 isolated runs under -race produced p99 9.9 - 12.6 ms. Within the full kernelevents package (-race ./...): 15.3 ms. Both comfortably within the 500ms budget. Signed-off-by: Tri Lam <trilamsr@gmail.com>
5 tasks
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…ve value (#78) ## Summary `TestReceiver_SetDegraded` asserted the `degraded_seconds_total` counter matched `0\.0[0-9]+` — an implicit upper bound of `[0.0, 0.1)`. The test's stated intent (per its comment, "Value must be >0") is just "counter advanced past zero"; the upper bound is brittle artifact. Caught on PR #76's rerun, where a slow GitHub Actions runner observed 0.126s of degraded time — well past the regex's implicit 0.1s ceiling, but well within the test's actual intent. Same flake shape as the SLI deflake landed in PR #76: a CI-timing assertion calibrated to fast-runner expectations failing on a slow runner. **Fix:** regex becomes `\d+\.[0-9]*[1-9]` — match any value with at least one non-zero digit after the decimal point. Accepts `0.05`, `0.126`, `1.5`, etc.; correctly rejects `0`, `0.0`, `0.000`. ## Test plan - [x] 10 isolated runs of `TestReceiver_SetDegraded` under `-race`: all PASS. - [x] 20-iteration stress run under `GOMAXPROCS=2` + full `internal/selftelemetry` package: all PASS (2.554s). - [x] Other `require.Regexp` calls in the file inspected (lines 72, 75, 92, 110, 154): all assert exact integer counter values (`2`, `1`, `8`, etc.) or histogram counts — none have the same upper-bound-on-zero shape. No further changes needed. - [ ] CI: `verify-test` passes on this PR. - [ ] Once this lands, re-run PR #76's CI to clear the path for the SLI deflake. ```release-notes Stabilizes the `internal/selftelemetry` degraded-counter test by relaxing its regex to accept any positive counter value (the upper bound of 0.1s was an artifact, not load-bearing). Reduces CI flakes from slow GitHub Actions runners. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>
Drops the historical bullet trail (50ms → 250ms → 500ms across 2026-04/05) and the 'PR #73' reference. Per the user-memory rule on comments: don't reference specific PRs or one-off incidents that rot as the codebase evolves; the budget rationale (CI flake guard, not regression bound) is what matters going forward. Comment goes from 10 lines to 6 lines. Mechanical cleanup only, no behavior change. Signed-off-by: Tri Lam <trilamsr@gmail.com>
Surfaced during PR #76's review cycle. Both items arose from the session that landed the SLI and SetDegraded deflakes; neither is in scope for this PR but both warrant tracking. 1. Flake-pattern audit. Two flakes of the same shape (timing- sensitive regex assertions calibrated to fast-runner timings) landed in one session. A grep sweep would find any sibling cases before they bite future PRs. 2. SLI p99 visibility. The deflaked test's comment instructs tightening only after sustained sub-250ms CI history, but `make coverage-check` runs without `-v`, so the p99 line is suppressed. A future maintainer has no operable mechanism to observe the threshold. Two paths recorded. Signed-off-by: Tri Lam <trilamsr@gmail.com>
Drives PR #76 from B+ to A+ within the same PR via three additions: 1. Emit p99 + throughput + heap to GITHUB_STEP_SUMMARY when the env var is set. `make coverage-check` runs without `-v`, so the `t.Logf("SLI: ...")` line was hidden in CI. The summary surfaces in the PR job-summary tab and accumulates across PRs — the empirical history the budget comment conditions tightening on. Best-effort: errors ignored so the test never fails on a missing summary file or permission quirk. 2. New test `TestSLIBudget_WarmupDiscardIsLoadBearing`. Constructs a fabricated latency stream of warmupEvents 1s spikes followed by targetEvents 10ms steady-state samples, then asserts: - Untrimmed p99 is ≥ 1s (sanity). - After applying the same slice trim the SLI test uses, p99 collapses to < 100ms. - The trim collapses p99 by ≥10×. Falsifies the mechanism: if a future refactor changes the slice semantics or removes the discard, this test fails. 3. Defensive guard `if len(latencies) > warmupEvents` now carries a one-line rationale explaining why it isn't redundant with the prior `Eventually` waiter — they track different invariants (`sink.count` vs `len(sink.latencies)`) that can diverge if the receiver ever emits records without `ObservedTimestamp` set. FOLLOWUPS: SLI-visibility entry marked closed; three replacements opened (tighten the 500ms ceiling once CI history accumulates; switch to baseline-relative SLI assertion; repo-wide flake-pattern lint gate). Local: 3 isolated runs of both tests under -race PASS. `make vet` clean, `make doc-check` clean. Signed-off-by: Tri Lam <trilamsr@gmail.com>
…nal/sli Three changes drive PR #76 toward A+: 1. Graduated p99 signal levels. The test now defines two ceilings: - alertP99Ms (100ms under -race; 0.5ms non-race): emits a "SLI ALERT" line to the job summary when crossed but does NOT fail the test. Restores regression-signal value the loose 500ms ceiling lost. - targetP99Ms (500ms / 5ms): unchanged hard-failure budget. A 20x regression now produces actionable signal in the PR summary tab without forcing a flake; a 100x+ regression still fails. 2. Reusable helper at internal/sli/emit.go. PublishObservation handles the GITHUB_STEP_SUMMARY env-var gate + append/close + best-effort error handling. Next perf-budget test in the repo can use it without duplicating the boilerplate (or the G304 nosec). bench_test now imports it; the inline os.OpenFile block in bench_test is gone. 3. Unit tests for the helper. internal/sli/emit_test.go pins three behaviors: no-op when env unset, append when env points at a writable file, silent when path is unwritable. 100% coverage. Also fixes the gosec G304 finding that broke verify-lint on the prior commit (false positive: GITHUB_STEP_SUMMARY is a runner-controlled writable path, not attacker-controlled input). The annotation now documents the rationale. Verified locally per `feedback_test_plan_before_pr` memory: - make lint: 0 issues - make vet: clean - go test -race -count=3 on both SLI tests: PASS - go test -race -cover ./internal/sli/: 100% statements, 3/3 PASS - go test -race ./components/receivers/kernelevents/: full package PASS Signed-off-by: Tri Lam <trilamsr@gmail.com>
Drops comments that don't earn six-months-cold-reader weight: the two-tier-signal explanation collapses from 14 lines to 4, the trim + guard block from 14 to 3, the falsifying test docstring from 19 to 4, the sli package doc from 10 to 5. Sli emit_test docstrings drop boilerplate intent that the test name already conveys. The rationale that remains is what code can't tell a reader on its own: WHY no t.Parallel, why two tiers, why the count/latencies guard, why the slice-trim is load-bearing, why GITHUB_STEP_SUMMARY. No semantic change. Lint + vet + tests clean locally. Signed-off-by: Tri Lam <trilamsr@gmail.com>
5 tasks
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
…83) ## Summary Flake-pattern audit follow-up to PR #76 + #78. Two assertions in `tools/failure-inject/cpusteal/cpusteal_test.go` match the same shape we fixed in `TestReceiver_SLIBudget` and `TestReceiver_SetDegraded`: hard absolute upper bound on observed timing, calibrated to fast-runner expectations. | Before | After | What changed | |---|---|---| | `require.Less(elapsed, 500ms)` for 100ms request | `require.Less(elapsed, 2s)` | Hang sentinel, not perf bound — busy-loop scheduler delay under contention can run a 100ms request to 300-400ms | | `require.Less(elapsed, 250ms)` for cancel response | `require.Less(elapsed, 2s)` | Same — context-cancellation latency varies by an order of magnitude under contention | The lower-bound assertion on `TestRun_HonorsDuration` (`elapsed >= 95ms`) still pins the real contract (busy-loop runs for the requested time). The upper bounds only catch "never returned." This matches the lesson landed in `AGENTS.md` via PR #81 — *match perf-budget assertions by the invariant only*. ## Test plan - [x] Local: `go test -race -count=3 -v ./tools/failure-inject/cpusteal/` — all 4 tests PASS each iteration. - [x] `make lint` clean. - [x] `make vet` clean. - [x] Audit completeness verified: broader grep sweep (`require.Less.*Millisecond`, `assert.Less.*Millisecond`, `elapsed > N*time.X`, `WithinDuration`, `Budget` callsites, `isRaceBuild` callsites) found no other instances of the same shape outside the kernelevents SLI test we already covered. - [ ] CI on this PR. ## Rollback Single Edit to restore the original numeric bounds. No dependents; the bounds are local to two test functions. ```release-notes NONE — test stability only. Relaxes two absolute-time assertions in cpusteal's test to hang sentinels rather than performance bounds, matching the pattern landed in PR #76 and #78. No production behavior change. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>
3 tasks
trilamsr
added a commit
that referenced
this pull request
May 19, 2026
## Summary Three observations from the recent session that didn't fit the structured surfaces (already used for the load-bearing AGENTS.md entries in PR #81 and the agent-internal notes in PR #82). Each captured via the `learn-from-mistakes` flow and lands in its existing topic note. **`.claude/notes/automation.md`** — *Memory captures rationale; hooks enforce.* The pre-PR checklist personal memory landed mid-session was followed by a lint failure shipped to CI within the hour. The same gap closed reliably by the `PreToolUse` hook installed shortly after. For any "always do X before Y" pattern, prefer the hook; the memory documents *why* the hook exists. **`docs/notes/ci.md`** — *Frame CI / perf projections as ranges, not single numbers.* PR #72's 155s wall-time projection vs 242s actual cost an investigation round (later landed in PR #77) because the projection's setup-go-cache amortization assumption was unverified. Either verify assumptions empirically before publishing the number, or frame as a range. **`.claude/notes/review-patterns.md`** — *Self-rate work, then write criteria for the next grade up.* Forces articulation of measurable improvements rather than free-form "anything else?". PR #76's B+ → A → A+ came from two iterations of this exact pattern; each iteration closed real structural gaps. A fourth lesson — "fix existing tools before proposing new ones" — was captured to personal memory (no PR, lives in `~/.claude/projects/.../memory/`), not the repo, because it's a judgment heuristic about my own decision-making rather than a repo-resident convention. ## Test plan - [x] `make doc-check` clean (banned-phrase lint, link resolution, all gates). - [x] `learn-from-mistakes` format check: banned vocabulary absent, no first-person AI phrasing, no AI attribution, all three entries carry `Anchor:` lines pointing at concrete PRs. - [ ] CI on this PR exercises `doc-check` + `pr-lint`. ## Rollback Each entry is a self-contained `### title` + body + `Anchor:` block at the top of its file. No dependents elsewhere; reverting is a single Edit per file. ```release-notes NONE — documentation only. Three meta-lessons from a recent session retrospective land in their existing topic notes (`automation.md`, `ci.md`, `review-patterns.md`). No runtime behavior change. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Deflakes
TestReceiver_SLIBudgetincomponents/receivers/kernelevents/bench_test.go(observed a 539ms spike under-racefrom GH Actions runner contention; local baseline is 10–18ms). Ships four substantive changes:t.Parallel(). Measuring latency under cross-test contention measures contention, not the receiver.alertP99Ms(100ms race / 0.5ms non-race) emits anSLI ALERTline to the job summary without failing;targetP99Ms(500ms / 5ms) is the hard ceiling. Restores regression signal value the loose 500ms ceiling lost — a 20× regression now produces actionable summary signal even though the test still passes.TestSLIBudget_WarmupDiscardIsLoadBearinginjects synthetic cold-start spikes and asserts the trim collapses p99 by ≥10×. Mutation-resistant: if a future refactor drops the discard, this test fails.Plus three pieces of supporting infrastructure:
internal/sli/.PublishObservationhandles theGITHUB_STEP_SUMMARYenv-var gate + append/close + best-effort error handling. 100% test coverage. Next perf-budget test in the repo can reuse it.GITHUB_STEP_SUMMARY.make coverage-checkrunsgo test -racewithout-v, sot.Logfis hidden in CI. The helper surfaces p99 in the PR job-summary tab and accumulates across PRs — the empirical history the budget comment conditions tightening on.len(latencies) > warmupEventsguard now documents the invariant it protects (sink.count and len(latencies) can diverge if a future receiver emits records withoutObservedTimestamp).FOLLOWUPS updated: SLI-visibility entry closed (this PR ships it); three replacements opened — tighten the 500ms ceiling once CI history accumulates, switch to baseline-relative SLI assertion, repo-wide flake-pattern lint gate.
Empirical validation
-race): p99 9.9 – 12.6 ms.GOMAXPROCS=2: p99 9.3 – 10.9 ms.GOMAXPROCS=2 + -count=10+ full kernelevents package: p99 9.9 – 17.0 ms (median 13.4 ms).verify-testgreen,race-stability (count=10)green five consecutive times (50+ SLI assertions under stress).TestSLIBudget_WarmupDiscardIsLoadBearing: 3 runs under-racePASS. Asserts p99(untrimmed) ≥ 1s, p99(trimmed) < 100ms, ratio ≥ 10×.internal/slipackage: 3 unit tests, 100% statement coverage.