[ci] kernelevents SLI test: warmup + alert tier + sli emit helper by trilamsr · Pull Request #76 · TraceCoreAI/tracecore

trilamsr · 2026-05-19T06:15:00Z

Summary

Deflakes TestReceiver_SLIBudget in components/receivers/kernelevents/bench_test.go (observed a 539ms spike under -race from GH Actions runner contention; local baseline is 10–18ms). Ships four substantive changes:

Warmup phase. First 500 events of a 5500-event fixture have their latencies discarded before p99 is computed over the remaining 5000-event measurement window. Absorbs GC, race-detector metadata, and scheduler cold-start.
Remove t.Parallel(). Measuring latency under cross-test contention measures contention, not the receiver.
Graduated p99 signal levels. alertP99Ms (100ms race / 0.5ms non-race) emits an SLI ALERT line to the job summary without failing; targetP99Ms (500ms / 5ms) is the hard ceiling. Restores regression signal value the loose 500ms ceiling lost — a 20× regression now produces actionable summary signal even though the test still passes.
Falsifying test for the trim. TestSLIBudget_WarmupDiscardIsLoadBearing injects synthetic cold-start spikes and asserts the trim collapses p99 by ≥10×. Mutation-resistant: if a future refactor drops the discard, this test fails.

Plus three pieces of supporting infrastructure:

Reusable SLI emit helper at internal/sli/. PublishObservation handles the GITHUB_STEP_SUMMARY env-var gate + append/close + best-effort error handling. 100% test coverage. Next perf-budget test in the repo can reuse it.
CI p99 visibility via GITHUB_STEP_SUMMARY. make coverage-check runs go test -race without -v, so t.Logf is hidden in CI. The helper surfaces p99 in the PR job-summary tab and accumulates across PRs — the empirical history the budget comment conditions tightening on.
Defensive-guard rationale. The len(latencies) > warmupEvents guard now documents the invariant it protects (sink.count and len(latencies) can diverge if a future receiver emits records without ObservedTimestamp).

FOLLOWUPS updated: SLI-visibility entry closed (this PR ships it); three replacements opened — tighten the 500ms ceiling once CI history accumulates, switch to baseline-relative SLI assertion, repo-wide flake-pattern lint gate.

Empirical validation

Local 5 isolated runs (-race): p99 9.9 – 12.6 ms.
Local 10 runs under GOMAXPROCS=2: p99 9.3 – 10.9 ms.
Local 10 runs under GOMAXPROCS=2 + -count=10 + full kernelevents package: p99 9.9 – 17.0 ms (median 13.4 ms).
CI on this PR: verify-test green, race-stability (count=10) green five consecutive times (50+ SLI assertions under stress).
New TestSLIBudget_WarmupDiscardIsLoadBearing: 3 runs under -race PASS. Asserts p99(untrimmed) ≥ 1s, p99(trimmed) < 100ms, ratio ≥ 10×.
internal/sli package: 3 unit tests, 100% statement coverage.

Stabilizes the kernelevents SLI test (`TestReceiver_SLIBudget`) with a 500-event warmup phase, serial execution, and a graduated `SLI ALERT` signal tier emitted to `GITHUB_STEP_SUMMARY` so CI runs surface a visible p99 history per PR. Extracts the summary-publishing helper into `internal/sli` for reuse by future perf-budget tests. No change to the operator-facing 5 ms non-race SLO or the 500 ms race-mode flake guard.

PR #73 caught a flake: TestReceiver_SLIBudget asserted p99 emit latency < 500ms under -race, observed 539ms. Local baseline is 10-18ms; the spike was GitHub Actions runner contention, not a receiver regression. The aggregator bypass fix (PR #74) means this flake will now correctly block subsequent PRs. Three changes: 1. Warmup phase. The first 500 events feed a cold pipeline (GC not yet stabilized, race-detector metadata warming, scheduler still bouncing). Those latencies are now discarded before p99 is computed over the 5000-event measurement window. Locally, p99 tightens to a 9.9-12.6ms range (was 10-18ms). 2. Remove t.Parallel(). The test specifically measures latency-under-load on the receiver hot path. Running it concurrently with other parallel tests in the same binary measures goroutine-pool contention, not the receiver. Wall-time cost: ~200ms of serial execution within the kernelevents package. 3. Fix the misleading comment. The previous comment claimed the budget was tuned against -count=10 contention from the race-stability workflow. That workflow is `make test-extras-race`, not `make coverage-check`. The bound was historically chased upward to absorb GH Actions runner noise, which is what actually happens under `coverage-check`. The comment now names the real noise source and the path forward (tighten only after sustained sub-250ms CI observations). Local validation: 5 isolated runs under -race produced p99 9.9 - 12.6 ms. Within the full kernelevents package (-race ./...): 15.3 ms. Both comfortably within the 500ms budget. Signed-off-by: Tri Lam <trilamsr@gmail.com>

…ve value (#78) ## Summary `TestReceiver_SetDegraded` asserted the `degraded_seconds_total` counter matched `0\.0[0-9]+` — an implicit upper bound of `[0.0, 0.1)`. The test's stated intent (per its comment, "Value must be >0") is just "counter advanced past zero"; the upper bound is brittle artifact. Caught on PR #76's rerun, where a slow GitHub Actions runner observed 0.126s of degraded time — well past the regex's implicit 0.1s ceiling, but well within the test's actual intent. Same flake shape as the SLI deflake landed in PR #76: a CI-timing assertion calibrated to fast-runner expectations failing on a slow runner. **Fix:** regex becomes `\d+\.[0-9]*[1-9]` — match any value with at least one non-zero digit after the decimal point. Accepts `0.05`, `0.126`, `1.5`, etc.; correctly rejects `0`, `0.0`, `0.000`. ## Test plan - [x] 10 isolated runs of `TestReceiver_SetDegraded` under `-race`: all PASS. - [x] 20-iteration stress run under `GOMAXPROCS=2` + full `internal/selftelemetry` package: all PASS (2.554s). - [x] Other `require.Regexp` calls in the file inspected (lines 72, 75, 92, 110, 154): all assert exact integer counter values (`2`, `1`, `8`, etc.) or histogram counts — none have the same upper-bound-on-zero shape. No further changes needed. - [ ] CI: `verify-test` passes on this PR. - [ ] Once this lands, re-run PR #76's CI to clear the path for the SLI deflake. ```release-notes Stabilizes the `internal/selftelemetry` degraded-counter test by relaxing its regex to accept any positive counter value (the upper bound of 0.1s was an artifact, not load-bearing). Reduces CI flakes from slow GitHub Actions runners. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>

…d-deflake

Drops the historical bullet trail (50ms → 250ms → 500ms across 2026-04/05) and the 'PR #73' reference. Per the user-memory rule on comments: don't reference specific PRs or one-off incidents that rot as the codebase evolves; the budget rationale (CI flake guard, not regression bound) is what matters going forward. Comment goes from 10 lines to 6 lines. Mechanical cleanup only, no behavior change. Signed-off-by: Tri Lam <trilamsr@gmail.com>

Surfaced during PR #76's review cycle. Both items arose from the session that landed the SLI and SetDegraded deflakes; neither is in scope for this PR but both warrant tracking. 1. Flake-pattern audit. Two flakes of the same shape (timing- sensitive regex assertions calibrated to fast-runner timings) landed in one session. A grep sweep would find any sibling cases before they bite future PRs. 2. SLI p99 visibility. The deflaked test's comment instructs tightening only after sustained sub-250ms CI history, but `make coverage-check` runs without `-v`, so the p99 line is suppressed. A future maintainer has no operable mechanism to observe the threshold. Two paths recorded. Signed-off-by: Tri Lam <trilamsr@gmail.com>

Drives PR #76 from B+ to A+ within the same PR via three additions: 1. Emit p99 + throughput + heap to GITHUB_STEP_SUMMARY when the env var is set. `make coverage-check` runs without `-v`, so the `t.Logf("SLI: ...")` line was hidden in CI. The summary surfaces in the PR job-summary tab and accumulates across PRs — the empirical history the budget comment conditions tightening on. Best-effort: errors ignored so the test never fails on a missing summary file or permission quirk. 2. New test `TestSLIBudget_WarmupDiscardIsLoadBearing`. Constructs a fabricated latency stream of warmupEvents 1s spikes followed by targetEvents 10ms steady-state samples, then asserts: - Untrimmed p99 is ≥ 1s (sanity). - After applying the same slice trim the SLI test uses, p99 collapses to < 100ms. - The trim collapses p99 by ≥10×. Falsifies the mechanism: if a future refactor changes the slice semantics or removes the discard, this test fails. 3. Defensive guard `if len(latencies) > warmupEvents` now carries a one-line rationale explaining why it isn't redundant with the prior `Eventually` waiter — they track different invariants (`sink.count` vs `len(sink.latencies)`) that can diverge if the receiver ever emits records without `ObservedTimestamp` set. FOLLOWUPS: SLI-visibility entry marked closed; three replacements opened (tighten the 500ms ceiling once CI history accumulates; switch to baseline-relative SLI assertion; repo-wide flake-pattern lint gate). Local: 3 isolated runs of both tests under -race PASS. `make vet` clean, `make doc-check` clean. Signed-off-by: Tri Lam <trilamsr@gmail.com>

…nal/sli Three changes drive PR #76 toward A+: 1. Graduated p99 signal levels. The test now defines two ceilings: - alertP99Ms (100ms under -race; 0.5ms non-race): emits a "SLI ALERT" line to the job summary when crossed but does NOT fail the test. Restores regression-signal value the loose 500ms ceiling lost. - targetP99Ms (500ms / 5ms): unchanged hard-failure budget. A 20x regression now produces actionable signal in the PR summary tab without forcing a flake; a 100x+ regression still fails. 2. Reusable helper at internal/sli/emit.go. PublishObservation handles the GITHUB_STEP_SUMMARY env-var gate + append/close + best-effort error handling. Next perf-budget test in the repo can use it without duplicating the boilerplate (or the G304 nosec). bench_test now imports it; the inline os.OpenFile block in bench_test is gone. 3. Unit tests for the helper. internal/sli/emit_test.go pins three behaviors: no-op when env unset, append when env points at a writable file, silent when path is unwritable. 100% coverage. Also fixes the gosec G304 finding that broke verify-lint on the prior commit (false positive: GITHUB_STEP_SUMMARY is a runner-controlled writable path, not attacker-controlled input). The annotation now documents the rationale. Verified locally per `feedback_test_plan_before_pr` memory: - make lint: 0 issues - make vet: clean - go test -race -count=3 on both SLI tests: PASS - go test -race -cover ./internal/sli/: 100% statements, 3/3 PASS - go test -race ./components/receivers/kernelevents/: full package PASS Signed-off-by: Tri Lam <trilamsr@gmail.com>

Drops comments that don't earn six-months-cold-reader weight: the two-tier-signal explanation collapses from 14 lines to 4, the trim + guard block from 14 to 3, the falsifying test docstring from 19 to 4, the sli package doc from 10 to 5. Sli emit_test docstrings drop boilerplate intent that the test name already conveys. The rationale that remains is what code can't tell a reader on its own: WHY no t.Parallel, why two tiers, why the count/latencies guard, why the slice-trim is load-bearing, why GITHUB_STEP_SUMMARY. No semantic change. Lint + vet + tests clean locally. Signed-off-by: Tri Lam <trilamsr@gmail.com>

…83) ## Summary Flake-pattern audit follow-up to PR #76 + #78. Two assertions in `tools/failure-inject/cpusteal/cpusteal_test.go` match the same shape we fixed in `TestReceiver_SLIBudget` and `TestReceiver_SetDegraded`: hard absolute upper bound on observed timing, calibrated to fast-runner expectations. | Before | After | What changed | |---|---|---| | `require.Less(elapsed, 500ms)` for 100ms request | `require.Less(elapsed, 2s)` | Hang sentinel, not perf bound — busy-loop scheduler delay under contention can run a 100ms request to 300-400ms | | `require.Less(elapsed, 250ms)` for cancel response | `require.Less(elapsed, 2s)` | Same — context-cancellation latency varies by an order of magnitude under contention | The lower-bound assertion on `TestRun_HonorsDuration` (`elapsed >= 95ms`) still pins the real contract (busy-loop runs for the requested time). The upper bounds only catch "never returned." This matches the lesson landed in `AGENTS.md` via PR #81 — *match perf-budget assertions by the invariant only*. ## Test plan - [x] Local: `go test -race -count=3 -v ./tools/failure-inject/cpusteal/` — all 4 tests PASS each iteration. - [x] `make lint` clean. - [x] `make vet` clean. - [x] Audit completeness verified: broader grep sweep (`require.Less.*Millisecond`, `assert.Less.*Millisecond`, `elapsed > N*time.X`, `WithinDuration`, `Budget` callsites, `isRaceBuild` callsites) found no other instances of the same shape outside the kernelevents SLI test we already covered. - [ ] CI on this PR. ## Rollback Single Edit to restore the original numeric bounds. No dependents; the bounds are local to two test functions. ```release-notes NONE — test stability only. Relaxes two absolute-time assertions in cpusteal's test to hang sentinels rather than performance bounds, matching the pattern landed in PR #76 and #78. No production behavior change. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>

## Summary Three observations from the recent session that didn't fit the structured surfaces (already used for the load-bearing AGENTS.md entries in PR #81 and the agent-internal notes in PR #82). Each captured via the `learn-from-mistakes` flow and lands in its existing topic note. **`.claude/notes/automation.md`** — *Memory captures rationale; hooks enforce.* The pre-PR checklist personal memory landed mid-session was followed by a lint failure shipped to CI within the hour. The same gap closed reliably by the `PreToolUse` hook installed shortly after. For any "always do X before Y" pattern, prefer the hook; the memory documents *why* the hook exists. **`docs/notes/ci.md`** — *Frame CI / perf projections as ranges, not single numbers.* PR #72's 155s wall-time projection vs 242s actual cost an investigation round (later landed in PR #77) because the projection's setup-go-cache amortization assumption was unverified. Either verify assumptions empirically before publishing the number, or frame as a range. **`.claude/notes/review-patterns.md`** — *Self-rate work, then write criteria for the next grade up.* Forces articulation of measurable improvements rather than free-form "anything else?". PR #76's B+ → A → A+ came from two iterations of this exact pattern; each iteration closed real structural gaps. A fourth lesson — "fix existing tools before proposing new ones" — was captured to personal memory (no PR, lives in `~/.claude/projects/.../memory/`), not the repo, because it's a judgment heuristic about my own decision-making rather than a repo-resident convention. ## Test plan - [x] `make doc-check` clean (banned-phrase lint, link resolution, all gates). - [x] `learn-from-mistakes` format check: banned vocabulary absent, no first-person AI phrasing, no AI attribution, all three entries carry `Anchor:` lines pointing at concrete PRs. - [ ] CI on this PR exercises `doc-check` + `pr-lint`. ## Rollback Each entry is a self-contained `### title` + body + `Anchor:` block at the top of its file. No dependents elsewhere; reverting is a single Edit per file. ```release-notes NONE — documentation only. Three meta-lessons from a recent session retrospective land in their existing topic notes (`automation.md`, `ci.md`, `review-patterns.md`). No runtime behavior change. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>

trilamsr mentioned this pull request May 19, 2026

[ci] selftelemetry: relax degraded-counter regex to accept any positive value #78

Merged

5 tasks

trilamsr added 6 commits May 18, 2026 23:51

Merge remote-tracking branch 'origin/main' into ci/sli-test-warmup-an…

038d814

…d-deflake

trilamsr changed the title ~~[ci] kernelevents SLI test: warmup phase + remove t.Parallel~~ [ci] kernelevents SLI test: warmup + alert tier + sli emit helper May 19, 2026

trilamsr merged commit 869450b into main May 19, 2026
17 checks passed

trilamsr deleted the ci/sli-test-warmup-and-deflake branch May 19, 2026 07:48

trilamsr mentioned this pull request May 19, 2026

[ci] cpusteal_test: relax hang-sentinel bounds (flake-pattern audit) #83

Merged

5 tasks

trilamsr mentioned this pull request May 19, 2026

[docs] session retrospective: three meta-lessons captured #84

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ci] kernelevents SLI test: warmup + alert tier + sli emit helper#76

[ci] kernelevents SLI test: warmup + alert tier + sli emit helper#76
trilamsr merged 7 commits into
mainfrom
ci/sli-test-warmup-and-deflake

trilamsr commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Empirical validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

trilamsr commented May 19, 2026 •

edited

Loading