Skip to content

[ci] cpusteal_test: relax hang-sentinel bounds (flake-pattern audit)#83

Merged
trilamsr merged 1 commit into
mainfrom
ci/audit-flake-regex-pattern
May 19, 2026
Merged

[ci] cpusteal_test: relax hang-sentinel bounds (flake-pattern audit)#83
trilamsr merged 1 commit into
mainfrom
ci/audit-flake-regex-pattern

Conversation

@trilamsr

Copy link
Copy Markdown
Contributor

Summary

Flake-pattern audit follow-up to PR #76 + #78. Two assertions in tools/failure-inject/cpusteal/cpusteal_test.go match the same shape we fixed in TestReceiver_SLIBudget and TestReceiver_SetDegraded: hard absolute upper bound on observed timing, calibrated to fast-runner expectations.

Before After What changed
require.Less(elapsed, 500ms) for 100ms request require.Less(elapsed, 2s) Hang sentinel, not perf bound — busy-loop scheduler delay under contention can run a 100ms request to 300-400ms
require.Less(elapsed, 250ms) for cancel response require.Less(elapsed, 2s) Same — context-cancellation latency varies by an order of magnitude under contention

The lower-bound assertion on TestRun_HonorsDuration (elapsed >= 95ms) still pins the real contract (busy-loop runs for the requested time). The upper bounds only catch "never returned." This matches the lesson landed in AGENTS.md via PR #81match perf-budget assertions by the invariant only.

Test plan

  • Local: go test -race -count=3 -v ./tools/failure-inject/cpusteal/ — all 4 tests PASS each iteration.
  • make lint clean.
  • make vet clean.
  • Audit completeness verified: broader grep sweep (require.Less.*Millisecond, assert.Less.*Millisecond, elapsed > N*time.X, WithinDuration, Budget callsites, isRaceBuild callsites) found no other instances of the same shape outside the kernelevents SLI test we already covered.
  • CI on this PR.

Rollback

Single Edit to restore the original numeric bounds. No dependents; the bounds are local to two test functions.

NONE — test stability only. Relaxes two absolute-time assertions in cpusteal's test to hang sentinels rather than performance bounds, matching the pattern landed in PR #76 and #78. No production behavior change.

Caught by the flake-pattern audit (FOLLOWUPS entry post-PR #76).
`require.Less(elapsed, 500ms)` for a 100ms request and
`require.Less(elapsed, 250ms)` for a context-cancel were both
calibrated to fast-runner expectations — same shape as the SLI and
SetDegraded flakes already fixed this session. Under GH Actions
runner contention, scheduler delays on a busy-loop or
context-cancellation latency can exceed those bounds without any
regression in the receiver under test.

Relaxed both upper bounds to 2s as hang sentinels rather than perf
bounds. The lower-bound assertion on `TestRun_HonorsDuration`
(`elapsed >= 95ms`) still pins the real contract (busy-loop ran for
the requested time); the upper bound just catches "never returned".
Same fix shape applied to `TestRun_HonorsContextCancellation`.

Local: 3 isolated runs under -race, all 4 cpusteal tests PASS.
`make lint` clean, `make vet` clean.

Anchor for the audit: `AGENTS.md` lesson "Match perf-budget
assertions by the invariant only" (PR #81); FOLLOWUPS § "CI flake
hygiene".

Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr enabled auto-merge (squash) May 19, 2026 08:23
@trilamsr trilamsr merged commit 7d39606 into main May 19, 2026
12 checks passed
@trilamsr trilamsr deleted the ci/audit-flake-regex-pattern branch May 19, 2026 08:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant