Skip to content

[ci] selftelemetry: relax degraded-counter regex to accept any positive value#78

Merged
trilamsr merged 1 commit into
mainfrom
ci/selftelemetry-setdegraded-deflake
May 19, 2026
Merged

[ci] selftelemetry: relax degraded-counter regex to accept any positive value#78
trilamsr merged 1 commit into
mainfrom
ci/selftelemetry-setdegraded-deflake

Conversation

@trilamsr

Copy link
Copy Markdown
Contributor

Summary

TestReceiver_SetDegraded asserted the degraded_seconds_total counter matched 0\.0[0-9]+ — an implicit upper bound of [0.0, 0.1). The test's stated intent (per its comment, "Value must be >0") is just "counter advanced past zero"; the upper bound is brittle artifact.

Caught on PR #76's rerun, where a slow GitHub Actions runner observed 0.126s of degraded time — well past the regex's implicit 0.1s ceiling, but well within the test's actual intent. Same flake shape as the SLI deflake landed in PR #76: a CI-timing assertion calibrated to fast-runner expectations failing on a slow runner.

Fix: regex becomes \d+\.[0-9]*[1-9] — match any value with at least one non-zero digit after the decimal point. Accepts 0.05, 0.126, 1.5, etc.; correctly rejects 0, 0.0, 0.000.

Test plan

  • 10 isolated runs of TestReceiver_SetDegraded under -race: all PASS.
  • 20-iteration stress run under GOMAXPROCS=2 + full internal/selftelemetry package: all PASS (2.554s).
  • Other require.Regexp calls in the file inspected (lines 72, 75, 92, 110, 154): all assert exact integer counter values (2, 1, 8, etc.) or histogram counts — none have the same upper-bound-on-zero shape. No further changes needed.
  • CI: verify-test passes on this PR.
  • Once this lands, re-run PR [ci] kernelevents SLI test: warmup + alert tier + sli emit helper #76's CI to clear the path for the SLI deflake.
Stabilizes the `internal/selftelemetry` degraded-counter test by relaxing its regex to accept any positive counter value (the upper bound of 0.1s was an artifact, not load-bearing). Reduces CI flakes from slow GitHub Actions runners.

…ve value

TestReceiver_SetDegraded asserted the degraded_seconds_total
counter matched `0\.0[0-9]+` — an upper-bounded shape requiring the
value to fall in `[0.0, 0.1)`. The test's stated intent (per its
comment "Value must be >0 (>= ~0.05s)") is just "counter advanced
past zero", so the upper bound is brittle artifact, not load-bearing.

Caught on PR #76's rerun, where a slow GH Actions runner observed
0.126s of degraded time — well past the regex's implicit 0.1s
ceiling, but well within the test's actual intent. Same flake shape
as the SLI deflake landed in PR #76: a CI-timing assertion calibrated
to fast-runner expectations failing on a slow runner.

Fix: regex becomes `\d+\.[0-9]*[1-9]` — match any value with at
least one non-zero digit after the decimal point. Accepts 0.05,
0.126, 1.5, etc.; correctly rejects 0, 0.0, 0.000.

Verified locally: 10 isolated runs PASS; 20 iterations under
GOMAXPROCS=2 + full package PASS.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr merged commit 06b9ddf into main May 19, 2026
10 checks passed
@trilamsr trilamsr deleted the ci/selftelemetry-setdegraded-deflake branch May 19, 2026 06:51
trilamsr added a commit that referenced this pull request May 19, 2026
…83)

## Summary

Flake-pattern audit follow-up to PR #76 + #78. Two assertions in
`tools/failure-inject/cpusteal/cpusteal_test.go` match the same shape we
fixed in `TestReceiver_SLIBudget` and `TestReceiver_SetDegraded`: hard
absolute upper bound on observed timing, calibrated to fast-runner
expectations.

| Before | After | What changed |
|---|---|---|
| `require.Less(elapsed, 500ms)` for 100ms request |
`require.Less(elapsed, 2s)` | Hang sentinel, not perf bound — busy-loop
scheduler delay under contention can run a 100ms request to 300-400ms |
| `require.Less(elapsed, 250ms)` for cancel response |
`require.Less(elapsed, 2s)` | Same — context-cancellation latency varies
by an order of magnitude under contention |

The lower-bound assertion on `TestRun_HonorsDuration` (`elapsed >=
95ms`) still pins the real contract (busy-loop runs for the requested
time). The upper bounds only catch "never returned." This matches the
lesson landed in `AGENTS.md` via PR #81 — *match perf-budget assertions
by the invariant only*.

## Test plan

- [x] Local: `go test -race -count=3 -v
./tools/failure-inject/cpusteal/` — all 4 tests PASS each iteration.
- [x] `make lint` clean.
- [x] `make vet` clean.
- [x] Audit completeness verified: broader grep sweep
(`require.Less.*Millisecond`, `assert.Less.*Millisecond`, `elapsed >
N*time.X`, `WithinDuration`, `Budget` callsites, `isRaceBuild`
callsites) found no other instances of the same shape outside the
kernelevents SLI test we already covered.
- [ ] CI on this PR.

## Rollback

Single Edit to restore the original numeric bounds. No dependents; the
bounds are local to two test functions.

```release-notes
NONE — test stability only. Relaxes two absolute-time assertions in cpusteal's test to hang sentinels rather than performance bounds, matching the pattern landed in PR #76 and #78. No production behavior change.
```

Signed-off-by: Tri Lam <trilamsr@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant