[ci] selftelemetry: relax degraded-counter regex to accept any positive value by trilamsr · Pull Request #78 · TraceCoreAI/tracecore

trilamsr · 2026-05-19T06:46:20Z

Summary

TestReceiver_SetDegraded asserted the degraded_seconds_total counter matched 0\.0[0-9]+ — an implicit upper bound of [0.0, 0.1). The test's stated intent (per its comment, "Value must be >0") is just "counter advanced past zero"; the upper bound is brittle artifact.

Caught on PR #76's rerun, where a slow GitHub Actions runner observed 0.126s of degraded time — well past the regex's implicit 0.1s ceiling, but well within the test's actual intent. Same flake shape as the SLI deflake landed in PR #76: a CI-timing assertion calibrated to fast-runner expectations failing on a slow runner.

Fix: regex becomes \d+\.[0-9]*[1-9] — match any value with at least one non-zero digit after the decimal point. Accepts 0.05, 0.126, 1.5, etc.; correctly rejects 0, 0.0, 0.000.

Test plan

10 isolated runs of TestReceiver_SetDegraded under -race: all PASS.
20-iteration stress run under GOMAXPROCS=2 + full internal/selftelemetry package: all PASS (2.554s).
Other require.Regexp calls in the file inspected (lines 72, 75, 92, 110, 154): all assert exact integer counter values (2, 1, 8, etc.) or histogram counts — none have the same upper-bound-on-zero shape. No further changes needed.
CI: verify-test passes on this PR.
Once this lands, re-run PR [ci] kernelevents SLI test: warmup + alert tier + sli emit helper #76's CI to clear the path for the SLI deflake.

Stabilizes the `internal/selftelemetry` degraded-counter test by relaxing its regex to accept any positive counter value (the upper bound of 0.1s was an artifact, not load-bearing). Reduces CI flakes from slow GitHub Actions runners.

…ve value TestReceiver_SetDegraded asserted the degraded_seconds_total counter matched `0\.0[0-9]+` — an upper-bounded shape requiring the value to fall in `[0.0, 0.1)`. The test's stated intent (per its comment "Value must be >0 (>= ~0.05s)") is just "counter advanced past zero", so the upper bound is brittle artifact, not load-bearing. Caught on PR #76's rerun, where a slow GH Actions runner observed 0.126s of degraded time — well past the regex's implicit 0.1s ceiling, but well within the test's actual intent. Same flake shape as the SLI deflake landed in PR #76: a CI-timing assertion calibrated to fast-runner expectations failing on a slow runner. Fix: regex becomes `\d+\.[0-9]*[1-9]` — match any value with at least one non-zero digit after the decimal point. Accepts 0.05, 0.126, 1.5, etc.; correctly rejects 0, 0.0, 0.000. Verified locally: 10 isolated runs PASS; 20 iterations under GOMAXPROCS=2 + full package PASS. Signed-off-by: Tri Lam <trilamsr@gmail.com>

…83) ## Summary Flake-pattern audit follow-up to PR #76 + #78. Two assertions in `tools/failure-inject/cpusteal/cpusteal_test.go` match the same shape we fixed in `TestReceiver_SLIBudget` and `TestReceiver_SetDegraded`: hard absolute upper bound on observed timing, calibrated to fast-runner expectations. | Before | After | What changed | |---|---|---| | `require.Less(elapsed, 500ms)` for 100ms request | `require.Less(elapsed, 2s)` | Hang sentinel, not perf bound — busy-loop scheduler delay under contention can run a 100ms request to 300-400ms | | `require.Less(elapsed, 250ms)` for cancel response | `require.Less(elapsed, 2s)` | Same — context-cancellation latency varies by an order of magnitude under contention | The lower-bound assertion on `TestRun_HonorsDuration` (`elapsed >= 95ms`) still pins the real contract (busy-loop runs for the requested time). The upper bounds only catch "never returned." This matches the lesson landed in `AGENTS.md` via PR #81 — *match perf-budget assertions by the invariant only*. ## Test plan - [x] Local: `go test -race -count=3 -v ./tools/failure-inject/cpusteal/` — all 4 tests PASS each iteration. - [x] `make lint` clean. - [x] `make vet` clean. - [x] Audit completeness verified: broader grep sweep (`require.Less.*Millisecond`, `assert.Less.*Millisecond`, `elapsed > N*time.X`, `WithinDuration`, `Budget` callsites, `isRaceBuild` callsites) found no other instances of the same shape outside the kernelevents SLI test we already covered. - [ ] CI on this PR. ## Rollback Single Edit to restore the original numeric bounds. No dependents; the bounds are local to two test functions. ```release-notes NONE — test stability only. Relaxes two absolute-time assertions in cpusteal's test to hang sentinels rather than performance bounds, matching the pattern landed in PR #76 and #78. No production behavior change. ``` Signed-off-by: Tri Lam <trilamsr@gmail.com>

trilamsr merged commit 06b9ddf into main May 19, 2026
10 checks passed

trilamsr deleted the ci/selftelemetry-setdegraded-deflake branch May 19, 2026 06:51

trilamsr mentioned this pull request May 19, 2026

[ci] cpusteal_test: relax hang-sentinel bounds (flake-pattern audit) #83

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ci] selftelemetry: relax degraded-counter regex to accept any positive value#78

[ci] selftelemetry: relax degraded-counter regex to accept any positive value#78
trilamsr merged 1 commit into
mainfrom
ci/selftelemetry-setdegraded-deflake

trilamsr commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented May 19, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant