Skip to content

test(lifecycle): harden TOCTOU concurrent-Add race window (kernelevents+k8sevents)#210

Merged
trilamsr merged 1 commit into
mainfrom
pr-toctou-hardening-backport
May 31, 2026
Merged

test(lifecycle): harden TOCTOU concurrent-Add race window (kernelevents+k8sevents)#210
trilamsr merged 1 commit into
mainfrom
pr-toctou-hardening-backport

Conversation

@trilamsr

Copy link
Copy Markdown
Contributor

Summary

Backport containerstdout's hardened TOCTOU race-window test pattern (#197, #209) to kernelevents and k8sevents Lifecycle_ConcurrentAddDuringShutdown_NoPanic tests. The pre-hardened tests have a brittle race window that collapses on fast schedulers — Shutdown wins universally, every Add no-ops via the closed-guard, and the test passes vacuously while never exercising the TOCTOU path it claims to cover. On slightly slower (or differently-loaded) CI machines, the same brittleness flips and produces intermittent failures, blocking unrelated PRs.

Root cause

The lifecycle's mutex guard around (closed-check, wg.Add(1)) is correct in production code. The test, however, releases all 50 adders + the shutdowner from a single gate, then relies on runtime.Gosched() alone to interleave them. On fast schedulers the shutdowner reaches lc.Shutdown(...) before any adder reaches lc.Add(...), so:

  • Branch (a) "Add wins, registers under WaitGroup" never fires
  • Branch (b) "Shutdown wins, Add no-ops" fires for all 50 adders
  • Test passes but tests nothing meaningful
  • On flip side: race window straddle changes between CI runs, surfacing as flake

Containerstdout's hardened equivalent inserts a shutdownGate channel that holds the shutdowner back until 50µs after release fires, deterministically straddling adders-in-flight with the Shutdown call. This PR ports that pattern verbatim.

Changes

  • components/receivers/kernelevents/lifecycle_test.go: add shutdownGate chan + 50µs sleep + comment explaining intent
  • components/receivers/k8sevents/lifecycle_test.go: same

Production code: unchanged.

Verification

for i in $(seq 1 10); do go test -race -count=1 -run ConcurrentAddDuringShutdown ./components/receivers/kernelevents/...; done
# 10/10 PASS

for i in $(seq 1 10); do go test -race -count=1 -run ConcurrentAddDuringShutdown ./components/receivers/k8sevents/...; done
# 10/10 PASS

Branch-coverage check across 10 verbose iterations (registered count <50 == branch (b) exercised):

  • kernelevents: registered counts 44, 47, 49, 50, 50, 47, 44, 50, 50, 50 → branch (b) hit 6/10 iterations
  • k8sevents: registered counts 49, 49, 50, 50, 50, 48, 50, 44, 48, 50 → branch (b) hit 7/10 iterations

Both branches deterministically exercised. The existing registeredCount == 0 guard in the test prevents the inverse-vacuous regression (all-no-op).

Full repo: make check clean, go test -race ./... green.

Motivation

Unblocks four in-flight PRs hitting this flake on CI:

Reference

Containerstdout's port: #197 (original finding + fix), #209 (additional hardening). Same pattern applied here with no behavioral divergence.

Test plan

NONE

Port containerstdout's shutdownGate + 50µs scheduling-window pattern
(PRs #197, #209) to kernelevents and k8sevents lifecycle tests. On
fast schedulers, Shutdown trivially wins the race, forcing every Add
into the closed-guard no-op branch (b) and never exercising branch
(a) — which makes the test pass vacuously, then flake when CI
machines hit different scheduling. Gating Shutdown behind a 50µs
window after release straddles the TOCTOU race so both branches
land across stress iterations. Verified 10/10 with -race on both
packages; branch (b) exercised 6/10 (kernelevents) and 7/10
(k8sevents) iterations.

Unblocks PRs #203, #204, #205, #207 which were hitting the same
flake on CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) May 31, 2026 08:12
@trilamsr trilamsr merged commit 46e6f20 into main May 31, 2026
13 checks passed
@trilamsr trilamsr deleted the pr-toctou-hardening-backport branch May 31, 2026 08:19
trilamsr added a commit that referenced this pull request May 31, 2026
## Summary

Reconcile the four pivot-tracking docs
(`docs/rfcs/0013-distro-first-pivot.md`, `CHANGELOG.md`,
`MILESTONES.md`, `docs/migration/v0.1-to-v0.2.md`) with the wave-3
(PR-B1-shape sibling ports) and wave-4 (PR-B2-shape upstream-only ports
+ PR-F.1 + PR-J + PR-L + PR-N) landings. Pure doc sweep — no code or
config touched.

## What changed

### `docs/rfcs/0013-distro-first-pivot.md` §migration

PR sequence rows updated with PR-number citations and landed markers:

- **PR-A2** (landed, #189, 2026-05-30)
- **PR-B2** (landed, #201) — also enumerates sibling-receiver follow-ups
under PR-B2 to dispel the slug collision with #188's PR-B2-labelled dcgm
port: stdoutexporter (#202), pyspy (#203), kernelevents (#208),
containerstdout (#209)
- **PR-F.1** (landed) — fleshed-out delete list
(`internal/{selftelemetry,telemetry}` + `components/receivers/dcgm/` +
`pkg/dcgm/` + one orphan clockreceiver integration test)
- **PR-F.2** re-scoped — now deletes the whole
`internal/{componentstatus,pipeline,pipelinebuilder,consumer,fanout,runtime/lifecycle}`
bundle in one cut once the last three pipeline+consumer-importing
receivers land (#204 k8sevents, #205 clockreceiver, #207 otlphttp). Per
the import-graph state — `internal/componentstatus`'s only non-test
consumer is `internal/pipeline`, so they delete together
- **PR-G** (landed, #182), **PR-H** (landed, #183)
- **PR-I.1a** (in flight — scaffold agent), **PR-I.1b** (pre-staged;
gate satisfied by #201)
- **PR-J** (landed, #195) — kept existing marker
- **PR-K.1** (in flight — separate agent landing)
- **PR-L** (landed, skeleton #179 + body #191) — flagged as living
document
- **PR-N** (landed, #200) — shipped at v0.1.0 ahead of v0.3.0 as a
doc-only update at `docs/migration/v0.2-to-v0.3.md`

### `CHANGELOG.md` [Unreleased]

- Restructured the pivot wave list as **four waves** (was three). Wave 3
enumerates PR-B1-shape sibling ports + support infra (#180-#194/#196).
Wave 4 enumerates PR-B2-shape upstream-only ports + PR-J (#195) + PR-F.1
(#206) + PR-N (#200) + lint/TOCTOU hardening (#198/#210).
- Tightened the PR-F.2 deferred note to point at the three open ports
(#204/#205/#207) as the gate.

### `MILESTONES.md`

- **M1** (pipeline runtime) — status row now cites PR-A2 (#189), PR-F.1
(#206), PR-F.2 gate (#204/#205/#207), PR-E (#180), retains
`internal/config/` (still load-bearing for `tracecore validate`).
- **M2** (self-telemetry) — status row now cites PR-F.1 (#206); flags
`internal/componentstatus` as travelling with `internal/pipeline` in
PR-F.2.
- **M8** (DCGM receiver) — status flipped to *landed-and-replaced*:
cites PR-F.1 (#206) deletion + PR-J (#195)
`docs/integrations/prometheus-scrape.md` recipe. Notes the inert chart
toggle retention until PR-K.3.

### `docs/migration/v0.1-to-v0.2.md`

- §`internal/*` package deletion (PR-F) status flips from "not yet open"
to "PR-F.1 landed (#206), PR-F.2 gated on three open ports".
- Open-items checklist expanded from 5 to 13 entries — tracks every PR
letter the migration guide cares about (A2 / E / F.1 / F.2 / I.1a-c / J
/ K.1-3 / L / N) with PR numbers and links.

## Why now

Tracking docs accumulated drift across wave-3 + wave-4 because every
sibling-port PR (and the support-infra PRs around them) updated the
bottom of `CHANGELOG.md` but did not always touch the upstream
sequencing section in RFC-0013. Per memory rule `[Keeping this document
current]`: status drift is a review blocker. This PR is the consolidated
catch-up; future port PRs include their RFC-row flip in-PR.

## What this PR does NOT change

- No code, no config, no YAML, no chart — only the four tracking docs.
- No new doc gates added; existing gates pass.
- No PRs other than the four named docs are modified.

## Test plan

- [x] `bash scripts/doc-check.sh` clean (33 test refs, 528 links
resolve, comment-noise diff gate clean vs `origin/main`, all 13 gates
green).
- [x] Pre-commit hook (`commitlint` 72-char subject limit + DCO +
AI-trailer gates) passed.
- [x] Pre-push hook (`make ci-fast` equivalent: `golangci-lint`, `go
vet`, `go mod verify`, `no-autoupdate-check`, `doc-check.sh`) passed on
second attempt after `git fetch origin main` populated the worktree's
`origin/main` ref — first push failed because the worktree previously
tracked the (gone) `pr-a2-ocb-main-swap` branch, so `doc-check.sh`'s
comment-noise diff-scope gate exited 128 on the missing ref. Root cause
fixed by the fetch; not a workaround.
- [ ] CI green on this branch.

```release-notes
NONE
```

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant