Skip to content

[m19] chaos.yml pattern-pod-evicted row + real-world replay slot#134

Merged
trilamsr merged 2 commits into
mainfrom
worktree-m19-chaos-replay
May 21, 2026
Merged

[m19] chaos.yml pattern-pod-evicted row + real-world replay slot#134
trilamsr merged 2 commits into
mainfrom
worktree-m19-chaos-replay

Conversation

@trilamsr

@trilamsr trilamsr commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes two M19 carry-forwards and trims a stale third:

  • chaos.yml pattern-pod-evicted job (M19 carry-forward Wave 1: governance bootstrap (CODEOWNERS, DCO, signing) #4). New job runs the hermetic go test ./internal/synthesis/... -race replay-corpus path. The pod-evict CLI half is covered by a pod-evict --reason=DiskPressure SHA pin in tools/failure-inject/testdata/golden.sha256; the existing harness-determinism Golden-SHA-Pin loop now verifies pod-evict alongside xid. Job timeout set to 5min (observed runtime 1m48s; ~3× headroom).
  • _real_world/ replay-corpus slot (M19 carry-forward ci(deps): bump the gh-actions group with 5 updates #1). internal/synthesis/replay/pod_evicted/_real_world/ is the contribution slot for anonymized production captures. The replay loader's underscore-group convention (per _negative/) means new fixtures plug in without code changes; README.md documents the contribution checklist and ships an inline fixture template (manifest / events / node_conditions / golden JSON shapes). TestPodEvictedReplay_RealWorldGroupLoaderSafe mutation-verifies the empty group walks safely. Empty group today — populating it is the remaining piece of carry-forward ci(deps): bump the gh-actions group with 5 updates #1.
  • MILESTONES bookkeeping. Rubric L409 flips ⧗ → ☑ (slot exists); carry-forward list trimmed from 5 → 3. Drops the chaos-row line and the stale detector-overhead-bench line (rubric L418 was already ☑). Carry-forward (1) annotates the real_world_*/_real_world/ rename and its reason (loader-convention parity with _negative/).
  • paths: filter gains internal/synthesis/** so detector / replay-corpus changes retrigger Chaos on PRs.

Carry-forwards still open: detection-latency ≤5s p95 (needs live cluster); --filler=/tmp real-kubelet eviction (needs --allow-cluster-write kube-apiserver seam); real-world fixture contributions.

Root-cause framing

The two closed carry-forwards were missing-infrastructure items, not workarounds:

  • The replay-corpus loader already supported underscore-prefixed groups (_negative/). A second slot for real-world captures was a directory-and-README-and-loader-test away; no code change to the loader itself was needed.
  • The chaos workflow already pinned the pod-evict CLI's two-run determinism (lines 77-81). The matrix-of-patterns rule asked for an explicit named pattern row, which lands here.

Explicitly skipped: per-job path-filter trim of internal/synthesis/** from harness-determinism. Would need dorny/paths-filter (added dep) or workflow split (scope creep). The ~2min of redundant CI serves two distinct gates (CLI bytes vs detector verdict) and is justified per PRINCIPLES §13 operator-first.

No new workarounds.

Release notes

NONE

Test plan

  • go test -race -count=1 ./internal/synthesis/... ./tools/failure-inject/... passes locally.
  • make doc-check clean (em-dash diff-gate, banned-phrase lint, alert-check, release-doc-parity).
  • Mutation-verified SHA gate: failure-inject --seed=1 pod-evict --reason=DiskPressure produces a different SHA than the pinned one; both pattern-pod-evicted and harness-determinism Golden-SHA-Pin loop fail closed on argv drift.
  • Mutation-verified loader test: dropping a manifest-less subdir under _real_world/ makes TestPodEvictedReplay_RealWorldGroupLoaderSafe fail with the offending-path citation; restored.
  • Replay loader + new test pass with _real_world/ group present and empty.
  • Chaos workflow green on prior push (15/15 checks; pattern-pod-evicted (M19) 1m48s).

Closes two M19 carry-forwards (chaos matrix row, anonymized
real-world fixture slot) and tightens MILESTONES bookkeeping.

- chaos.yml gains a `pattern-pod-evicted` job that runs the
  hermetic `go test ./internal/synthesis/...` replay-corpus path
  under `-race`. The pod-evict CLI half is covered by a new SHA
  pin in `tools/failure-inject/testdata/golden.sha256` — the
  existing `harness-determinism` Golden-SHA-Pin loop now
  verifies pod-evict alongside xid. Mutation-verified: changing
  `--seed=1` flips the SHA and the gate fails.
- `internal/synthesis/replay/pod_evicted/_real_world/` is the
  contribution slot for anonymized production captures. The
  replay loader's underscore-group convention (per `_negative/`)
  means new fixtures plug in without code changes; README.md
  documents the contribution checklist. Empty group today;
  contributions are the remaining M19 carry-forward.
- MILESTONES rubric L409 flips ⧗ → ☑; carry-forward list
  trimmed from 5 → 3 (drops the chaos-row line and rewrites
  the real-world slot to point at the now-present `_real_world/`
  group). Carry-forward (3) detector-overhead bench was already
  ☑ on rubric L418 and is removed from the list as stale.
- chaos.yml `paths:` filter gains `internal/synthesis/**` so
  detector / replay-corpus changes retrigger the workflow on
  PRs.

Carry-forwards still open: detection-latency ≤5s p95 (needs
live cluster); `--filler=/tmp` real-kubelet eviction (needs
`--allow-cluster-write` kube-apiserver seam); real-world
fixture contributions.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr force-pushed the worktree-m19-chaos-replay branch from 8f6eef8 to 5ec7e92 Compare May 20, 2026 22:48
Five small additions on top of the M19 chaos-replay PR after
self-review surfaced gaps the single-pass review missed:

- `TestPodEvictedReplay_RealWorldGroupLoaderSafe` asserts the
  `_real_world/` group walks empty and contributes 0 fixtures
  (canonical + 3 negatives = 4 total). Mutation-verified by
  dropping a manifest-less subdir and watching the test fail
  with the path citation; restored. Catches contributor
  partial-fixture drops before the rubric harness sees them.
- `_real_world/README.md` gains an inline fixture template
  (manifest / events / node_conditions / golden JSON shapes with
  placeholder values). Reduces contribution friction from
  "reverse-engineer from canonical/" to "copy template, edit
  placeholders". Template lives in the README rather than as a
  fixture dir because every subdir of `_real_world/` is loaded;
  a directory-based template would either run as a test or
  require a loader special-case neither rule-of-three justifies.
- `pattern-pod-evicted` job timeout tightened from 10min to 5min
  (observed runtime 1m48s; ~3× headroom). Fails fast on flake
  rather than soaking the chaos-job boilerplate ceiling.
- MILESTONES carry-forward (1) clause documents the
  `real_world_*/` → `_real_world/` rename and its reason
  (loader-convention parity with the existing `_negative/`
  group). Original carry-forward text named the un-prefixed
  shape; the loader's underscore-group walk is the constraint.
- Four em-dashes in the new README replaced with semicolons /
  full stops to satisfy `make doc-check`'s diff-scope em-dash
  gate. The diff-gate compares vs origin/main and the em-dashes
  were introduced by this PR, so they are this PR's punctuation
  to fix.

Explicitly skipped: per-job path filtering trim of
`internal/synthesis/**` from `harness-determinism` (CONCERN
flagged during review). Would need either `dorny/paths-filter`
(added dependency) or a workflow split (scope creep). The
~2min of redundant CI serves two distinct gates (CLI bytes vs
detector verdict) and is justified per PRINCIPLES §13
operator-first. Revisit if the redundancy pattern recurs.

Signed-off-by: Tri Lam <trilamsr@gmail.com>
@trilamsr trilamsr merged commit 389b48b into main May 21, 2026
16 checks passed
@trilamsr trilamsr deleted the worktree-m19-chaos-replay branch May 21, 2026 03:01
trilamsr added a commit that referenced this pull request May 21, 2026
#147)

## Summary

Single-PR bundle of 10 low-risk follow-up actions. Each row was
anchor-verified on `main` before editing; no production behavior change.
Diff is 10 files, +91/-37, dominated by markdown.

**Breakdown:**
- 3 strikes (anchor shipped, row was stale)
- 1 test-only struct add (k8sevents `NodeWatchErrors`)
- 1 bash test add (`no-autoupdate-check` hit-line format lock)
- 5 doc-only clarifications / partial-ship / audit notes

## Items applied

### Strikes — anchor on `main` confirms shipped

1. **M3.md L188 `make doc-check`.** `scripts/doc-check.sh` header reads
"verify every Test\*/Fuzz\*/Benchmark\* name referenced in docs"; wired
into `make doc-check` and `make ci`.
2. **M8.md L103 `docs/HARDWARE-TESTING.md` libdcgm + nv-hostengine
setup.** File exists (28 hits for libdcgm/dcgm/nv-hostengine); covers
Ubuntu 22.04 driver / `libdcgm-dev` / `nv-hostengine` provisioning,
x86_64 + aarch64-SBSA build matrix, and the `//go:build dcgm,hardware`
distinction. Doc shipped ahead of cgo client to unblock GPU-less
contributors.
3. **M19.md L18 `nodeWatchErrCount` not in SnapshotCounters.** Closed by
item 6 below — added `NodeWatchErrors` field symmetrically.

### Test-only struct add

4. **components/receivers/k8sevents/export_test.go.** Added
`NodeWatchErrors int64` field on `CountersForTest`; `SnapshotCounters`
now reads `rr.nodeWatchErrCount.Load()` symmetrically with
`rr.watchErrCount.Load()`. Both call sites are keyed-init inside the
same file; no external positional callers to break (grep confirmed: only
2 hits, both in `export_test.go`).

### Bash test add (M23 grep-gate format lock)

5. **scripts/no-autoupdate-check_test.sh "hit-line-format-stable".** New
assertion that runs the gate against the hyphenated-go-update fixture,
captures stdout (existing tests discard it), and asserts at least one
line matches `^[^:]+:[0-9]+:`. Locks the parseable hit-line shape
*before* the first automation consumer (CI summary, dashboard, Slack
notifier) wires up — a cosmetic tweak to the gate's message body now
fails CI instead of silently breaking downstream parsers. M23.md row
struck.

### Doc-only clarifications

6. **M15.md L185 falsifying-check backfill.** Anchored the
"/var/lib/tracecore/ subdir governance" row's grep-falsifying-check to
RFC-0010 §Proposal — `docs/rfcs/0010-containerstdout-receiver-scope.md`
L177/L217/L274/L393/L407 already carry the convention ("M15 owns
`/var/lib/tracecore/container_stdout/`. Future siblings reserve their
own subdirectories."). Row marked `[x]`.

7. **M15.md L192 + RFC-0010 §Pod-attribution forward-pointer.** Appended
one-line cross-reference at RFC-0010 L158 → `docs/followups/M15.md`
"Cross-receiver rank-label reconciliation" so the deferred audit trail
is discoverable from the RFC. Row marked `[x]`.

8. **M8.md L30 `tracecore debug dump` partial-ship.**
`cmd/tracecore/debug.go::runDebugDump` already writes version + revision
+ branch + build date + Go runtime stats + registered components +
redacted config to `tracecore debug dump > diagnostic.txt`. Remaining
gap is "last N samples" — needs receiver-side ring buffer (M2
carry-forward). Row kept open with partial-ship line +
remaining-trigger.

9. **M3.md L153 SUPPLY-CHAIN-IDENTITY.md scope clarification.** Added
one sentence noting the consolidation is a copy-and-deduplicate pass
against existing `release.yml` comment blocks (cosign-sign-blob,
gh-attestation-sign), not net-new authoring — so the next reader sees
the actual scope of work, not a misleading "30-min write" estimate that
implies green-field.

10. **otlphttp.md L182 workflow paths audit + M14.md L88 test pointer.**
- **otlphttp**: inlined audit findings (2026-05-20). `chart.yml` and
`install-bench.yml` are substrate-aware (include `cmd/tracecore/**`,
`internal/**`); `kernelevents-integration.yml` and
`pyspy-integration.yml` cover only `components/receivers/<name>/**` +
`internal/runtime/lifecycle/**` — a `cmd/tracecore` factory wiring or
`internal/pipeline` contract change can land without re-running these
integration jobs. `chaos.yml` covers `tools/failure-inject/**` +
`internal/synthesis/**` only (indirect coupling, acceptable). Remaining:
6-line YAML edit per integration workflow.
- **M14**: added inline pointer from the multi-retry slow-write fixture
row to the existing single-retry baseline at
`components/receivers/kineto/shutdown_test.go::TestIngest_RetryOnTruncated`
so the future author has the test-shape anchor.

## Files changed

| File | LOC | Kind |
|---|---|---|
| `components/receivers/k8sevents/export_test.go` | +2 | test struct
field |
| `scripts/no-autoupdate-check_test.sh` | +20 | bash test add |
| `docs/rfcs/0010-containerstdout-receiver-scope.md` | +1/-1 | inline
cross-ref |
| `docs/followups/M3.md` | +9/-5 | strike + scope clarification |
| `docs/followups/M8.md` | +16/-5 | strike + partial-ship |
| `docs/followups/M14.md` | +1/-1 | test pointer |
| `docs/followups/M15.md` | +15/-8 | 2 strikes |
| `docs/followups/M19.md` | +5/-9 | strike (anchored to test add) |
| `docs/followups/M23.md` | +9/-7 | strike |
| `docs/followups/otlphttp.md` | +13/-1 | audit findings inline |

## Test plan

- [x] `go test ./components/receivers/k8sevents/...` green.
- [x] `bash scripts/no-autoupdate-check_test.sh` 10/10 assertions pass
(added "hit-line-format-stable" — the 10th).
- [x] `bash scripts/doc-check.sh` green (437 markdown links resolve,
em-dash + en-dash diff gate clean, comment-noise diff gate clean).
- [x] Pre-commit hook ran full `make check` + `make ci` (all package
tests cached/passing).
- [ ] CI green on this branch.

## Release notes

```release-notes
NONE
```

## Sequencing

Builds on `main` after PRs #132 (shard split), #133 (RUNBOOK +
chart-appversion), #142 (opportunistic curation), #134 (chaos.yml row),
#143 (cross-shard audit). Independent of currently-open PRs #144 (m6
integration recipes) and #145 (m3 GHCR image publish).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr pushed a commit that referenced this pull request May 21, 2026
Sync feature branch with main per the merge-not-rebase policy
documented in CONTRIBUTING.md (commit ddf86f7).

Main moved 5 PRs ahead during this branch's lifetime:
- PR #143 (followups sweep)
- PR #134 (chaos.yml pattern-pod-evicted)
- PR #142 (follow-up curation)
- PR #144 (M6 integration recipes)
- PR #146 (kineto MaxEvents stub)
- PR #147 (followups bundle)

Conflicts expected in CHANGELOG.md and docs/followups/M3.md
(both additive).

# Conflicts:
#	CHANGELOG.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant