[m19] chaos.yml pattern-pod-evicted row + real-world replay slot#134
Merged
Conversation
Closes two M19 carry-forwards (chaos matrix row, anonymized real-world fixture slot) and tightens MILESTONES bookkeeping. - chaos.yml gains a `pattern-pod-evicted` job that runs the hermetic `go test ./internal/synthesis/...` replay-corpus path under `-race`. The pod-evict CLI half is covered by a new SHA pin in `tools/failure-inject/testdata/golden.sha256` — the existing `harness-determinism` Golden-SHA-Pin loop now verifies pod-evict alongside xid. Mutation-verified: changing `--seed=1` flips the SHA and the gate fails. - `internal/synthesis/replay/pod_evicted/_real_world/` is the contribution slot for anonymized production captures. The replay loader's underscore-group convention (per `_negative/`) means new fixtures plug in without code changes; README.md documents the contribution checklist. Empty group today; contributions are the remaining M19 carry-forward. - MILESTONES rubric L409 flips ⧗ → ☑; carry-forward list trimmed from 5 → 3 (drops the chaos-row line and rewrites the real-world slot to point at the now-present `_real_world/` group). Carry-forward (3) detector-overhead bench was already ☑ on rubric L418 and is removed from the list as stale. - chaos.yml `paths:` filter gains `internal/synthesis/**` so detector / replay-corpus changes retrigger the workflow on PRs. Carry-forwards still open: detection-latency ≤5s p95 (needs live cluster); `--filler=/tmp` real-kubelet eviction (needs `--allow-cluster-write` kube-apiserver seam); real-world fixture contributions. Signed-off-by: Tri Lam <trilamsr@gmail.com>
8f6eef8 to
5ec7e92
Compare
Five small additions on top of the M19 chaos-replay PR after self-review surfaced gaps the single-pass review missed: - `TestPodEvictedReplay_RealWorldGroupLoaderSafe` asserts the `_real_world/` group walks empty and contributes 0 fixtures (canonical + 3 negatives = 4 total). Mutation-verified by dropping a manifest-less subdir and watching the test fail with the path citation; restored. Catches contributor partial-fixture drops before the rubric harness sees them. - `_real_world/README.md` gains an inline fixture template (manifest / events / node_conditions / golden JSON shapes with placeholder values). Reduces contribution friction from "reverse-engineer from canonical/" to "copy template, edit placeholders". Template lives in the README rather than as a fixture dir because every subdir of `_real_world/` is loaded; a directory-based template would either run as a test or require a loader special-case neither rule-of-three justifies. - `pattern-pod-evicted` job timeout tightened from 10min to 5min (observed runtime 1m48s; ~3× headroom). Fails fast on flake rather than soaking the chaos-job boilerplate ceiling. - MILESTONES carry-forward (1) clause documents the `real_world_*/` → `_real_world/` rename and its reason (loader-convention parity with the existing `_negative/` group). Original carry-forward text named the un-prefixed shape; the loader's underscore-group walk is the constraint. - Four em-dashes in the new README replaced with semicolons / full stops to satisfy `make doc-check`'s diff-scope em-dash gate. The diff-gate compares vs origin/main and the em-dashes were introduced by this PR, so they are this PR's punctuation to fix. Explicitly skipped: per-job path filtering trim of `internal/synthesis/**` from `harness-determinism` (CONCERN flagged during review). Would need either `dorny/paths-filter` (added dependency) or a workflow split (scope creep). The ~2min of redundant CI serves two distinct gates (CLI bytes vs detector verdict) and is justified per PRINCIPLES §13 operator-first. Revisit if the redundancy pattern recurs. Signed-off-by: Tri Lam <trilamsr@gmail.com>
5 tasks
trilamsr
added a commit
that referenced
this pull request
May 21, 2026
#147) ## Summary Single-PR bundle of 10 low-risk follow-up actions. Each row was anchor-verified on `main` before editing; no production behavior change. Diff is 10 files, +91/-37, dominated by markdown. **Breakdown:** - 3 strikes (anchor shipped, row was stale) - 1 test-only struct add (k8sevents `NodeWatchErrors`) - 1 bash test add (`no-autoupdate-check` hit-line format lock) - 5 doc-only clarifications / partial-ship / audit notes ## Items applied ### Strikes — anchor on `main` confirms shipped 1. **M3.md L188 `make doc-check`.** `scripts/doc-check.sh` header reads "verify every Test\*/Fuzz\*/Benchmark\* name referenced in docs"; wired into `make doc-check` and `make ci`. 2. **M8.md L103 `docs/HARDWARE-TESTING.md` libdcgm + nv-hostengine setup.** File exists (28 hits for libdcgm/dcgm/nv-hostengine); covers Ubuntu 22.04 driver / `libdcgm-dev` / `nv-hostengine` provisioning, x86_64 + aarch64-SBSA build matrix, and the `//go:build dcgm,hardware` distinction. Doc shipped ahead of cgo client to unblock GPU-less contributors. 3. **M19.md L18 `nodeWatchErrCount` not in SnapshotCounters.** Closed by item 6 below — added `NodeWatchErrors` field symmetrically. ### Test-only struct add 4. **components/receivers/k8sevents/export_test.go.** Added `NodeWatchErrors int64` field on `CountersForTest`; `SnapshotCounters` now reads `rr.nodeWatchErrCount.Load()` symmetrically with `rr.watchErrCount.Load()`. Both call sites are keyed-init inside the same file; no external positional callers to break (grep confirmed: only 2 hits, both in `export_test.go`). ### Bash test add (M23 grep-gate format lock) 5. **scripts/no-autoupdate-check_test.sh "hit-line-format-stable".** New assertion that runs the gate against the hyphenated-go-update fixture, captures stdout (existing tests discard it), and asserts at least one line matches `^[^:]+:[0-9]+:`. Locks the parseable hit-line shape *before* the first automation consumer (CI summary, dashboard, Slack notifier) wires up — a cosmetic tweak to the gate's message body now fails CI instead of silently breaking downstream parsers. M23.md row struck. ### Doc-only clarifications 6. **M15.md L185 falsifying-check backfill.** Anchored the "/var/lib/tracecore/ subdir governance" row's grep-falsifying-check to RFC-0010 §Proposal — `docs/rfcs/0010-containerstdout-receiver-scope.md` L177/L217/L274/L393/L407 already carry the convention ("M15 owns `/var/lib/tracecore/container_stdout/`. Future siblings reserve their own subdirectories."). Row marked `[x]`. 7. **M15.md L192 + RFC-0010 §Pod-attribution forward-pointer.** Appended one-line cross-reference at RFC-0010 L158 → `docs/followups/M15.md` "Cross-receiver rank-label reconciliation" so the deferred audit trail is discoverable from the RFC. Row marked `[x]`. 8. **M8.md L30 `tracecore debug dump` partial-ship.** `cmd/tracecore/debug.go::runDebugDump` already writes version + revision + branch + build date + Go runtime stats + registered components + redacted config to `tracecore debug dump > diagnostic.txt`. Remaining gap is "last N samples" — needs receiver-side ring buffer (M2 carry-forward). Row kept open with partial-ship line + remaining-trigger. 9. **M3.md L153 SUPPLY-CHAIN-IDENTITY.md scope clarification.** Added one sentence noting the consolidation is a copy-and-deduplicate pass against existing `release.yml` comment blocks (cosign-sign-blob, gh-attestation-sign), not net-new authoring — so the next reader sees the actual scope of work, not a misleading "30-min write" estimate that implies green-field. 10. **otlphttp.md L182 workflow paths audit + M14.md L88 test pointer.** - **otlphttp**: inlined audit findings (2026-05-20). `chart.yml` and `install-bench.yml` are substrate-aware (include `cmd/tracecore/**`, `internal/**`); `kernelevents-integration.yml` and `pyspy-integration.yml` cover only `components/receivers/<name>/**` + `internal/runtime/lifecycle/**` — a `cmd/tracecore` factory wiring or `internal/pipeline` contract change can land without re-running these integration jobs. `chaos.yml` covers `tools/failure-inject/**` + `internal/synthesis/**` only (indirect coupling, acceptable). Remaining: 6-line YAML edit per integration workflow. - **M14**: added inline pointer from the multi-retry slow-write fixture row to the existing single-retry baseline at `components/receivers/kineto/shutdown_test.go::TestIngest_RetryOnTruncated` so the future author has the test-shape anchor. ## Files changed | File | LOC | Kind | |---|---|---| | `components/receivers/k8sevents/export_test.go` | +2 | test struct field | | `scripts/no-autoupdate-check_test.sh` | +20 | bash test add | | `docs/rfcs/0010-containerstdout-receiver-scope.md` | +1/-1 | inline cross-ref | | `docs/followups/M3.md` | +9/-5 | strike + scope clarification | | `docs/followups/M8.md` | +16/-5 | strike + partial-ship | | `docs/followups/M14.md` | +1/-1 | test pointer | | `docs/followups/M15.md` | +15/-8 | 2 strikes | | `docs/followups/M19.md` | +5/-9 | strike (anchored to test add) | | `docs/followups/M23.md` | +9/-7 | strike | | `docs/followups/otlphttp.md` | +13/-1 | audit findings inline | ## Test plan - [x] `go test ./components/receivers/k8sevents/...` green. - [x] `bash scripts/no-autoupdate-check_test.sh` 10/10 assertions pass (added "hit-line-format-stable" — the 10th). - [x] `bash scripts/doc-check.sh` green (437 markdown links resolve, em-dash + en-dash diff gate clean, comment-noise diff gate clean). - [x] Pre-commit hook ran full `make check` + `make ci` (all package tests cached/passing). - [ ] CI green on this branch. ## Release notes ```release-notes NONE ``` ## Sequencing Builds on `main` after PRs #132 (shard split), #133 (RUNBOOK + chart-appversion), #142 (opportunistic curation), #134 (chaos.yml row), #143 (cross-shard audit). Independent of currently-open PRs #144 (m6 integration recipes) and #145 (m3 GHCR image publish). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Tri Lam <tri@maydow.com> Co-authored-by: Tri Lam <tri@maydow.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trilamsr
pushed a commit
that referenced
this pull request
May 21, 2026
Sync feature branch with main per the merge-not-rebase policy documented in CONTRIBUTING.md (commit ddf86f7). Main moved 5 PRs ahead during this branch's lifetime: - PR #143 (followups sweep) - PR #134 (chaos.yml pattern-pod-evicted) - PR #142 (follow-up curation) - PR #144 (M6 integration recipes) - PR #146 (kineto MaxEvents stub) - PR #147 (followups bundle) Conflicts expected in CHANGELOG.md and docs/followups/M3.md (both additive). # Conflicts: # CHANGELOG.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes two M19 carry-forwards and trims a stale third:
chaos.ymlpattern-pod-evictedjob (M19 carry-forward Wave 1: governance bootstrap (CODEOWNERS, DCO, signing) #4). New job runs the hermeticgo test ./internal/synthesis/... -racereplay-corpus path. The pod-evict CLI half is covered by apod-evict --reason=DiskPressureSHA pin intools/failure-inject/testdata/golden.sha256; the existingharness-determinismGolden-SHA-Pin loop now verifies pod-evict alongside xid. Job timeout set to 5min (observed runtime 1m48s; ~3× headroom)._real_world/replay-corpus slot (M19 carry-forward ci(deps): bump the gh-actions group with 5 updates #1).internal/synthesis/replay/pod_evicted/_real_world/is the contribution slot for anonymized production captures. The replay loader's underscore-group convention (per_negative/) means new fixtures plug in without code changes;README.mddocuments the contribution checklist and ships an inline fixture template (manifest / events / node_conditions / golden JSON shapes).TestPodEvictedReplay_RealWorldGroupLoaderSafemutation-verifies the empty group walks safely. Empty group today — populating it is the remaining piece of carry-forward ci(deps): bump the gh-actions group with 5 updates #1.real_world_*/→_real_world/rename and its reason (loader-convention parity with_negative/).paths:filter gainsinternal/synthesis/**so detector / replay-corpus changes retrigger Chaos on PRs.Carry-forwards still open: detection-latency ≤5s p95 (needs live cluster);
--filler=/tmpreal-kubelet eviction (needs--allow-cluster-writekube-apiserver seam); real-world fixture contributions.Root-cause framing
The two closed carry-forwards were missing-infrastructure items, not workarounds:
_negative/). A second slot for real-world captures was a directory-and-README-and-loader-test away; no code change to the loader itself was needed.Explicitly skipped: per-job path-filter trim of
internal/synthesis/**fromharness-determinism. Would needdorny/paths-filter(added dep) or workflow split (scope creep). The ~2min of redundant CI serves two distinct gates (CLI bytes vs detector verdict) and is justified per PRINCIPLES §13 operator-first.No new workarounds.
Release notes
Test plan
go test -race -count=1 ./internal/synthesis/... ./tools/failure-inject/...passes locally.make doc-checkclean (em-dash diff-gate, banned-phrase lint, alert-check, release-doc-parity).failure-inject --seed=1 pod-evict --reason=DiskPressureproduces a different SHA than the pinned one; bothpattern-pod-evictedandharness-determinismGolden-SHA-Pin loop fail closed on argv drift._real_world/makesTestPodEvictedReplay_RealWorldGroupLoaderSafefail with the offending-path citation; restored._real_world/group present and empty.pattern-pod-evicted (M19)1m48s).