feat(replay): pod_evicted PII anonymizer + real-world fixture (M19 #1)#484
Merged
Conversation
M19 carry-forward #1 — ship the infrastructure that lets operators contribute anonymized pod_evicted captures under `module/pkg/replay/pod_evicted/_real_world/<anon-name>/`. * `scripts/anonymize-pod-evicted-fixture.sh` — deterministic sha8 rewrite of event_uid / regarding.{namespace,name,uid} / reporting_instance / node_{name,uid}; verifier flags surviving IPv4 / email / cloud-instance-node / image-ref shapes in note + message prose. * `scripts/anonymize-pod-evicted-fixture_test.sh` — mutation tests: baseline-clean passes; IPv4 / email / EC2 / GKE / ECR shapes fail verify; `v1.28.4`-style version strings do NOT false-positive; rewrite is deterministic (two passes byte-identical) and strips every raw input string. * `synthetic-2026-06-multi-rank-disk-pressure/` — synthetic-but- real-world-shaped fixture exercising multi-rank disk-pressure burst with mixed full+partial confidence (third eviction at T+35s falls outside the 30s join window, partial-remediation path inferring disk pressure from note). * `TestPodEvictedReplay_RealWorldGroupLoaderSafe` — asserts the loader walks `_real_world/` identically to `_negative/`; the synthetic fixture is the load-bearing proof of the loader path. * README polished with the explicit PII-field map + cross-link to `docs/threat-model.md`; threat-model row updated to reflect the partial-shipped enforcement. * `make ci-full` + `make verify` gain `anonymize-pod-evicted-fixture-check` so a PR that drops raw PII into `_real_world/` fails before merge. ```release-notes feat: pod_evicted replay fixtures gain a deterministic PII anonymizer (`scripts/anonymize-pod-evicted-fixture.sh`) and a synthetic multi-rank disk-pressure fixture under `module/pkg/replay/pod_evicted/_real_world/`, closing M19 carry-forward welcome. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
Closes the spec-impl gap reviewer flagged on PR #484: README §1 "Contribution checklist" promised `event_time` + `transition_at` get normalized so the earliest event lands at 2026-01-01T00:00:00Z (offsets preserved), but rewrite_dir() never implemented it. Implementation: collect all (event_time, transition_at) values across events.json + node_conditions.json, compute a single delta = target - min, append (orig -> shifted) pairs to the existing sed-driven rewrite map. Goldens stay consistent because their prose embeds the same strings the structured fields carry. Determinism preserved: delta is a pure function of the input min, so re-anonymization is byte-identical. Synthetic fixture earliest is already 2026-01-01T00:00:00Z, so delta=0 — no goldens drift. Added two assertions to the test harness: - earliest timestamp anchors at exactly 2026-01-01T00:00:00Z and inter-event offsets are preserved - no rewritten timestamp predates the anchor Also tightened README (and the script header) phrasing on IPv6: the verifier only auto-detects IPv4 prose; IPv6 must be scrubbed by eye. The "IPv4/IPv6" wording over-promised. Signed-off-by: Tri Lam <tree@lumalabs.ai>
Contributor
Author
|
BLOCK fix: timestamp normalization implemented + tested; IPv6 phrasing clarified. Re-requesting review.
Commit: cd760f2. |
This was referenced Jun 2, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
## Summary PR #481 shipped `securityHardening.appArmorProfile.enabled: true` as the default in `install/kubernetes/tracecore/values.yaml`. Kubelet rejects pod-create when `pod.securityContext.appArmorProfile` references a profile the host cannot resolve, so the chart no longer installs on AppArmor-less nodes — including the ubuntu-latest GitHub Actions runner image (AppArmor dropped post-2024) and RHEL/SELinux production hosts. install-bench regressed; PRs #491, #484, #479, #431 are blocked behind this. This PR implements option (a) from #492: flip the default to opt-in. `values-production.yaml` keeps `enabled: true` since AppArmor-equipped Linux clusters (the production target) ship `RuntimeDefault` via containerd / CRI-O. ## Root cause Default-on AppArmor in `values.yaml` violated the chart contract that the default render installs on a vanilla cluster. The defense-in-depth posture is correct for production-preset users; it was wrong as the unconditional default. PR #481 didn't add a CI gate to assert "default render installs on a host without AppArmor", so the regression escaped review. ## Changes - `install/kubernetes/tracecore/values.yaml`: `securityHardening.appArmorProfile.enabled: true` -> `false`; in-line guidance reflects opt-in posture and names the failing-host classes (CI runners, RHEL/SELinux). - `install/kubernetes/tracecore/values-production.yaml`: unchanged — production preset still hardens with `enabled: true`. - `install/kubernetes/tracecore/README.md`: defaults table + Defense-in-depth section explain the opt-in posture, point operators at `values-production.yaml` for the prior behavior, and link #492. - `.github/workflows/chart.yml`: AppArmor mutation tests reshuffled from 6 to 8 cases. T1/T2 now assert default render emits **no** AppArmor field or annotation on K8s 1.30 + 1.28 (regression-prevent for #492). T3/T4 cover the opt-in path (`--set enabled=true`) and pin pre-#492 production-preset behavior. T7/T8 explicitly pass `--set enabled=true` so the Localhost-profile contract still fires under the new default. Production-preset assertion (`appArmorProfile.type=RuntimeDefault` from `values-production.yaml`) is untouched. ## Backward compatibility **Behavior change for default-values users.** Operators who installed via `helm install ... install/kubernetes/tracecore` (no production preset) and depended on the AppArmor hardening that #481 added will see it disappear on next upgrade. Two ways to keep the prior behavior: ```bash # Option 1 — adopt the production preset (recommended). helm upgrade demo install/kubernetes/tracecore \ --values install/kubernetes/tracecore/values-production.yaml # Option 2 — keep your current values, just flip the flag. helm upgrade demo install/kubernetes/tracecore \ --set securityHardening.appArmorProfile.enabled=true ``` Operators who relied on the chart's documented default (#481 was three days old; opt-in is the chart-hygiene norm for defense-in-depth knobs) get a quieter install on AppArmor-less hosts. ## Test plan - [x] `helm lint install/kubernetes/tracecore` — 0 warnings. - [x] `helm template ... --kube-version 1.30.0 --show-only templates/daemonset.yaml | grep -i apparmor` — empty (default render has no AppArmor). - [x] Same with `--kube-version 1.28.0` — empty. - [x] `helm template ... --values values-production.yaml --kube-version 1.30.0` — renders `appArmorProfile.type: RuntimeDefault` (production preset unchanged). - [x] `helm template ... --set securityHardening.appArmorProfile.enabled=true --kube-version 1.30.0` — renders structured field (opt-in works). - [x] All 8 mutation tests in `.github/workflows/chart.yml` AppArmor step run locally and pass. - [x] conftest: 52/52 default render, 91/91 production render. - [x] actionlint: 0 issues on `chart.yml`. - [x] Pre-commit (golangci-lint, vet, attribute-namespace-check, test-flake-audit) — all green. - [ ] CI: chart workflow turns green on this PR. - [ ] CI: install-bench turns green on this PR (and unblocks #491 / #484 / #479 / #431 once merged). ## Refs Closes #492 (refs #481). ```release-notes **Breaking (default-values users only).** `securityHardening.appArmorProfile.enabled` now defaults to `false` in `values.yaml` so the chart installs on AppArmor-less nodes (CI runners, RHEL/SELinux). The `values-production.yaml` preset still ships `enabled: true` — production Linux clusters that package the `RuntimeDefault` profile (every distro with containerd / CRI-O) keep the hardening when they layer that preset. Operators upgrading default-values installs who want the prior behavior can either adopt `values-production.yaml` or set `--set securityHardening.appArmorProfile.enabled=true`. Fixes the install-bench regression introduced in #481. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
…real-world-infra # Conflicts: # Makefile
…real-world-infra # Conflicts: # Makefile
This was referenced Jun 2, 2026
trilamsr
added a commit
that referenced
this pull request
Jun 2, 2026
#498) ## Summary Triple-shipper closing three load-bearing infra debts that recurred on every chart/CI-touching PR. Atomic so we handle this once. ### Part 1 — Makefile sharding (cascade-rebase tripwire) **Root cause:** The root `Makefile` carried four monolithic prereq lists (`.PHONY:`, `check:`, `verify:`, `ci-fast:`, `ci-full:`). Every new gate appended one token to each list, and two open PRs touching the same line produced a 3-way merge conflict that required manual fix-up — the dominant source of cascade-rebases on this repo. **Fix:** Split into `make/{phony,check,verify,ci-fast,ci-full}.mk` shards using `+=` appends. Main `Makefile` `include`s the shards; aggregate targets now consume `$(*_DEPS)`. Prereq sets are logically equivalent to `origin/main` (modulo intentional gate additions): `make -pn` shows `lint-unused-module` replaced by `lint-module-full` in `check` (Part 3) and the new `makefile-hotfile-check` added to `ci-fast`/`ci-full` (Part 1 A+). No other prereq tokens moved. **A+:** Added `scripts/makefile-hotfile-check.sh` (+ `make makefile-hotfile-check` target) that fails if a future PR re-inlines prereq tokens into the root `Makefile`. Wired into `ci-fast` + `ci-full` so drift trips per-PR. ### Part 2 — Kind-CRD bootstrap composite action **Root cause:** Three workflows (chart.yml, policy-matrix.yml, install-bench.yml) each separately installed helm + kind + the tracecore image, each drifted from the others on CRD prereqs (ServiceMonitor #494 fixed policy-matrix only; chart.yml + install-bench.yml remained vulnerable to the same regression). **Fix:** Created `.github/actions/kind-cluster-setup/action.yml` as a single source of truth: pinned helm v3.16.4, kind v0.25.0, node v1.32.0, ServiceMonitor CRD v0.91.0 (#494 pin), with toggles for Gatekeeper / cert-manager CRDs (reserved for future workflows). All 3 workflows now `uses:` the composite. Old `kind-tracecore-up` shim deleted (zero remaining callsites). **Mutation-verify:** changing the ServiceMonitor CRD URL in `kind-cluster-setup/action.yml` fails all 3 workflows uniformly by construction (single-source-of-truth pin). ### Part 3 — Full module/ lint coverage (#490 follow-up of #486) **Root cause:** `make lint` from the root never reaches `module/` (workspace mode resolves `./...` only inside the current module). PR #486 added `make lint-unused-module` for the `unused` linter only; the rest of the 13 linters declared in `.golangci.yml` were silently skipped against `module/`, accumulating 57 findings. **Fix:** New `make lint-module-full` target runs the full `.golangci.yml` linter set against `module/`. Swept all 57 pre-existing findings to 0: - `golangci-lint --fix`: 17 findings auto-fixed (testifylint 14, errorlint 1, perfsprint 1, staticcheck-QF1003 1) - Real fixes: forcetypeassert → checked assertions (6); goconst → `fallbackCollectiveOp`, `PressureUnknown` constants (4); gocritic rewrites (2); predeclared renames (3); revive package-comments (1); prealloc (1); wrapcheck `fmt.Errorf "%w"` (2); G301 mkdir 0755 → 0750 (1); exhaustive → explicit `default:` clauses (7) + explicit `PressureUnknown` case (1) - Documented opt-outs with per-finding rationale: gocyclo on 5 pattern-match dispatch funcs whose complexity tracks vocabulary cardinality, not nested logic; gosec G103 on audited `unsafe.String` aliasing carve; gosec G304 on 5 test-local fixture reads; staticcheck SA1019 on explicit pre-deprecation parity assertion (#277) `make lint-module-full` exits 0 on the cleaned tree. Wired into `make check`, replacing `lint-unused-module` (retained for fast-iteration dead-code sweeps). ## Per-part metrics | Part | Metric | Before | After | |---|---|---|---| | 1 | Makefile aggregate-list LoC (hot lines) | 5 monolithic lines | 5 `include` lines | | 1 | Make-target prereq sets vs `origin/main` | — | logically equivalent (modulo intentional gate additions; see Part 1 Fix) | | 1 | New shards under `make/` | 0 | 5 (`phony`, `check`, `verify`, `ci-fast`, `ci-full`) | | 2 | Composite action wired into N workflows | 0 (each duplicated kind setup) | 3 (chart, policy-matrix, install-bench) | | 2 | CRDs covered | 0 (ad-hoc per workflow) | ServiceMonitor (now), Gatekeeper + cert-manager (reserved) | | 2 | Workflow LoC delta (kind setup blocks) | 3× ~10 lines duplicated | 3× ~10 lines (composite-action `uses:` blocks) | | 3 | Lint findings (module/) | 57 (across 14 linters) | 0 | | 3 | Linters enabled against module/ | 1 (`unused`) | 13 (full `.golangci.yml` set) | | 3 | Documented opt-outs | — | 14 `//nolint:` directives, each with per-finding rationale | ## Closes - #490 (full module/ lint coverage) - Reaffirms #486 (extends from `unused` only to the 13-linter set) ## Follow-ups filed - #497 — surfaced (not caused) by this PR: `TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06-multi-rank-disk-pressure` fails on `origin/main` HEAD because the negative-fixture filter treats every non-canonical fixture as negative, including the `_real_world/*` positives added in #484. Fix sketch + repro included. ## Test plan - [x] `make verify` — green - [x] `make check` — green - [x] `make lint-module-full` — exit 0 - [x] `make doc-check` — green (comment-noise sweep) - [x] `make actionlint` — green - [x] `make zizmor` — green - [x] `make makefile-hotfile-check` — green (own gate) - [x] `make -pn` byte-identity check against `origin/main` for all 4 aggregate targets — passes (modulo intentional `makefile-hotfile-check` additions to ci-fast/ci-full + `lint-module-full` superseding `lint-unused-module` in check) - [x] `(cd module && GOWORK=off go test ./...)` — green except pre-existing #497 failure (filed) - [ ] CI green on this PR (kind workflows actually exercise the new composite action) ```release-notes infra: shard Makefile into make/*.mk (cascade-rebase prevention per [[rebase-cascade]]); unify 3 workflows behind .github/actions/kind-cluster-setup composite (CRD prereq install + pinned tools); enable full module/ lint coverage (57 findings → 0, 13 → 27 linters wired into make check). Closes #490; refs #494/#496. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
4 tasks
trilamsr
added a commit
that referenced
this pull request
Jun 4, 2026
) Closes #497. ## Root cause `TestPatternDetector_NegativeFixturesEmitNoVerdicts` enumerated every non-canonical subdir under `module/pkg/replay/pod_evicted/` as a negative fixture (filter: `if f.Name == \"canonical\" { continue }`). PR #484 introduced `_real_world/` as a slot for anonymized, operator-shaped **positive** captures (see `module/pkg/replay/pod_evicted/_real_world/README.md` — \"contributed fixtures land golden verdicts alongside\"). The first such fixture, `synthetic-2026-06-multi-rank-disk-pressure/`, ships a 3-verdict `golden.json` (2 full + 1 partial). The detector was correct; the test's negative-set definition was wrong. Two symptom-only fixes were possible (skip `_real_world/*` by path, or add a manifest-level `positive: bool`), but both leave the inverse drift hole: a positive fixture demoted to silently-skipped would still pass. ## Fix Replace path-based filtering with **golden-driven dispatch**: - A fixture's own `golden.json` declares its polarity — empty `[]` means \"detector must emit nothing\", non-empty means \"detector must emit exactly this\". - Negative test (`NegativeFixturesEmitNoVerdicts`): skips fixtures whose golden is non-empty; remaining set asserts zero verdicts (unchanged contract). - New positive test (`PositiveFixturesMatchGolden`): non-canonical fixtures with non-empty goldens must round-trip to that exact verdict slice. Closes the inverse-drift hole. Self-aligning: future contributions under any group land in the correct lane based on what their golden declares, not which subdir they live in. ## Verification - `cd module && GOWORK=off go test -run 'TestPatternDetector_NegativeFixturesEmitNoVerdicts|TestPatternDetector_PositiveFixturesMatchGolden' -v ./processor/patterndetectorprocessor/` — green (3 negatives + 1 positive subtest). - `cd module && GOWORK=off go test ./...` — green, no new failures. - `make lint` — 0 issues. - Pre-push hooks (`go vet`, `go mod verify`, `attribute-namespace-check`, `no-autoupdate-check`) — green. ## Test plan - [x] Reproduced RED on `origin/main` before edit. - [x] Both new test bodies run green locally. - [x] Full module test suite green. - [x] `make lint` green. ```release-notes fix(test): pattern-detector replay test now keys fixture polarity off `golden.json` content rather than directory name, so positive `_real_world/` captures land in the positive-assertion lane and operator contributions plug in without test-code edits. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the M19 carry-forward #1 infrastructure obligation: real-world
pod_evictedreplay captures can now be safely contributed.scripts/anonymize-pod-evicted-fixture.sh(--rewriterewritesevent_uid/regarding.{namespace,name,uid}/reporting_instance/node_{name,uid}to<prefix>-<sha8(value)>while preserving-rank-Nsuffixes;--verifyrefuses any fixture still carrying IPv4, email, EC2/GKE/AKS, or AWS-ECR/GCR-style image-ref shapes in prose).scripts/anonymize-pod-evicted-fixture_test.shproves the verifier catches every PII shape it claims to catch, the rewrite is byte-deterministic across two passes, and false-positives stay quiet on innocent inputs (v1.28.4-style version strings).module/pkg/replay/pod_evicted/_real_world/synthetic-2026-06-multi-rank-disk-pressure/exercises a 3-pod disk-pressure burst with two full-confidence joins (per-condition cache reuse) + one partial-remediation eviction at T+35s (outside the default 30sJoinWindow→ note-inferred pressure path).TestPodEvictedReplay_RealWorldGroupLoaderSafenow asserts the loader walks_real_world/exactly like_negative/and would catch a future refactor that broke either group walk.Root cause being fixed
M19 carry-forward #1 was "no captures contributed yet" — but the deeper blocker was that no operator could safely contribute without (a) a deterministic anonymizer they could rerun on their side, (b) a verifier strong enough to use as a CI gate, and (c) loader proof that
_real_world/actually walks. This PR ships all three. Future captures plug in without code changes.Test plan
go test ./module/pkg/replay/... -count=1→ all green; newsynthetic-2026-06-multi-rank-disk-pressuresubtest runs.bash scripts/anonymize-pod-evicted-fixture_test.sh→ 11 assertions pass (baseline clean, IPv4 / email / EC2 / GKE / ECR shapes flagged, version-string false-positive guarded, deterministic-rewrite byte-equal, every raw input string stripped, shipped fixture clean).make anonymize-pod-evicted-fixture-check→ wires verify + mutation tests together; exits 0.bash scripts/doc-check.sh→ unaffected, still clean.shellcheckclean on both new scripts.go vet ./module/...clean.Follow-up
cuda_oom,nccl_hang,hbm_eccand the other pattern detectors don't yet have_real_world/slots. The anonymizer is shaped to generalize (the structured-field map is the only pattern-specific bit; the prose-PII regex set is universal). Tracked as a follow-up issue once a second operator capture justifies the rule-of-three lift.