feat(replay): pod_evicted PII anonymizer + real-world fixture (M19 #1) by trilamsr · Pull Request #484 · TraceCoreAI/tracecore

trilamsr · 2026-06-02T04:58:26Z

Summary

Closes the M19 carry-forward #1 infrastructure obligation: real-world pod_evicted replay captures can now be safely contributed.

Deterministic PII anonymizer: scripts/anonymize-pod-evicted-fixture.sh (--rewrite rewrites event_uid / regarding.{namespace,name,uid} / reporting_instance / node_{name,uid} to <prefix>-<sha8(value)> while preserving -rank-N suffixes; --verify refuses any fixture still carrying IPv4, email, EC2/GKE/AKS, or AWS-ECR/GCR-style image-ref shapes in prose).
Mutation tests: scripts/anonymize-pod-evicted-fixture_test.sh proves the verifier catches every PII shape it claims to catch, the rewrite is byte-deterministic across two passes, and false-positives stay quiet on innocent inputs (v1.28.4-style version strings).
Synthetic real-world-shaped fixture: module/pkg/replay/pod_evicted/_real_world/synthetic-2026-06-multi-rank-disk-pressure/ exercises a 3-pod disk-pressure burst with two full-confidence joins (per-condition cache reuse) + one partial-remediation eviction at T+35s (outside the default 30s JoinWindow → note-inferred pressure path).
Loader-symmetry test: TestPodEvictedReplay_RealWorldGroupLoaderSafe now asserts the loader walks _real_world/ exactly like _negative/ and would catch a future refactor that broke either group walk.
Threat-model + MILESTONES updated: the §7 audit row references the anonymizer; the M19 carry-forward bullet reflects what's shipped vs still pending (operator captures).

Root cause being fixed

M19 carry-forward #1 was "no captures contributed yet" — but the deeper blocker was that no operator could safely contribute without (a) a deterministic anonymizer they could rerun on their side, (b) a verifier strong enough to use as a CI gate, and (c) loader proof that _real_world/ actually walks. This PR ships all three. Future captures plug in without code changes.

Test plan

go test ./module/pkg/replay/... -count=1 → all green; new synthetic-2026-06-multi-rank-disk-pressure subtest runs.
bash scripts/anonymize-pod-evicted-fixture_test.sh → 11 assertions pass (baseline clean, IPv4 / email / EC2 / GKE / ECR shapes flagged, version-string false-positive guarded, deterministic-rewrite byte-equal, every raw input string stripped, shipped fixture clean).
make anonymize-pod-evicted-fixture-check → wires verify + mutation tests together; exits 0.
bash scripts/doc-check.sh → unaffected, still clean.
shellcheck clean on both new scripts.
go vet ./module/... clean.

Follow-up

cuda_oom, nccl_hang, hbm_ecc and the other pattern detectors don't yet have _real_world/ slots. The anonymizer is shaped to generalize (the structured-field map is the only pattern-specific bit; the prose-PII regex set is universal). Tracked as a follow-up issue once a second operator capture justifies the rule-of-three lift.

feat: pod_evicted replay fixtures gain a deterministic PII anonymizer
(`scripts/anonymize-pod-evicted-fixture.sh`) and a synthetic
multi-rank disk-pressure fixture under
`module/pkg/replay/pod_evicted/_real_world/`, closing M19
carry-forward #1's infrastructure obligation. Operator-contributed
captures still welcome.

M19 carry-forward #1 — ship the infrastructure that lets operators contribute anonymized pod_evicted captures under `module/pkg/replay/pod_evicted/_real_world/<anon-name>/`. * `scripts/anonymize-pod-evicted-fixture.sh` — deterministic sha8 rewrite of event_uid / regarding.{namespace,name,uid} / reporting_instance / node_{name,uid}; verifier flags surviving IPv4 / email / cloud-instance-node / image-ref shapes in note + message prose. * `scripts/anonymize-pod-evicted-fixture_test.sh` — mutation tests: baseline-clean passes; IPv4 / email / EC2 / GKE / ECR shapes fail verify; `v1.28.4`-style version strings do NOT false-positive; rewrite is deterministic (two passes byte-identical) and strips every raw input string. * `synthetic-2026-06-multi-rank-disk-pressure/` — synthetic-but- real-world-shaped fixture exercising multi-rank disk-pressure burst with mixed full+partial confidence (third eviction at T+35s falls outside the 30s join window, partial-remediation path inferring disk pressure from note). * `TestPodEvictedReplay_RealWorldGroupLoaderSafe` — asserts the loader walks `_real_world/` identically to `_negative/`; the synthetic fixture is the load-bearing proof of the loader path. * README polished with the explicit PII-field map + cross-link to `docs/threat-model.md`; threat-model row updated to reflect the partial-shipped enforcement. * `make ci-full` + `make verify` gain `anonymize-pod-evicted-fixture-check` so a PR that drops raw PII into `_real_world/` fails before merge. ```release-notes feat: pod_evicted replay fixtures gain a deterministic PII anonymizer (`scripts/anonymize-pod-evicted-fixture.sh`) and a synthetic multi-rank disk-pressure fixture under `module/pkg/replay/pod_evicted/_real_world/`, closing M19 carry-forward welcome. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>

Closes the spec-impl gap reviewer flagged on PR #484: README §1 "Contribution checklist" promised `event_time` + `transition_at` get normalized so the earliest event lands at 2026-01-01T00:00:00Z (offsets preserved), but rewrite_dir() never implemented it. Implementation: collect all (event_time, transition_at) values across events.json + node_conditions.json, compute a single delta = target - min, append (orig -> shifted) pairs to the existing sed-driven rewrite map. Goldens stay consistent because their prose embeds the same strings the structured fields carry. Determinism preserved: delta is a pure function of the input min, so re-anonymization is byte-identical. Synthetic fixture earliest is already 2026-01-01T00:00:00Z, so delta=0 — no goldens drift. Added two assertions to the test harness: - earliest timestamp anchors at exactly 2026-01-01T00:00:00Z and inter-event offsets are preserved - no rewritten timestamp predates the anchor Also tightened README (and the script header) phrasing on IPv6: the verifier only auto-detects IPv4 prose; IPv6 must be scrubbed by eye. The "IPv4/IPv6" wording over-promised. Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr · 2026-06-02T05:29:09Z

BLOCK fix: timestamp normalization implemented + tested; IPv6 phrasing clarified. Re-requesting review.

rewrite_dir() now collects every event_time + transition_at, computes a single delta = target - min, and appends (orig -> shifted) pairs into the existing sed-driven rewrite map. Determinism preserved (pure function of input min). Goldens stay consistent because their prose embeds the structured strings.
Two new assertions in scripts/anonymize-pod-evicted-fixture_test.sh: anchor-at-2026-01-01T00:00:00Z + inter-event-offsets-preserved, and no-pre-anchor-leak.
README + script header tightened: verifier only auto-detects IPv4 prose; IPv6 scrubbed by eye.
Verification: bash scripts/anonymize-pod-evicted-fixture_test.sh 12/12, make anonymize-pod-evicted-fixture-check exit 0, go test -race -count=1 ./module/pkg/replay/... ok.

Commit: cd760f2.

## Summary PR #481 shipped `securityHardening.appArmorProfile.enabled: true` as the default in `install/kubernetes/tracecore/values.yaml`. Kubelet rejects pod-create when `pod.securityContext.appArmorProfile` references a profile the host cannot resolve, so the chart no longer installs on AppArmor-less nodes — including the ubuntu-latest GitHub Actions runner image (AppArmor dropped post-2024) and RHEL/SELinux production hosts. install-bench regressed; PRs #491, #484, #479, #431 are blocked behind this. This PR implements option (a) from #492: flip the default to opt-in. `values-production.yaml` keeps `enabled: true` since AppArmor-equipped Linux clusters (the production target) ship `RuntimeDefault` via containerd / CRI-O. ## Root cause Default-on AppArmor in `values.yaml` violated the chart contract that the default render installs on a vanilla cluster. The defense-in-depth posture is correct for production-preset users; it was wrong as the unconditional default. PR #481 didn't add a CI gate to assert "default render installs on a host without AppArmor", so the regression escaped review. ## Changes - `install/kubernetes/tracecore/values.yaml`: `securityHardening.appArmorProfile.enabled: true` -> `false`; in-line guidance reflects opt-in posture and names the failing-host classes (CI runners, RHEL/SELinux). - `install/kubernetes/tracecore/values-production.yaml`: unchanged — production preset still hardens with `enabled: true`. - `install/kubernetes/tracecore/README.md`: defaults table + Defense-in-depth section explain the opt-in posture, point operators at `values-production.yaml` for the prior behavior, and link #492. - `.github/workflows/chart.yml`: AppArmor mutation tests reshuffled from 6 to 8 cases. T1/T2 now assert default render emits **no** AppArmor field or annotation on K8s 1.30 + 1.28 (regression-prevent for #492). T3/T4 cover the opt-in path (`--set enabled=true`) and pin pre-#492 production-preset behavior. T7/T8 explicitly pass `--set enabled=true` so the Localhost-profile contract still fires under the new default. Production-preset assertion (`appArmorProfile.type=RuntimeDefault` from `values-production.yaml`) is untouched. ## Backward compatibility **Behavior change for default-values users.** Operators who installed via `helm install ... install/kubernetes/tracecore` (no production preset) and depended on the AppArmor hardening that #481 added will see it disappear on next upgrade. Two ways to keep the prior behavior: ```bash # Option 1 — adopt the production preset (recommended). helm upgrade demo install/kubernetes/tracecore \ --values install/kubernetes/tracecore/values-production.yaml # Option 2 — keep your current values, just flip the flag. helm upgrade demo install/kubernetes/tracecore \ --set securityHardening.appArmorProfile.enabled=true ``` Operators who relied on the chart's documented default (#481 was three days old; opt-in is the chart-hygiene norm for defense-in-depth knobs) get a quieter install on AppArmor-less hosts. ## Test plan - [x] `helm lint install/kubernetes/tracecore` — 0 warnings. - [x] `helm template ... --kube-version 1.30.0 --show-only templates/daemonset.yaml | grep -i apparmor` — empty (default render has no AppArmor). - [x] Same with `--kube-version 1.28.0` — empty. - [x] `helm template ... --values values-production.yaml --kube-version 1.30.0` — renders `appArmorProfile.type: RuntimeDefault` (production preset unchanged). - [x] `helm template ... --set securityHardening.appArmorProfile.enabled=true --kube-version 1.30.0` — renders structured field (opt-in works). - [x] All 8 mutation tests in `.github/workflows/chart.yml` AppArmor step run locally and pass. - [x] conftest: 52/52 default render, 91/91 production render. - [x] actionlint: 0 issues on `chart.yml`. - [x] Pre-commit (golangci-lint, vet, attribute-namespace-check, test-flake-audit) — all green. - [ ] CI: chart workflow turns green on this PR. - [ ] CI: install-bench turns green on this PR (and unblocks #491 / #484 / #479 / #431 once merged). ## Refs Closes #492 (refs #481). ```release-notes **Breaking (default-values users only).** `securityHardening.appArmorProfile.enabled` now defaults to `false` in `values.yaml` so the chart installs on AppArmor-less nodes (CI runners, RHEL/SELinux). The `values-production.yaml` preset still ships `enabled: true` — production Linux clusters that package the `RuntimeDefault` profile (every distro with containerd / CRI-O) keep the hardening when they layer that preset. Operators upgrading default-values installs who want the prior behavior can either adopt `values-production.yaml` or set `--set securityHardening.appArmorProfile.enabled=true`. Fixes the install-bench regression introduced in #481. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>

…real-world-infra # Conflicts: # Makefile

#498) ## Summary Triple-shipper closing three load-bearing infra debts that recurred on every chart/CI-touching PR. Atomic so we handle this once. ### Part 1 — Makefile sharding (cascade-rebase tripwire) **Root cause:** The root `Makefile` carried four monolithic prereq lists (`.PHONY:`, `check:`, `verify:`, `ci-fast:`, `ci-full:`). Every new gate appended one token to each list, and two open PRs touching the same line produced a 3-way merge conflict that required manual fix-up — the dominant source of cascade-rebases on this repo. **Fix:** Split into `make/{phony,check,verify,ci-fast,ci-full}.mk` shards using `+=` appends. Main `Makefile` `include`s the shards; aggregate targets now consume `$(*_DEPS)`. Prereq sets are logically equivalent to `origin/main` (modulo intentional gate additions): `make -pn` shows `lint-unused-module` replaced by `lint-module-full` in `check` (Part 3) and the new `makefile-hotfile-check` added to `ci-fast`/`ci-full` (Part 1 A+). No other prereq tokens moved. **A+:** Added `scripts/makefile-hotfile-check.sh` (+ `make makefile-hotfile-check` target) that fails if a future PR re-inlines prereq tokens into the root `Makefile`. Wired into `ci-fast` + `ci-full` so drift trips per-PR. ### Part 2 — Kind-CRD bootstrap composite action **Root cause:** Three workflows (chart.yml, policy-matrix.yml, install-bench.yml) each separately installed helm + kind + the tracecore image, each drifted from the others on CRD prereqs (ServiceMonitor #494 fixed policy-matrix only; chart.yml + install-bench.yml remained vulnerable to the same regression). **Fix:** Created `.github/actions/kind-cluster-setup/action.yml` as a single source of truth: pinned helm v3.16.4, kind v0.25.0, node v1.32.0, ServiceMonitor CRD v0.91.0 (#494 pin), with toggles for Gatekeeper / cert-manager CRDs (reserved for future workflows). All 3 workflows now `uses:` the composite. Old `kind-tracecore-up` shim deleted (zero remaining callsites). **Mutation-verify:** changing the ServiceMonitor CRD URL in `kind-cluster-setup/action.yml` fails all 3 workflows uniformly by construction (single-source-of-truth pin). ### Part 3 — Full module/ lint coverage (#490 follow-up of #486) **Root cause:** `make lint` from the root never reaches `module/` (workspace mode resolves `./...` only inside the current module). PR #486 added `make lint-unused-module` for the `unused` linter only; the rest of the 13 linters declared in `.golangci.yml` were silently skipped against `module/`, accumulating 57 findings. **Fix:** New `make lint-module-full` target runs the full `.golangci.yml` linter set against `module/`. Swept all 57 pre-existing findings to 0: - `golangci-lint --fix`: 17 findings auto-fixed (testifylint 14, errorlint 1, perfsprint 1, staticcheck-QF1003 1) - Real fixes: forcetypeassert → checked assertions (6); goconst → `fallbackCollectiveOp`, `PressureUnknown` constants (4); gocritic rewrites (2); predeclared renames (3); revive package-comments (1); prealloc (1); wrapcheck `fmt.Errorf "%w"` (2); G301 mkdir 0755 → 0750 (1); exhaustive → explicit `default:` clauses (7) + explicit `PressureUnknown` case (1) - Documented opt-outs with per-finding rationale: gocyclo on 5 pattern-match dispatch funcs whose complexity tracks vocabulary cardinality, not nested logic; gosec G103 on audited `unsafe.String` aliasing carve; gosec G304 on 5 test-local fixture reads; staticcheck SA1019 on explicit pre-deprecation parity assertion (#277) `make lint-module-full` exits 0 on the cleaned tree. Wired into `make check`, replacing `lint-unused-module` (retained for fast-iteration dead-code sweeps). ## Per-part metrics | Part | Metric | Before | After | |---|---|---|---| | 1 | Makefile aggregate-list LoC (hot lines) | 5 monolithic lines | 5 `include` lines | | 1 | Make-target prereq sets vs `origin/main` | — | logically equivalent (modulo intentional gate additions; see Part 1 Fix) | | 1 | New shards under `make/` | 0 | 5 (`phony`, `check`, `verify`, `ci-fast`, `ci-full`) | | 2 | Composite action wired into N workflows | 0 (each duplicated kind setup) | 3 (chart, policy-matrix, install-bench) | | 2 | CRDs covered | 0 (ad-hoc per workflow) | ServiceMonitor (now), Gatekeeper + cert-manager (reserved) | | 2 | Workflow LoC delta (kind setup blocks) | 3× ~10 lines duplicated | 3× ~10 lines (composite-action `uses:` blocks) | | 3 | Lint findings (module/) | 57 (across 14 linters) | 0 | | 3 | Linters enabled against module/ | 1 (`unused`) | 13 (full `.golangci.yml` set) | | 3 | Documented opt-outs | — | 14 `//nolint:` directives, each with per-finding rationale | ## Closes - #490 (full module/ lint coverage) - Reaffirms #486 (extends from `unused` only to the 13-linter set) ## Follow-ups filed - #497 — surfaced (not caused) by this PR: `TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06-multi-rank-disk-pressure` fails on `origin/main` HEAD because the negative-fixture filter treats every non-canonical fixture as negative, including the `_real_world/*` positives added in #484. Fix sketch + repro included. ## Test plan - [x] `make verify` — green - [x] `make check` — green - [x] `make lint-module-full` — exit 0 - [x] `make doc-check` — green (comment-noise sweep) - [x] `make actionlint` — green - [x] `make zizmor` — green - [x] `make makefile-hotfile-check` — green (own gate) - [x] `make -pn` byte-identity check against `origin/main` for all 4 aggregate targets — passes (modulo intentional `makefile-hotfile-check` additions to ci-fast/ci-full + `lint-module-full` superseding `lint-unused-module` in check) - [x] `(cd module && GOWORK=off go test ./...)` — green except pre-existing #497 failure (filed) - [ ] CI green on this PR (kind workflows actually exercise the new composite action) ```release-notes infra: shard Makefile into make/*.mk (cascade-rebase prevention per [[rebase-cascade]]); unify 3 workflows behind .github/actions/kind-cluster-setup composite (CRD prereq install + pinned tools); enable full module/ lint coverage (57 findings → 0, 13 → 27 linters wired into make check). Closes #490; refs #494/#496. ``` --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>

) Closes #497. ## Root cause `TestPatternDetector_NegativeFixturesEmitNoVerdicts` enumerated every non-canonical subdir under `module/pkg/replay/pod_evicted/` as a negative fixture (filter: `if f.Name == \"canonical\" { continue }`). PR #484 introduced `_real_world/` as a slot for anonymized, operator-shaped **positive** captures (see `module/pkg/replay/pod_evicted/_real_world/README.md` — \"contributed fixtures land golden verdicts alongside\"). The first such fixture, `synthetic-2026-06-multi-rank-disk-pressure/`, ships a 3-verdict `golden.json` (2 full + 1 partial). The detector was correct; the test's negative-set definition was wrong. Two symptom-only fixes were possible (skip `_real_world/*` by path, or add a manifest-level `positive: bool`), but both leave the inverse drift hole: a positive fixture demoted to silently-skipped would still pass. ## Fix Replace path-based filtering with **golden-driven dispatch**: - A fixture's own `golden.json` declares its polarity — empty `[]` means \"detector must emit nothing\", non-empty means \"detector must emit exactly this\". - Negative test (`NegativeFixturesEmitNoVerdicts`): skips fixtures whose golden is non-empty; remaining set asserts zero verdicts (unchanged contract). - New positive test (`PositiveFixturesMatchGolden`): non-canonical fixtures with non-empty goldens must round-trip to that exact verdict slice. Closes the inverse-drift hole. Self-aligning: future contributions under any group land in the correct lane based on what their golden declares, not which subdir they live in. ## Verification - `cd module && GOWORK=off go test -run 'TestPatternDetector_NegativeFixturesEmitNoVerdicts|TestPatternDetector_PositiveFixturesMatchGolden' -v ./processor/patterndetectorprocessor/` — green (3 negatives + 1 positive subtest). - `cd module && GOWORK=off go test ./...` — green, no new failures. - `make lint` — 0 issues. - Pre-push hooks (`go vet`, `go mod verify`, `attribute-namespace-check`, `no-autoupdate-check`) — green. ## Test plan - [x] Reproduced RED on `origin/main` before edit. - [x] Both new test bodies run green locally. - [x] Full module test suite green. - [x] `make lint` green. ```release-notes fix(test): pattern-detector replay test now keys fixture polarity off `golden.json` content rather than directory name, so positive `_real_world/` captures land in the positive-assertion lane and operator contributions plug in without test-code edits. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr mentioned this pull request Jun 2, 2026

Sibling-detector _real_world/ infra: generalize PR #484's anonymizer (or land per-pattern slots on first capture) #485

Open

trilamsr enabled auto-merge (squash) June 2, 2026 05:35

This was referenced Jun 2, 2026

regression(chart): #481 AppArmor default-on breaks install-bench on AppArmor-less hosts #492

Closed

fix(chart): flip AppArmor default to opt-in (#492) (refs #481) #493

Merged

trilamsr added 2 commits June 1, 2026 23:34

Merge remote-tracking branch 'origin/main' into feat/m19-pod-evicted-…

b929cb6

…real-world-infra # Conflicts: # Makefile

Merge remote-tracking branch 'origin/main' into feat/m19-pod-evicted-…

c4583a5

…real-world-infra # Conflicts: # Makefile

trilamsr merged commit f8b5419 into main Jun 2, 2026
27 checks passed

trilamsr deleted the feat/m19-pod-evicted-real-world-infra branch June 2, 2026 06:59

This was referenced Jun 2, 2026

test: synthetic-2026-06-multi-rank-disk-pressure fixture mis-labelled as negative #497

Closed

chore(infra): Makefile shards + kind-CRD bootstrap + module/ full-lint #498

Merged

trilamsr mentioned this pull request Jun 4, 2026

test(patterndetector): key fixture polarity off golden.json (#497) #514

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(replay): pod_evicted PII anonymizer + real-world fixture (M19 #1)#484

feat(replay): pod_evicted PII anonymizer + real-world fixture (M19 #1)#484
trilamsr merged 4 commits into
mainfrom
feat/m19-pod-evicted-real-world-infra

trilamsr commented Jun 2, 2026

Uh oh!

trilamsr commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 2, 2026

Summary

Root cause being fixed

Test plan

Follow-up

Uh oh!

trilamsr commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant