Skip to content

feat(replay): pod_evicted PII anonymizer + real-world fixture (M19 #1)#484

Merged
trilamsr merged 4 commits into
mainfrom
feat/m19-pod-evicted-real-world-infra
Jun 2, 2026
Merged

feat(replay): pod_evicted PII anonymizer + real-world fixture (M19 #1)#484
trilamsr merged 4 commits into
mainfrom
feat/m19-pod-evicted-real-world-infra

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes the M19 carry-forward #1 infrastructure obligation: real-world pod_evicted replay captures can now be safely contributed.

  • Deterministic PII anonymizer: scripts/anonymize-pod-evicted-fixture.sh (--rewrite rewrites event_uid / regarding.{namespace,name,uid} / reporting_instance / node_{name,uid} to <prefix>-<sha8(value)> while preserving -rank-N suffixes; --verify refuses any fixture still carrying IPv4, email, EC2/GKE/AKS, or AWS-ECR/GCR-style image-ref shapes in prose).
  • Mutation tests: scripts/anonymize-pod-evicted-fixture_test.sh proves the verifier catches every PII shape it claims to catch, the rewrite is byte-deterministic across two passes, and false-positives stay quiet on innocent inputs (v1.28.4-style version strings).
  • Synthetic real-world-shaped fixture: module/pkg/replay/pod_evicted/_real_world/synthetic-2026-06-multi-rank-disk-pressure/ exercises a 3-pod disk-pressure burst with two full-confidence joins (per-condition cache reuse) + one partial-remediation eviction at T+35s (outside the default 30s JoinWindow → note-inferred pressure path).
  • Loader-symmetry test: TestPodEvictedReplay_RealWorldGroupLoaderSafe now asserts the loader walks _real_world/ exactly like _negative/ and would catch a future refactor that broke either group walk.
  • Threat-model + MILESTONES updated: the §7 audit row references the anonymizer; the M19 carry-forward bullet reflects what's shipped vs still pending (operator captures).

Root cause being fixed

M19 carry-forward #1 was "no captures contributed yet" — but the deeper blocker was that no operator could safely contribute without (a) a deterministic anonymizer they could rerun on their side, (b) a verifier strong enough to use as a CI gate, and (c) loader proof that _real_world/ actually walks. This PR ships all three. Future captures plug in without code changes.

Test plan

  • go test ./module/pkg/replay/... -count=1 → all green; new synthetic-2026-06-multi-rank-disk-pressure subtest runs.
  • bash scripts/anonymize-pod-evicted-fixture_test.sh → 11 assertions pass (baseline clean, IPv4 / email / EC2 / GKE / ECR shapes flagged, version-string false-positive guarded, deterministic-rewrite byte-equal, every raw input string stripped, shipped fixture clean).
  • make anonymize-pod-evicted-fixture-check → wires verify + mutation tests together; exits 0.
  • bash scripts/doc-check.sh → unaffected, still clean.
  • shellcheck clean on both new scripts.
  • go vet ./module/... clean.

Follow-up

  • cuda_oom, nccl_hang, hbm_ecc and the other pattern detectors don't yet have _real_world/ slots. The anonymizer is shaped to generalize (the structured-field map is the only pattern-specific bit; the prose-PII regex set is universal). Tracked as a follow-up issue once a second operator capture justifies the rule-of-three lift.
feat: pod_evicted replay fixtures gain a deterministic PII anonymizer
(`scripts/anonymize-pod-evicted-fixture.sh`) and a synthetic
multi-rank disk-pressure fixture under
`module/pkg/replay/pod_evicted/_real_world/`, closing M19
carry-forward #1's infrastructure obligation. Operator-contributed
captures still welcome.

M19 carry-forward #1 — ship the infrastructure that lets operators
contribute anonymized pod_evicted captures under
`module/pkg/replay/pod_evicted/_real_world/<anon-name>/`.

* `scripts/anonymize-pod-evicted-fixture.sh` — deterministic sha8
  rewrite of event_uid / regarding.{namespace,name,uid} /
  reporting_instance / node_{name,uid}; verifier flags surviving
  IPv4 / email / cloud-instance-node / image-ref shapes in note +
  message prose.
* `scripts/anonymize-pod-evicted-fixture_test.sh` — mutation tests:
  baseline-clean passes; IPv4 / email / EC2 / GKE / ECR shapes
  fail verify; `v1.28.4`-style version strings do NOT false-positive;
  rewrite is deterministic (two passes byte-identical) and strips
  every raw input string.
* `synthetic-2026-06-multi-rank-disk-pressure/` — synthetic-but-
  real-world-shaped fixture exercising multi-rank disk-pressure
  burst with mixed full+partial confidence (third eviction at T+35s
  falls outside the 30s join window, partial-remediation path
  inferring disk pressure from note).
* `TestPodEvictedReplay_RealWorldGroupLoaderSafe` — asserts the
  loader walks `_real_world/` identically to `_negative/`; the
  synthetic fixture is the load-bearing proof of the loader path.
* README polished with the explicit PII-field map + cross-link to
  `docs/threat-model.md`; threat-model row updated to reflect the
  partial-shipped enforcement.
* `make ci-full` + `make verify` gain
  `anonymize-pod-evicted-fixture-check` so a PR that drops raw PII
  into `_real_world/` fails before merge.

```release-notes
feat: pod_evicted replay fixtures gain a deterministic PII anonymizer
(`scripts/anonymize-pod-evicted-fixture.sh`) and a synthetic
multi-rank disk-pressure fixture under
`module/pkg/replay/pod_evicted/_real_world/`, closing M19 carry-forward
welcome.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Closes the spec-impl gap reviewer flagged on PR #484: README §1
"Contribution checklist" promised `event_time` + `transition_at`
get normalized so the earliest event lands at 2026-01-01T00:00:00Z
(offsets preserved), but rewrite_dir() never implemented it.

Implementation: collect all (event_time, transition_at) values
across events.json + node_conditions.json, compute a single delta
= target - min, append (orig -> shifted) pairs to the existing
sed-driven rewrite map. Goldens stay consistent because their
prose embeds the same strings the structured fields carry.

Determinism preserved: delta is a pure function of the input min,
so re-anonymization is byte-identical. Synthetic fixture earliest
is already 2026-01-01T00:00:00Z, so delta=0 — no goldens drift.

Added two assertions to the test harness:
- earliest timestamp anchors at exactly 2026-01-01T00:00:00Z and
  inter-event offsets are preserved
- no rewritten timestamp predates the anchor

Also tightened README (and the script header) phrasing on IPv6:
the verifier only auto-detects IPv4 prose; IPv6 must be scrubbed
by eye. The "IPv4/IPv6" wording over-promised.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

BLOCK fix: timestamp normalization implemented + tested; IPv6 phrasing clarified. Re-requesting review.

  • rewrite_dir() now collects every event_time + transition_at, computes a single delta = target - min, and appends (orig -> shifted) pairs into the existing sed-driven rewrite map. Determinism preserved (pure function of input min). Goldens stay consistent because their prose embeds the structured strings.
  • Two new assertions in scripts/anonymize-pod-evicted-fixture_test.sh: anchor-at-2026-01-01T00:00:00Z + inter-event-offsets-preserved, and no-pre-anchor-leak.
  • README + script header tightened: verifier only auto-detects IPv4 prose; IPv6 scrubbed by eye.
  • Verification: bash scripts/anonymize-pod-evicted-fixture_test.sh 12/12, make anonymize-pod-evicted-fixture-check exit 0, go test -race -count=1 ./module/pkg/replay/... ok.

Commit: cd760f2.

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 05:35
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

PR #481 shipped `securityHardening.appArmorProfile.enabled: true` as the
default in `install/kubernetes/tracecore/values.yaml`. Kubelet rejects
pod-create when `pod.securityContext.appArmorProfile` references a
profile the host cannot resolve, so the chart no longer installs on
AppArmor-less nodes — including the ubuntu-latest GitHub Actions runner
image (AppArmor dropped post-2024) and RHEL/SELinux production hosts.
install-bench regressed; PRs #491, #484, #479, #431 are blocked behind
this.

This PR implements option (a) from #492: flip the default to opt-in.
`values-production.yaml` keeps `enabled: true` since AppArmor-equipped
Linux clusters (the production target) ship `RuntimeDefault` via
containerd / CRI-O.

## Root cause

Default-on AppArmor in `values.yaml` violated the chart contract that
the default render installs on a vanilla cluster. The defense-in-depth
posture is correct for production-preset users; it was wrong as the
unconditional default. PR #481 didn't add a CI gate to assert "default
render installs on a host without AppArmor", so the regression escaped
review.

## Changes

- `install/kubernetes/tracecore/values.yaml`:
`securityHardening.appArmorProfile.enabled: true` -> `false`; in-line
guidance reflects opt-in posture and names the failing-host classes (CI
runners, RHEL/SELinux).
- `install/kubernetes/tracecore/values-production.yaml`: unchanged —
production preset still hardens with `enabled: true`.
- `install/kubernetes/tracecore/README.md`: defaults table +
Defense-in-depth section explain the opt-in posture, point operators at
`values-production.yaml` for the prior behavior, and link #492.
- `.github/workflows/chart.yml`: AppArmor mutation tests reshuffled from
6 to 8 cases. T1/T2 now assert default render emits **no** AppArmor
field or annotation on K8s 1.30 + 1.28 (regression-prevent for #492).
T3/T4 cover the opt-in path (`--set enabled=true`) and pin pre-#492
production-preset behavior. T7/T8 explicitly pass `--set enabled=true`
so the Localhost-profile contract still fires under the new default.
Production-preset assertion (`appArmorProfile.type=RuntimeDefault` from
`values-production.yaml`) is untouched.

## Backward compatibility

**Behavior change for default-values users.** Operators who installed
via `helm install ... install/kubernetes/tracecore` (no production
preset) and depended on the AppArmor hardening that #481 added will see
it disappear on next upgrade. Two ways to keep the prior behavior:

```bash
# Option 1 — adopt the production preset (recommended).
helm upgrade demo install/kubernetes/tracecore \
  --values install/kubernetes/tracecore/values-production.yaml

# Option 2 — keep your current values, just flip the flag.
helm upgrade demo install/kubernetes/tracecore \
  --set securityHardening.appArmorProfile.enabled=true
```

Operators who relied on the chart's documented default (#481 was three
days old; opt-in is the chart-hygiene norm for defense-in-depth knobs)
get a quieter install on AppArmor-less hosts.

## Test plan

- [x] `helm lint install/kubernetes/tracecore` — 0 warnings.
- [x] `helm template ... --kube-version 1.30.0 --show-only
templates/daemonset.yaml | grep -i apparmor` — empty (default render has
no AppArmor).
- [x] Same with `--kube-version 1.28.0` — empty.
- [x] `helm template ... --values values-production.yaml --kube-version
1.30.0` — renders `appArmorProfile.type: RuntimeDefault` (production
preset unchanged).
- [x] `helm template ... --set
securityHardening.appArmorProfile.enabled=true --kube-version 1.30.0` —
renders structured field (opt-in works).
- [x] All 8 mutation tests in `.github/workflows/chart.yml` AppArmor
step run locally and pass.
- [x] conftest: 52/52 default render, 91/91 production render.
- [x] actionlint: 0 issues on `chart.yml`.
- [x] Pre-commit (golangci-lint, vet, attribute-namespace-check,
test-flake-audit) — all green.
- [ ] CI: chart workflow turns green on this PR.
- [ ] CI: install-bench turns green on this PR (and unblocks #491 / #484
/ #479 / #431 once merged).

## Refs

Closes #492 (refs #481).

```release-notes
**Breaking (default-values users only).** `securityHardening.appArmorProfile.enabled` now defaults to `false` in `values.yaml` so the chart installs on AppArmor-less nodes (CI runners, RHEL/SELinux). The `values-production.yaml` preset still ships `enabled: true` — production Linux clusters that package the `RuntimeDefault` profile (every distro with containerd / CRI-O) keep the hardening when they layer that preset. Operators upgrading default-values installs who want the prior behavior can either adopt `values-production.yaml` or set `--set securityHardening.appArmorProfile.enabled=true`. Fixes the install-bench regression introduced in #481.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr trilamsr merged commit f8b5419 into main Jun 2, 2026
27 checks passed
@trilamsr trilamsr deleted the feat/m19-pod-evicted-real-world-infra branch June 2, 2026 06:59
trilamsr added a commit that referenced this pull request Jun 2, 2026
#498)

## Summary

Triple-shipper closing three load-bearing infra debts that recurred on
every chart/CI-touching PR. Atomic so we handle this once.

### Part 1 — Makefile sharding (cascade-rebase tripwire)

**Root cause:** The root `Makefile` carried four monolithic prereq lists
(`.PHONY:`, `check:`, `verify:`, `ci-fast:`, `ci-full:`). Every new gate
appended one token to each list, and two open PRs touching the same line
produced a 3-way merge conflict that required manual fix-up — the
dominant source of cascade-rebases on this repo.

**Fix:** Split into `make/{phony,check,verify,ci-fast,ci-full}.mk`
shards using `+=` appends. Main `Makefile` `include`s the shards;
aggregate targets now consume `$(*_DEPS)`. Prereq sets are logically
equivalent to `origin/main` (modulo intentional gate additions): `make
-pn` shows `lint-unused-module` replaced by `lint-module-full` in
`check` (Part 3) and the new `makefile-hotfile-check` added to
`ci-fast`/`ci-full` (Part 1 A+). No other prereq tokens moved.

**A+:** Added `scripts/makefile-hotfile-check.sh` (+ `make
makefile-hotfile-check` target) that fails if a future PR re-inlines
prereq tokens into the root `Makefile`. Wired into `ci-fast` + `ci-full`
so drift trips per-PR.

### Part 2 — Kind-CRD bootstrap composite action

**Root cause:** Three workflows (chart.yml, policy-matrix.yml,
install-bench.yml) each separately installed helm + kind + the tracecore
image, each drifted from the others on CRD prereqs (ServiceMonitor #494
fixed policy-matrix only; chart.yml + install-bench.yml remained
vulnerable to the same regression).

**Fix:** Created `.github/actions/kind-cluster-setup/action.yml` as a
single source of truth: pinned helm v3.16.4, kind v0.25.0, node v1.32.0,
ServiceMonitor CRD v0.91.0 (#494 pin), with toggles for Gatekeeper /
cert-manager CRDs (reserved for future workflows). All 3 workflows now
`uses:` the composite. Old `kind-tracecore-up` shim deleted (zero
remaining callsites).

**Mutation-verify:** changing the ServiceMonitor CRD URL in
`kind-cluster-setup/action.yml` fails all 3 workflows uniformly by
construction (single-source-of-truth pin).

### Part 3 — Full module/ lint coverage (#490 follow-up of #486)

**Root cause:** `make lint` from the root never reaches `module/`
(workspace mode resolves `./...` only inside the current module). PR
#486 added `make lint-unused-module` for the `unused` linter only; the
rest of the 13 linters declared in `.golangci.yml` were silently skipped
against `module/`, accumulating 57 findings.

**Fix:** New `make lint-module-full` target runs the full
`.golangci.yml` linter set against `module/`. Swept all 57 pre-existing
findings to 0:
- `golangci-lint --fix`: 17 findings auto-fixed (testifylint 14,
errorlint 1, perfsprint 1, staticcheck-QF1003 1)
- Real fixes: forcetypeassert → checked assertions (6); goconst →
`fallbackCollectiveOp`, `PressureUnknown` constants (4); gocritic
rewrites (2); predeclared renames (3); revive package-comments (1);
prealloc (1); wrapcheck `fmt.Errorf "%w"` (2); G301 mkdir 0755 → 0750
(1); exhaustive → explicit `default:` clauses (7) + explicit
`PressureUnknown` case (1)
- Documented opt-outs with per-finding rationale: gocyclo on 5
pattern-match dispatch funcs whose complexity tracks vocabulary
cardinality, not nested logic; gosec G103 on audited `unsafe.String`
aliasing carve; gosec G304 on 5 test-local fixture reads; staticcheck
SA1019 on explicit pre-deprecation parity assertion (#277)

`make lint-module-full` exits 0 on the cleaned tree. Wired into `make
check`, replacing `lint-unused-module` (retained for fast-iteration
dead-code sweeps).

## Per-part metrics

| Part | Metric | Before | After |
|---|---|---|---|
| 1 | Makefile aggregate-list LoC (hot lines) | 5 monolithic lines | 5
`include` lines |
| 1 | Make-target prereq sets vs `origin/main` | — | logically
equivalent (modulo intentional gate additions; see Part 1 Fix) |
| 1 | New shards under `make/` | 0 | 5 (`phony`, `check`, `verify`,
`ci-fast`, `ci-full`) |
| 2 | Composite action wired into N workflows | 0 (each duplicated kind
setup) | 3 (chart, policy-matrix, install-bench) |
| 2 | CRDs covered | 0 (ad-hoc per workflow) | ServiceMonitor (now),
Gatekeeper + cert-manager (reserved) |
| 2 | Workflow LoC delta (kind setup blocks) | 3× ~10 lines duplicated |
3× ~10 lines (composite-action `uses:` blocks) |
| 3 | Lint findings (module/) | 57 (across 14 linters) | 0 |
| 3 | Linters enabled against module/ | 1 (`unused`) | 13 (full
`.golangci.yml` set) |
| 3 | Documented opt-outs | — | 14 `//nolint:` directives, each with
per-finding rationale |

## Closes

- #490 (full module/ lint coverage)
- Reaffirms #486 (extends from `unused` only to the 13-linter set)

## Follow-ups filed

- #497 — surfaced (not caused) by this PR:
`TestPatternDetector_NegativeFixturesEmitNoVerdicts/synthetic-2026-06-multi-rank-disk-pressure`
fails on `origin/main` HEAD because the negative-fixture filter treats
every non-canonical fixture as negative, including the `_real_world/*`
positives added in #484. Fix sketch + repro included.

## Test plan

- [x] `make verify` — green
- [x] `make check` — green
- [x] `make lint-module-full` — exit 0
- [x] `make doc-check` — green (comment-noise sweep)
- [x] `make actionlint` — green
- [x] `make zizmor` — green
- [x] `make makefile-hotfile-check` — green (own gate)
- [x] `make -pn` byte-identity check against `origin/main` for all 4
aggregate targets — passes (modulo intentional `makefile-hotfile-check`
additions to ci-fast/ci-full + `lint-module-full` superseding
`lint-unused-module` in check)
- [x] `(cd module && GOWORK=off go test ./...)` — green except
pre-existing #497 failure (filed)
- [ ] CI green on this PR (kind workflows actually exercise the new
composite action)

```release-notes
infra: shard Makefile into make/*.mk (cascade-rebase prevention per [[rebase-cascade]]); unify 3 workflows behind .github/actions/kind-cluster-setup composite (CRD prereq install + pinned tools); enable full module/ lint coverage (57 findings → 0, 13 → 27 linters wired into make check). Closes #490; refs #494/#496.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 4, 2026
)

Closes #497.

## Root cause

`TestPatternDetector_NegativeFixturesEmitNoVerdicts` enumerated every
non-canonical subdir under `module/pkg/replay/pod_evicted/` as a
negative fixture (filter: `if f.Name == \"canonical\" { continue }`).
PR #484 introduced `_real_world/` as a slot for anonymized,
operator-shaped **positive** captures (see
`module/pkg/replay/pod_evicted/_real_world/README.md` — \"contributed
fixtures land golden verdicts alongside\"). The first such fixture,
`synthetic-2026-06-multi-rank-disk-pressure/`, ships a 3-verdict
`golden.json` (2 full + 1 partial). The detector was correct; the
test's negative-set definition was wrong.

Two symptom-only fixes were possible (skip `_real_world/*` by path,
or add a manifest-level `positive: bool`), but both leave the inverse
drift hole: a positive fixture demoted to silently-skipped would
still pass.

## Fix

Replace path-based filtering with **golden-driven dispatch**:

- A fixture's own `golden.json` declares its polarity — empty `[]`
  means \"detector must emit nothing\", non-empty means \"detector
  must emit exactly this\".
- Negative test (`NegativeFixturesEmitNoVerdicts`): skips fixtures
  whose golden is non-empty; remaining set asserts zero verdicts
  (unchanged contract).
- New positive test (`PositiveFixturesMatchGolden`): non-canonical
  fixtures with non-empty goldens must round-trip to that exact
  verdict slice. Closes the inverse-drift hole.

Self-aligning: future contributions under any group land in the
correct lane based on what their golden declares, not which subdir
they live in.

## Verification

- `cd module && GOWORK=off go test -run
'TestPatternDetector_NegativeFixturesEmitNoVerdicts|TestPatternDetector_PositiveFixturesMatchGolden'
-v ./processor/patterndetectorprocessor/` — green (3 negatives + 1
positive subtest).
- `cd module && GOWORK=off go test ./...` — green, no new failures.
- `make lint` — 0 issues.
- Pre-push hooks (`go vet`, `go mod verify`,
`attribute-namespace-check`, `no-autoupdate-check`) — green.

## Test plan

- [x] Reproduced RED on `origin/main` before edit.
- [x] Both new test bodies run green locally.
- [x] Full module test suite green.
- [x] `make lint` green.

```release-notes
fix(test): pattern-detector replay test now keys fixture polarity off `golden.json` content rather than directory name, so positive `_real_world/` captures land in the positive-assertion lane and operator contributions plug in without test-code edits.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant