Skip to content

perf(pod_evicted): ratchet allocs/event 15.27 -> 1.90 (hit NORTHSTAR, #434)#456

Merged
trilamsr merged 2 commits into
mainfrom
perf/434-pod-evicted-allocs
Jun 2, 2026
Merged

perf(pod_evicted): ratchet allocs/event 15.27 -> 1.90 (hit NORTHSTAR, #434)#456
trilamsr merged 2 commits into
mainfrom
perf/434-pod-evicted-allocs

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #434. Profile-driven optimization of PodEvictedDetector.Evaluate drops per-event allocations from 15.27 → 1.90, clearing the NORTHSTAR (2) floor. Bench ceiling lowered 16 → 2 in scripts/bench-registry.sh::allocs_gate; bench/detectors/baselines.json updated. Same recipe as #438 (nccl_hang) and #448 (xid_correlation).

Before / after

go test -bench=BenchmarkPodEvictedDetector -benchmem -count=10 -benchtime=500ms ./bench/detectors/ (Apple M1 Max):

metric before after delta
allocs/op 15635 1943 -87.6%
B/op 970020 473400 -51.2%
allocs/event 15.27 1.90 hit NORTHSTAR

Stability: 10 consecutive bench runs all measured 1943 allocs/op (CV = 0%).

Root cause (named)

Per-evicted-pod the previous detector allocated ~19 objects. Profile (-memprofile, -alloc_objects) attributed them:

  1. fmt.Sprintf ×5 in buildVerdict + helpers — 60% of pre-fix allocs. Sprintf boxes each arg into an any slice + scans the format string into a fresh internal buffer; both escape. Replaced with strconv.AppendInt + manual byte builders.
  2. strings.ToLower(note) in pressureFromNote — 1 alloc per evicted pod (the lowercased copy). Replaced with a zero-alloc containsFold ASCII-fold scan (all anchors are already lowercase).
  3. time.Format in formatTimestamp — Sprintf-style alloc per headline. Replaced with t.AppendFormat(buf, time.RFC3339) writing into the shared scratch buffer.
  4. Unsized condIdx map + per-bucket append growthmake(map[K]V) defaults to 8 buckets; append(nil, r) grows each bucket. Pre-sized to len(recs).
  5. Per-verdict make([]EvidenceRef, 1, 2) — 1 alloc per verdict for the trail slice header. Replaced with ONE contiguous trailBacking := make([]EvidenceRef, 2*len(events)) sliced cap=2 per verdict.
  6. Escaping make([]byte, dynamic-cap) per builder call — escape analysis can't prove the dynamic capacity fits on stack, so each builder allocated TWO objects (buf + the string(buf) cast). Replaced with a per-Evaluate scratch *[]byte reused across every builder (each resets to (*scratch)[:0] and appends), collapsing to ONE alloc per call (the irreducible string cast).
  7. Redundant per-condition re-render — 820 evictions joining onto 64 conditions re-rendered the same UID + description + remediation strings on every join (2460 redundant allocs). Pre-rendered once at index time into a new indexedNodeCond struct.
  8. Partial-path remediation re-render — partial-path evictions (no joined condition) re-rendered the "On node X: <prose>" remediation per pod. Added a (node, pressure) → string cache so burst evictions on the same node share one alloc.
  9. sort.SliceStable reflection — both indexNodeConds's per-bucket sort and the outer verdicts sort used reflection-based sort.SliceStable (allocates a swapper per call). Replaced with slices.SortStableFunc (generics, no reflection); buckets of length ≤1 skip the sort call entirely.

Top profile entries (memprofile -alloc_objects)

Before:

60% buildVerdict (5 fmt.Sprintf calls per eviction)
22% fmt.Sprintf
 6% nodeConditionDescription
 4% annotateRemediationWithNode
 3% time.Time.Format
 4% strings.ToLower

After:

42% podEventDescription   (1 alloc/call — irreducible string buf cast)
36% renderHeadline        (1 alloc/call — irreducible string buf cast)
 7% annotateRemediationWithNode  (partial-path cache misses only)
 5% nodeConditionDescription     (index-time pre-render, 64 calls total)
 3% nodeConditionUID             (index-time pre-render, 64 calls total)

The two per-verdict string allocations (headline + pod_event description) are now the irreducible floor: each is a distinct verdict field that downstream json.Marshal requires as separate strings.

Verdict semantics — unchanged

Byte-identical headlines, remediation, UIDs, descriptions, JSON envelope. Confirmed via:

  • module/pkg/replay/pod_evicted/canonical/golden.json round-trips unchanged.
  • All TestPodEvicted* unit tests pass (negative fixtures, deterministic order, partial path, out-of-window, future transition excluded, empty node message, remediation pins node name, schema conformance + drift rejection).
  • TestPodEvictedVerdict_SchemaConformance (canonical full + partial) passes.

Gate state

gate result
scripts/bench-allocs-check.sh (hard absolute ceiling) PASS at new ceiling=2
scripts/bench-check-detectors.sh (10% delta soft gate) PASS (no detector regressed)
go test -race ./module/pkg/patterns/ ./module/pkg/replay/ ./module/processor/... PASS
go vet ./module/pkg/patterns/ clean
pre-commit (golangci-lint, vet, mod verify, attribute-namespace-check) PASS
pod_evicted detector now allocates 1.90 allocations per event (down from 15.27), hitting the v0.3.0 NORTHSTAR floor of <=2 allocs/event. Joins #418 (nccl_hang) and #417 (xid_correlation) — three of the six per-detector tracking issues at or below NORTHSTAR.

Closes #434
Refs #302

Test plan

  • go test -race -count=1 ./module/pkg/patterns/ ./module/pkg/replay/ ./module/processor/... PASS
  • go vet ./module/pkg/patterns/ clean
  • bash scripts/bench-allocs-check.sh PASS (pod_evicted 1.90/ev <= ceiling 2; all 6 detectors at-or-below ceiling)
  • bash scripts/bench-check-detectors.sh PASS (no detector regressed >10%)
  • go test -bench=BenchmarkPodEvictedDetector -count=10 -benchtime=500ms stable at 1943 allocs/op (CV=0%)
  • Golden replay (module/pkg/replay/pod_evicted/canonical/golden.json) round-trips unchanged
  • Rebased onto origin/main (resolved scripts/bench-registry.sh conflict with perf(xid_correlation): ratchet allocs/event 12.4 -> 1.88 (hit NORTHSTAR, #417) #448's xid_correlation ratchet — both ratchets shipped together in this branch's view)

Profile-driven optimization of PodEvictedDetector.Evaluate drops
per-event allocations from 15.27 -> 1.90, clearing NORTHSTAR (2).
Ceiling lowered 16 -> 2 in scripts/bench-registry.sh::allocs_gate;
baselines.json updated.

Recipe: same shape as #438/#448 (strconv byte-builders + scratch
pool + indexed-condition cache + slices.SortStableFunc).

Allocs per Evaluate (1024-event window, 820 evictions):
  before: 15635 allocs/op = 15.27 allocs/event
  after:   1943 allocs/op =  1.90 allocs/event   (-87.6%)

CV=0% across 10 consecutive bench runs.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Independent adversarial review — VERDICT: A−

Hits NORTHSTAR via sound recipe; caches correctly scoped; golden replay byte-identical. Trim concerns:

Findings

  1. module/pkg/patterns/pod_evicted.go:464-477displayPodName optimization is dead in this detector's hot path (inlined into buildVerdict). Remove the optimization-impact comment; function stays exported (xid_correlation may use it) but misleads about its impact here.
  2. pod_evicted.go:64-74 — Evaluate "allocation-budget" comment lists 9 root causes — changelog content masquerading as code comment. Belongs in commit/PR body. Trim to 1-2 lines pointing to perf(pod_evicted): lower allocs/event from 15.27 toward NORTHSTAR (2) #434.
  3. pod_evicted.go:292-297, 321-323 — Redundant fmt→strconv before/after narratives across 5 sites. Consolidate into single top-of-file "byte-builder pattern" note or single allocation strategy comment.
  4. pod_evicted.go:528 containsFold — ASCII-only assumption is safe (vendor kubelet messages English) but comment doesn't warn future callers. Add: "safe only for ASCII source + ASCII-lowercase substrs."
  5. scripts/bench-registry.sh:68 — Verify rebase landed xid_correlation at 2 (not 13). PR body claims perf(xid_correlation): ratchet allocs/event 12.4 -> 1.88 (hit NORTHSTAR, #417) #448 conflict resolved.

Simplification sweep — trim targets identified

~25 lines saved across 3 trim sites (lines 64-74, 292-297/321-323, 456-463).

VERDICT: A− → A after trim sweep. Apply 1-4 + verify finding 5, then merge.

Address reviewer A- trim items on #456:
1. displayPodName: drop optimization-impact comment (function inlined
   in pod_evicted; kept exported for xid_correlation use).
2. Evaluate: collapse 9-bullet allocation-changelog into one-line
   NORTHSTAR pointer + per-file note.
3. Consolidate fmt->strconv narrative into one top-of-Evaluate note;
   drop the per-site repetition across five render helpers.
4. containsFold: replace 'equivalent to' note with explicit ASCII-only
   safety guard (anchors + kubelet eviction messages are English).

No behavior change: allocs/op = 1943 unchanged, golden replay
byte-identical, bench-allocs-check still PASS.

Verified finding 5: scripts/bench-registry.sh xid_correlation
ceiling already 2 (post-#448), no change required.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Trim sweep applied (commit 04e7a76, -28 lines net):

  • displayPodName: dropped optimization-impact narrative; kept exported for xid_correlation use.
  • Evaluate: collapsed 9-bullet allocation-changelog to one-line NORTHSTAR pointer + per-file note.
  • Consolidated fmt->strconv narrative into one top-of-Evaluate note; dropped per-site repetition across 5 render helpers.
  • containsFold: replaced 'equivalent to' note with explicit ASCII-only safety guard.
  • Verified finding 5: scripts/bench-registry.sh xid_correlation ceiling already 2 (post-perf(xid_correlation): ratchet allocs/event 12.4 -> 1.88 (hit NORTHSTAR, #417) #448), no change required.

No behavior change: allocs/op = 1943 unchanged, bench-allocs-check PASS, golden replay byte-identical. Re-requesting review.

@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Re-review post-trim — VERDICT: A

All 5 prior findings resolved cleanly:

  1. displayPodName comment trimmed to 4-line external-use note (was 9-line optimization claim).
  2. ✓ Evaluate allocation-budget comment 9 bullets → 2 lines pointing to perf(pod_evicted): lower allocs/event from 15.27 toward NORTHSTAR (2) #434.
  3. ✓ Per-function fmt→strconv narratives consolidated into top-of-Evaluate; per-site repetition removed.
  4. containsFold ASCII-only safety guard explicit.
  5. ✓ xid_correlation already at ceiling 2 in bench-registry.sh (post-perf(xid_correlation): ratchet allocs/event 12.4 -> 1.88 (hit NORTHSTAR, #417) #448 merge).

Simplification sweep clean. Allocs/op = 1943 unchanged. Golden replay byte-identical. bench-allocs-check PASS.

VERDICT: A — recommend MERGE.

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 02:37
@trilamsr trilamsr merged commit 1da73c5 into main Jun 2, 2026
23 checks passed
@trilamsr trilamsr deleted the perf/434-pod-evicted-allocs branch June 2, 2026 02:43
trilamsr added a commit that referenced this pull request Jun 2, 2026
…460) (#466)

## Summary

Closes #460. The `exit 0` on `scripts/doc-check.sh` ran unconditionally
whenever `docs/FAILURE-MODES.md` carried no `Test*`/`Fuzz*`/`Benchmark*`
identifiers (its current state on `main` — `grep -c` = 0), silently
bypassing every gate below it. Fix scopes the skip to the Go-test parity
block only (if/else, not `exit`), then surfaces and fixes the dead refs
the gates were supposed to be catching.

## Root cause

Commit a57883f (#13) shipped `doc-check.sh` with one gate — the Go-test
name parity check — so `[ -z "$referenced" ] && exit 0` was correct
then. PRs #28, #56, #115, #131, #144, #149, #195, #234, #241, #443,
#455, #459 (and others) appended gates **below** that line without
recognising they'd become dead code whenever `FAILURE-MODES.md` lost its
`Test*` references. PR #459 worked around the bug by placing its new
YAML gate *above* line 99 and tracked the root cause separately as #460.

## What surfaced

Once `exit 0` was removed, three real issues fired:

1. **Dead `.md` link**: `docs/FOLLOWUPS.md` → `followups/otlphttp.md`.
The shard was never committed to `main`'s ancestry. Folded into the
existing "Shards deleted post-v0.2.0 as fully resolved-via-pivot" prose
block (sibling treatment to M9, M14, M16).
2. **Banned-phrase hits** (3x `production-grade`): reworded in
`docs/cut-criteria.yaml.md` (2x) and
`install/kubernetes/tracecore/README.md` (1x) to falsifiable language.
3. **`docs/getting-started.md` block cap**: 7 fenced bash/sh blocks. The
M6 cap of 5 was set for the quickstart only — `## Install via Helm` and
`## Air-gapped install` are alternate deployment paths that landed
post-M6 and aren't part of the quickstart budget. Rescoped the gate to
count blocks inside the `## Walkthrough` H2 section only (1 block, well
under cap).

## Gate count

Empirically verified via `grep -c '^doc-check: '` on `make doc-check`
output on a clean tree:

| State | Status lines emitted | Gates the early-exit was hiding |
|---|---|---|
| Pre-fix on `main` (post-#459) | 3 (trust-posture, YAML cross-link,
parity-skip) | 14 |
| Post-fix this PR (post-rebase) | 17 | 0 |

The "14 gates hidden" number is invariant across the rebase: it counts
gates placed below the early-exit line. The "3 → 17" total reflects
post-#459 reality on `main`; pre-#459 baseline was "2 → 16" (the figure
originally in this PR body), and #459 itself worked around the bug by
placing its YAML gate above line 99.

## Mutation tests

Each gate below the original early-exit was confirmed to fire post-fix:

| Mutation | Gate expected to fire | Exit code post-mutation | Exit code
post-restore |
|---|---|---|---|
| Inject `[bad](nonexistent-ghost.md)` into `docs/FOLLOWUPS.md` |
markdown link-rot | 1 | 0 |
| Append `blazing-fast` + `rock-solid` to `docs/getting-started.md` |
banned-phrase lint | 1 | 0 |
| Delete `<!-- tested-against: ... -->` from
`docs/integrations/datadog.md` | M6 recipe markers | 1 | 0 |

## Test plan

- [x] `make doc-check` exits 0 on clean tree (re-run post-rebase onto
origin/main; 17 status lines)
- [x] 3 mutation tests above each toggle exit 1 → 0 across mutate /
restore
- [x] Pre-push hooks green: golangci-lint (0 issues), `go vet ./...`,
`go mod verify`, `attribute-namespace-check` (100 attrs, all
documented), `register-lint`, `actionlint`, `zizmor`,
`deprecation-check`, `no-autoupdate-check`
- [x] Rebased onto current `origin/main` (includes #459, #461, #462,
#456); no conflicts; gate count re-verified empirically post-rebase
- [x] No changes to gates above line 99 (the trust-posture callout +
YAML cross-link gate from #459 still run and emit unchanged status
lines)

## Self-grade

**A+** — root cause named in commit body (a57883f #13 with one gate;
gates appended below without exit-path awareness); 3 mutation tests
(success criteria required 1–2); rescoped the getting-started gate to
match M6 intent rather than papering over the surfaced overflow; the `[
-z "$referenced" ]` legitimate skip is preserved via if/else (not `:`
no-op, which would have left the `defined=` / `orphans=` block running
on empty input); gate count corrected empirically post-rebase per
reviewer B feedback.

```release-notes
- fix(ci): `scripts/doc-check.sh` no longer exits 0 at the Go-test parity gate when `docs/FAILURE-MODES.md` carries no `Test*` references. 14 gates below that line (link-rot, banned-phrase, M6 recipe markers, etc.) are now actually enforced on every `make doc-check` invocation. Closes #460.
```

---------

Signed-off-by: Tri Lam <tree@lumalabs.ai>
trilamsr added a commit that referenced this pull request Jun 2, 2026
## Summary

`scripts/bench-registry.sh::bench_entries` covered only
`BenchmarkPodEvictedDetector` and the pyspy parser benches, so the **%
delta regression gate** (`scripts/bench-check.sh` via
`bench-check-all.sh`) missed five of the six detectors. The absolute
allocs/event ceiling (`allocs_gate`) caught a detector that drifted past
its hard ceiling, but a detector that drifted +50% while staying under
the ceiling would ship silently. This PR closes the gap by registering
all six detectors against both gates.

## Root cause

The registry was authored when `pod_evicted` was the only detector with
a bench. The five later detectors (added in
#417/#418/#434/#438/#448/#456) were plumbed into `allocs_gate` (absolute
ceiling) but the parallel % delta gate was never extended — a historical
artifact, not an architectural defect. Fix: extend `bench_entries` to
cover the missing five.

## What changed

- `scripts/bench-registry.sh`: added one entry covering five detectors
against `./bench/detectors/`, single-regex / single `go test` invocation
(~5s incremental vs ~30s if split into 5 entries).
- `bench/detectors/testdata/bench-baseline.txt`: generated baseline
(count=10 × 500ms, Apple M1 Max). Hardware-invariant signals (B/op +
allocs/op) pin the gate; sec/op stays advisory.
- Existing ceilings in `allocs_gate` unchanged. Existing baselines
unchanged.

## Mutation verification

Bumped `PCIeAERDetector` allocs baseline down by ~18% (524 → 430):

```
PCIeAERDetector-10  430.0 ± 0%  524.0 ± 0%  +21.86% (p=0.000 n=10)
REGRESSION: the following benchmarks exceeded the 10% threshold vs baseline:
  PCIeAERDetector-10  +21.86% (p=0.000 n=10)
```

`scripts/bench-check-all.sh` exit 1. Revert → exit 0. Gate fires on
regression, stays clean on baseline.

## Coverage gap context

Cross-link: #302 (allocs/event rollup), #417 (xid_correlation
NORTHSTAR), #418 (nccl_hang NORTHSTAR), #434 (pod_evicted NORTHSTAR),
#438/#448/#456 (sibling perf PRs). Every detector that hit NORTHSTAR
under the absolute ceiling now also has a relative-regression gate.

## Test plan

- [x] `make bench-check` exit 0 (covers all 6 detectors now)
- [x] `make bench-allocs-check` exit 0
- [x] `make bench-detectors-check` exit 0 (soft-gate, unchanged)
- [x] Mutation: lower baseline → exit 1; revert → exit 0
- [x] Lint + vet (pre-commit hook): 0 issues

```release-notes
ci(bench): % delta regression gate now covers all six detectors (#302). xid_correlation, hbm_ecc, nccl_hang, thermal_throttle, and pcie_aer were previously gated only by the absolute allocs/event ceiling — they now also fail builds on >10% allocs/op or B/op drift vs the committed baseline.
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(pod_evicted): lower allocs/event from 15.27 toward NORTHSTAR (2)

1 participant