perf(nccl_hang): ratchet allocs/event 3.99 -> 1.70 (hit NORTHSTAR, #418)#438
Conversation
Profile-driven optimization of NCCLHangDetector.Evaluate. Three layers
of fix, each measured under bench/detectors/ count=10 × 500ms:
1. fmt.Sprintf → strconv-based byte builders for the four hot strings
(per-evidence UID + description, per-cohort headline + remediation).
Per-Sprintf interface-boxing alloc + format-scan alloc were ~70% of
the per-event count under the bench fixture; replacing them collapses
each render to a single string(buf) allocation.
2. Pre-size the (pg,collective,rank) latest map to len(records) so the
runtime's grow-by-doubling path doesn't fire mid-loop.
3. Eliminate the per-cohort map+slice growth entirely: stream survivors
into a single contiguous []NCCLFRRecord, sort once by
(pg, collective, rank), then walk contiguous runs. Drops the
per-record cohorts[k] = append(...) alloc storm (was ~500 allocs/call
residual after step 1) and lets buildNCCLHangVerdict skip its own
per-cohort sort since the cohort hand-off is already rank-sorted.
Before / after (Apple M1 Max, count=10):
before: 4088 allocs/op 727096 B/op 3.99 allocs/event
after: 1743 allocs/op 521605 B/op 1.70 allocs/event
Top alloc_objects before / after:
before: ncclHangEvidenceDescription 38%
fmt.Sprintf 33%
ncclHangHeadline 8%
Evaluate (maps) 5%
...
after: ncclHangEvidenceDescription 56% (~1 alloc/rank — string buf cast)
ncclHangEvidenceUID 29% (~1 alloc/rank — string buf cast)
buildNCCLHangVerdict ranks+trail ~3% each (2 allocs/cohort)
remaining < 5%
The two per-rank string allocations (UID, description) are now the
irreducible floor — they're distinct fields on EvidenceRef and the
JSON-marshal contract requires distinct strings.
Verdict semantics unchanged. Golden replay
(module/pkg/replay/nccl_hang/canonical/golden.json) round-trips
byte-identical headlines, remediation, UIDs, and descriptions. Schema
+ envelope tests pass.
Ceiling in scripts/bench-registry.sh::allocs_gate lowered 5 → 2
(NORTHSTAR floor); baselines.json updated. Bench-check-detectors.sh
soft gate + bench-allocs-check.sh hard gate both PASS.
Audit: pod_evicted shows the same fmt.Sprintf + sort.SliceStable +
unsized-map shape (currently 15.27 allocs/event); filed #434 to apply
the same recipe. xid_correlation is the sibling case tracked by #417
(separate lane per task scope).
Closes #418
Refs #302
Refs #434
Signed-off-by: Tri Lam <tree@lumalabs.ai>
Adversarial Review: PR #438B/A/A+ CriteriaB (ship, known gaps): Perf optimization that clearly hits NORTHSTAR numerically (1.70 allocs/event vs ceiling 2), with golden replay unchanged and all tests passing. Acceptable with minor trim. A (tight, no debt): Same as B, plus dead code removed and capacity comments document assumptions. A+ (exemplary): Same as A, plus substantive evidence for type-hoisting claim via independent pprof run showing pre-call type allocation. Findings
Simplification SweepTrim targets: Remove type (dead code, see finding #1). Verdict: Clean post-trim. Byte-builder pattern is appropriate (format-specific, 4 uses, <3 would inline). Helper functions and earn keep (DRY, documented backward-compat for ). Comments are concise, no noise. VERDICT: A− (close to A)Justification: ✓ Perf win is legit: -57% allocs, 3.99→1.70 allocs/event, hits NORTHSTAR numerically. Bench run CV=0% across 10 samples is solid. ✓ Verdict semantics preserved: Golden replay round-trips unchanged; all existing tests pass. ✓ Sort determinism: Non-stable global sort safe because rank uniqueness ensures deterministic order within cohorts. Final verdicts use SliceStable. ✓ Capacity analysis: Buffer under-provisioning is safe (append handles overflow), acceptable if int64 values < 10^6 (typical in process-group/collective ID context). ✗ One trim required: is dead code and should be deleted. ✗ One documentation gap: capacity comment should document int64-range assumption or increase 96→136. Post-trim, this is A-tier: solid perf work, clear root-cause investigation, no correctness gaps. The dead code and capacity comment are lightweight cleanup, not design issues. |
1. Delete ncclcohortKey — dead type, never instantiated since the cohort logic moved to the contiguous-run walk in Evaluate. 2. Bump ncclHangEvidenceDescription byte-builder pre-size 96 → 136 to cover worst-case int64 formatting (pg_id + collective_seq_id can each hit 19 bytes). Costs <40 bytes per evidence line; eliminates the append-grow failure mode when ids exceed 10^6. 3. Trim nccllatestKey type-hoisting comment from the speculative -gcflags-inlining rationale to plain 'avoid per-call typedef cost'. Bench allocs/op unchanged at 1743; ceiling 2 holds. Replay golden byte-identical. Signed-off-by: Tri Lam <tree@lumalabs.ai>
|
Reviewer trim applied: deleted ncclcohortKey dead type; increased byte-builder capacity 96 → 136 for worst-case int64. Bench allocs unchanged at 1743. Re-requesting review. |
Review: Final state post-trim (b1cd965)Prior Findings — All Resolved ✓Finding 1: ncclcohortKey dead type
Finding 2: Buffer cap 96 → 136 with math
Finding 3: nccllatestKey comment trim
Simplification SweepClean. No new simplification targets. All three changes are surgical and justified. Perf & Correctness
Verdict: A+Minimal, correct, thoroughly documented. Ready to merge. |
…AR, #417) (#448) ## Summary Lowers `BenchmarkXidCorrelationDetector` allocs/event from **12.40 → 1.88** (a 6.6x reduction), hitting the NORTHSTAR ≤2 target. Ratchets the absolute-ceiling gate from 13 → 2 and refreshes the baseline. Follows the recipe established by PR #438 (nccl_hang: 3.99 → 1.70) and extends it with a composite-buffer + `unsafe.String` aliasing pass, which is load-bearing here because xid_correlation emits one verdict per evicted pod (896 verdicts / 1024-event bench window) vs nccl_hang's ~64 verdicts — per-verdict allocation cost dominates allocs/event. ## Root cause Per-verdict the detector built five variable-length prose strings (`Headline`, `Remediation`, xid-evidence `UID`, pod-evidence `Description`, plus `EvictedPod` from `displayPodName`'s `ns + "/" + name` concat). Four of those used `fmt.Sprintf`, each costing one interface-boxing allocation per integer/string argument plus one result-string allocation. At 896 verdicts × ~11 allocs/verdict + a growing `verdicts` slice + a default-sized `indexXidsByNode` map, the measured floor was 12699 allocs/op. ## Approach 1. **Composite-buffer carve** (the load-bearing change). `buildXidCorrelationVerdict` now pre-computes the rendered length of every variable-length string, allocates ONE `[]byte` of exactly that total size, appends each field into a known offset range, then carves immutable string headers out via `unsafe.String(&buf[lo], hi-lo)`. The buffer is finalized before any string is carved, so the aliasing is sound by construction. Per-verdict allocation floor drops from 6 to 2 (the composite buffer + the 2-element `EvidenceTrail` backing array). 2. **Pre-size join structures**. `indexXidsByNode` map is sized to `len(xids)`; the `verdicts` slice is sized to `len(events)`. 3. **strconv-based int rendering** via stack-allocated `[20]byte` scratch arrays + `strconv.AppendInt` (no `fmt.Sprintf`, no `strconv.Itoa`'s allocation). 4. **Strings.Builder fallback** for `xidEvidenceUID` and `xidEvidenceDescription` (shared with hbm_ecc.go). The `unsafe.String` aliasing trick is documented inline; the invariant ("buffer is immutable post-carve") is enforced by construction in `buildXidCorrelationVerdict` (no further `append` after the carve block). ## Before / after Bench: `go test -bench=BenchmarkXidCorrelationDetector -benchmem -count=5 -benchtime=500ms ./bench/detectors/` on M1 Max. | Metric | Before | After | Delta | |---|---:|---:|---:| | allocs/op | 12699 | 1928 | **−85%** | | B/op | 907 KiB | 597 KiB | −34% | | allocs/event | 12.40 | **1.88** | hits NORTHSTAR (≤2) | Profile entries fixed (per `pprof -sample_index=alloc_objects`): - `xidCorrelationHeadline` `fmt.Sprintf` → composite-buffer carve - `xidCorrelationRemediation` `fmt.Sprintf` → composite-buffer carve - `xidEvidenceUID` `fmt.Sprintf` → composite-buffer carve - pod-event `Description` `fmt.Sprintf` → composite-buffer carve - `displayPodName` `ns + "/" + name` concat → folded into composite buffer - `indexXidsByNode` map → pre-sized to `len(xids)` - `verdicts` slice grow chain → pre-sized to `len(events)` Residual per-verdict allocations (the two unavoidable ones at NORTHSTAR floor): 1. Composite prose buffer (1 alloc, ~660 B holding all 5 strings). 2. `EvidenceTrail` 2-element backing array (1 alloc, escapes on the returned verdict). ## Semantics Golden-replay byte-identity preserved: `module/pkg/replay/xid_correlation/canonical/golden.json` passes unchanged. The `unsafe.String`-carved strings alias an immutable buffer; JSON serialization compares content equality, not pointer identity. ## Files changed - `module/pkg/patterns/xid_correlation.go` — composite-buffer rewrite of `buildXidCorrelationVerdict`; pre-sized join maps; dead Sprintf-era renderers removed. - `bench/detectors/baselines.json` — refreshed BenchmarkXidCorrelationDetector entry to the new floor. - `scripts/bench-registry.sh` — ceiling 13 → 2; comment updated to record the ratchet and NORTHSTAR-hit. ## Test plan - [x] `go test -race ./pkg/patterns/... ./pkg/replay/... ./processor/... -count=1` - [x] `go test -bench=BenchmarkXidCorrelationDetector -benchmem -count=5 ./bench/detectors/` — stable at 1928 allocs/op - [x] `./scripts/bench-allocs-check.sh` — every detector at-or-below ceiling - [x] `./scripts/bench-check-detectors.sh` — within 10% of baseline - [x] `go vet ./...`, `golangci-lint run ./...` — clean (pre-commit hook) - [x] `module/pkg/replay/xid_correlation/canonical/golden.json` — byte-identical replay ## Sibling work - #418 / PR #438 (nccl_hang): merged, established the strconv-builder recipe. - #434 (pod_evicted, 15.27/ev): remains open; same composite-buffer technique applies if its per-evicted-pod shape allows. ```release-notes perf: xid_correlation detector allocations dropped from 12.4 to 1.88 per event (hits the ≤2 NORTHSTAR target). Verdict output is byte-identical. ``` Closes #417 Signed-off-by: Tri Lam <tree@lumalabs.ai>
…434) (#456) ## Summary Closes #434. Profile-driven optimization of `PodEvictedDetector.Evaluate` drops per-event allocations from **15.27 → 1.90**, clearing the **NORTHSTAR (2)** floor. Bench ceiling lowered 16 → 2 in `scripts/bench-registry.sh::allocs_gate`; `bench/detectors/baselines.json` updated. Same recipe as #438 (nccl_hang) and #448 (xid_correlation). ## Before / after `go test -bench=BenchmarkPodEvictedDetector -benchmem -count=10 -benchtime=500ms ./bench/detectors/` (Apple M1 Max): | metric | before | after | delta | |---|---|---|---| | allocs/op | 15635 | 1943 | **-87.6%** | | B/op | 970020 | 473400 | -51.2% | | allocs/event | 15.27 | **1.90** | hit NORTHSTAR | Stability: 10 consecutive bench runs all measured 1943 allocs/op (CV = 0%). ## Root cause (named) Per-evicted-pod the previous detector allocated ~19 objects. Profile (`-memprofile`, `-alloc_objects`) attributed them: 1. **`fmt.Sprintf` ×5 in `buildVerdict` + helpers** — 60% of pre-fix allocs. Sprintf boxes each arg into an `any` slice + scans the format string into a fresh internal buffer; both escape. Replaced with `strconv.AppendInt` + manual byte builders. 2. **`strings.ToLower(note)` in `pressureFromNote`** — 1 alloc per evicted pod (the lowercased copy). Replaced with a zero-alloc `containsFold` ASCII-fold scan (all anchors are already lowercase). 3. **`time.Format` in `formatTimestamp`** — Sprintf-style alloc per headline. Replaced with `t.AppendFormat(buf, time.RFC3339)` writing into the shared scratch buffer. 4. **Unsized `condIdx` map + per-bucket `append` growth** — `make(map[K]V)` defaults to 8 buckets; `append(nil, r)` grows each bucket. Pre-sized to `len(recs)`. 5. **Per-verdict `make([]EvidenceRef, 1, 2)`** — 1 alloc per verdict for the trail slice header. Replaced with ONE contiguous `trailBacking := make([]EvidenceRef, 2*len(events))` sliced cap=2 per verdict. 6. **Escaping `make([]byte, dynamic-cap)` per builder call** — escape analysis can't prove the dynamic capacity fits on stack, so each builder allocated TWO objects (buf + the `string(buf)` cast). Replaced with a per-Evaluate `scratch *[]byte` reused across every builder (each resets to `(*scratch)[:0]` and appends), collapsing to ONE alloc per call (the irreducible string cast). 7. **Redundant per-condition re-render** — 820 evictions joining onto 64 conditions re-rendered the same UID + description + remediation strings on every join (2460 redundant allocs). Pre-rendered once at index time into a new `indexedNodeCond` struct. 8. **Partial-path remediation re-render** — partial-path evictions (no joined condition) re-rendered the `"On node X: <prose>"` remediation per pod. Added a `(node, pressure) → string` cache so burst evictions on the same node share one alloc. 9. **`sort.SliceStable` reflection** — both `indexNodeConds`'s per-bucket sort and the outer verdicts sort used reflection-based `sort.SliceStable` (allocates a swapper per call). Replaced with `slices.SortStableFunc` (generics, no reflection); buckets of length ≤1 skip the sort call entirely. ## Top profile entries (memprofile -alloc_objects) **Before:** ``` 60% buildVerdict (5 fmt.Sprintf calls per eviction) 22% fmt.Sprintf 6% nodeConditionDescription 4% annotateRemediationWithNode 3% time.Time.Format 4% strings.ToLower ``` **After:** ``` 42% podEventDescription (1 alloc/call — irreducible string buf cast) 36% renderHeadline (1 alloc/call — irreducible string buf cast) 7% annotateRemediationWithNode (partial-path cache misses only) 5% nodeConditionDescription (index-time pre-render, 64 calls total) 3% nodeConditionUID (index-time pre-render, 64 calls total) ``` The two per-verdict string allocations (headline + pod_event description) are now the irreducible floor: each is a distinct verdict field that downstream `json.Marshal` requires as separate strings. ## Verdict semantics — unchanged Byte-identical headlines, remediation, UIDs, descriptions, JSON envelope. Confirmed via: - `module/pkg/replay/pod_evicted/canonical/golden.json` round-trips unchanged. - All `TestPodEvicted*` unit tests pass (negative fixtures, deterministic order, partial path, out-of-window, future transition excluded, empty node message, remediation pins node name, schema conformance + drift rejection). - `TestPodEvictedVerdict_SchemaConformance` (canonical full + partial) passes. ## Gate state | gate | result | |---|---| | `scripts/bench-allocs-check.sh` (hard absolute ceiling) | PASS at new ceiling=2 | | `scripts/bench-check-detectors.sh` (10% delta soft gate) | PASS (no detector regressed) | | `go test -race ./module/pkg/patterns/ ./module/pkg/replay/ ./module/processor/...` | PASS | | `go vet ./module/pkg/patterns/` | clean | | pre-commit (golangci-lint, vet, mod verify, attribute-namespace-check) | PASS | ```release-notes pod_evicted detector now allocates 1.90 allocations per event (down from 15.27), hitting the v0.3.0 NORTHSTAR floor of <=2 allocs/event. Joins #418 (nccl_hang) and #417 (xid_correlation) — three of the six per-detector tracking issues at or below NORTHSTAR. ``` Closes #434 Refs #302 ## Test plan - [x] `go test -race -count=1 ./module/pkg/patterns/ ./module/pkg/replay/ ./module/processor/...` PASS - [x] `go vet ./module/pkg/patterns/` clean - [x] `bash scripts/bench-allocs-check.sh` PASS (pod_evicted 1.90/ev <= ceiling 2; all 6 detectors at-or-below ceiling) - [x] `bash scripts/bench-check-detectors.sh` PASS (no detector regressed >10%) - [x] `go test -bench=BenchmarkPodEvictedDetector -count=10 -benchtime=500ms` stable at 1943 allocs/op (CV=0%) - [x] Golden replay (`module/pkg/replay/pod_evicted/canonical/golden.json`) round-trips unchanged - [x] Rebased onto origin/main (resolved scripts/bench-registry.sh conflict with #448's xid_correlation ratchet — both ratchets shipped together in this branch's view) --------- Signed-off-by: Tri Lam <tree@lumalabs.ai>
## Summary `scripts/bench-registry.sh::bench_entries` covered only `BenchmarkPodEvictedDetector` and the pyspy parser benches, so the **% delta regression gate** (`scripts/bench-check.sh` via `bench-check-all.sh`) missed five of the six detectors. The absolute allocs/event ceiling (`allocs_gate`) caught a detector that drifted past its hard ceiling, but a detector that drifted +50% while staying under the ceiling would ship silently. This PR closes the gap by registering all six detectors against both gates. ## Root cause The registry was authored when `pod_evicted` was the only detector with a bench. The five later detectors (added in #417/#418/#434/#438/#448/#456) were plumbed into `allocs_gate` (absolute ceiling) but the parallel % delta gate was never extended — a historical artifact, not an architectural defect. Fix: extend `bench_entries` to cover the missing five. ## What changed - `scripts/bench-registry.sh`: added one entry covering five detectors against `./bench/detectors/`, single-regex / single `go test` invocation (~5s incremental vs ~30s if split into 5 entries). - `bench/detectors/testdata/bench-baseline.txt`: generated baseline (count=10 × 500ms, Apple M1 Max). Hardware-invariant signals (B/op + allocs/op) pin the gate; sec/op stays advisory. - Existing ceilings in `allocs_gate` unchanged. Existing baselines unchanged. ## Mutation verification Bumped `PCIeAERDetector` allocs baseline down by ~18% (524 → 430): ``` PCIeAERDetector-10 430.0 ± 0% 524.0 ± 0% +21.86% (p=0.000 n=10) REGRESSION: the following benchmarks exceeded the 10% threshold vs baseline: PCIeAERDetector-10 +21.86% (p=0.000 n=10) ``` `scripts/bench-check-all.sh` exit 1. Revert → exit 0. Gate fires on regression, stays clean on baseline. ## Coverage gap context Cross-link: #302 (allocs/event rollup), #417 (xid_correlation NORTHSTAR), #418 (nccl_hang NORTHSTAR), #434 (pod_evicted NORTHSTAR), #438/#448/#456 (sibling perf PRs). Every detector that hit NORTHSTAR under the absolute ceiling now also has a relative-regression gate. ## Test plan - [x] `make bench-check` exit 0 (covers all 6 detectors now) - [x] `make bench-allocs-check` exit 0 - [x] `make bench-detectors-check` exit 0 (soft-gate, unchanged) - [x] Mutation: lower baseline → exit 1; revert → exit 0 - [x] Lint + vet (pre-commit hook): 0 issues ```release-notes ci(bench): % delta regression gate now covers all six detectors (#302). xid_correlation, hbm_ecc, nccl_hang, thermal_throttle, and pcie_aer were previously gated only by the absolute allocs/event ceiling — they now also fail builds on >10% allocs/op or B/op drift vs the committed baseline. ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
Summary
Closes #418. Profile-driven optimization of
NCCLHangDetector.Evaluatedrops per-event allocations from 3.99 → 1.70, clearing the NORTHSTAR (2) floor. Bench ceiling lowered 5 → 2 inscripts/bench-registry.sh::allocs_gate; baselines.json updated.Before / after
go test -bench=BenchmarkNCCLHangDetector -benchmem -count=10 -benchtime=500ms ./bench/detectors/(Apple M1 Max):Stability: 10 consecutive bench runs all measured 1743 allocs/op (CV = 0%).
Root cause (named)
Three contributors, each measured under
-memprofile:fmt.Sprintfin the per-evidence render loop — ~70% of pre-fix allocs. Sprintf boxes each argument into ananyslice + scans the format string into an internal buffer; both escape, both alloc. Four hot strings (UID + description per evidence, headline + remediation per cohort) × ~64 cohorts × 8 ranks/cohort dominated the count. Replaced withstrconv.AppendInt+ manual byte builders sharing a single[]byteper render. Output strings are byte-identical (golden replay unchanged).Unsized
latestmap growth —make(map[K]V)defaults to ~8 buckets and grows by doubling. For 1024 records → ~7 grow events. Pre-sized tolen(records).Per-cohort
cohorts[k] = append(...)slice growth — ~500 residual allocs/call after step 1. Each per-cohort slice starts nil and grows incrementally as records funnel in. Replaced with: stream survivors into one contiguous slice, sort by(pg, collective, rank)once, walk contiguous runs. Removes the cohort map entirely AND letsbuildNCCLHangVerdictskip its own per-cohort sort (reflectlite.Swapperwas another ~10% chunk).Top profile entries (memprofile -alloc_objects)
Before:
After:
The two per-rank string allocations are now the irreducible floor: they're distinct
EvidenceReffields and the JSON-marshal contract requires distinct strings.Verdict semantics — unchanged
Byte-identical headlines, remediation, UIDs, descriptions. Confirmed via:
module/pkg/replay/nccl_hang/canonical/golden.jsonround-trips unchanged.TestNCCLHang*unit tests pass (positive, negative, edge straggler, solo rank, threshold configurable, deterministic order, cross-collective, later-completed-supersedes, schema conformance, schema drift battery).TestCanonicalShippedFixtures_PatternIDsMatchDetectorConsts) passes.Sibling-detector audit
Same
fmt.Sprintf+sort.SliceStable+ unsized-map shape lives in:pod_evicted.go(15.27 allocs/event, ceiling 16) — filed perf(pod_evicted): lower allocs/event from 15.27 toward NORTHSTAR (2) #434 with the worked-example recipe.xid_correlation.go(12.4 allocs/event) — already tracked by perf(xid_correlation): lower allocs/event from 12.4 toward NORTHSTAR (2) #417 (separate lane per task scope, not touched here).Gate state
scripts/bench-allocs-check.sh(hard absolute ceiling)scripts/bench-check-detectors.sh(10% delta soft gate)go test ./module/...go vet ./module/pkg/patterns/Closes #418
Refs #302
Refs #434
Test plan
go test -count=1 ./module/pkg/patterns/ ./module/pkg/replay/...PASSgo test -count=1 ./module/...PASSgo vet ./module/pkg/patterns/cleanbash scripts/bench-allocs-check.shPASS (nccl_hang 1.70/ev ≤ ceiling 2)bash scripts/bench-check-detectors.shPASS (no detector regressed >10%)go test -bench=BenchmarkNCCLHangDetector -count=10 -benchtime=500msstable at 1743 allocs/op (CV=0%)module/pkg/replay/nccl_hang/canonical/golden.json) round-trips unchanged