Tracks per-detector alloc optimization toward NORTHSTAR (2 allocs/event) for
nccl_hang. Current ceiling in scripts/bench-registry.sh::allocs_gate is
5 allocs/event (measured 3.99 ev, 4088 allocs/op @ window=1024). The
target is ≤2 to converge on NORTHSTAR alongside the other detectors.
What the bench measures
./bench/detectors/BenchmarkNCCLHangDetector — 1024 NCCL FR records across
8 process groups × 128 ranks; half the collectives "hang" (no completed
record past threshold). See bench/detectors/detectors_bench_test.go.
Hot-path candidates (pre-investigation)
- Per-pg cohort sort allocates a slice per cohort (~8 process groups).
- Per-hang verdict allocation + evidence-trail rendering.
- Map keyed by
(pgID, collectiveSeqID) for cohort grouping.
Definition of done
- Lower the ceiling in
scripts/bench-registry.sh::allocs_gate to the new
floor (the absolute gate prevents creep back upward).
- Update
bench/detectors/baselines.json for the soft-gate trend tracking.
- Profile-driven (
go test -bench=BenchmarkNCCLHangDetector -memprofile) —
fix the largest contributor, re-measure, repeat.
Why now (priority signal)
~2× over NORTHSTAR. Closer to the bar than xid_correlation and
pod_evicted, so likely a smaller optimization (sort buffer reuse +
cohort-map pooling) to land.
Rolls up to #302.
Tracks per-detector alloc optimization toward NORTHSTAR (2 allocs/event) for
nccl_hang. Current ceiling inscripts/bench-registry.sh::allocs_gateis5 allocs/event (measured 3.99 ev, 4088 allocs/op @ window=1024). The
target is ≤2 to converge on NORTHSTAR alongside the other detectors.
What the bench measures
./bench/detectors/BenchmarkNCCLHangDetector— 1024 NCCL FR records across8 process groups × 128 ranks; half the collectives "hang" (no completed
record past threshold). See
bench/detectors/detectors_bench_test.go.Hot-path candidates (pre-investigation)
(pgID, collectiveSeqID)for cohort grouping.Definition of done
scripts/bench-registry.sh::allocs_gateto the newfloor (the absolute gate prevents creep back upward).
bench/detectors/baselines.jsonfor the soft-gate trend tracking.go test -bench=BenchmarkNCCLHangDetector -memprofile) —fix the largest contributor, re-measure, repeat.
Why now (priority signal)
~2× over NORTHSTAR. Closer to the bar than
xid_correlationandpod_evicted, so likely a smaller optimization (sort buffer reuse +cohort-map pooling) to land.
Rolls up to #302.