Skip to content

perf(nccl_hang): lower allocs/event from 3.99 toward NORTHSTAR (2) #418

Description

@trilamsr

Tracks per-detector alloc optimization toward NORTHSTAR (2 allocs/event) for
nccl_hang. Current ceiling in scripts/bench-registry.sh::allocs_gate is
5 allocs/event (measured 3.99 ev, 4088 allocs/op @ window=1024). The
target is ≤2 to converge on NORTHSTAR alongside the other detectors.

What the bench measures

./bench/detectors/BenchmarkNCCLHangDetector — 1024 NCCL FR records across
8 process groups × 128 ranks; half the collectives "hang" (no completed
record past threshold). See bench/detectors/detectors_bench_test.go.

Hot-path candidates (pre-investigation)

  • Per-pg cohort sort allocates a slice per cohort (~8 process groups).
  • Per-hang verdict allocation + evidence-trail rendering.
  • Map keyed by (pgID, collectiveSeqID) for cohort grouping.

Definition of done

  • Lower the ceiling in scripts/bench-registry.sh::allocs_gate to the new
    floor (the absolute gate prevents creep back upward).
  • Update bench/detectors/baselines.json for the soft-gate trend tracking.
  • Profile-driven (go test -bench=BenchmarkNCCLHangDetector -memprofile) —
    fix the largest contributor, re-measure, repeat.

Why now (priority signal)

~2× over NORTHSTAR. Closer to the bar than xid_correlation and
pod_evicted, so likely a smaller optimization (sort buffer reuse +
cohort-map pooling) to land.

Rolls up to #302.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions