Skip to content

feat(failure-inject): nccl-hang wraps fr_parser (M11 cf #1/#2)#474

Merged
trilamsr merged 1 commit into
mainfrom
feat/m11-failure-inject-nccl-hang
Jun 2, 2026
Merged

feat(failure-inject): nccl-hang wraps fr_parser (M11 cf #1/#2)#474
trilamsr merged 1 commit into
mainfrom
feat/m11-failure-inject-nccl-hang

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace the ErrPending stub at tools/failure-inject/ncclhang/ with a deterministic wrapper over module/pkg/nccl/fr_parser.Synthesize. Output is one of the canonical M11 hang fixtures (nccl-2.29.x-hang / nccl-2.30.x-hang), selected by --seed mod 2; bytes round-trip through frparser.Parse and a re-synthesize is byte-identical — closes M4b carry-forward ci(deps): bump the gh-actions group with 5 updates #1.
  • Pin the new SHA in tools/failure-inject/testdata/golden.sha256 so chaos.yml's harness-determinism job (matrix linux/amd64 + linux/arm64) replays the same argv on both arches and enforces cross-arch SHA equality — closes M4b carry-forward Bump the gh-actions group across 1 directory with 4 updates #2.
  • Flip ⧗ → ☑ on the two M4b functional rubrics (round-trip, safe-opcodes) and the M4b determinism non-functional rubric, plus the M11 synthetic-fixture-generator rubric. Remove the failure-inject nccl-hang follow-up from docs/followups/M4b.md and from M11's carry-forward list.

Root cause

M4b shipped at v0.1 with the nccl-hang subcommand stubbed (ErrPending, exit 70) because pkg/nccl/fr_parser/synthesize.go was still pending under M11. M11 landed the synthesizer plus the canonical hang fixtures (fixture229Hang, fixture230Hang) in module/pkg/nccl/fr_parser/. The CLI shim was carry-forward — this PR is the wiring.

What's in the diff

  • tools/failure-inject/ncclhang/ncclhang.goOptions{Seed uint64}; Run selects a hang variant by Seed % len(hangVariants), calls FixtureSpec.Bytes() (which delegates to frparser.Synthesize), writes to w. ErrPending deleted; ctx.Err() honoured before any write.
  • tools/failure-inject/main.go — pass Options{Seed: *c.flagSeed} through to ncclhang.Run; drop the errors.Is(err, ncclhang.ErrPending) → exit 70 branch.
  • tools/failure-inject/ncclhang/ncclhang_test.go — RED → GREEN: TestRun_RoundTrip (synthesize → parse → re-synthesize byte-identical), TestRun_SeedDeterminism (same seed → same bytes, 4 seeds), TestRun_SafeOpcodesOnly (delegates to frparser.Parse as the safe-opcode oracle — a naive byte scan false-positives on opcode bytes inside SHORT_BINUNICODE string literals), TestRun_CtxCancelled.
  • tools/failure-inject/main_test.go — replace TestRun_NCCLHangReturnsNotImplemented with TestRun_NCCLHangRoundTrip + TestRun_NCCLHangSeedDeterminism so the contract is pinned through the actual argv path too.
  • tools/failure-inject/testdata/golden.sha256 — add failure-inject --seed=0 nccl-hang → e6f49920…. The existing TestRun_GoldenSHA loop in main_test.go and the Golden SHA pin step in chaos.yml pick it up automatically.
  • docs/MILESTONES.md — flip §M4b rubrics ⧗ → ☑ (round-trip, safe-opcodes, cross-arch determinism) and §M11 synthetic-fixture rubric; trim carry-forward list.
  • docs/followups/M4b.md — mark the nccl-hang entry closed with the wiring-PR pointer.
  • tools/failure-inject/README.md — add a nccl-hang section; remove nccl-hang from carve-outs (now only pod-evict --allow-cluster-write carves).
  • module/receiver/ncclfrreceiver/README.md — replace stale tracecore failure-inject invocation with the actual go run ./tools/failure-inject path.

Test plan

  • go test -race -count=1 ./tools/failure-inject/... — green (4 packages).
  • (cd module && go test -race -count=1 ./pkg/nccl/fr_parser/...) — green (no semantic change here, gate against accidental drift).
  • go build ./... && (cd module && go build ./...) — clean.
  • Pre-commit gates: golangci-lint, go vet, go mod verify, attribute-namespace-check — all 0 issues.
  • End-to-end determinism: failure-inject --seed=0 nccl-hang | sha256sum reproduces the pinned SHA (e6f49920…) twice in a row.
  • Seed variance: --seed=1 produces a distinct SHA (2788a726…); --seed=42 (42 mod 2 = 0) matches --seed=0 per the documented modulo mapping.
  • failure-inject nccl-hang --help documents --seed and --out and the round-trip-through-fr_parser purpose.

Self-grade

A+: round-trip green, determinism golden-SHA pinned, safe-opcode set verified via parser oracle, cross-arch SHA equality wired into existing chaos.yml matrix, MILESTONES.md flipped on four ⧗ rubrics, M4b.md follow-up closed with a pointer, doc drift swept.

tools(failure-inject): `nccl-hang` subcommand now produces parseable byte-deterministic NCCL FlightRecorder bytes via `pkg/nccl/fr_parser` (was a stub returning `ErrPending`). `--seed` flag selects variant + deterministic synthesis; cross-arch SHA enforced in `chaos.yml` (linux/amd64 + linux/arm64). Closes M4b carry-forward #1 + #2.

Replace the ErrPending stub at tools/failure-inject/ncclhang/ with a
deterministic wrapper over module/pkg/nccl/fr_parser. Output is one of
the canonical M11 hang fixtures (nccl-2.29.x-hang / nccl-2.30.x-hang)
selected by --seed mod 2; the bytes round-trip through frparser.Parse
and a re-synthesize is byte-identical (M4b carry-forward #1, MILESTONES
§M4b).

Pinned the new SHA in testdata/golden.sha256 so chaos.yml's
harness-determinism job (matrix linux/amd64 + linux/arm64) enforces
cross-arch equality on every push — that closes M4b carry-forward #2 too.

Removed Options{}, ErrPending, the exit-70 carve-out test, and the
M4b.md follow-up entry; flipped four ⧗ → ☑ in MILESTONES.md across §M4b
and §M11.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Grade: A

PR #474 ships M4b carry-forward #1/#2 closure: CLI wraps frparser.synthesize, emits canonical hang-dump pickle fixtures, determinism pinned on both linux/amd64 + linux/arm64 via golden SHA matrix. Four rubrics flipped ⧗→☑ across M4b + M11; all closures are legitimate.

Findings

tools/failure-inject/main_test.go:58: 🔵 nit: --seed flag help text is minimal ("Deterministic PRNG seed.") and doesn't explain the hang-variant modulo mapping. The mapping is well-documented in README + code, so this is comment clarity only — not a functional defect. Consider: "Selects which NCCL hang fixture variant [0=2.29.x, 1=2.30.x]; seed % variant_count." or point to README.

Everything else checks: Seed modulo distribution is Fair (2 variants via idx = Seed % 2; tests cover seeds 0/1/42); parser-as-oracle for TestRun_SafeOpcodesOnly is smart (avoids false-positives in string literals); golden SHA e6f49920... pinned and replayed on both arches; round-trip + determinism + safe-opcodes + cancellation tests all land; zero semantic drift in pkg/nccl/fr_parser/; README sweep is load-bearing.

Simplification sweep

Doc comment at line 3 of ncclhang.go names the contract twice ("byte-deterministic for a fixed Options" at L6, then "two calls with equal Options" at L8). Rephrase L6-L9 to one stronger statement and lose the repetition. Similar: MILESTONES.md M4b status line uses both "wrapped over" and "per the carry-forward closure" — first is clearer. Cut the second.

VERDICT

A — Ship. Carry-forward closures legitimate, test coverage strong, no blockers. Fix the nit before or after merge (comment flavor only).

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 04:00
@trilamsr trilamsr merged commit d5f39ec into main Jun 2, 2026
27 of 28 checks passed
@trilamsr trilamsr deleted the feat/m11-failure-inject-nccl-hang branch June 2, 2026 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant