feat(failure-inject): nccl-hang wraps fr_parser (M11 cf #1/#2)#474
Conversation
Replace the ErrPending stub at tools/failure-inject/ncclhang/ with a deterministic wrapper over module/pkg/nccl/fr_parser. Output is one of the canonical M11 hang fixtures (nccl-2.29.x-hang / nccl-2.30.x-hang) selected by --seed mod 2; the bytes round-trip through frparser.Parse and a re-synthesize is byte-identical (M4b carry-forward #1, MILESTONES §M4b). Pinned the new SHA in testdata/golden.sha256 so chaos.yml's harness-determinism job (matrix linux/amd64 + linux/arm64) enforces cross-arch equality on every push — that closes M4b carry-forward #2 too. Removed Options{}, ErrPending, the exit-70 carve-out test, and the M4b.md follow-up entry; flipped four ⧗ → ☑ in MILESTONES.md across §M4b and §M11. Signed-off-by: Tri Lam <tree@lumalabs.ai>
Grade: APR #474 ships M4b carry-forward #1/#2 closure: CLI wraps Findingstools/failure-inject/main_test.go:58: 🔵 nit: Everything else checks: Seed modulo distribution is Fair (2 variants via Simplification sweepDoc comment at line 3 of ncclhang.go names the contract twice ("byte-deterministic for a fixed Options" at L6, then "two calls with equal Options" at L8). Rephrase L6-L9 to one stronger statement and lose the repetition. Similar: MILESTONES.md M4b status line uses both "wrapped over" and "per the carry-forward closure" — first is clearer. Cut the second. VERDICTA — Ship. Carry-forward closures legitimate, test coverage strong, no blockers. Fix the nit before or after merge (comment flavor only). |
Summary
ErrPendingstub attools/failure-inject/ncclhang/with a deterministic wrapper overmodule/pkg/nccl/fr_parser.Synthesize. Output is one of the canonical M11 hang fixtures (nccl-2.29.x-hang/nccl-2.30.x-hang), selected by--seed mod 2; bytes round-trip throughfrparser.Parseand a re-synthesize is byte-identical — closes M4b carry-forward ci(deps): bump the gh-actions group with 5 updates #1.tools/failure-inject/testdata/golden.sha256sochaos.yml'sharness-determinismjob (matrixlinux/amd64+linux/arm64) replays the same argv on both arches and enforces cross-arch SHA equality — closes M4b carry-forward Bump the gh-actions group across 1 directory with 4 updates #2.failure-inject nccl-hangfollow-up fromdocs/followups/M4b.mdand from M11's carry-forward list.Root cause
M4b shipped at v0.1 with the
nccl-hangsubcommand stubbed (ErrPending, exit 70) becausepkg/nccl/fr_parser/synthesize.gowas still pending under M11. M11 landed the synthesizer plus the canonical hang fixtures (fixture229Hang,fixture230Hang) inmodule/pkg/nccl/fr_parser/. The CLI shim was carry-forward — this PR is the wiring.What's in the diff
tools/failure-inject/ncclhang/ncclhang.go—Options{Seed uint64};Runselects a hang variant bySeed % len(hangVariants), callsFixtureSpec.Bytes()(which delegates tofrparser.Synthesize), writes tow.ErrPendingdeleted;ctx.Err()honoured before any write.tools/failure-inject/main.go— passOptions{Seed: *c.flagSeed}through toncclhang.Run; drop theerrors.Is(err, ncclhang.ErrPending) → exit 70branch.tools/failure-inject/ncclhang/ncclhang_test.go— RED → GREEN:TestRun_RoundTrip(synthesize → parse → re-synthesize byte-identical),TestRun_SeedDeterminism(same seed → same bytes, 4 seeds),TestRun_SafeOpcodesOnly(delegates tofrparser.Parseas the safe-opcode oracle — a naive byte scan false-positives on opcode bytes insideSHORT_BINUNICODEstring literals),TestRun_CtxCancelled.tools/failure-inject/main_test.go— replaceTestRun_NCCLHangReturnsNotImplementedwithTestRun_NCCLHangRoundTrip+TestRun_NCCLHangSeedDeterminismso the contract is pinned through the actual argv path too.tools/failure-inject/testdata/golden.sha256— addfailure-inject --seed=0 nccl-hang → e6f49920…. The existingTestRun_GoldenSHAloop inmain_test.goand theGolden SHA pinstep inchaos.ymlpick it up automatically.docs/MILESTONES.md— flip §M4b rubrics ⧗ → ☑ (round-trip, safe-opcodes, cross-arch determinism) and §M11 synthetic-fixture rubric; trim carry-forward list.docs/followups/M4b.md— mark thenccl-hangentry closed with the wiring-PR pointer.tools/failure-inject/README.md— add anccl-hangsection; removenccl-hangfrom carve-outs (now onlypod-evict --allow-cluster-writecarves).module/receiver/ncclfrreceiver/README.md— replace staletracecore failure-injectinvocation with the actualgo run ./tools/failure-injectpath.Test plan
go test -race -count=1 ./tools/failure-inject/...— green (4 packages).(cd module && go test -race -count=1 ./pkg/nccl/fr_parser/...)— green (no semantic change here, gate against accidental drift).go build ./... && (cd module && go build ./...)— clean.golangci-lint,go vet,go mod verify,attribute-namespace-check— all 0 issues.failure-inject --seed=0 nccl-hang | sha256sumreproduces the pinned SHA (e6f49920…) twice in a row.--seed=1produces a distinct SHA (2788a726…);--seed=42(42 mod 2 = 0) matches--seed=0per the documented modulo mapping.failure-inject nccl-hang --helpdocuments--seedand--outand the round-trip-through-fr_parserpurpose.Self-grade
A+: round-trip green, determinism golden-SHA pinned, safe-opcode set verified via parser oracle, cross-arch SHA equality wired into existing
chaos.ymlmatrix, MILESTONES.md flipped on four ⧗ rubrics,M4b.mdfollow-up closed with a pointer, doc drift swept.