Commit a279960
committed
fix: DDP deadlock when no valid loss positions on a rank
When a rank's batch has no valid loss positions (e.g., all tokens in
Block 0 which is excluded), the loss was a detached zero tensor with
no connection to dflash_module parameters. DDP waited forever for
gradient sync on those parameters → NCCL ALLREDUCE timeout.
Fix: use logits.sum() * 0.0 as zero loss, which maintains the
computation graph through dflash_module parameters so DDP can sync
zero gradients properly.
Also revert to super().forward() for training (matching EAGLE pattern)
and add --ddp_find_unused_parameters True, --ddp_timeout 300.
Root cause analysis: rank 4 completed ALLREDUCE #272 and proceeded to
ALLGATHER #273, while other ranks were stuck at ALLREDUCE #272. This
indicated rank 4 had a different backward graph (no gradients for
dflash_module on that rank).
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>1 parent 2c42363 commit a279960
1 file changed
+3
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
523 | 523 | | |
524 | 524 | | |
525 | 525 | | |
526 | | - | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
527 | 529 | | |
528 | 530 | | |
529 | 531 | | |
| |||
0 commit comments