fix: DDP deadlock when no valid loss positions on a rank

ChenhanYu · ChenhanYu · commit a2799601625c · 2026-03-31T18:39:04.000-07:00
When a rank's batch has no valid loss positions (e.g., all tokens in Block 0 which is excluded), the loss was a detached zero tensor with no connection to dflash_module parameters. DDP waited forever for gradient sync on those parameters → NCCL ALLREDUCE timeout. Fix: use logits.sum() * 0.0 as zero loss, which maintains the computation graph through dflash_module parameters so DDP can sync zero gradients properly. Also revert to super().forward() for training (matching EAGLE pattern) and add --ddp_find_unused_parameters True, --ddp_timeout 300. Root cause analysis: rank 4 completed ALLREDUCE #272 and proceeded to ALLGATHER #273, while other ranks were stuck at ALLREDUCE #272. This indicated rank 4 had a different backward graph (no gradients for dflash_module on that rank). Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
diff --git a/modelopt/torch/speculative/plugins/hf_dflash.py b/modelopt/torch/speculative/plugins/hf_dflash.py
@@ -523,7 +523,9 @@ def forward(
                 preds = active_logits.argmax(dim=-1)
                 accuracy = (preds == active_labels).float().mean().item()
         else:
-            loss = torch.tensor(0.0, device=device, dtype=dtype, requires_grad=True)
+            # No valid positions — compute a zero loss that still flows through
+            # dflash_module parameters to keep DDP gradient sync happy
+            loss = logits.sum() * 0.0
             accuracy = 0.0
 
         return ModelOutput(