-
Notifications
You must be signed in to change notification settings - Fork 581
fix(pt/pd): fix eta computation #4886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdjusted ETA calculation in log_loss_valid for two training modules to use a dynamic divisor based on min(disp_freq, display_step_id - start_step). No public APIs changed. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
deepmd/pd/train/training.py (1)
921-924: ETA fix is correct; add a defensive guard and a tiny readability improvementUsing min(disp_freq, display_step_id - start_step) correctly accounts for partial display intervals and stabilizes ETA early and near the end. To be extra safe against misconfiguration (e.g., disp_freq accidentally set to 0) and to improve readability, compute the interval once and guard it to be at least 1.
Apply this diff:
- eta = int( - (self.num_steps - display_step_id) - / min(self.disp_freq, display_step_id - self.start_step) - * train_time - ) + interval = max(1, min(self.disp_freq, display_step_id - self.start_step)) + eta = int((self.num_steps - display_step_id) / interval * train_time)Additional note:
- Consider asserting disp_freq > 0 at config parse time to prevent modulo-by-zero in display condition and future regressions.
- Optional: align average training-time accounting with PT’s approach (track timed_steps and add min(disp_freq, display_step_id - start_step) each time) to avoid skew in the last, shorter interval.
deepmd/pt/train/training.py (1)
1004-1007: ETA denominator fix looks good; guard the interval and improve readabilityThis change fixes ETA spikes when the first/last display window is shorter than disp_freq. For robustness and clarity, compute a guarded interval once and reuse it.
Apply this diff:
- eta = int( - (self.num_steps - display_step_id) - / min(self.disp_freq, display_step_id - self.start_step) - * train_time - ) + interval = max(1, min(self.disp_freq, display_step_id - self.start_step)) + eta = int((self.num_steps - display_step_id) / interval * train_time)Note:
- You already maintain timed_steps consistently with the same min(...) logic below; this keeps ETA and average-time metrics conceptually aligned across PT/PD.
- As a separate hardening step, consider validating disp_freq > 0 at config load to avoid modulo-by-zero in the display condition.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
deepmd/pd/train/training.py(1 hunks)deepmd/pt/train/training.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
- GitHub Check: Test Python (6, 3.9)
- GitHub Check: Test Python (5, 3.9)
- GitHub Check: Test Python (1, 3.12)
- GitHub Check: Test Python (1, 3.9)
- GitHub Check: Test Python (4, 3.9)
- GitHub Check: Test Python (6, 3.12)
- GitHub Check: Test Python (2, 3.9)
- GitHub Check: Test Python (3, 3.12)
- GitHub Check: Test Python (4, 3.12)
- GitHub Check: Test Python (5, 3.12)
- GitHub Check: Test Python (3, 3.9)
- GitHub Check: Test Python (2, 3.12)
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp310-manylinux_aarch64
- GitHub Check: Build wheels for cp311-win_amd64
- GitHub Check: Build wheels for cp311-macosx_arm64
- GitHub Check: Build wheels for cp311-manylinux_x86_64
- GitHub Check: Build wheels for cp311-macosx_x86_64
- GitHub Check: Analyze (python)
- GitHub Check: Analyze (c-cpp)
- GitHub Check: Build C++ (cpu, cpu)
- GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
- GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
- GitHub Check: Build C++ (clang, clang)
- GitHub Check: Build C++ (cuda, cuda)
- GitHub Check: Build C++ (rocm, rocm)
- GitHub Check: Test C++ (false)
- GitHub Check: Test C++ (true)
- GitHub Check: Build C++ (cuda120, cuda)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
fix eta computation code
Summary by CodeRabbit