-
Notifications
You must be signed in to change notification settings - Fork 630
[forge] allow torchforges to set checkpoint base folder #2131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| # Logging needs to happen after distributed initialized | ||
| if job_config.job.print_config: | ||
| dict_config = OmegaConf.to_container(job_config, resolve=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand what's the benefit of introducing extra dependency OmegaConf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torchforge uses yaml & omegaconf to manage the configs. for some weird reason, when job_config is passed to the engine, it becomes an omegaconf object instead of a dataclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be careful introducing this dependency. We haven't but should set up some test in torchtitan for this engine, and we don't want to be forced to have this dependency.
If it's for logging purpose, it can be done in forge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, for now what forge doing is
class ForgeSFTRecipe(ForgeActor, ForgeEngine):
...
...
def __init__(self, config: DictConfig):
job_config = ForgeJobConfig().to_dict()
# Hack to deal with literal types from titan
job_config = OmegaConf.merge(job_config, config)
super().__init__(job_config)
it make sense on the other side to convert job_config back to dataclass
|
|
||
| # Logging needs to happen after distributed initialized | ||
| if job_config.job.print_config: | ||
| dict_config = OmegaConf.to_container(job_config, resolve=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be careful introducing this dependency. We haven't but should set up some test in torchtitan for this engine, and we don't want to be forced to have this dependency.
If it's for logging purpose, it can be done in forge?
caaacae to
33fd4f9
Compare
|
sry for the confusion. I realized it's annoying to do this convert. I will move job config logging parts to Forge. this pr is only about set up the checkpoint's |
| logger = logging.getLogger(__name__) | ||
| logger.setLevel(logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
tianyu-l
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
this PR
allowing Torchforge to decide to print / log the configs