Skip to content

Conversation

@rakkit
Copy link
Contributor

@rakkit rakkit commented Dec 9, 2025

this PR

  1. allowing Torchforge to decide where to put the checkpoint and wandb, etc, instead of the "current" folder
    allowing Torchforge to decide to print / log the configs

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 9, 2025

# Logging needs to happen after distributed initialized
if job_config.job.print_config:
dict_config = OmegaConf.to_container(job_config, resolve=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand what's the benefit of introducing extra dependency OmegaConf

Copy link
Contributor Author

@rakkit rakkit Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torchforge uses yaml & omegaconf to manage the configs. for some weird reason, when job_config is passed to the engine, it becomes an omegaconf object instead of a dataclass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be careful introducing this dependency. We haven't but should set up some test in torchtitan for this engine, and we don't want to be forced to have this dependency.

If it's for logging purpose, it can be done in forge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, for now what forge doing is

     class ForgeSFTRecipe(ForgeActor, ForgeEngine):
      ...
      ...
    def __init__(self, config: DictConfig):
        job_config = ForgeJobConfig().to_dict()
        # Hack to deal with literal types from titan
        job_config = OmegaConf.merge(job_config, config)
        super().__init__(job_config)
        

it make sense on the other side to convert job_config back to dataclass


# Logging needs to happen after distributed initialized
if job_config.job.print_config:
dict_config = OmegaConf.to_container(job_config, resolve=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be careful introducing this dependency. We haven't but should set up some test in torchtitan for this engine, and we don't want to be forced to have this dependency.

If it's for logging purpose, it can be done in forge?

@rakkit rakkit force-pushed the fix-forge-ckpt-base-folder branch from caaacae to 33fd4f9 Compare December 9, 2025 19:18
@rakkit rakkit changed the title [forge] allow torchforges to set checkpoint base folder and print configs [forge] allow torchforges to set checkpoint base folder Dec 9, 2025
@rakkit
Copy link
Contributor Author

rakkit commented Dec 9, 2025

sry for the confusion. I realized it's annoying to do this convert. I will move job config logging parts to Forge.

this pr is only about set up the checkpoint's base_folder now.

Comment on lines 26 to 27
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@tianyu-l tianyu-l merged commit f3f2e8f into pytorch:main Dec 9, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants