Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Download of recovery checkpoints should only happen on rank 0 in distributed training #473

Description

@ant0nsc

Example job: melanibe_private_cxr_modules_1619425230_983432d1. All 8 ranks try to download the checkpoints, which then times out. However, quite likely, downloading the checkpoints is only necessary on rank 0.

@melanibe

AB#4083

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions