Skip to content
This repository was archived by the owner on Mar 21, 2024. It is now read-only.
This repository was archived by the owner on Mar 21, 2024. It is now read-only.

Multi-node training jobs for LightningContainer models can get stuck at inference time #493

Description

@ant0nsc

It appears that using any of the PyTorch Lightning metrics in the test_step can cause multi-node jobs to hang indefinitely. They appear to try to synchronize to the other GPUs, but those are terminated already.

AB#4121

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions