Skip to content

DDP training tries to save sharded checkpoint on the last step #664

@ananyahjha93

Description

@ananyahjha93

🐛 Describe the bug

2024-07-17T04:30:51.202069810Z 2024-07-16 21:30:51.201	jupiter-cs-aus-121.reviz.ai2.in:0	olmo.train:1268	INFO	Saving final checkpoint...
2024-07-17T04:30:52.220928528Z 2024-07-16 21:30:52.219	jupiter-cs-aus-121.reviz.ai2.in:5	olmo.util:163	CRITICAL	Uncaught AssertionError: TorchLegacyShardedCheckpointer is being called to save a model where `distributed_strategy` is not FSDP.

With DDP, when last step count is divisible by save_interval_unsharded, this checkpoint has already been saved and hence the condition to save the last checkpoint for DDP goes to sharded checkpoint saver TorchLegacyShardedCheckpointer because of the following if condition: https://github.com/allenai/OLMo/blob/main/olmo/train.py#L1256

Versions

NA

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugAn issue about a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions