Skip to content

Race condition when creating checkpoint directories causes training failures #1657

@rj42

Description

@rj42

Problem
We've encountered a race condition when creating checkpoint directories during DaPo training that causes the process to crash with the following error:

RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist.

Environment

Error logs
Watch my comments marked with <--.

Tue May 20 21:16:06 2025[1,0]:RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist. <-- failed (controller)
Tue May 20 21:16:06 2025[1,0]:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving model to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0mTue May 20 21:16:06 2025[1,0]:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving checkpoint to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0mTue May 20 21:16:06 2025[1,0]:[36m(WTue May 20 21:16:06 2025[1,0]:orkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving extra_state to /workspace/ckpts/global_step_20/actor/extra_state_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0m

Possible cause
self.actor_rollout_wg.save_checkpoint creates the folder asynchronously, but doesn't complete in time before the controller saves the data.

Proposed Solution
Forcibly create the directory in advance before saving.

PR
link

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions