-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Problem
We've encountered a race condition when creating checkpoint directories during DaPo training that causes the process to crash with the following error:
RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist.
Environment
- Model: Qwen2.5-32B
- Training Method: DaPo (vanilla)
- Parallelism: FSDP
- Script used: https://github.com/volcengine/verl/blob/main/recipe/dapo/run_dapo_early_qwen2.5_32b.sh
- Environment: Standard from verl/trainer/runtime_env
Error logs
Watch my comments marked with <--
.
Tue May 20 21:16:06 2025[1,0]:RuntimeError: Parent directory /workspace/ckpts/global_step_20 does not exist. <-- failed (controller)
Tue May 20 21:16:06 2025[1,0]:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving model to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0mTue May 20 21:16:06 2025[1,0]:[36m(WorkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving checkpoint to /workspace/ckpts/global_step_20/actor/model_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0mTue May 20 21:16:06 2025[1,0]:[36m(WTue May 20 21:16:06 2025[1,0]:orkerDict pid=11558, ip=2a02:6b8:c43:5030:0:4457:48a7:6002)[0m [rank-26]: Saving extra_state to /workspace/ckpts/global_step_20/actor/extra_state_world_size_32_rank_26.pt[32m [repeated 31x across cluster][0m
Possible cause
self.actor_rollout_wg.save_checkpoint
creates the folder asynchronously, but doesn't complete in time before the controller saves the data.
Proposed Solution
Forcibly create the directory in advance before saving.
PR
link