Skip to content

Segault when checkpointing.save_period is not a multiple of grpo.val_period + megatron backend #834

@ashors1

Description

@ashors1

Describe the bug
Saving a checkpoint without first running validation fails with GRPO + megatron. The following segfault occurs in the step immediately after saving the checkpoint:

▶ Generating responses for batch of size 4...
(MegatronPolicyWorker[rank=0] pid=4182) GPU Memory before optimizer offload: 0.02GB allocated, 0.06GB reserved
(MegatronPolicyWorker[rank=0] pid=4182) GPU Memory after optimizer offload: 0.02GB allocated, 0.06GB reserved
(VllmGenerationWorker pid=3269) INFO 08-04 16:39:12 [executor_base.py:226] It took 0.196385 seconds to wake up tags ['weights'].
[Refit] Split 226 keys into 1 groups
(MegatronPolicyWorker[rank=0] pid=4182) [f676f732498b:4182 :0:4182] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3121dc10)
(MegatronPolicyWorker[rank=0] pid=4182) ==== backtrace (tid:   4182) ====
(MegatronPolicyWorker[rank=0] pid=4182)  0  /opt/hpcx/nccl_rdma_sharp_plugin/lib/../../ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7faf44156774]
(MegatronPolicyWorker[rank=0] pid=4182)  1  /opt/hpcx/nccl_rdma_sharp_plugin/lib/../../ucx/lib/libucs.so.0(+0x3796a) [0x7faf4415696a]
(MegatronPolicyWorker[rank=0] pid=4182)  2  /opt/hpcx/nccl_rdma_sharp_plugin/lib/../../ucx/lib/libucs.so.0(+0x37ba8) [0x7faf44156ba8]
(MegatronPolicyWorker[rank=0] pid=4182)  3  /usr/local/cuda/compat/lib.real/libcuda.so.1(+0x276180) [0x7fb0603fb180]
...

Note that the following combinations run without error:

  • SFT + megatron backend
  • GRPO + hf backend

Steps/Code to reproduce bug

uv run examples/run_grpo_math.py --config examples/configs/grpo_math_1B_megatron.yaml policy.model_name=Qwen/Qwen3-0.6B grpo.num_prompts_per_step=2 grpo.num_generations_per_prompt=2 policy.train_global_batch_size=4 checkpointing.enabled=True checkpointing.save_period=1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions