-
Notifications
You must be signed in to change notification settings - Fork 122
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Saving a checkpoint without first running validation fails with GRPO + megatron. The following segfault occurs in the step immediately after saving the checkpoint:
▶ Generating responses for batch of size 4...
(MegatronPolicyWorker[rank=0] pid=4182) GPU Memory before optimizer offload: 0.02GB allocated, 0.06GB reserved
(MegatronPolicyWorker[rank=0] pid=4182) GPU Memory after optimizer offload: 0.02GB allocated, 0.06GB reserved
(VllmGenerationWorker pid=3269) INFO 08-04 16:39:12 [executor_base.py:226] It took 0.196385 seconds to wake up tags ['weights'].
[Refit] Split 226 keys into 1 groups
(MegatronPolicyWorker[rank=0] pid=4182) [f676f732498b:4182 :0:4182] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3121dc10)
(MegatronPolicyWorker[rank=0] pid=4182) ==== backtrace (tid: 4182) ====
(MegatronPolicyWorker[rank=0] pid=4182) 0 /opt/hpcx/nccl_rdma_sharp_plugin/lib/../../ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7faf44156774]
(MegatronPolicyWorker[rank=0] pid=4182) 1 /opt/hpcx/nccl_rdma_sharp_plugin/lib/../../ucx/lib/libucs.so.0(+0x3796a) [0x7faf4415696a]
(MegatronPolicyWorker[rank=0] pid=4182) 2 /opt/hpcx/nccl_rdma_sharp_plugin/lib/../../ucx/lib/libucs.so.0(+0x37ba8) [0x7faf44156ba8]
(MegatronPolicyWorker[rank=0] pid=4182) 3 /usr/local/cuda/compat/lib.real/libcuda.so.1(+0x276180) [0x7fb0603fb180]
...
Note that the following combinations run without error:
- SFT + megatron backend
- GRPO + hf backend
Steps/Code to reproduce bug
uv run examples/run_grpo_math.py --config examples/configs/grpo_math_1B_megatron.yaml policy.model_name=Qwen/Qwen3-0.6B grpo.num_prompts_per_step=2 grpo.num_generations_per_prompt=2 policy.train_global_batch_size=4 checkpointing.enabled=True checkpointing.save_period=1
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working