Skip to content

Error when using SGLang rollout #1049

@yujianll

Description

@yujianll

Hi, I get this error when using SGLang for rollout.

(WorkerDict pid=2199833) [2025-04-12 21:11:01 TP0] Scheduler hit an exception: Traceback (most recent call last):
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1999, in run_scheduler_process
(WorkerDict pid=2199833)     scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 249, in __init__
(WorkerDict pid=2199833)     self.tp_worker = TpWorkerClass(
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
(WorkerDict pid=2199833)     self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 74, in __init__
(WorkerDict pid=2199833)     self.model_runner = ModelRunner(
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 174, in __init__
(WorkerDict pid=2199833)     min_per_gpu_memory = self.init_torch_distributed()
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 316, in init_torch_distributed
(WorkerDict pid=2199833)     before_avail_memory = get_available_gpu_memory(self.device, self.gpu_id)
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/utils.py", line 277, in get_available_gpu_memory
(WorkerDict pid=2199833)     free_gpu_memory, _ = torch.cuda.mem_get_info(gpu_id)
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/torch/cuda/memory.py", line 712, in mem_get_info
(WorkerDict pid=2199833)     return torch.cuda.cudart().cudaMemGetInfo(device)
(WorkerDict pid=2199833) RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
(WorkerDict pid=2199833) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I am running the simple script provided in examples/grpo_trainer/run_qwen2-7b_seq_balance.sh, with vllm changed to sglang:

set -x

export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=data/gsm8k/train.parquet \
    data.val_files=data/gsm8k/test.parquet \
    data.train_batch_size=1024 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=16000 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=16000 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_grpo_example_gsm8k' \
    trainer.experiment_name='qwen2_7b_function_rm_kl1e-3' \
    trainer.val_before_train=False \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    trainer.total_epochs=15 $@

However, the same script runs well on the same machine when using vllm. And slang alone works well in the same environment with python -m sglang.launch_server. I wonder what could be the possible reasons for this error. Thanks!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions