Error when using SGLang rollout

Hi, I get this error when using SGLang for rollout.
```
(WorkerDict pid=2199833) [2025-04-12 21:11:01 TP0] Scheduler hit an exception: Traceback (most recent call last):
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1999, in run_scheduler_process
(WorkerDict pid=2199833)     scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 249, in __init__
(WorkerDict pid=2199833)     self.tp_worker = TpWorkerClass(
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
(WorkerDict pid=2199833)     self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 74, in __init__
(WorkerDict pid=2199833)     self.model_runner = ModelRunner(
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 174, in __init__
(WorkerDict pid=2199833)     min_per_gpu_memory = self.init_torch_distributed()
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 316, in init_torch_distributed
(WorkerDict pid=2199833)     before_avail_memory = get_available_gpu_memory(self.device, self.gpu_id)
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/utils.py", line 277, in get_available_gpu_memory
(WorkerDict pid=2199833)     free_gpu_memory, _ = torch.cuda.mem_get_info(gpu_id)
(WorkerDict pid=2199833)   File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/torch/cuda/memory.py", line 712, in mem_get_info
(WorkerDict pid=2199833)     return torch.cuda.cudart().cudaMemGetInfo(device)
(WorkerDict pid=2199833) RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
(WorkerDict pid=2199833) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

I am running the simple script provided in `examples/grpo_trainer/run_qwen2-7b_seq_balance.sh`, with vllm changed to sglang:
```
set -x

export VLLM_ATTENTION_BACKEND=XFORMERS

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=data/gsm8k/train.parquet \
    data.val_files=data/gsm8k/test.parquet \
    data.train_batch_size=1024 \
    data.max_prompt_length=512 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=16000 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=sglang \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.rollout.n=5 \
    actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=16000 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_grpo_example_gsm8k' \
    trainer.experiment_name='qwen2_7b_function_rm_kl1e-3' \
    trainer.val_before_train=False \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=5 \
    trainer.total_epochs=15 $@

```

However, the same script runs well on the same machine when using vllm. And slang alone works well in the same environment with `python -m sglang.launch_server`. I wonder what could be the possible reasons for this error. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error when using SGLang rollout #1049

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error when using SGLang rollout #1049

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions