-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Description
Hi, I get this error when using SGLang for rollout.
(WorkerDict pid=2199833) [2025-04-12 21:11:01 TP0] Scheduler hit an exception: Traceback (most recent call last):
(WorkerDict pid=2199833) File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1999, in run_scheduler_process
(WorkerDict pid=2199833) scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
(WorkerDict pid=2199833) File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 249, in __init__
(WorkerDict pid=2199833) self.tp_worker = TpWorkerClass(
(WorkerDict pid=2199833) File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 63, in __init__
(WorkerDict pid=2199833) self.worker = TpModelWorker(server_args, gpu_id, tp_rank, dp_rank, nccl_port)
(WorkerDict pid=2199833) File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 74, in __init__
(WorkerDict pid=2199833) self.model_runner = ModelRunner(
(WorkerDict pid=2199833) File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 174, in __init__
(WorkerDict pid=2199833) min_per_gpu_memory = self.init_torch_distributed()
(WorkerDict pid=2199833) File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 316, in init_torch_distributed
(WorkerDict pid=2199833) before_avail_memory = get_available_gpu_memory(self.device, self.gpu_id)
(WorkerDict pid=2199833) File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/sglang/srt/utils.py", line 277, in get_available_gpu_memory
(WorkerDict pid=2199833) free_gpu_memory, _ = torch.cuda.mem_get_info(gpu_id)
(WorkerDict pid=2199833) File "/proj/long-multi/midi/miniforge3/envs/exp_t/lib/python3.10/site-packages/torch/cuda/memory.py", line 712, in mem_get_info
(WorkerDict pid=2199833) return torch.cuda.cudart().cudaMemGetInfo(device)
(WorkerDict pid=2199833) RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
(WorkerDict pid=2199833) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I am running the simple script provided in examples/grpo_trainer/run_qwen2-7b_seq_balance.sh
, with vllm changed to sglang:
set -x
export VLLM_ATTENTION_BACKEND=XFORMERS
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=data/gsm8k/train.parquet \
data.val_files=data/gsm8k/test.parquet \
data.train_batch_size=1024 \
data.max_prompt_length=512 \
data.max_response_length=1024 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=Qwen/Qwen2-7B-Instruct \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.ppo_mini_batch_size=256 \
actor_rollout_ref.actor.use_dynamic_bsz=True \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=16000 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
actor_rollout_ref.rollout.name=sglang \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
actor_rollout_ref.rollout.n=5 \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=16000 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.experiment_name='qwen2_7b_function_rm_kl1e-3' \
trainer.val_before_train=False \
trainer.n_gpus_per_node=8 \
trainer.nnodes=1 \
trainer.save_freq=-1 \
trainer.test_freq=5 \
trainer.total_epochs=15 $@
However, the same script runs well on the same machine when using vllm. And slang alone works well in the same environment with python -m sglang.launch_server
. I wonder what could be the possible reasons for this error. Thanks!
eranhirs and MYC000801
Metadata
Metadata
Assignees
Labels
No labels