Skip to content

[GRPOTrainer]RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! #2851

@YunGe0414

Description

@YunGe0414

Reproduction

I have some problems using GRPOTrainer in trl. In the current environment there are two GPUs.

deepspeed config

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

training conifg

model_name_or_path: xx
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
bf16: true
tf32: true
output_dir: xx

# Dataset arguments
dataset_id_or_path:xx
# Lora Arguments
# No LoRA is used here

# Training arguments
max_steps: 450
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
learning_rate: 5.0e-7 
lr_scheduler_type: cosine
warmup_ratio: 0.03
# GRPO specific parameters
beta: 0.001
max_prompt_length: 256
max_completion_length: 1024
num_generations: 8
use_vllm: true
# vllm_device: "cuda:1"
vllm_gpu_memory_utilization: 0.6

# Logging arguments
logging_strategy: steps
logging_steps: 2
report_to:
- tensorboard
save_strategy: "steps"
save_steps: 100
seed: 42

# Hugging Face Hub 
push_to_hub: true
hub_strategy: every_save

shell

accelerate launch --num_processes 1 --config_file deepspeed_zero3.yaml scripts/run_r1_grpo.py --config grpo-qwen-2.5.yaml

outputs:

[rank0]:   File "xx/lib/python3.11/site-packages/torch/nn/functional.py", line 2551, in embedding
[rank0]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

When I changed use_vllm: true to false, the following error appeared again.

outputs:

[rank1]:   File "xx/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 206, in apply_rotary_pos_emb
[rank1]:     q_embed = (q * cos) + (rotate_half(q) * sin)
[rank1]:                ~~^~~~~
[rank1]: RuntimeError: The size of tensor a (1034) must match the size of tensor b (1035) at non-singleton dimension 2

System Info

  • Python version: 3.11.8
  • PyTorch version: 2.5.1+cu118
  • CUDA device(s): NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB
  • Transformers version: 4.46.3
  • Accelerate version: 1.2.1
  • Accelerate config: not found
  • Datasets version: 3.2.0
  • HF Hub version: 0.27.1
  • TRL version: 0.14.0
  • bitsandbytes version: 0.42.0
  • DeepSpeed version: 0.15.0
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: 1.59.7
  • PEFT version: 0.14.0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

Labels

🏋 GRPORelated to GRPO🐛 bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions