[GRPOTrainer]RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

### Reproduction

I have some problems using GRPOTrainer in trl. In the current environment there are two GPUs.

deepspeed config
```python
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```
training conifg
```python
model_name_or_path: xx
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
bf16: true
tf32: true
output_dir: xx

# Dataset arguments
dataset_id_or_path:xx
# Lora Arguments
# No LoRA is used here

# Training arguments
max_steps: 450
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
learning_rate: 5.0e-7 
lr_scheduler_type: cosine
warmup_ratio: 0.03
# GRPO specific parameters
beta: 0.001
max_prompt_length: 256
max_completion_length: 1024
num_generations: 8
use_vllm: true
# vllm_device: "cuda:1"
vllm_gpu_memory_utilization: 0.6

# Logging arguments
logging_strategy: steps
logging_steps: 2
report_to:
- tensorboard
save_strategy: "steps"
save_steps: 100
seed: 42

# Hugging Face Hub 
push_to_hub: true
hub_strategy: every_save
```
shell 
```python
accelerate launch --num_processes 1 --config_file deepspeed_zero3.yaml scripts/run_r1_grpo.py --config grpo-qwen-2.5.yaml
```
outputs:
```
[rank0]:   File "xx/lib/python3.11/site-packages/torch/nn/functional.py", line 2551, in embedding
[rank0]:     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
```
When I changed use_vllm: true to false, the following error appeared again.

outputs:
```
[rank1]:   File "xx/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 206, in apply_rotary_pos_emb
[rank1]:     q_embed = (q * cos) + (rotate_half(q) * sin)
[rank1]:                ~~^~~~~
[rank1]: RuntimeError: The size of tensor a (1034) must match the size of tensor b (1035) at non-singleton dimension 2
```

### System Info

- Python version: 3.11.8
- PyTorch version: 2.5.1+cu118
- CUDA device(s): NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB
- Transformers version: 4.46.3
- Accelerate version: 1.2.1
- Accelerate config: not found
- Datasets version: 3.2.0
- HF Hub version: 0.27.1
- TRL version: 0.14.0
- bitsandbytes version: 0.42.0
- DeepSpeed version: 0.15.0
- Diffusers version: not installed
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.59.7
- PEFT version: 0.14.0

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GRPOTrainer]RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! #2851

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[GRPOTrainer]RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! #2851

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions