generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Closed
Labels
Description
Reproduction
I have some problems using GRPOTrainer in trl. In the current environment there are two GPUs.
deepspeed config
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
training conifg
model_name_or_path: xx
model_revision: main
torch_dtype: bfloat16
attn_implementation: flash_attention_2
bf16: true
tf32: true
output_dir: xx
# Dataset arguments
dataset_id_or_path:xx
# Lora Arguments
# No LoRA is used here
# Training arguments
max_steps: 450
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 5.0e-7
lr_scheduler_type: cosine
warmup_ratio: 0.03
# GRPO specific parameters
beta: 0.001
max_prompt_length: 256
max_completion_length: 1024
num_generations: 8
use_vllm: true
# vllm_device: "cuda:1"
vllm_gpu_memory_utilization: 0.6
# Logging arguments
logging_strategy: steps
logging_steps: 2
report_to:
- tensorboard
save_strategy: "steps"
save_steps: 100
seed: 42
# Hugging Face Hub
push_to_hub: true
hub_strategy: every_save
shell
accelerate launch --num_processes 1 --config_file deepspeed_zero3.yaml scripts/run_r1_grpo.py --config grpo-qwen-2.5.yaml
outputs:
[rank0]: File "xx/lib/python3.11/site-packages/torch/nn/functional.py", line 2551, in embedding
[rank0]: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
When I changed use_vllm: true to false, the following error appeared again.
outputs:
[rank1]: File "xx/lib/python3.11/site-packages/transformers/models/qwen2/modeling_qwen2.py", line 206, in apply_rotary_pos_emb
[rank1]: q_embed = (q * cos) + (rotate_half(q) * sin)
[rank1]: ~~^~~~~
[rank1]: RuntimeError: The size of tensor a (1034) must match the size of tensor b (1035) at non-singleton dimension 2
System Info
- Python version: 3.11.8
- PyTorch version: 2.5.1+cu118
- CUDA device(s): NVIDIA A100-SXM4-80GB, NVIDIA A100-SXM4-80GB
- Transformers version: 4.46.3
- Accelerate version: 1.2.1
- Accelerate config: not found
- Datasets version: 3.2.0
- HF Hub version: 0.27.1
- TRL version: 0.14.0
- bitsandbytes version: 0.42.0
- DeepSpeed version: 0.15.0
- Diffusers version: not installed
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.59.7
- PEFT version: 0.14.0
Checklist
- I have checked that my issue isn't already filed (see open issues)
- I have included my system information
- Any code provided is minimal, complete, and reproducible (more on MREs)
- Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
- Any traceback provided is complete