Device Mismatch in sft_trainer.py Causing gather_for_metrics Failure

In TRL v0.16.0,
https://github.com/huggingface/trl/blob/23a635ed61c47609c88fd0fd7bfc7b08c398742d/trl/trainer/sft_trainer.py#L505

creates a new tensor via torch.tensor(inputs["position_ids"].size(1)) on CPU, while self.accelerator.gather_for_metrics requires tensors to be on GPU. This device mismatch causes runtime errors when using GPU-accelerated training.

```
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/nvme0n1p1/.wang/AiInfraFactory_v3/factory/trainer.py", line 120, in <module>
[rank0]:     trainer.train()
[rank0]:   File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3718, in training_step
[rank0]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 505, in compute_loss
[rank0]:     self.accelerator.gather_for_metrics(torch.tensor(inputs["position_ids"].size(1))).sum().item()
[rank0]:   File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2613, in gather_for_metrics
[rank0]:     data = self.gather(input_data)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2569, in gather
[rank0]:     return gather(tensor)
[rank0]:   File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 379, in wrapper
[rank0]:     raise DistributedOperationException(
[rank0]: accelerate.utils.operations.DistributedOperationException: One or more of the tensors passed to accelerate.utils.operations.gather were not on the cpu while the `Accelerator` is configured for cuda. Please move it to the cuda before calling accelerate.utils.operations.gather.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Device Mismatch in sft_trainer.py Causing gather_for_metrics Failure #3244

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Device Mismatch in sft_trainer.py Causing gather_for_metrics Failure #3244

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions