generated from fastai/nbdev_template
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Closed
Labels
⚡accelerateRelated to accelerateRelated to accelerate🏋 SFTRelated to SFTRelated to SFT🐛 bugSomething isn't workingSomething isn't working
Description
In TRL v0.16.0,
trl/trl/trainer/sft_trainer.py
Line 505 in 23a635e
self.accelerator.gather_for_metrics(torch.tensor(inputs["position_ids"].size(1))).sum().item() |
creates a new tensor via torch.tensor(inputs["position_ids"].size(1)) on CPU, while self.accelerator.gather_for_metrics requires tensors to be on GPU. This device mismatch causes runtime errors when using GPU-accelerated training.
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/nvme0n1p1/.wang/AiInfraFactory_v3/factory/trainer.py", line 120, in <module>
[rank0]: trainer.train()
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank0]: return inner_training_loop(
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 3718, in training_step
[rank0]: loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank0]: File "/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 505, in compute_loss
[rank0]: self.accelerator.gather_for_metrics(torch.tensor(inputs["position_ids"].size(1))).sum().item()
[rank0]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2613, in gather_for_metrics
[rank0]: data = self.gather(input_data)
[rank0]: File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2569, in gather
[rank0]: return gather(tensor)
[rank0]: File "/usr/local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 379, in wrapper
[rank0]: raise DistributedOperationException(
[rank0]: accelerate.utils.operations.DistributedOperationException: One or more of the tensors passed to accelerate.utils.operations.gather were not on the cpu while the `Accelerator` is configured for cuda. Please move it to the cuda before calling accelerate.utils.operations.gather.
Metadata
Metadata
Assignees
Labels
⚡accelerateRelated to accelerateRelated to accelerate🏋 SFTRelated to SFTRelated to SFT🐛 bugSomething isn't workingSomething isn't working