-
Notifications
You must be signed in to change notification settings - Fork 813
Closed
Labels
Description
What happened?
I attempted to add an end-to-end (e2e) test for the train
API. While running in a CPU environment, the trainer
started training but the job eventually failed due to the following error:
Traceback (most recent call last):
File "/app/hf_llm_training.py", line 167, in <module>
train_args = TrainingArguments(**json.loads(args.training_parameters))
Thank you for using `train` API for LLMs fine-tuning. This feature is in alpha stage Kubeflow community is looking for your feedback. Please share your experience via #kubeflow-training Slack channel or Kubeflow Training Operator GitHub.
File "<string>", line 123, in __init__
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1528, in __post_init__
and (self.device.type != "cuda")
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1995, in device
return self._setup_devices
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 56, in __get__
cached = self.fget(obj)
File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1931, in _setup_devices
self.distributed_state = PartialState(
File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 275, in __init__
self.set_device()
File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 786, in set_device
device_module.set_device(self.device)
AttributeError: module 'torch.cpu' has no attribute 'set_device'. Did you mean: '_device'?
To resolve this issue, I updated the base image of the trainer
to FROM nvcr.io/nvidia/pytorch:24.06-py3
, which fixed the problem. However, a new error then occurred:
[rank0]: Traceback (most recent call last):
[rank0]: File "/app/hf_llm_training.py", line 178, in <module>
[rank0]: train_model(model, transformer_type, train_data, eval_data, tokenizer, train_args)
[rank0]: File "/app/hf_llm_training.py", line 138, in train_model
[rank0]: trainer.train()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1624, in train
[rank0]: return inner_training_loop(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1961, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2902, in training_step
[rank0]: loss = self.compute_loss(model, inputs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2925, in compute_loss
[rank0]: outputs = model(**inputs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1618, in forward
[rank0]: else self._run_ddp_forward(*inputs, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1436, in _run_ddp_forward
[rank0]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 296, in forward
[rank0]: return self.get_base_model()(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/bert/modeling_bert.py", line 1599, in forward
[rank0]: loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/loss.py", line 1185, in forward
[rank0]: return F.cross_entropy(input, target, weight=self.weight,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 3103, in cross_entropy
[rank0]: return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
[rank0]: IndexError: Target 4 is out of bounds.
E0813 22:17:04.876000 281472870619232 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 17) of binary: /usr/bin/python
### What did you expect to happen?
The training using the `train` API is expected to complete successfully.
### Environment
Kubernetes version:
```bash
$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
Training Operator version:
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
kubeflow/training-operator:latest%
Training Operator Python SDK version:
$ pip show kubeflow-training
Name: kubeflow-training
Version: 1.8.0
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: hejinchi@cn.ibm.com
License: Apache License Version 2.0
Location: /opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍