[SDK] Training failure in CPU Environment: AttributeError of "torch.cpu" and target label out of bounds 

### What happened?

I attempted to add an end-to-end (e2e) test for the `train` API. While running in a CPU environment, the `trainer` started training but the job eventually failed due to the following error:
```shell
Traceback (most recent call last):
  File "/app/hf_llm_training.py", line 167, in <module>
    train_args = TrainingArguments(**json.loads(args.training_parameters))
Thank you for using `train` API for LLMs fine-tuning. This feature is in alpha stage Kubeflow community is looking for your feedback. Please share your experience via #kubeflow-training Slack channel or Kubeflow Training Operator GitHub.
  File "<string>", line 123, in __init__
  File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1528, in __post_init__
    and (self.device.type != "cuda")
  File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1995, in device
    return self._setup_devices
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py", line 56, in __get__
    cached = self.fget(obj)
  File "/usr/local/lib/python3.10/dist-packages/transformers/training_args.py", line 1931, in _setup_devices
    self.distributed_state = PartialState(
  File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 275, in __init__
    self.set_device()
  File "/usr/local/lib/python3.10/dist-packages/accelerate/state.py", line 786, in set_device
    device_module.set_device(self.device)
AttributeError: module 'torch.cpu' has no attribute 'set_device'. Did you mean: '_device'? 
```

To resolve this issue, I updated the base image of the `trainer` to `FROM nvcr.io/nvidia/pytorch:24.06-py3`, which fixed the problem. However, a new error then occurred:
```shell
[rank0]: Traceback (most recent call last):
[rank0]:   File "/app/hf_llm_training.py", line 178, in <module>
[rank0]:     train_model(model, transformer_type, train_data, eval_data, tokenizer, train_args)
[rank0]:   File "/app/hf_llm_training.py", line 138, in train_model
[rank0]:     trainer.train()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1624, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1961, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2902, in training_step
[rank0]:     loss = self.compute_loss(model, inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2925, in compute_loss
[rank0]:     outputs = model(**inputs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1618, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1436, in _run_ddp_forward
[rank0]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/peft/peft_model.py", line 296, in forward
[rank0]:     return self.get_base_model()(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/transformers/models/bert/modeling_bert.py", line 1599, in forward
[rank0]:     loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/loss.py", line 1185, in forward
[rank0]:     return F.cross_entropy(input, target, weight=self.weight,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 3103, in cross_entropy
[rank0]:     return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
[rank0]: IndexError: Target 4 is out of bounds.
E0813 22:17:04.876000 281472870619232 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 17) of binary: /usr/bin/python

### What did you expect to happen?

The training using the `train` API is expected to complete successfully.

### Environment

Kubernetes version:
```bash
$ kubectl version

Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
```
Training Operator version:
```bash
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

kubeflow/training-operator:latest%  
```
Training Operator Python SDK version:
```bash
$ pip show kubeflow-training

Name: kubeflow-training
Version: 1.8.0
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: hejinchi@cn.ibm.com
License: Apache License Version 2.0
Location: /opt/homebrew/anaconda3/envs/katib-llm-test/lib/python3.12/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by: 
```


### Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SDK] Training failure in CPU Environment: AttributeError of "torch.cpu" and target label out of bounds #2228

What happened?

Impacted by this bug?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SDK] Training failure in CPU Environment: AttributeError of "torch.cpu" and target label out of bounds #2228

Description

What happened?

Impacted by this bug?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions