-
Notifications
You must be signed in to change notification settings - Fork 441
Support GPU offloading for lab train on Linux #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
734dc48
to
325e791
Compare
927d74e
to
334a37f
Compare
Working here for me too, also with an RTX 3090. Do we have any idea what will happen with a gpu with insufficient VRAM? (I.e., a 12GB card of some sort -- I have one of those laying about too). |
I can provoke a memory error by running
|
cli/train/linux_train.py
Outdated
@@ -202,7 +255,7 @@ def model_generate(user): | |||
per_device_train_batch_size=per_device_train_batch_size, | |||
bf16=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am training on Nvidia H100. When I tried using lab train --device cuda --4bit-quant
, the inference step would fail with the stack trace below. When I changed this line to, it seemed to help:
fp16=args.use_bitsandbytes,
bf16=not args.use_bitsandbytes,
Stack trace for reference:
LINUX_TRAIN.PY: RUNNING INFERENCE ON THE OUTPUT MODEL
Traceback (most recent call last):
File "/workspace/venv/lib/python3.10/site-packages/cli/train/linux_train.py", line 308, in <module>
main()
File "/workspace/venv/lib/python3.10/site-packages/cli/train/linux_train.py", line 286, in main
model_generate(d["user"]).split(response_template.strip())[-1].strip()
File "/workspace/venv/lib/python3.10/site-packages/cli/train/linux_train.py", line 201, in model_generate
outputs = model.generate(
File "/workspace/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/workspace/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
return self.sample(
File "/workspace/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample
outputs = self(
File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/workspace/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1170, in forward
logits = self.lm_head(hidden_states)
File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure what the implications are of changing those dtypes but I just tried to make it match the bnb_4bit_compute_dtype
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You aren't the first person to run into this problem with an NVidia card. I haven't been able to reproduce it with AMD ROCm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into this issue with an AMD GPU. Doing some Googling, I found this: huggingface/peft#1515 (comment).
Adding that context manager to model_generate()
like this:
def model_generate(user):
text = create_prompt(user=user)
input_ids = tokenizer(text, return_tensors="pt").input_ids.to(args.device)
with torch.cuda.amp.autocast():
outputs = model.generate(
input_ids=input_ids,
max_new_tokens=256,
pad_token_id=tokenizer.eos_token_id,
temperature=0.7,
top_p=0.9,
stopping_criteria=stopping_criteria,
do_sample=True,
)
return tokenizer.batch_decode([o[:-1] for o in outputs])[0]
Seemed to make everything happy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16bit floats seem to be much slower on AMD ROCm. (from 4.5 ti/s with 4 bit quantization to 2.5 ti/s).
Does this change work for you:
fp16 = args.use_bitsandbytes and torch.version.cuda is not None
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=args.num_epochs,
per_device_train_batch_size=per_device_train_batch_size,
fp16=fp16,
bf16=not fp16,
...
)
By the way, I have created a container file for ROCm with all dependencies (torch, llama-cpp, bitesandbytes) pre-installed: |
For those of us with less powerful GPUs, is there some way to offload some of the work to the GPU based on available memory but the rest take place in CPU? Let's say I have an 8GB or 12GB Nvidia card, but would like some level of speedup compared to running only on CPU? |
I now have training working with 4-bit quantization. On my Radeon 7900 XT it's less than half performance compared to non-quantized training (2.5 it/s vs. 6 to 6.5 it/s). The next problem is
I don't know how to address the problem. Therefore I'm going to hide the 4 bit quantization option and let somebody else address the converter. |
A 12 GB card may be sufficient once 4 bit quantization is working correctly. Until then you need a card at least 17 GiB memory -- at least for AMD ROCm. I don't have access to NVidia hardware and cannot test if CUDA supports shared memory. |
Had not tried it, but just did, and it blew up on me with a lengthy backtrace, which I've ascertained is because I hadn't yet pip installed bitsandbytes. With that now installed, it does seem to be making some progress. It did spit out this table at me:
And here's nvidia-smi output, if it's of any interest:
I'll give that a look if this falls down and goes boom. |
Signed-off-by: Christian Heimes <cheimes@redhat.com>
I believe everything that was expected to work did, but then I hit #579.
Tried it, fails, memory allocation error. Looks like 12GB isn't sufficient w/o the --4-bit-quant option. |
Thanks @jarodwilson - it's good to know the 4 bit quantization works for you, and yes we'll have to get #579 solved for sure. I don't think it will be too hard to fix, although may involve a bit of loss during the conversion process. |
FWIW, I'm holding off merges from main until main is fixed. The tip of main is currently broken on Linux, see #673. I have submitted fixes for the bug and CI. |
At the end of last week, I used a version of this along with the notebook from #617 on Google Colab and it worked great. Kudos! 👏 |
I tried it on A100 40GB on Openshift AI pod. It worked great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see inline.
Just took a quick look to see how the default look (mostly for mac). I would think default to GPU if found would be nice. Example inline comment. Is cpu more appropriate default? I think we already default to using GPU for generate so why not use it for train.
"--device", | ||
type=TORCH_DEVICE, | ||
show_default=True, | ||
default="cpu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, but I'd probably default to auto-detect mps or cuda. I'm used to using libraries that do that for me, just not sure if there's something different about our use case. Why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm used to something like this:
https://github.com/UKPLab/sentence-transformers/blob/e6af66fcb3acf09c95695f436cc7a5bb0320fdd4/sentence_transformers/util.py#L595
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now training requires a recent GPU with lots of free VRAM and support for fp16 or bfloat16. I can about fit training data into 20 GiB VRAM. Others had OOM with 24 GiB VRAM. I would have to detect if you have sufficient VRAM, if you run on WSL2 (which can use USM), if your card has sufficient CUDA level / HIP level, and so on.
For now I'm sticking to the Zen of Python "Explicit is better than implicit" and keep the default to CPU. Somebody else can work on auto-detection after my PR has landed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markstur Could you please file tickets for auto-detection, mps
, and npu
device supports, so we have a place to discuss your ideas and don't forget them? The discussions on this PR is already long and hard to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this flag should only accept options that we explicitly support:
This will put guard-rails around the backends we reasonably support so we can roll out future stuff as CI improves for different backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also add mutual exclusion to --device=mps
and --4-bit-quant
and other exclusivities (4-bit might not be mutually exclusive forever, I just know that we can't use bitsandbytes for torch on metal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input values for --device
option are more complex than just cuda
or cpu
. Users can also supply cuda:1
for the second GPU card in their system. The custom click type TorchDeviceParam
performs validation and refuses unsupported values. I could add an additional check and only accept devices of type cpu
and cuda
for now.
$ lab train --device invalid
Usage: lab train [OPTIONS]
Try 'lab train --help' for help.
Error: Invalid value for '--device': Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: invalid
--4-bit-quant
is handled in the beginning of the function:
if four_bit_quant and device.type != "cuda":
raise click.ClickException("--4-bit-quant requires CUDA device")
$ lab train --4-bit-quant
Error: --4-bit-quant requires CUDA device
I ran it on a CentOS 8 Stream box with the Propriatry 550 Nvidia drivers and a Tesla DC GPU.
|
@jjasghar According to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities bfloat16 requires CUDA compute capability >= 8.0. |
To follow up here, with the help of @tiran I made these changes: index 6b427f2..f4b2f79 100644
--- a/cli/train/linux_train.py
+++ b/cli/train/linux_train.py
@@ -242,8 +242,8 @@ def linux_train(
output_dir=output_dir,
num_train_epochs=num_epochs,
per_device_train_batch_size=per_device_train_batch_size,
- fp16=use_fp16,
- bf16=not use_fp16,
+ fp16=False,
+ bf16=False,
# use_ipex=True, # TODO CPU test this possible optimization
use_cpu=model.device.type == "cpu",
save_strategy="epoch",
@@ -253,8 +253,8 @@ def linux_train(
# https://stackoverflow.com/a/75793317
# torch_compile=True,
# fp16=False, # fp16 increases memory consumption 1.5x
- # gradient_accumulation_steps=8,
- # gradient_checkpointing=True,
+ gradient_accumulation_steps=8,
+ gradient_checkpointing=True,
# eval_accumulation_steps=1,
# per_device_eval_batch_size=1,
) And I'm able to train now with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of nits, specifically logging and some comment questions
"--device", | ||
type=TORCH_DEVICE, | ||
show_default=True, | ||
default="cpu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this flag should only accept options that we explicitly support:
This will put guard-rails around the backends we reasonably support so we can roll out future stuff as CI improves for different backends.
"--device", | ||
type=TORCH_DEVICE, | ||
show_default=True, | ||
default="cpu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also add mutual exclusion to --device=mps
and --4-bit-quant
and other exclusivities (4-bit might not be mutually exclusive forever, I just know that we can't use bitsandbytes for torch on metal.
def _gib(size: int) -> str: | ||
return "{:.1f} GiB".format(size / 1024**3) | ||
|
||
for idx in range(torch.cuda.device_count()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this API behave differently for CUDA vs. ROCm devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no, (cuda will be positive for ROCm as well) then should note it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyTorch treats ROCm devices like a CUDA device. I have mentioned this in updated docs and it's explained in PyTorch docs, too.
There are differences between ROCm and CUDA, but the high level APIs work the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc string of the function is Report CUDA/ROCm device properties
. That should be enough, shouldn't it?
**Note:** This is explicitly forcing the build to use the ROCm compilers and prefix path for dependency resolution in the CMake build. This works around an issue in the CMake and ROCm version in Fedora 39 and below and may be fixed in F40. | ||
> **Note:** This is explicitly forcing the build to use the ROCm compilers and prefix path for dependency resolution in the CMake build. This works around an issue in the CMake and ROCm version in Fedora 39 and below and is fixed in Fedora 40. With Fedora 40's ROCm packages, use `CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DAMDGPU_TARGETS=gfx1100"` instead. | ||
|
||
Once that package is installed, recompile `lab` with `pip3 install .`. You also need to tell `HIP` which GPU to use - you can find this out via `rocminfo` although it is typically GPU 0. To set which device is visible to HIP, we'll set `export HIP_VISIBLE_DEVICES=0` for GPU 0. You may also have to set `HSA_OVERRIDE_GFX_VERSION` to override ROCm's GFX version detection, for example `export HSA_OVERRIDE_GFX_VERSION=10.3.0` to force an unsupported `gfx1032` card to use use supported `gfx1030` version. The environment variable `AMD_LOG_LEVEL` enables debug logging of ROCm libraries, for example `AMD_LOG_LEVEL=3` to print API calls to stderr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could recommend flag pip3 install . --force-reinstall --no-cache
as a more forceful route. I've had to do this before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be wrong, because it would override the ROCm builds of PyTorch and llama-cpp-python with CUDA builds + 4GB of CUDA libs. --force-reinstall
ignores all installed packages and overrides them with new copies. Please trust me on this. I have been packaging Python software before setuptools was first released in 2004.
Actually all recommendations for --force-reinstall --no-cache
should be replaced by better instructions. I'll file a ticket tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ okay cool, I'll defer to your expertise on that then. It's worked for me for somethings, but it's probably like adding stuff to a path manually- works, but bad practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pip isn't easy to tame and some options are confusing. Like you need to know that some parts of pip do not normalize the package name llama-cpp-python
to its canonical form llama_cpp_python
. Or the option --no-binary=llama_cpp_python
only affects downloads but not the local wheel cache.
I recommend:
- Remove the
venv
and start with a freshvenv
- Clear the local wheel cache with
pip cache purge
orpip cache remove llama_cpp_python
(sic!).pip cache purge
only purges the wheel cache for locally built packages, but keeps the HTTP cache with downloaded wheels and sdists (also a very confusing fact). This way you don't download GBs of packages over and over again. - Build and install all dependencies that needs special care
- build and install
llama-cpp-python
- down and install PyTorch for your GPU flavor
- build and install
- Finally install the remaining dependencies from
requirements.txt
orpyproject.toml
. pip recognizes thatllama-cpp-python
andtorch
are already present and won't override it -- unless there is a version conflict.
These four steps give you the smallest virtual env (no CUDA libs if you build for ROCm or CPU) and don't download the same package from PyPI several times.
requirements.txt
Outdated
@@ -20,3 +20,5 @@ torch>=2.2.1,<3.0.0 | |||
peft | |||
datasets | |||
trl | |||
# 4-bit quantization does not work, yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't work on mac, but should work on cuda no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. You can train with 4-bit-quant on CUDA and ROCm, but you cannot do anything with the result of the training. The rest of our tool chain cannot deal with quantized models. See ticket reference next to --4-bit-quant
option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does work on mac but not on others, right? the comment is confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linked ticket has more information. I have changed the comment to make it less confusing.
tl;dr it's not usable under Linux, yet.
```shell $ lab train --device invalid Usage: lab train [OPTIONS] Try 'lab train --help' for help. Error: Invalid value for '--device': Only 'cpu', 'cuda', cuda with device index ('cuda:0') are currently supported. ``` ```shell $ lab train --4-bit-quant Usage: lab train [OPTIONS] Try 'lab train --help' for help. Error: --4-bit-quant option requires --device=cuda ``` Signed-off-by: Christian Heimes <cheimes@redhat.com>
@JamesKunstle My latest commit addresses three remarks. |
requirements.txt
Outdated
@@ -20,3 +20,5 @@ torch>=2.2.1,<3.0.0 | |||
peft | |||
datasets | |||
trl | |||
# 4-bit quantization does not work, yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does work on mac but not on others, right? the comment is confusing
Signed-off-by: Christian Heimes <cheimes@redhat.com>
The old named tuple is gone and the code now lives in `cli.lab`. Signed-off-by: Christian Heimes <cheimes@redhat.com>
why does #520 (comment) requires to set both |
It's complicated... I mentioned before that we have to expose several additional parameters for training on Linux to tune the training for hardware. The current settings work for CUDA compute level >= 8.0 and AMD ROCm >= 10.3 (recent workstation and consumer GPUs with sufficient memory). @jjasghar has a GPU with CUDA compute level 7.5. I strongly recommend to keep the current settings and to look into different settings in a future PR. |
lab train
now supports NVidia CUDA and AMD ROCm devices to speed training.GPU acceleration is enabled with
lab train --device cuda
.The
--device
argument takes the same value astorch.device()
API.Values like
cuda:0
for first CUDA card work, too. Default value iscpu
. Other possible values are ipu, xpu, mkldnn, opengl, opencl,ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia,
privateuseone.
GPU acceleration requires a lot of GPU memory. On AMD GPUs, memory
consumption peaks at about 17 GiB. Training on GPU has been successfully
tested on