Support GPU offloading for lab train on Linux #520

tiran · 2024-03-11T16:31:57Z

lab train now supports NVidia CUDA and AMD ROCm devices to speed training.
GPU acceleration is enabled with lab train --device cuda.

The --device argument takes the same value as torch.device() API.
Values like cuda:0 for first CUDA card work, too. Default value is
cpu. Other possible values are ipu, xpu, mkldnn, opengl, opencl,
ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia,
privateuseone.

GPU acceleration requires a lot of GPU memory. On AMD GPUs, memory
consumption peaks at about 17 GiB. Training on GPU has been successfully
tested on

NVidia GeForce RTX 3090 (24 GiB), Fedora 39, PyTorch 2.2.1 CUDA 12.1
Radeon RX 7900 XT (20 GiB), Fedora 39, PyTorch 2.2.1+rocm5.7
Radeon RX 7900 XTX (24 GiB), Fedora 39, PyTorch 2.2.1+rocm5.7

jaideepr97 · 2024-03-11T19:40:30Z

@mbestavros

jarodwilson · 2024-03-12T16:04:52Z

Working here for me too, also with an RTX 3090. Do we have any idea what will happen with a gpu with insufficient VRAM? (I.e., a 12GB card of some sort -- I have one of those laying about too).

tiran · 2024-03-12T16:17:59Z

I can provoke a memory error by running lab serve, lab chat, and lab train in parallel. Training fails at a random place with this error message:

torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 19.98 GiB of which 0 bytes is free. Of the allocated memory 14.53 GiB is allocated by PyTorch, and 9.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

bostrt · 2024-03-12T17:12:24Z

cli/train/linux_train.py

@@ -202,7 +255,7 @@ def model_generate(user):
        per_device_train_batch_size=per_device_train_batch_size,
        bf16=True,


I am training on Nvidia H100. When I tried using lab train --device cuda --4bit-quant, the inference step would fail with the stack trace below. When I changed this line to, it seemed to help:

fp16=args.use_bitsandbytes, bf16=not args.use_bitsandbytes,

Stack trace for reference:

LINUX_TRAIN.PY: RUNNING INFERENCE ON THE OUTPUT MODEL Traceback (most recent call last): File "/workspace/venv/lib/python3.10/site-packages/cli/train/linux_train.py", line 308, in <module> main() File "/workspace/venv/lib/python3.10/site-packages/cli/train/linux_train.py", line 286, in main model_generate(d["user"]).split(response_template.strip())[-1].strip() File "/workspace/venv/lib/python3.10/site-packages/cli/train/linux_train.py", line 201, in model_generate outputs = model.generate( File "/workspace/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/workspace/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate return self.sample( File "/workspace/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample outputs = self( File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/workspace/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1170, in forward logits = self.lm_head(hidden_states) File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/workspace/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, **kwargs) File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward return F.linear(input, self.weight, self.bias) RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16

I'm not really sure what the implications are of changing those dtypes but I just tried to make it match the bnb_4bit_compute_dtype.

You aren't the first person to run into this problem with an NVidia card. I haven't been able to reproduce it with AMD ROCm.

I ran into this issue with an AMD GPU. Doing some Googling, I found this: huggingface/peft#1515 (comment).

Adding that context manager to model_generate() like this:

def model_generate(user): text = create_prompt(user=user) input_ids = tokenizer(text, return_tensors="pt").input_ids.to(args.device) with torch.cuda.amp.autocast(): outputs = model.generate( input_ids=input_ids, max_new_tokens=256, pad_token_id=tokenizer.eos_token_id, temperature=0.7, top_p=0.9, stopping_criteria=stopping_criteria, do_sample=True, ) return tokenizer.batch_decode([o[:-1] for o in outputs])[0]

Seemed to make everything happy.

16bit floats seem to be much slower on AMD ROCm. (from 4.5 ti/s with 4 bit quantization to 2.5 ti/s).

Does this change work for you:

fp16 = args.use_bitsandbytes and torch.version.cuda is not None training_arguments = TrainingArguments( output_dir=output_dir, num_train_epochs=args.num_epochs, per_device_train_batch_size=per_device_train_batch_size, fp16=fp16, bf16=not fp16, ... )

tiran · 2024-03-12T17:13:34Z

By the way, I have created a container file for ROCm with all dependencies (torch, llama-cpp, bitesandbytes) pre-installed:
https://gitlab.cee.redhat.com/cheimes/instructlab-rocm

bbrowning · 2024-03-12T18:01:51Z

For those of us with less powerful GPUs, is there some way to offload some of the work to the GPU based on available memory but the rest take place in CPU? Let's say I have an 8GB or 12GB Nvidia card, but would like some level of speedup compared to running only on CPU?

tiran · 2024-03-13T06:16:05Z

I now have training working with 4-bit quantization. On my Radeon 7900 XT it's less than half performance compared to non-quantized training (2.5 it/s vs. 6 to 6.5 it/s). The next problem is llamacpp_convert_to_gguf.py. The converted does not support quantized models, yet. KeyError: 'U8' is just a symptom if a bigger issue. For example a quantized model has blk.0.attn_norm.weight while the code expects blk.0.attn_q.weight. It also has a lot of unexpected tensors:

Unexpected tensor name: model.layers.0.mlp.down_proj.weight.nested_absmax - skipping
Unexpected tensor name: model.layers.0.mlp.down_proj.weight.nested_quant_map - skipping
Unexpected tensor name: model.layers.0.mlp.down_proj.weight.quant_map - skipping
Unexpected tensor name: model.layers.0.mlp.gate_proj.weight.nested_absmax - skipping
Unexpected tensor name: model.layers.0.mlp.gate_proj.weight.nested_quant_map - skipping
Unexpected tensor name: model.layers.0.mlp.gate_proj.weight.quant_map - skipping
Unexpected tensor name: model.layers.0.mlp.up_proj.weight.nested_absmax - skipping
Unexpected tensor name: model.layers.0.mlp.up_proj.weight.nested_quant_map - skipping
Unexpected tensor name: model.layers.0.mlp.up_proj.weight.quant_map - skipping

I don't know how to address the problem. Therefore I'm going to hide the 4 bit quantization option and let somebody else address the converter.

tiran · 2024-03-13T07:01:00Z

For those of us with less powerful GPUs, is there some way to offload some of the work to the GPU based on available memory but the rest take place in CPU? Let's say I have an 8GB or 12GB Nvidia card, but would like some level of speedup compared to running only on CPU?

A 12 GB card may be sufficient once 4 bit quantization is working correctly. Until then you need a card at least 17 GiB memory -- at least for AMD ROCm. I don't have access to NVidia hardware and cannot test if CUDA supports shared memory.

jarodwilson · 2024-03-18T20:25:25Z

Just to confirm, lab train --device cuda fails spectacularly on a GeForce RTX 4070 SUPER (12GB VRAM) under Fedora 39. The lab generate step was actually faster than on my 3090, I believe, but lack of fail-over to system memory like you get within the WSL2 environment definitely makes it harder to utilize directly under Linux.

@jarodwilson Did you try with the 4-bit-quant argument to enable quantization? In my test setups just using lab train --device cuda --4-bit-quant may work for you, assuming you have at least 10GB of free GPU memory.

Had not tried it, but just did, and it blew up on me with a lengthy backtrace, which I've ascertained is because I hadn't yet pip installed bitsandbytes. With that now installed, it does seem to be making some progress.

It did spit out this table at me:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   4446 MiB |   4446 MiB |  18175 MiB |  13729 MiB |
|       from large pool |   4340 MiB |   4340 MiB |  17988 MiB |  13648 MiB |
|       from small pool |    106 MiB |    107 MiB |    187 MiB |     81 MiB |
|---------------------------------------------------------------------------|
| Active memory         |   4446 MiB |   4446 MiB |  18175 MiB |  13729 MiB |
|       from large pool |   4340 MiB |   4340 MiB |  17988 MiB |  13648 MiB |
|       from small pool |    106 MiB |    107 MiB |    187 MiB |     81 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |   4446 MiB |   4446 MiB |  18175 MiB |  13729 MiB |
|       from large pool |   4340 MiB |   4340 MiB |  17988 MiB |  13648 MiB |
|       from small pool |    106 MiB |    107 MiB |    187 MiB |     81 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   4508 MiB |   4508 MiB |   8256 MiB |   3748 MiB |
|       from large pool |   4400 MiB |   4400 MiB |   8140 MiB |   3740 MiB |
|       from small pool |    108 MiB |    108 MiB |    116 MiB |      8 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Allocations           |    1284    |    1508    |    3396    |    2112    |
|       from large pool |     290    |     290    |     610    |     320    |
|       from small pool |     994    |    1218    |    2786    |    1792    |
|---------------------------------------------------------------------------|
| Active allocs         |    1284    |    1508    |    3396    |    2112    |
|       from large pool |     290    |     290    |     610    |     320    |
|       from small pool |     994    |    1218    |    2786    |    1792    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

And here's nvidia-smi output, if it's of any interest:

$ nvidia-smi 
Mon Mar 18 16:24:05 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:65:00.0  On |                  N/A |
| 34%   62C    P2            196W /  220W |   11107MiB /  12282MiB |     89%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2532      G   /usr/bin/gnome-shell                          333MiB |
|    0   N/A  N/A      3233      G   /usr/bin/Xwayland                             461MiB |
|    0   N/A  N/A      3333      G   /usr/libexec/xdg-desktop-portal-gnome           5MiB |
|    0   N/A  N/A      3530      G   ...AAAAAAAACAAAAAAAAAA= --shared-files          7MiB |
|    0   N/A  N/A      3944      G   ...yOnDemand --variations-seed-version         74MiB |
|    0   N/A  N/A      4187      G   /usr/lib64/firefox/firefox                    309MiB |
|    0   N/A  N/A     40724      G   ...sion,SpareRendererForSitePerProcess         49MiB |
|    0   N/A  N/A    229479      C   ...ts/instruct-lab/venv/bin/python3.11       9636MiB |
+-----------------------------------------------------------------------------------------+

If that doesn't, you can checkout the gradient changes I showed in #520 (comment) - enabling gradient accumulation should definitely get you going under the memory limit.

I'll give that a look if this falls down and goes boom.

Signed-off-by: Christian Heimes <cheimes@redhat.com>

jarodwilson · 2024-03-18T22:02:54Z

Just to confirm, lab train --device cuda fails spectacularly on a GeForce RTX 4070 SUPER (12GB VRAM) under Fedora 39. The lab generate step was actually faster than on my 3090, I believe, but lack of fail-over to system memory like you get within the WSL2 environment definitely makes it harder to utilize directly under Linux.

@jarodwilson Did you try with the 4-bit-quant argument to enable quantization? In my test setups just using lab train --device cuda --4-bit-quant may work for you, assuming you have at least 10GB of free GPU memory.

Had not tried it, but just did, and it blew up on me with a lengthy backtrace, which I've ascertained is because I hadn't yet pip installed bitsandbytes. With that now installed, it does seem to be making some progress.

I believe everything that was expected to work did, but then I hit #579.

If that doesn't, you can checkout the gradient changes I showed in #520 (comment) - enabling gradient accumulation should definitely get you going under the memory limit.

I'll give that a look if this falls down and goes boom.

Tried it, fails, memory allocation error. Looks like 12GB isn't sufficient w/o the --4-bit-quant option.

bbrowning · 2024-03-19T12:14:19Z

Thanks @jarodwilson - it's good to know the 4 bit quantization works for you, and yes we'll have to get #579 solved for sure. I don't think it will be too hard to fix, although may involve a bit of loss during the conversion process.

tiran · 2024-03-19T14:17:04Z

FWIW, I'm holding off merges from main until main is fixed. The tip of main is currently broken on Linux, see #673. I have submitted fixes for the bug and CI.

grdryn · 2024-03-19T15:57:06Z

At the end of last week, I used a version of this along with the notebook from #617 on Google Colab and it worked great. Kudos! 👏

tuhinsharma121 · 2024-03-20T14:09:24Z

I tried it on A100 40GB on Openshift AI pod. It worked great!

markstur

see inline.
Just took a quick look to see how the default look (mostly for mac). I would think default to GPU if found would be nice. Example inline comment. Is cpu more appropriate default? I think we already default to using GPU for generate so why not use it for train.

markstur · 2024-03-20T14:44:40Z

cli/lab.py

+    "--device",
+    type=TORCH_DEVICE,
+    show_default=True,
+    default="cpu",


Not sure, but I'd probably default to auto-detect mps or cuda. I'm used to using libraries that do that for me, just not sure if there's something different about our use case. Why not?

I'm used to something like this:
https://github.com/UKPLab/sentence-transformers/blob/e6af66fcb3acf09c95695f436cc7a5bb0320fdd4/sentence_transformers/util.py#L595

For now training requires a recent GPU with lots of free VRAM and support for fp16 or bfloat16. I can about fit training data into 20 GiB VRAM. Others had OOM with 24 GiB VRAM. I would have to detect if you have sufficient VRAM, if you run on WSL2 (which can use USM), if your card has sufficient CUDA level / HIP level, and so on.

For now I'm sticking to the Zen of Python "Explicit is better than implicit" and keep the default to CPU. Somebody else can work on auto-detection after my PR has landed.

@markstur Could you please file tickets for auto-detection, mps, and npu device supports, so we have a place to discuss your ideas and don't forget them? The discussions on this PR is already long and hard to follow.

I think this flag should only accept options that we explicitly support:

finite options for click flag

This will put guard-rails around the backends we reasonably support so we can roll out future stuff as CI improves for different backends.

Could also add mutual exclusion to --device=mps and --4-bit-quant and other exclusivities (4-bit might not be mutually exclusive forever, I just know that we can't use bitsandbytes for torch on metal.

The input values for --device option are more complex than just cuda or cpu. Users can also supply cuda:1 for the second GPU card in their system. The custom click type TorchDeviceParam performs validation and refuses unsupported values. I could add an additional check and only accept devices of type cpu and cuda for now.

$ lab train --device invalid Usage: lab train [OPTIONS] Try 'lab train --help' for help. Error: Invalid value for '--device': Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: invalid

--4-bit-quant is handled in the beginning of the function:

if four_bit_quant and device.type != "cuda": raise click.ClickException("--4-bit-quant requires CUDA device")

$ lab train --4-bit-quant Error: --4-bit-quant requires CUDA device

jjasghar · 2024-03-20T15:45:15Z

LINUX_TRAIN.PY: Using device 'TorchDeviceInfo(type='cuda', index=0, device_map={'': 0})'
  NVidia CUDA version: 12.1
  AMD ROCm HIP version: n/a
  cuda:0 is 'Tesla T4' (14.5 GiB of 14.6 GiB free, capability: 7.5)

I ran it on a CentOS 8 Stream box with the Propriatry 550 Nvidia drivers and a Tesla DC GPU.

lab train --device cuda failed with:

ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

tiran · 2024-03-20T16:14:13Z

@jjasghar According to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities bfloat16 requires CUDA compute capability >= 8.0.

jjasghar · 2024-03-20T17:29:54Z

To follow up here, with the help of @tiran I made these changes:

index 6b427f2..f4b2f79 100644
--- a/cli/train/linux_train.py
+++ b/cli/train/linux_train.py
@@ -242,8 +242,8 @@ def linux_train(
         output_dir=output_dir,
         num_train_epochs=num_epochs,
         per_device_train_batch_size=per_device_train_batch_size,
-        fp16=use_fp16,
-        bf16=not use_fp16,
+        fp16=False,
+        bf16=False,
         # use_ipex=True, # TODO CPU test this possible optimization
         use_cpu=model.device.type == "cpu",
         save_strategy="epoch",
@@ -253,8 +253,8 @@ def linux_train(
         # https://stackoverflow.com/a/75793317
         # torch_compile=True,
         # fp16=False,  # fp16 increases memory consumption 1.5x
-        # gradient_accumulation_steps=8,
-        # gradient_checkpointing=True,
+        gradient_accumulation_steps=8,
+        gradient_checkpointing=True,
         # eval_accumulation_steps=1,
         # per_device_eval_batch_size=1,
     )

And I'm able to train now with cuda via lab train --device cuda

JamesKunstle

couple of nits, specifically logging and some comment questions

JamesKunstle · 2024-03-20T19:33:12Z

cli/lab.py

+    "--device",
+    type=TORCH_DEVICE,
+    show_default=True,
+    default="cpu",


I think this flag should only accept options that we explicitly support:

finite options for click flag

This will put guard-rails around the backends we reasonably support so we can roll out future stuff as CI improves for different backends.

JamesKunstle · 2024-03-20T19:35:06Z

cli/lab.py

+    "--device",
+    type=TORCH_DEVICE,
+    show_default=True,
+    default="cpu",


Could also add mutual exclusion to --device=mps and --4-bit-quant and other exclusivities (4-bit might not be mutually exclusive forever, I just know that we can't use bitsandbytes for torch on metal.

cli/train/linux_train.py

JamesKunstle · 2024-03-20T19:37:12Z

cli/train/linux_train.py

+    def _gib(size: int) -> str:
+        return "{:.1f} GiB".format(size / 1024**3)
+
+    for idx in range(torch.cuda.device_count()):


does this API behave differently for CUDA vs. ROCm devices?

If no, (cuda will be positive for ROCm as well) then should note it.

PyTorch treats ROCm devices like a CUDA device. I have mentioned this in updated docs and it's explained in PyTorch docs, too.

There are differences between ROCm and CUDA, but the high level APIs work the same.

The doc string of the function is Report CUDA/ROCm device properties. That should be enough, shouldn't it?

JamesKunstle · 2024-03-20T19:41:05Z

docs/gpu-acceleration.md

-**Note:** This is explicitly forcing the build to use the ROCm compilers and prefix path for dependency resolution in the CMake build.  This works around an issue in the CMake and ROCm version in Fedora 39 and below and may be fixed in F40.  
+> **Note:** This is explicitly forcing the build to use the ROCm compilers and prefix path for dependency resolution in the CMake build.  This works around an issue in the CMake and ROCm version in Fedora 39 and below and is fixed in Fedora 40.  With Fedora 40's ROCm packages, use `CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DAMDGPU_TARGETS=gfx1100"` instead.
+
+Once that package is installed, recompile `lab` with `pip3 install .`.  You also need to tell `HIP` which GPU to use - you can find this out via `rocminfo` although it is typically GPU 0.  To set which device is visible to HIP, we'll set `export HIP_VISIBLE_DEVICES=0` for GPU 0.   You may also have to set `HSA_OVERRIDE_GFX_VERSION` to override ROCm's GFX version detection, for example `export HSA_OVERRIDE_GFX_VERSION=10.3.0` to force an unsupported `gfx1032` card to use use supported `gfx1030` version.  The environment variable `AMD_LOG_LEVEL` enables debug logging of ROCm libraries, for example `AMD_LOG_LEVEL=3` to print API calls to stderr.


could recommend flag pip3 install . --force-reinstall --no-cache as a more forceful route. I've had to do this before.

That would be wrong, because it would override the ROCm builds of PyTorch and llama-cpp-python with CUDA builds + 4GB of CUDA libs. --force-reinstall ignores all installed packages and overrides them with new copies. Please trust me on this. I have been packaging Python software before setuptools was first released in 2004.

Actually all recommendations for --force-reinstall --no-cache should be replaced by better instructions. I'll file a ticket tomorrow.

++ okay cool, I'll defer to your expertise on that then. It's worked for me for somethings, but it's probably like adding stuff to a path manually- works, but bad practice.

pip isn't easy to tame and some options are confusing. Like you need to know that some parts of pip do not normalize the package name llama-cpp-python to its canonical form llama_cpp_python. Or the option --no-binary=llama_cpp_python only affects downloads but not the local wheel cache.

I recommend:

Remove the venv and start with a fresh venv

Clear the local wheel cache with pip cache purge or pip cache remove llama_cpp_python (sic!). pip cache purge only purges the wheel cache for locally built packages, but keeps the HTTP cache with downloaded wheels and sdists (also a very confusing fact). This way you don't download GBs of packages over and over again.

Build and install all dependencies that needs special care

build and install llama-cpp-python

down and install PyTorch for your GPU flavor

Finally install the remaining dependencies from requirements.txt or pyproject.toml. pip recognizes that llama-cpp-python and torch are already present and won't override it -- unless there is a version conflict.

These four steps give you the smallest virtual env (no CUDA libs if you build for ROCm or CPU) and don't download the same package from PyPI several times.

JamesKunstle · 2024-03-20T19:41:53Z

requirements.txt

@@ -20,3 +20,5 @@ torch>=2.2.1,<3.0.0
 peft
 datasets
 trl
+# 4-bit quantization does not work, yet


doesn't work on mac, but should work on cuda no?

Yes and no. You can train with 4-bit-quant on CUDA and ROCm, but you cannot do anything with the result of the training. The rest of our tool chain cannot deal with quantized models. See ticket reference next to --4-bit-quant option.

it does work on mac but not on others, right? the comment is confusing

The linked ticket has more information. I have changed the comment to make it less confusing.

tl;dr it's not usable under Linux, yet.

```shell $ lab train --device invalid Usage: lab train [OPTIONS] Try 'lab train --help' for help. Error: Invalid value for '--device': Only 'cpu', 'cuda', cuda with device index ('cuda:0') are currently supported. ``` ```shell $ lab train --4-bit-quant Usage: lab train [OPTIONS] Try 'lab train --help' for help. Error: --4-bit-quant option requires --device=cuda ``` Signed-off-by: Christian Heimes <cheimes@redhat.com>

tiran · 2024-03-20T20:54:13Z

@JamesKunstle My latest commit addresses three remarks.

cli/lab.py

cli/train/param.py

containers/bin/debug-pytorch

xukai92 · 2024-03-21T10:51:05Z

requirements.txt

@@ -20,3 +20,5 @@ torch>=2.2.1,<3.0.0
 peft
 datasets
 trl
+# 4-bit quantization does not work, yet


it does work on mac but not on others, right? the comment is confusing

Signed-off-by: Christian Heimes <cheimes@redhat.com>

The old named tuple is gone and the code now lives in `cli.lab`. Signed-off-by: Christian Heimes <cheimes@redhat.com>

xukai92 · 2024-03-21T12:19:39Z

why does #520 (comment) requires to set both fp16 and bf16 to be False? or only bf16 is actually needed. I'm asking this because the current code would always set one of them to be true, which doesn't fit into that "fix".

tiran · 2024-03-21T12:49:25Z

why does #520 (comment) requires to set both fp16 and bf16 to be False? or only bf16 is actually needed. I'm asking this because the current code would always set one of them to be true, which doesn't fit into that "fix".

It's complicated...

I mentioned before that we have to expose several additional parameters for training on Linux to tune the training for hardware. The current settings work for CUDA compute level >= 8.0 and AMD ROCm >= 10.3 (recent workstation and consumer GPUs with sufficient memory). @jjasghar has a GPU with CUDA compute level 7.5. fp16=False and bf16=False may break training for users with consumer GPUs. The current settings have been tested successfully by nearly a dozen users.

I strongly recommend to keep the current settings and to look into different settings in a future PR.

tiran force-pushed the linux-train-gpu branch from 7427f80 to a82a698 Compare March 11, 2024 17:22

bostrt mentioned this pull request Mar 11, 2024

WIP: Training with Linux + GPU #526

Closed

tiran force-pushed the linux-train-gpu branch 2 times, most recently from 734dc48 to 325e791 Compare March 12, 2024 07:21

tiran changed the title ~~[WIP] Support GPU offloading for lab train on Linux~~ Support GPU offloading for lab train on Linux Mar 12, 2024

tiran marked this pull request as ready for review March 12, 2024 07:23

tiran requested review from mairin, kelbrown20, jeremyeder, abhi1092, soltysh, xukai92, markstur, hickeyma, afrittoli, spzala, Tomcli and mrutkows as code owners March 12, 2024 07:23

tiran force-pushed the linux-train-gpu branch 2 times, most recently from 927d74e to 334a37f Compare March 12, 2024 11:01

bostrt reviewed Mar 12, 2024

View reviewed changes

tiran mentioned this pull request Mar 13, 2024

llamacpp_convert_to_gguf does not support 4-bit quantied models (KeyError: 'U8') #579

Closed

tiran force-pushed the linux-train-gpu branch from 334a37f to 98ac51e Compare March 13, 2024 06:57

Report GPU device capabilities, test more dtypes

80a0f45

Signed-off-by: Christian Heimes <cheimes@redhat.com>

tiran force-pushed the linux-train-gpu branch from 31bdd51 to 80a0f45 Compare March 18, 2024 21:00

grdryn mentioned this pull request Mar 19, 2024

introduce notebook that runs lab cli natively in Jupyter #617

Closed

bbrowning mentioned this pull request Mar 19, 2024

CLI flags/configuration and examples for Linux GPU training #647

Closed

tiran force-pushed the linux-train-gpu branch from bce1a0d to 657487e Compare March 20, 2024 07:33

Merge remote-tracking branch 'upstream/main' into linux-train-gpu

bbcfe7e

tiran force-pushed the linux-train-gpu branch from 657487e to bbcfe7e Compare March 20, 2024 07:49

xukai92 added this to the March 21 Musts milestone Mar 20, 2024

markstur reviewed Mar 20, 2024

View reviewed changes

JamesKunstle suggested changes Mar 20, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into linux-train-gpu

b8a9c72

xukai92 reviewed Mar 21, 2024

View reviewed changes

tiran added 3 commits March 21, 2024 12:04

Merge remote-tracking branch 'upstream/main' into linux-train-gpu

39b3895

Clarify comment in requirements.txt

4d2a3bf

Signed-off-by: Christian Heimes <cheimes@redhat.com>

Refactor Torch device parameter

6915ca3

The old named tuple is gone and the code now lives in `cli.lab`. Signed-off-by: Christian Heimes <cheimes@redhat.com>

tiran force-pushed the linux-train-gpu branch from 0aaafd9 to 6915ca3 Compare March 21, 2024 12:06

xukai92 merged commit 951999a into instructlab:main Mar 21, 2024

		@@ -202,7 +255,7 @@ def model_generate(user):
		per_device_train_batch_size=per_device_train_batch_size,
		bf16=True,

Support GPU offloading for lab train on Linux #520

Support GPU offloading for lab train on Linux #520

Uh oh!

Conversation

tiran commented Mar 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaideepr97 commented Mar 11, 2024

Uh oh!

jarodwilson commented Mar 12, 2024

Uh oh!

tiran commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheesesashimi Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tiran commented Mar 12, 2024

Uh oh!

bbrowning commented Mar 12, 2024

Uh oh!

tiran commented Mar 13, 2024

Uh oh!

tiran commented Mar 13, 2024

Uh oh!

jarodwilson commented Mar 18, 2024

Uh oh!

jarodwilson commented Mar 18, 2024

Uh oh!

bbrowning commented Mar 19, 2024

Uh oh!

tiran commented Mar 19, 2024

Uh oh!

grdryn commented Mar 19, 2024

Uh oh!

tuhinsharma121 commented Mar 20, 2024

Uh oh!

markstur left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tiran Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjasghar commented Mar 20, 2024

Uh oh!

tiran commented Mar 20, 2024

Uh oh!

jjasghar commented Mar 20, 2024

Uh oh!

JamesKunstle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

tiran commented Mar 11, 2024 •

edited

Loading

tiran commented Mar 12, 2024 •

edited

Loading

cheesesashimi Mar 12, 2024 •

edited

Loading

tiran Mar 20, 2024 •

edited

Loading

tiran Mar 21, 2024 •

edited

Loading