Skip to content

Conversation

tiran
Copy link
Contributor

@tiran tiran commented Mar 11, 2024

lab train now supports NVidia CUDA and AMD ROCm devices to speed training.
GPU acceleration is enabled with lab train --device cuda.

The --device argument takes the same value as torch.device() API.
Values like cuda:0 for first CUDA card work, too. Default value is
cpu. Other possible values are ipu, xpu, mkldnn, opengl, opencl,
ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia,
privateuseone.

GPU acceleration requires a lot of GPU memory. On AMD GPUs, memory
consumption peaks at about 17 GiB. Training on GPU has been successfully
tested on

  • NVidia GeForce RTX 3090 (24 GiB), Fedora 39, PyTorch 2.2.1 CUDA 12.1
  • Radeon RX 7900 XT (20 GiB), Fedora 39, PyTorch 2.2.1+rocm5.7
  • Radeon RX 7900 XTX (24 GiB), Fedora 39, PyTorch 2.2.1+rocm5.7

@jaideepr97
Copy link
Member

@mbestavros

@tiran tiran force-pushed the linux-train-gpu branch 2 times, most recently from 734dc48 to 325e791 Compare March 12, 2024 07:21
@tiran tiran changed the title [WIP] Support GPU offloading for lab train on Linux Support GPU offloading for lab train on Linux Mar 12, 2024
@tiran tiran marked this pull request as ready for review March 12, 2024 07:23
@tiran tiran force-pushed the linux-train-gpu branch 2 times, most recently from 927d74e to 334a37f Compare March 12, 2024 11:01
@jarodwilson
Copy link

Working here for me too, also with an RTX 3090. Do we have any idea what will happen with a gpu with insufficient VRAM? (I.e., a 12GB card of some sort -- I have one of those laying about too).

@tiran
Copy link
Contributor Author

tiran commented Mar 12, 2024

I can provoke a memory error by running lab serve, lab chat, and lab train in parallel. Training fails at a random place with this error message:

torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 19.98 GiB of which 0 bytes is free. Of the allocated memory 14.53 GiB is allocated by PyTorch, and 9.25 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@@ -202,7 +255,7 @@ def model_generate(user):
per_device_train_batch_size=per_device_train_batch_size,
bf16=True,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am training on Nvidia H100. When I tried using lab train --device cuda --4bit-quant, the inference step would fail with the stack trace below. When I changed this line to, it seemed to help:

        fp16=args.use_bitsandbytes,
        bf16=not args.use_bitsandbytes,

Stack trace for reference:

LINUX_TRAIN.PY: RUNNING INFERENCE ON THE OUTPUT MODEL
Traceback (most recent call last):
  File "/workspace/venv/lib/python3.10/site-packages/cli/train/linux_train.py", line 308, in <module>
    main()
  File "/workspace/venv/lib/python3.10/site-packages/cli/train/linux_train.py", line 286, in main
    model_generate(d["user"]).split(response_template.strip())[-1].strip()
  File "/workspace/venv/lib/python3.10/site-packages/cli/train/linux_train.py", line 201, in model_generate
    outputs = model.generate(
  File "/workspace/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
    return self.sample(
  File "/workspace/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample
    outputs = self(
  File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/workspace/venv/lib/python3.10/site-packages/transformers/models/mistral/modeling_mistral.py", line 1170, in forward
    logits = self.lm_head(hidden_states)
  File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/workspace/venv/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected mat1 and mat2 to have the same dtype, but got: float != c10::BFloat16

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure what the implications are of changing those dtypes but I just tried to make it match the bnb_4bit_compute_dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You aren't the first person to run into this problem with an NVidia card. I haven't been able to reproduce it with AMD ROCm.

Copy link
Contributor

@cheesesashimi cheesesashimi Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into this issue with an AMD GPU. Doing some Googling, I found this: huggingface/peft#1515 (comment).

Adding that context manager to model_generate() like this:

    def model_generate(user):
        text = create_prompt(user=user)

        input_ids = tokenizer(text, return_tensors="pt").input_ids.to(args.device)
        with torch.cuda.amp.autocast():
            outputs = model.generate(
                input_ids=input_ids,
                max_new_tokens=256,
                pad_token_id=tokenizer.eos_token_id,
                temperature=0.7,
                top_p=0.9,
                stopping_criteria=stopping_criteria,
                do_sample=True,
            )
            return tokenizer.batch_decode([o[:-1] for o in outputs])[0]

Seemed to make everything happy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16bit floats seem to be much slower on AMD ROCm. (from 4.5 ti/s with 4 bit quantization to 2.5 ti/s).

Does this change work for you:

    fp16 = args.use_bitsandbytes and torch.version.cuda is not None
    training_arguments = TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=args.num_epochs,
        per_device_train_batch_size=per_device_train_batch_size,
        fp16=fp16,
        bf16=not fp16,
        ...
    )

@tiran
Copy link
Contributor Author

tiran commented Mar 12, 2024

By the way, I have created a container file for ROCm with all dependencies (torch, llama-cpp, bitesandbytes) pre-installed:
https://gitlab.cee.redhat.com/cheimes/instructlab-rocm

@bbrowning
Copy link
Contributor

For those of us with less powerful GPUs, is there some way to offload some of the work to the GPU based on available memory but the rest take place in CPU? Let's say I have an 8GB or 12GB Nvidia card, but would like some level of speedup compared to running only on CPU?

@tiran
Copy link
Contributor Author

tiran commented Mar 13, 2024

I now have training working with 4-bit quantization. On my Radeon 7900 XT it's less than half performance compared to non-quantized training (2.5 it/s vs. 6 to 6.5 it/s). The next problem is llamacpp_convert_to_gguf.py. The converted does not support quantized models, yet. KeyError: 'U8' is just a symptom if a bigger issue. For example a quantized model has blk.0.attn_norm.weight while the code expects blk.0.attn_q.weight. It also has a lot of unexpected tensors:

Unexpected tensor name: model.layers.0.mlp.down_proj.weight.nested_absmax - skipping
Unexpected tensor name: model.layers.0.mlp.down_proj.weight.nested_quant_map - skipping
Unexpected tensor name: model.layers.0.mlp.down_proj.weight.quant_map - skipping
Unexpected tensor name: model.layers.0.mlp.gate_proj.weight.nested_absmax - skipping
Unexpected tensor name: model.layers.0.mlp.gate_proj.weight.nested_quant_map - skipping
Unexpected tensor name: model.layers.0.mlp.gate_proj.weight.quant_map - skipping
Unexpected tensor name: model.layers.0.mlp.up_proj.weight.nested_absmax - skipping
Unexpected tensor name: model.layers.0.mlp.up_proj.weight.nested_quant_map - skipping
Unexpected tensor name: model.layers.0.mlp.up_proj.weight.quant_map - skipping

I don't know how to address the problem. Therefore I'm going to hide the 4 bit quantization option and let somebody else address the converter.

@tiran
Copy link
Contributor Author

tiran commented Mar 13, 2024

For those of us with less powerful GPUs, is there some way to offload some of the work to the GPU based on available memory but the rest take place in CPU? Let's say I have an 8GB or 12GB Nvidia card, but would like some level of speedup compared to running only on CPU?

A 12 GB card may be sufficient once 4 bit quantization is working correctly. Until then you need a card at least 17 GiB memory -- at least for AMD ROCm. I don't have access to NVidia hardware and cannot test if CUDA supports shared memory.

@jarodwilson
Copy link

Just to confirm, lab train --device cuda fails spectacularly on a GeForce RTX 4070 SUPER (12GB VRAM) under Fedora 39. The lab generate step was actually faster than on my 3090, I believe, but lack of fail-over to system memory like you get within the WSL2 environment definitely makes it harder to utilize directly under Linux.

@jarodwilson Did you try with the 4-bit-quant argument to enable quantization? In my test setups just using lab train --device cuda --4-bit-quant may work for you, assuming you have at least 10GB of free GPU memory.

Had not tried it, but just did, and it blew up on me with a lengthy backtrace, which I've ascertained is because I hadn't yet pip installed bitsandbytes. With that now installed, it does seem to be making some progress.

It did spit out this table at me:

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   4446 MiB |   4446 MiB |  18175 MiB |  13729 MiB |
|       from large pool |   4340 MiB |   4340 MiB |  17988 MiB |  13648 MiB |
|       from small pool |    106 MiB |    107 MiB |    187 MiB |     81 MiB |
|---------------------------------------------------------------------------|
| Active memory         |   4446 MiB |   4446 MiB |  18175 MiB |  13729 MiB |
|       from large pool |   4340 MiB |   4340 MiB |  17988 MiB |  13648 MiB |
|       from small pool |    106 MiB |    107 MiB |    187 MiB |     81 MiB |
|---------------------------------------------------------------------------|
| Requested memory      |   4446 MiB |   4446 MiB |  18175 MiB |  13729 MiB |
|       from large pool |   4340 MiB |   4340 MiB |  17988 MiB |  13648 MiB |
|       from small pool |    106 MiB |    107 MiB |    187 MiB |     81 MiB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   4508 MiB |   4508 MiB |   8256 MiB |   3748 MiB |
|       from large pool |   4400 MiB |   4400 MiB |   8140 MiB |   3740 MiB |
|       from small pool |    108 MiB |    108 MiB |    116 MiB |      8 MiB |
|---------------------------------------------------------------------------|
| Non-releasable memory |      0 B   |      0 B   |      0 B   |      0 B   |
|       from large pool |      0 B   |      0 B   |      0 B   |      0 B   |
|       from small pool |      0 B   |      0 B   |      0 B   |      0 B   |
|---------------------------------------------------------------------------|
| Allocations           |    1284    |    1508    |    3396    |    2112    |
|       from large pool |     290    |     290    |     610    |     320    |
|       from small pool |     994    |    1218    |    2786    |    1792    |
|---------------------------------------------------------------------------|
| Active allocs         |    1284    |    1508    |    3396    |    2112    |
|       from large pool |     290    |     290    |     610    |     320    |
|       from small pool |     994    |    1218    |    2786    |    1792    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize allocations  |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Oversize GPU segments |       0    |       0    |       0    |       0    |
|===========================================================================|

And here's nvidia-smi output, if it's of any interest:

$ nvidia-smi 
Mon Mar 18 16:24:05 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:65:00.0  On |                  N/A |
| 34%   62C    P2            196W /  220W |   11107MiB /  12282MiB |     89%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2532      G   /usr/bin/gnome-shell                          333MiB |
|    0   N/A  N/A      3233      G   /usr/bin/Xwayland                             461MiB |
|    0   N/A  N/A      3333      G   /usr/libexec/xdg-desktop-portal-gnome           5MiB |
|    0   N/A  N/A      3530      G   ...AAAAAAAACAAAAAAAAAA= --shared-files          7MiB |
|    0   N/A  N/A      3944      G   ...yOnDemand --variations-seed-version         74MiB |
|    0   N/A  N/A      4187      G   /usr/lib64/firefox/firefox                    309MiB |
|    0   N/A  N/A     40724      G   ...sion,SpareRendererForSitePerProcess         49MiB |
|    0   N/A  N/A    229479      C   ...ts/instruct-lab/venv/bin/python3.11       9636MiB |
+-----------------------------------------------------------------------------------------+

If that doesn't, you can checkout the gradient changes I showed in #520 (comment) - enabling gradient accumulation should definitely get you going under the memory limit.

I'll give that a look if this falls down and goes boom.

Signed-off-by: Christian Heimes <cheimes@redhat.com>
@tiran tiran force-pushed the linux-train-gpu branch from 31bdd51 to 80a0f45 Compare March 18, 2024 21:00
@jarodwilson
Copy link

Just to confirm, lab train --device cuda fails spectacularly on a GeForce RTX 4070 SUPER (12GB VRAM) under Fedora 39. The lab generate step was actually faster than on my 3090, I believe, but lack of fail-over to system memory like you get within the WSL2 environment definitely makes it harder to utilize directly under Linux.

@jarodwilson Did you try with the 4-bit-quant argument to enable quantization? In my test setups just using lab train --device cuda --4-bit-quant may work for you, assuming you have at least 10GB of free GPU memory.

Had not tried it, but just did, and it blew up on me with a lengthy backtrace, which I've ascertained is because I hadn't yet pip installed bitsandbytes. With that now installed, it does seem to be making some progress.

I believe everything that was expected to work did, but then I hit #579.

If that doesn't, you can checkout the gradient changes I showed in #520 (comment) - enabling gradient accumulation should definitely get you going under the memory limit.

I'll give that a look if this falls down and goes boom.

Tried it, fails, memory allocation error. Looks like 12GB isn't sufficient w/o the --4-bit-quant option.

@bbrowning
Copy link
Contributor

Thanks @jarodwilson - it's good to know the 4 bit quantization works for you, and yes we'll have to get #579 solved for sure. I don't think it will be too hard to fix, although may involve a bit of loss during the conversion process.

@tiran
Copy link
Contributor Author

tiran commented Mar 19, 2024

FWIW, I'm holding off merges from main until main is fixed. The tip of main is currently broken on Linux, see #673. I have submitted fixes for the bug and CI.

@grdryn
Copy link
Contributor

grdryn commented Mar 19, 2024

At the end of last week, I used a version of this along with the notebook from #617 on Google Colab and it worked great. Kudos! 👏

@tiran tiran force-pushed the linux-train-gpu branch from bce1a0d to 657487e Compare March 20, 2024 07:33
@tiran tiran force-pushed the linux-train-gpu branch from 657487e to bbcfe7e Compare March 20, 2024 07:49
@tuhinsharma121
Copy link

I tried it on A100 40GB on Openshift AI pod. It worked great!

@xukai92 xukai92 added this to the March 21 Musts milestone Mar 20, 2024
Copy link
Member

@markstur markstur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see inline.
Just took a quick look to see how the default look (mostly for mac). I would think default to GPU if found would be nice. Example inline comment. Is cpu more appropriate default? I think we already default to using GPU for generate so why not use it for train.

"--device",
type=TORCH_DEVICE,
show_default=True,
default="cpu",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, but I'd probably default to auto-detect mps or cuda. I'm used to using libraries that do that for me, just not sure if there's something different about our use case. Why not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@tiran tiran Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now training requires a recent GPU with lots of free VRAM and support for fp16 or bfloat16. I can about fit training data into 20 GiB VRAM. Others had OOM with 24 GiB VRAM. I would have to detect if you have sufficient VRAM, if you run on WSL2 (which can use USM), if your card has sufficient CUDA level / HIP level, and so on.

For now I'm sticking to the Zen of Python "Explicit is better than implicit" and keep the default to CPU. Somebody else can work on auto-detection after my PR has landed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markstur Could you please file tickets for auto-detection, mps, and npu device supports, so we have a place to discuss your ideas and don't forget them? The discussions on this PR is already long and hard to follow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this flag should only accept options that we explicitly support:

finite options for click flag

This will put guard-rails around the backends we reasonably support so we can roll out future stuff as CI improves for different backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also add mutual exclusion to --device=mps and --4-bit-quant and other exclusivities (4-bit might not be mutually exclusive forever, I just know that we can't use bitsandbytes for torch on metal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input values for --device option are more complex than just cuda or cpu. Users can also supply cuda:1 for the second GPU card in their system. The custom click type TorchDeviceParam performs validation and refuses unsupported values. I could add an additional check and only accept devices of type cpu and cuda for now.

$ lab train --device invalid
Usage: lab train [OPTIONS]
Try 'lab train --help' for help.

Error: Invalid value for '--device': Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: invalid

--4-bit-quant is handled in the beginning of the function:

    if four_bit_quant and device.type != "cuda":
        raise click.ClickException("--4-bit-quant requires CUDA device")
$ lab train --4-bit-quant
Error: --4-bit-quant requires CUDA device

@jjasghar
Copy link
Member

LINUX_TRAIN.PY: Using device 'TorchDeviceInfo(type='cuda', index=0, device_map={'': 0})'
  NVidia CUDA version: 12.1
  AMD ROCm HIP version: n/a
  cuda:0 is 'Tesla T4' (14.5 GiB of 14.6 GiB free, capability: 7.5)

I ran it on a CentOS 8 Stream box with the Propriatry 550 Nvidia drivers and a Tesla DC GPU.

lab train --device cuda failed with:

ValueError: Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0

@tiran
Copy link
Contributor Author

tiran commented Mar 20, 2024

@jjasghar According to https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities bfloat16 requires CUDA compute capability >= 8.0.

@jjasghar
Copy link
Member

To follow up here, with the help of @tiran I made these changes:

index 6b427f2..f4b2f79 100644
--- a/cli/train/linux_train.py
+++ b/cli/train/linux_train.py
@@ -242,8 +242,8 @@ def linux_train(
         output_dir=output_dir,
         num_train_epochs=num_epochs,
         per_device_train_batch_size=per_device_train_batch_size,
-        fp16=use_fp16,
-        bf16=not use_fp16,
+        fp16=False,
+        bf16=False,
         # use_ipex=True, # TODO CPU test this possible optimization
         use_cpu=model.device.type == "cpu",
         save_strategy="epoch",
@@ -253,8 +253,8 @@ def linux_train(
         # https://stackoverflow.com/a/75793317
         # torch_compile=True,
         # fp16=False,  # fp16 increases memory consumption 1.5x
-        # gradient_accumulation_steps=8,
-        # gradient_checkpointing=True,
+        gradient_accumulation_steps=8,
+        gradient_checkpointing=True,
         # eval_accumulation_steps=1,
         # per_device_eval_batch_size=1,
     )

And I'm able to train now with cuda via lab train --device cuda

Copy link
Contributor

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of nits, specifically logging and some comment questions

"--device",
type=TORCH_DEVICE,
show_default=True,
default="cpu",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this flag should only accept options that we explicitly support:

finite options for click flag

This will put guard-rails around the backends we reasonably support so we can roll out future stuff as CI improves for different backends.

"--device",
type=TORCH_DEVICE,
show_default=True,
default="cpu",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also add mutual exclusion to --device=mps and --4-bit-quant and other exclusivities (4-bit might not be mutually exclusive forever, I just know that we can't use bitsandbytes for torch on metal.

def _gib(size: int) -> str:
return "{:.1f} GiB".format(size / 1024**3)

for idx in range(torch.cuda.device_count()):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this API behave differently for CUDA vs. ROCm devices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no, (cuda will be positive for ROCm as well) then should note it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTorch treats ROCm devices like a CUDA device. I have mentioned this in updated docs and it's explained in PyTorch docs, too.

There are differences between ROCm and CUDA, but the high level APIs work the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc string of the function is Report CUDA/ROCm device properties. That should be enough, shouldn't it?

**Note:** This is explicitly forcing the build to use the ROCm compilers and prefix path for dependency resolution in the CMake build. This works around an issue in the CMake and ROCm version in Fedora 39 and below and may be fixed in F40.
> **Note:** This is explicitly forcing the build to use the ROCm compilers and prefix path for dependency resolution in the CMake build. This works around an issue in the CMake and ROCm version in Fedora 39 and below and is fixed in Fedora 40. With Fedora 40's ROCm packages, use `CMAKE_ARGS="-DLLAMA_HIPBLAS=on -DCMAKE_C_COMPILER=/usr/bin/clang -DCMAKE_CXX_COMPILER=/usr/bin/clang++ -DAMDGPU_TARGETS=gfx1100"` instead.

Once that package is installed, recompile `lab` with `pip3 install .`. You also need to tell `HIP` which GPU to use - you can find this out via `rocminfo` although it is typically GPU 0. To set which device is visible to HIP, we'll set `export HIP_VISIBLE_DEVICES=0` for GPU 0. You may also have to set `HSA_OVERRIDE_GFX_VERSION` to override ROCm's GFX version detection, for example `export HSA_OVERRIDE_GFX_VERSION=10.3.0` to force an unsupported `gfx1032` card to use use supported `gfx1030` version. The environment variable `AMD_LOG_LEVEL` enables debug logging of ROCm libraries, for example `AMD_LOG_LEVEL=3` to print API calls to stderr.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could recommend flag pip3 install . --force-reinstall --no-cache as a more forceful route. I've had to do this before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be wrong, because it would override the ROCm builds of PyTorch and llama-cpp-python with CUDA builds + 4GB of CUDA libs. --force-reinstall ignores all installed packages and overrides them with new copies. Please trust me on this. I have been packaging Python software before setuptools was first released in 2004.

Actually all recommendations for --force-reinstall --no-cache should be replaced by better instructions. I'll file a ticket tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ okay cool, I'll defer to your expertise on that then. It's worked for me for somethings, but it's probably like adding stuff to a path manually- works, but bad practice.

Copy link
Contributor Author

@tiran tiran Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip isn't easy to tame and some options are confusing. Like you need to know that some parts of pip do not normalize the package name llama-cpp-python to its canonical form llama_cpp_python. Or the option --no-binary=llama_cpp_python only affects downloads but not the local wheel cache.

I recommend:

  1. Remove the venv and start with a fresh venv
  2. Clear the local wheel cache with pip cache purge or pip cache remove llama_cpp_python (sic!). pip cache purge only purges the wheel cache for locally built packages, but keeps the HTTP cache with downloaded wheels and sdists (also a very confusing fact). This way you don't download GBs of packages over and over again.
  3. Build and install all dependencies that needs special care
    • build and install llama-cpp-python
    • down and install PyTorch for your GPU flavor
  4. Finally install the remaining dependencies from requirements.txt or pyproject.toml. pip recognizes that llama-cpp-python and torch are already present and won't override it -- unless there is a version conflict.

These four steps give you the smallest virtual env (no CUDA libs if you build for ROCm or CPU) and don't download the same package from PyPI several times.

requirements.txt Outdated
@@ -20,3 +20,5 @@ torch>=2.2.1,<3.0.0
peft
datasets
trl
# 4-bit quantization does not work, yet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't work on mac, but should work on cuda no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. You can train with 4-bit-quant on CUDA and ROCm, but you cannot do anything with the result of the training. The rest of our tool chain cannot deal with quantized models. See ticket reference next to --4-bit-quant option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does work on mac but not on others, right? the comment is confusing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked ticket has more information. I have changed the comment to make it less confusing.

tl;dr it's not usable under Linux, yet.

```shell
$ lab train --device invalid
Usage: lab train [OPTIONS]
Try 'lab train --help' for help.

Error: Invalid value for '--device': Only 'cpu', 'cuda', cuda with device index ('cuda:0') are currently supported.
```

```shell
$ lab train --4-bit-quant
Usage: lab train [OPTIONS]
Try 'lab train --help' for help.

Error: --4-bit-quant option requires --device=cuda
```

Signed-off-by: Christian Heimes <cheimes@redhat.com>
@tiran
Copy link
Contributor Author

tiran commented Mar 20, 2024

@JamesKunstle My latest commit addresses three remarks.

requirements.txt Outdated
@@ -20,3 +20,5 @@ torch>=2.2.1,<3.0.0
peft
datasets
trl
# 4-bit quantization does not work, yet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does work on mac but not on others, right? the comment is confusing

tiran added 3 commits March 21, 2024 12:04
Signed-off-by: Christian Heimes <cheimes@redhat.com>
The old named tuple is gone and the code now lives in `cli.lab`.

Signed-off-by: Christian Heimes <cheimes@redhat.com>
@tiran tiran force-pushed the linux-train-gpu branch from 0aaafd9 to 6915ca3 Compare March 21, 2024 12:06
@xukai92
Copy link
Member

xukai92 commented Mar 21, 2024

why does #520 (comment) requires to set both fp16 and bf16 to be False? or only bf16 is actually needed. I'm asking this because the current code would always set one of them to be true, which doesn't fit into that "fix".

@tiran
Copy link
Contributor Author

tiran commented Mar 21, 2024

why does #520 (comment) requires to set both fp16 and bf16 to be False? or only bf16 is actually needed. I'm asking this because the current code would always set one of them to be true, which doesn't fit into that "fix".

It's complicated...

I mentioned before that we have to expose several additional parameters for training on Linux to tune the training for hardware. The current settings work for CUDA compute level >= 8.0 and AMD ROCm >= 10.3 (recent workstation and consumer GPUs with sufficient memory). @jjasghar has a GPU with CUDA compute level 7.5. fp16=False and bf16=False may break training for users with consumer GPUs. The current settings have been tested successfully by nearly a dozen users.

I strongly recommend to keep the current settings and to look into different settings in a future PR.

@xukai92 xukai92 merged commit 951999a into instructlab:main Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.