-
Notifications
You must be signed in to change notification settings - Fork 188
Add distributed Axolotl and TRL example #2703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- ACCELERATE_LOG_LEVEL=info | ||
- WANDB_API_KEY | ||
- MODEL_ID=meta-llama/Llama-3.1-8B | ||
- HUB_MODEL_ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is HUB_MODEL_ID
environment variable? How is it diffrenent in this context from MODEL_ID
. Why does the Axolotl examample, use HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterschmidt85 HUB_MODEL_ID
: Repository ID where the model will be pushed on Hugging Face Hub (format: username/repo-name)
I will remove the assignment HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B
and only use HUB_MODEL_ID
as in TRL example.
|
||
|
||
!!! Note | ||
We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know which specific drivers are missing in dstack's default Docker image?
cc @un-def
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterschmidt85 As far as I remember dstack's default Docker image was issuing this error. libnccl-net.so
was not found.
lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Using internal network plugin.
lambda-cluster-node-001:1651:1651 [5] NCCL INFO cudaDriverVersion 12080
lambda-cluster-node-001:1651:1651 [5] NCCL INFO Bootstrap : Using eno1:172.26.135.50<0>
lambda-cluster-node-001:1651:1651 [5] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
|
||
# Commands of the task | ||
commands: | ||
- pip install transformers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use separate pip install
commands instead of a single pip install
command with multiple packages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about using uv pip install
in examples since we now recommend uv
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update with uv pip install
- git clone https://github.com/huggingface/trl | ||
- cd trl | ||
- pip install . | ||
- accelerate launch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For such multi-line commands as accelerate launch
, should we use - |
syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterschmidt85 Yes, we can use - |
like below for every multi-line commands.
- |
accelerate launch \
--config_file=examples/accelerate_configs/fsdp1.yaml \
--main_process_ip=$DSTACK_MASTER_NODE_IP \
--main_process_port=8008 \
--machine_rank=$DSTACK_NODE_RANK \
--num_processes=$DSTACK_GPUS_NUM \
--num_machines=$DSTACK_NODES_NUM \
trl/scripts/sft.py \
--model_name $MODEL_ID \
--dataset_name OpenAssistant/oasst_top1_2023-08-25 \
--dataset_text_field="text" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--report_to wandb
This would make copy/paste
multi-line command to shell very easy during debugging.
<div class="termy"> | ||
|
||
```shell | ||
$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to set environment variables which values aren't configured in YAML (HF_TOKEN
, WANDB_API_KEY
, etc).
[`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl). | ||
|
||
!!! info "What's next?" | ||
1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a link to the Clusters guide too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterschmidt85 I have added link to clusters guide in Create Fleet
section as below
"For more detials on how to use clusters with dstack, check the Clusters guide."
therefore I did not add in What's next?
section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also:
Remove multi-node example from Fine-tuning | TRL
Add links to Distributed training | TRL from-tuning | TRL
Add links to Distributed training | Axolotl from-tuning | Axolot
…ty and consistence
…ingle node training
No description provided.