Skip to content

Conversation

Bihan
Copy link
Collaborator

@Bihan Bihan commented May 27, 2025

No description provided.

@Bihan Bihan requested a review from peterschmidt85 May 27, 2025 12:37
- ACCELERATE_LOG_LEVEL=info
- WANDB_API_KEY
- MODEL_ID=meta-llama/Llama-3.1-8B
- HUB_MODEL_ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is HUB_MODEL_ID environment variable? How is it diffrenent in this context from MODEL_ID. Why does the Axolotl examample, use HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B?

Copy link
Collaborator Author

@Bihan Bihan May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 HUB_MODEL_ID : Repository ID where the model will be pushed on Hugging Face Hub (format: username/repo-name)

I will remove the assignment HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B and only use HUB_MODEL_ID as in TRL example.



!!! Note
We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know which specific drivers are missing in dstack's default Docker image?

cc @un-def

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 As far as I remember dstack's default Docker image was issuing this error. libnccl-net.so was not found.

lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Using internal network plugin.
lambda-cluster-node-001:1651:1651 [5] NCCL INFO cudaDriverVersion 12080
lambda-cluster-node-001:1651:1651 [5] NCCL INFO Bootstrap : Using eno1:172.26.135.50<0>
lambda-cluster-node-001:1651:1651 [5] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)


# Commands of the task
commands:
- pip install transformers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use separate pip install commands instead of a single pip install command with multiple packages?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about using uv pip install in examples since we now recommend uv?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update with uv pip install

- git clone https://github.com/huggingface/trl
- cd trl
- pip install .
- accelerate launch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For such multi-line commands as accelerate launch, should we use - | syntax?

Copy link
Collaborator Author

@Bihan Bihan May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 Yes, we can use - | like below for every multi-line commands.

 - |
   accelerate launch \
     --config_file=examples/accelerate_configs/fsdp1.yaml \
     --main_process_ip=$DSTACK_MASTER_NODE_IP \
     --main_process_port=8008 \
     --machine_rank=$DSTACK_NODE_RANK \
     --num_processes=$DSTACK_GPUS_NUM \
     --num_machines=$DSTACK_NODES_NUM \
     trl/scripts/sft.py \
     --model_name $MODEL_ID \
     --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
     --dataset_text_field="text" \
     --per_device_train_batch_size 1 \
     --per_device_eval_batch_size 1 \
     --gradient_accumulation_steps 4 \
     --learning_rate 2e-4 \
     --report_to wandb 

This would make copy/paste multi-line command to shell very easy during debugging.

<div class="termy">

```shell
$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to set environment variables which values aren't configured in YAML (HF_TOKEN, WANDB_API_KEY, etc).

[`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl).

!!! info "What's next?"
1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the Clusters guide too

Copy link
Collaborator Author

@Bihan Bihan May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterschmidt85 I have added link to clusters guide in Create Fleet section as below
"For more detials on how to use clusters with dstack, check the Clusters guide."

therefore I did not add in What's next? section.

Copy link
Contributor

@peterschmidt85 peterschmidt85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also:
Remove multi-node example from Fine-tuning | TRL
Add links to Distributed training | TRL from-tuning | TRL
Add links to Distributed training | Axolotl from-tuning | Axolot

@Bihan Bihan merged commit 36cb5aa into dstackai:master May 29, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants