Add distributed Axolotl and TRL example #2703

Bihan · 2025-05-27T12:37:26Z

No description provided.

peterschmidt85 · 2025-05-28T05:44:14Z

examples/distributed-training/trl/README.md

+      - ACCELERATE_LOG_LEVEL=info
+      - WANDB_API_KEY
+      - MODEL_ID=meta-llama/Llama-3.1-8B
+      - HUB_MODEL_ID


What is HUB_MODEL_ID environment variable? How is it diffrenent in this context from MODEL_ID. Why does the Axolotl examample, use HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B?

@peterschmidt85 HUB_MODEL_ID : Repository ID where the model will be pushed on Hugging Face Hub (format: username/repo-name)

I will remove the assignment HUB_MODEL_ID=meta-llama/Meta-Llama-3-70B and only use HUB_MODEL_ID as in TRL example.

peterschmidt85 · 2025-05-28T05:49:33Z

examples/distributed-training/trl/README.md

+
+
+!!! Note
+    We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.


Do you know which specific drivers are missing in dstack's default Docker image?

cc @un-def

@peterschmidt85 As far as I remember dstack's default Docker image was issuing this error. libnccl-net.so was not found.

lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so lambda-cluster-node-001:1652:1652 [6] NCCL INFO NET/Plugin: Using internal network plugin. lambda-cluster-node-001:1651:1651 [5] NCCL INFO cudaDriverVersion 12080 lambda-cluster-node-001:1651:1651 [5] NCCL INFO Bootstrap : Using eno1:172.26.135.50<0> lambda-cluster-node-001:1651:1651 [5] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)

peterschmidt85 · 2025-05-28T05:51:25Z

examples/distributed-training/trl/README.md

+
+    # Commands of the task
+    commands:
+      - pip install transformers


Why use separate pip install commands instead of a single pip install command with multiple packages?

What about using uv pip install in examples since we now recommend uv?

I will update with uv pip install

peterschmidt85 · 2025-05-28T05:52:16Z

examples/distributed-training/trl/README.md

+      - git clone https://github.com/huggingface/trl
+      - cd trl
+      - pip install .
+      - accelerate launch


For such multi-line commands as accelerate launch, should we use - | syntax?

@peterschmidt85 Yes, we can use - | like below for every multi-line commands.

- | accelerate launch \ --config_file=examples/accelerate_configs/fsdp1.yaml \ --main_process_ip=$DSTACK_MASTER_NODE_IP \ --main_process_port=8008 \ --machine_rank=$DSTACK_NODE_RANK \ --num_processes=$DSTACK_GPUS_NUM \ --num_machines=$DSTACK_NODES_NUM \ trl/scripts/sft.py \ --model_name $MODEL_ID \ --dataset_name OpenAssistant/oasst_top1_2023-08-25 \ --dataset_text_field="text" \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 4 \ --learning_rate 2e-4 \ --report_to wandb

This would make copy/paste multi-line command to shell very easy during debugging.

peterschmidt85 · 2025-05-28T05:54:52Z

examples/distributed-training/trl/README.md

+<div class="termy">
+
+```shell
+$ dstack apply -f examples/distributed-training/trl/fsdp.dstack.yml


Need to set environment variables which values aren't configured in YAML (HF_TOKEN, WANDB_API_KEY, etc).

peterschmidt85 · 2025-05-28T05:56:00Z

examples/distributed-training/trl/README.md

+[`examples/distributed-training/trl` :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/distributed-training/trl).
+
+!!! info "What's next?"
+    1. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), 


Add a link to the Clusters guide too

@peterschmidt85 I have added link to clusters guide in Create Fleet section as below
"For more detials on how to use clusters with dstack, check the Clusters guide."

therefore I did not add in What's next? section.

peterschmidt85

…ty and consistence

…ingle node training

Add distributed Axolotl and TRL example

8db354c

Bihan requested a review from peterschmidt85 May 27, 2025 12:37

peterschmidt85 reviewed May 28, 2025

View reviewed changes

Bihan Rana and others added 5 commits May 29, 2025 12:36

Resolve review comments

874099c

[Docs] Renamed Fine-tuning to Single-node training for more clari…

7364f8f

…ty and consistence

Remove uv from examples with ngc and remove multi-node example from s…

6cfa0b6

…ingle node training

[Examples] Minor improvements regarding TRL and Axolotl

119f7b8

Update Axolotl Single Node Training Example

ff9118d

Bihan merged commit 36cb5aa into dstackai:master May 29, 2025
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add distributed Axolotl and TRL example #2703

Add distributed Axolotl and TRL example #2703

Uh oh!

Bihan commented May 27, 2025

Uh oh!

peterschmidt85 May 28, 2025

Uh oh!

Bihan May 28, 2025 •

edited

Loading

Uh oh!

peterschmidt85 May 28, 2025

Uh oh!

Bihan May 28, 2025

Uh oh!

peterschmidt85 May 28, 2025

Uh oh!

r4victor May 28, 2025

Uh oh!

Bihan May 28, 2025

Uh oh!

peterschmidt85 May 28, 2025

Uh oh!

Bihan May 28, 2025 •

edited

Loading

Uh oh!

peterschmidt85 May 28, 2025

Uh oh!

peterschmidt85 May 28, 2025

Uh oh!

Bihan May 28, 2025 •

edited

Loading

Uh oh!

peterschmidt85 left a comment

Uh oh!

Uh oh!

Uh oh!



		!!! Note
		We are using the NGC container because it includes the necessary libraries and packages for RDMA and InfiniBand support.

Add distributed Axolotl and TRL example #2703

Add distributed Axolotl and TRL example #2703

Uh oh!

Conversation

Bihan commented May 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bihan May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bihan May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bihan May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peterschmidt85 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Bihan May 28, 2025 •

edited

Loading

Bihan May 28, 2025 •

edited

Loading

Bihan May 28, 2025 •

edited

Loading