[Docs] Added `Clusters` guide #2646

peterschmidt85 · 2025-05-15T21:47:13Z

No description provided.

docs/docs/concepts/fleets.md

jvstme · 2025-05-16T08:22:15Z

docs/docs/concepts/fleets.md

-    If the `aws` backend config has `public_ips: false` set, `dstack` enables the maximum number of interfaces supported by the instance.
-    Otherwise, if instances have public IPs, only one EFA interface is enabled per instance due to AWS limitations.
+    When you create a cloud fleet with `aws`, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
+    Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration.


Not quite, according to the previous version of this section, EFA is also used for public_ips: true, except only one EFA interface is attached

There is no point of using EFA without multiple interface I guess

docs/docs/concepts/fleets.md

jvstme · 2025-05-16T08:32:53Z

docs/docs/guides/clusters.md

+suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
+`dstack` starts worker nodes and runs the task container on each worker node.


Is this actually the case? I don't think we guarantee the order in which containers start

@r4victor please comment on this

I'm pretty sure we don't provide any guarantees.

For example, here the master node started 40 seconds after a non-master node. To demonstrate this, I pre-created a fleet and removed the Docker image from one of the nodes so that the job assigned to that node takes longer to start.

type: task nodes: 2 commands: - "echo Node rank: $DSTACK_NODE_RANK" - date --iso-8601=ns

> dstack logs chatty-swan-1 --job 0 Node rank: 0 2025-05-16T10:27:35,731006010-04:00 > dstack logs chatty-swan-1 --job 1 Node rank: 1 2025-05-16T10:26:55,453938960-04:00

We've thought about making the order configurable, but currently it is expected to be random. And this is a good default, as it prevents any GPU time from being wasted.

jvstme · 2025-05-16T08:39:10Z

docs/docs/guides/clusters.md

+=== "AWS"
+    When you create a cloud fleet with `aws`, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
+
+    !!! info "Backend configuration"    
+        Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration.
+        Refer to the [EFA](../../blog/posts/efa.md) example for more details.
+
+=== "GCP"
+    When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.
+
+    !!! info "Backend configuration"    
+        Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured  in the `gcp` backend configuration.
+        Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and 
+        [A3 Mega](../../examples/clusters/a3high/index.md) examples for more details.
+
+=== "Nebius"
+    When you create a cloud fleet with `nebius`, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.


My comments from fleets.md are also relevant here.

Maybe leave a link to fleets.md instead of duplicating the details for each backend?

docs/docs/guides/clusters.md

Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>

[Docs] Added Clusters guide

332611f

peterschmidt85 requested review from r4victor and jvstme May 15, 2025 21:47

jvstme approved these changes May 16, 2025

View reviewed changes

peterschmidt85 and others added 4 commits May 16, 2025 11:00

Update docs/docs/concepts/fleets.md

0132afb

Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>

Update docs/docs/concepts/fleets.md

a235336

Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>

Update docs/docs/concepts/fleets.md

be0bec6

Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>

[Docs] Minor PR review feedback fixes

8632ded

peterschmidt85 merged commit 20de4c7 into master May 16, 2025
23 checks passed

peterschmidt85 deleted the clusters-guide branch May 16, 2025 12:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Docs] Added `Clusters` guide #2646

[Docs] Added `Clusters` guide #2646

Uh oh!

peterschmidt85 commented May 15, 2025

Uh oh!

Uh oh!

jvstme May 16, 2025

Uh oh!

peterschmidt85 May 16, 2025

Uh oh!

Uh oh!

Uh oh!

jvstme May 16, 2025

Uh oh!

peterschmidt85 May 16, 2025

Uh oh!

jvstme May 16, 2025

Uh oh!

jvstme May 16, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
		`dstack` starts worker nodes and runs the task container on each worker node.

[Docs] Added Clusters guide #2646

[Docs] Added Clusters guide #2646

Uh oh!

Conversation

peterschmidt85 commented May 15, 2025

Uh oh!

Uh oh!

jvstme May 16, 2025

Choose a reason for hiding this comment

Uh oh!

peterschmidt85 May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jvstme May 16, 2025

Choose a reason for hiding this comment

Uh oh!

peterschmidt85 May 16, 2025

Choose a reason for hiding this comment

Uh oh!

jvstme May 16, 2025

Choose a reason for hiding this comment

Uh oh!

jvstme May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Docs] Added `Clusters` guide #2646

[Docs] Added `Clusters` guide #2646