Skip to content

Conversation

peterschmidt85
Copy link
Contributor

No description provided.

@peterschmidt85 peterschmidt85 requested review from r4victor and jvstme May 15, 2025 21:47
If the `aws` backend config has `public_ips: false` set, `dstack` enables the maximum number of interfaces supported by the instance.
Otherwise, if instances have public IPs, only one EFA interface is enabled per instance due to AWS limitations.
When you create a cloud fleet with `aws`, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type.
Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, according to the previous version of this section, EFA is also used for public_ips: true, except only one EFA interface is attached

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no point of using EFA without multiple interface I guess

Comment on lines 49 to 50
suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up,
`dstack` starts worker nodes and runs the task container on each worker node.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually the case? I don't think we guarantee the order in which containers start

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@r4victor please comment on this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we don't provide any guarantees.

For example, here the master node started 40 seconds after a non-master node. To demonstrate this, I pre-created a fleet and removed the Docker image from one of the nodes so that the job assigned to that node takes longer to start.

type: task
nodes: 2
commands:
- "echo Node rank: $DSTACK_NODE_RANK"
- date --iso-8601=ns
> dstack logs chatty-swan-1 --job 0
Node rank: 0
2025-05-16T10:27:35,731006010-04:00
> dstack logs chatty-swan-1 --job 1
Node rank: 1
2025-05-16T10:26:55,453938960-04:00

We've thought about making the order configurable, but currently it is expected to be random. And this is a good default, as it prevents any GPU time from being wasted.

Comment on lines 20 to 36
=== "AWS"
When you create a cloud fleet with `aws`, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.

!!! info "Backend configuration"
Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration.
Refer to the [EFA](../../blog/posts/efa.md) example for more details.

=== "GCP"
When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured.

!!! info "Backend configuration"
Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration.
Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and
[A3 Mega](../../examples/clusters/a3high/index.md) examples for more details.

=== "Nebius"
When you create a cloud fleet with `nebius`, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments from fleets.md are also relevant here.

Maybe leave a link to fleets.md instead of duplicating the details for each backend?

peterschmidt85 and others added 4 commits May 16, 2025 11:00
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
@peterschmidt85 peterschmidt85 merged commit 20de4c7 into master May 16, 2025
23 checks passed
@peterschmidt85 peterschmidt85 deleted the clusters-guide branch May 16, 2025 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants