-
Notifications
You must be signed in to change notification settings - Fork 186
[Docs] Added Clusters
guide
#2646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
docs/docs/concepts/fleets.md
Outdated
If the `aws` backend config has `public_ips: false` set, `dstack` enables the maximum number of interfaces supported by the instance. | ||
Otherwise, if instances have public IPs, only one EFA interface is enabled per instance due to AWS limitations. | ||
When you create a cloud fleet with `aws`, [Elastic Fabric Adapter networking :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} is automatically configured if it’s supported for the corresponding instance type. | ||
Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite, according to the previous version of this section, EFA is also used for public_ips: true
, except only one EFA interface is attached
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no point of using EFA without multiple interface I guess
docs/docs/guides/clusters.md
Outdated
suitable fleet is available, then starts the master node and runs the task container on it. Once the master is up, | ||
`dstack` starts worker nodes and runs the task container on each worker node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually the case? I don't think we guarantee the order in which containers start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r4victor please comment on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we don't provide any guarantees.
For example, here the master node started 40 seconds after a non-master node. To demonstrate this, I pre-created a fleet and removed the Docker image from one of the nodes so that the job assigned to that node takes longer to start.
type: task
nodes: 2
commands:
- "echo Node rank: $DSTACK_NODE_RANK"
- date --iso-8601=ns
> dstack logs chatty-swan-1 --job 0
Node rank: 0
2025-05-16T10:27:35,731006010-04:00
> dstack logs chatty-swan-1 --job 1
Node rank: 1
2025-05-16T10:26:55,453938960-04:00
We've thought about making the order configurable, but currently it is expected to be random. And this is a good default, as it prevents any GPU time from being wasted.
docs/docs/guides/clusters.md
Outdated
=== "AWS" | ||
When you create a cloud fleet with `aws`, [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type. | ||
|
||
!!! info "Backend configuration" | ||
Note, EFA requires the `public_ips` to set to `false` in the `aws` backend configuration. | ||
Refer to the [EFA](../../blog/posts/efa.md) example for more details. | ||
|
||
=== "GCP" | ||
When you create a cloud fleet with `gcp`, for the A3 Mega and A3 High instance types, [GPUDirect-TCPXO and GPUDirect-TCPX :material-arrow-top-right-thin:{ .external }](https://cloud.google.com/kubernetes-engine/docs/how-to/gpu-bandwidth-gpudirect-tcpx-autopilot){:target="_blank"} networking is automatically configured. | ||
|
||
!!! info "Backend configuration" | ||
Note, GPUDirect-TCPXO and GPUDirect-TCPX require `extra_vpcs` to be configured in the `gcp` backend configuration. | ||
Refer to the [A3 Mega](../../examples/clusters/a3mega/index.md) and | ||
[A3 Mega](../../examples/clusters/a3high/index.md) examples for more details. | ||
|
||
=== "Nebius" | ||
When you create a cloud fleet with `nebius`, [InfiniBand :material-arrow-top-right-thin:{ .external }](https://docs.nebius.com/compute/clusters/gpu){:target="_blank"} networking is automatically configured if it’s supported for the corresponding instance type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comments from fleets.md
are also relevant here.
Maybe leave a link to fleets.md
instead of duplicating the details for each backend?
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
Co-authored-by: jvstme <36324149+jvstme@users.noreply.github.com>
No description provided.