Skip to content

Conversation

angeloskath
Copy link
Member

Initial version of a custom ring reduce backend. It can be used with Ethernet as well but the main focus is on thunderbolt because:

  1. MPI ring reduce is not a very good fit and doesn't saturate the TB interface.
  2. it makes it much easier to add more cables for even more bandwidth

The comparison is as follows for a 4-way all reduce using 4 M2 Ultras:
all_reduce

The bandwidth is computed as total bytes * 3/4 * 2 / time the 3/4 is due to the ring reduce actually sending less than 2*B bytes on the wire.

In terms of time per all reduce
all_reduce_time

which means that for 2GB all reduce we go from 1.3 seconds over Ethernet and 0.45s with MPI to 0.27s with our reduce.

Things to note:

  1. mx.distributed.init() takes a backend argument and initializes the corresponding backend. If no argument is provided it tries to initialize in order ring > mpi. The reason is that ring will fail if MLX_HOSTFILE and MLX_RANK is not provided while mpi may succeed and "hide" the rest of the backends.
  2. mx.distributed.init() called for a 2nd time without an argument returns the previously initialized backend. This allows the rest of the code to remain unchanged and agnostic of the communication group.
  3. Small edits to the Threadpool to make it resizable.
  4. The ring backend can be improved/simplified so take this as a 1st version mostly defining the API.

Finally, I am in the process of preparing another PR with changes in the docs and possibly a launcher for this backend so people don't have to write their own shell scripts to launch with this backend.

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice, I didn't read the actual ring.cpp yet but I think the API changes look great.

One high-level question, is there anyway to test this back-end in CI?

@angeloskath
Copy link
Member Author

Yeah 100%. I should have added that actually. Will do.

It doesn't actually need thunderbolt connections any socket will do. I regularly test for correctness using the loop back interface which will be the CI test as well.

Copy link
Member

@awni awni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!! This is good to go right?

@angeloskath
Copy link
Member Author

Yeah it is good to go. I am writing docs for it in a different PR and there will be more PRs improving the performance and adding features (these should be limited in ring.cpp). I want to apply your comments and then we can merge.

@angeloskath angeloskath merged commit ccb61d7 into main Jan 28, 2025
5 checks passed
@angeloskath angeloskath deleted the ring-backend branch January 28, 2025 06:15
@stemann stemann mentioned this pull request Jan 28, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants