Skip to content

Add multi-device execution support in ONNX #6641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 4, 2025

Conversation

kevinch-nv
Copy link
Contributor

Co-authored with @gramalingam

Description

Updates ONNX specification to include fields to describe multi-device inference execution.

Motivation and Context

The recent trend in increasingly larger models has spurred an interest in distributed inference. A key performance bottleneck for inference for these large models has been the memory limits of GPUs and other accelerators as well as communication bandwidth. Thus, efficient distributed inference typically requires parallelization of the computation across multiple devices taking memory and bandwidth into account.

Our goal is to extend ONNX so that it can serve as a representation of a parallelized model. This is driven by the current state-of-the-art techniques used for distributed inference (eg., see GSPMD: General and Scalable Parallelization for ML Computation Graphs). In particular, two techniques of interest are: tensor parallelism and pipelining. In tensor parallelism (also known as horizontal parallelism or operator parallelism), the computation of a single operator (node) in the graph is parallelized across multiple devices by sharding its inputs, In pipeline parallelism, different subgraphs are assigned to different devices.

@kevinch-nv kevinch-nv requested a review from a team as a code owner January 16, 2025 00:54
@justinchuby justinchuby added this to the 1.18 milestone Jan 16, 2025
@justinchuby
Copy link
Member

justinchuby commented Feb 20, 2025

The newly added proto classes should be imported in

from onnx.onnx_pb import (
and exposed in __all__

Copy link

codecov bot commented Feb 20, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.45%. Comparing base (c403175) to head (ac6edb0).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6641   +/-   ##
=======================================
  Coverage   56.45%   56.45%           
=======================================
  Files         509      509           
  Lines       32515    32515           
  Branches     3057     3057           
=======================================
  Hits        18356    18356           
  Misses      13334    13334           
  Partials      825      825           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kevinch-nv
Copy link
Contributor Author

@justinchuby @gramalingam do you have any insight on the failing checks? Perhaps I need to also add these proto definitions in a few other files?

@justinchuby
Copy link
Member

The proto files needs to be auto generated and updated with

python onnx/gen_proto.py -l
python onnx/gen_proto.py -l --ml

@justinchuby
Copy link
Member

The file entry docs/proposals/images/composing_broadcast_axes.png needs to be added to https://github.com/onnx/onnx/blob/main/REUSE.toml

@kevinch-nv kevinch-nv force-pushed the multidevice-draft branch 2 times, most recently from ad2f94e to 9d1450a Compare February 27, 2025 18:30
@gramalingam
Copy link
Contributor

I see there is a failed check. Maybe run the proto-generation scripts again (after recent changes)?

Signed-off-by: Kevin Chen <kevinch@nvidia.com>
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
Copy link
Member

@justinchuby justinchuby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proto changes lgtm. Thanks!

@github-project-automation github-project-automation bot moved this from In progress to Reviewer approved in PR Tracker Mar 3, 2025
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
Signed-off-by: Kevin Chen <kevinch@nvidia.com>
@gramalingam
Copy link
Contributor

gramalingam commented Mar 4, 2025

Hi @kevinch-nv : I think the recent changes you made to onnx.in.proto like this line repeated NodeDeviceConfigurationProto device_configurations = 10; is not reflected in the other generated proto files. You may need to run the proto-generation script and commit those files. Thanks! (This is causing the CI failure.)

@gramalingam
Copy link
Contributor

Hi @kevinch-nv : I think the recent changes you made to onnx.in.proto like this line repeated NodeDeviceConfigurationProto device_configurations = 10; is not reflected in the other generated proto files. You may need to run the proto-generation script and commit those files. Thanks! (This is causing the CI failure.)

Sorry, one more suggested change first: can you add a comment line here to document this update to the proto?

Signed-off-by: Kevin Chen <kevinch@nvidia.com>
Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
@justinchuby justinchuby enabled auto-merge March 4, 2025 19:52
@justinchuby justinchuby added the release notes Important changes to call out in release notes label Mar 4, 2025
@justinchuby justinchuby added this pull request to the merge queue Mar 4, 2025
Merged via the queue into onnx:main with commit 51092aa Mar 4, 2025
52 checks passed
@github-project-automation github-project-automation bot moved this from Reviewer approved to Done in PR Tracker Mar 4, 2025
@lutzroeder
Copy link
Member

lutzroeder commented Mar 7, 2025

@kevinch-nv @justinchuby can you share a sample .onnx file using this feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: spec release notes Important changes to call out in release notes
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants