Add multi-device execution support in ONNX #6641

kevinch-nv · 2025-01-16T00:54:35Z

Description

Updates ONNX specification to include fields to describe multi-device inference execution.

Motivation and Context

The recent trend in increasingly larger models has spurred an interest in distributed inference. A key performance bottleneck for inference for these large models has been the memory limits of GPUs and other accelerators as well as communication bandwidth. Thus, efficient distributed inference typically requires parallelization of the computation across multiple devices taking memory and bandwidth into account.

Our goal is to extend ONNX so that it can serve as a representation of a parallelized model. This is driven by the current state-of-the-art techniques used for distributed inference (eg., see GSPMD: General and Scalable Parallelization for ML Computation Graphs). In particular, two techniques of interest are: tensor parallelism and pipelining. In tensor parallelism (also known as horizontal parallelism or operator parallelism), the computation of a single operator (node) in the graph is parallelized across multiple devices by sharding its inputs, In pipeline parallelism, different subgraphs are assigned to different devices.

onnx/onnx.in.proto

docs/proposals/ONNXMultiDeviceProposal.md

onnx/onnx.in.proto

justinchuby · 2025-02-20T00:31:03Z

The newly added proto classes should be imported in

onnx/onnx/__init__.py

Line 84 in 5973bd9

from onnx.onnx_pb import (

and exposed in __all__

codecov · 2025-02-20T01:47:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 56.45%. Comparing base (c403175) to head (ac6edb0).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #6641   +/-   ##
=======================================
  Coverage   56.45%   56.45%           
=======================================
  Files         509      509           
  Lines       32515    32515           
  Branches     3057     3057           
=======================================
  Hits        18356    18356           
  Misses      13334    13334           
  Partials      825      825

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

onnx/__init__.py

kevinch-nv · 2025-02-25T18:50:13Z

@justinchuby @gramalingam do you have any insight on the failing checks? Perhaps I need to also add these proto definitions in a few other files?

justinchuby · 2025-02-26T15:54:54Z

The proto files needs to be auto generated and updated with

python onnx/gen_proto.py -l
python onnx/gen_proto.py -l --ml

justinchuby · 2025-02-26T15:56:01Z

The file entry docs/proposals/images/composing_broadcast_axes.png needs to be added to https://github.com/onnx/onnx/blob/main/REUSE.toml

onnx/__init__.py

onnx/onnx-ml.proto

gramalingam · 2025-03-03T18:14:37Z

I see there is a failed check. Maybe run the proto-generation scripts again (after recent changes)?

onnx/onnx.in.proto

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

onnx/onnx.in.proto

justinchuby

proto changes lgtm. Thanks!

onnx/onnx.in.proto

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

gramalingam · 2025-03-04T18:16:35Z

Hi @kevinch-nv : I think the recent changes you made to onnx.in.proto like this line repeated NodeDeviceConfigurationProto device_configurations = 10; is not reflected in the other generated proto files. You may need to run the proto-generation script and commit those files. Thanks! (This is causing the CI failure.)

gramalingam · 2025-03-04T18:24:45Z

Hi @kevinch-nv : I think the recent changes you made to onnx.in.proto like this line repeated NodeDeviceConfigurationProto device_configurations = 10; is not reflected in the other generated proto files. You may need to run the proto-generation script and commit those files. Thanks! (This is causing the CI failure.)

Sorry, one more suggested change first: can you add a comment line here to document this update to the proto?

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

docs/proposals/ONNXMultiDeviceProposal.md

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

lutzroeder · 2025-03-07T02:12:03Z

@kevinch-nv @justinchuby can you share a sample .onnx file using this feature?

kevinch-nv requested a review from a team as a code owner January 16, 2025 00:54

justinchuby reviewed Jan 16, 2025

View reviewed changes

onnx/onnx.in.proto Outdated Show resolved Hide resolved

justinchuby reviewed Jan 16, 2025

View reviewed changes

onnx/onnx.in.proto Outdated Show resolved Hide resolved

justinchuby reviewed Jan 16, 2025

View reviewed changes

onnx/onnx.in.proto Outdated Show resolved Hide resolved

justinchuby added this to the 1.18 milestone Jan 16, 2025

jhalakpatel reviewed Feb 4, 2025

View reviewed changes

docs/proposals/ONNXMultiDeviceProposal.md Show resolved Hide resolved

jhalakpatel reviewed Feb 4, 2025

View reviewed changes

docs/proposals/ONNXMultiDeviceProposal.md Show resolved Hide resolved

jhalakpatel reviewed Feb 4, 2025

View reviewed changes

docs/proposals/ONNXMultiDeviceProposal.md Show resolved Hide resolved

gramalingam reviewed Feb 19, 2025

View reviewed changes

onnx/onnx.in.proto Outdated Show resolved Hide resolved

justinchuby added the module: spec label Feb 20, 2025

kevinch-nv force-pushed the multidevice-draft branch from b62e289 to 759081a Compare February 25, 2025 01:10

github-advanced-security bot found potential problems Feb 25, 2025

View reviewed changes

onnx/__init__.py Fixed Show fixed Hide fixed

onnx/__init__.py Fixed Show fixed Hide fixed

onnx/__init__.py Fixed Show fixed Hide fixed

onnx/__init__.py Fixed Show fixed Hide fixed

onnx/__init__.py Fixed Show fixed Hide fixed

justinchuby reviewed Feb 26, 2025

View reviewed changes

onnx/__init__.py Show resolved Hide resolved

justinchuby reviewed Feb 26, 2025

View reviewed changes

onnx/__init__.py Outdated Show resolved Hide resolved

justinchuby reviewed Feb 27, 2025

View reviewed changes

onnx/onnx-ml.proto Outdated Show resolved Hide resolved

justinchuby reviewed Feb 27, 2025

View reviewed changes

onnx/onnx-ml.proto Outdated Show resolved Hide resolved

justinchuby reviewed Feb 27, 2025

View reviewed changes

onnx/onnx-ml.proto Show resolved Hide resolved

justinchuby reviewed Feb 27, 2025

View reviewed changes

onnx/onnx-ml.proto Show resolved Hide resolved

justinchuby reviewed Feb 27, 2025

View reviewed changes

onnx/onnx-ml.proto Show resolved Hide resolved

kevinch-nv force-pushed the multidevice-draft branch 2 times, most recently from ad2f94e to 9d1450a Compare February 27, 2025 18:30

justinchuby reviewed Mar 3, 2025

View reviewed changes

onnx/onnx.in.proto Outdated Show resolved Hide resolved

justinchuby reviewed Mar 3, 2025

View reviewed changes

onnx/onnx.in.proto Outdated Show resolved Hide resolved

kevinch-nv force-pushed the multidevice-draft branch from 9d1450a to 77c457f Compare March 3, 2025 21:40

kevinch-nv added 7 commits March 3, 2025 13:46

Add multi-device execution support in ONNX

b5a917e

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

Use optional and update proto names

6d68089

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

Export new protofields in python, update names to be a string field

c004682

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

Protobuf fixes

bf59914

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

Add diagram to REUSE, update protos

58a8c17

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

Add required comments, and update axis

e960503

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

Use proper wording for required fields and clarify ID

77c457f

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

justinchuby reviewed Mar 3, 2025

View reviewed changes

onnx/onnx.in.proto Outdated Show resolved Hide resolved

justinchuby approved these changes Mar 3, 2025

View reviewed changes

github-project-automation bot moved this from In progress to Reviewer approved in PR Tracker Mar 3, 2025

justinchuby reviewed Mar 3, 2025

View reviewed changes

onnx/onnx.in.proto Outdated Show resolved Hide resolved

kevinch-nv force-pushed the multidevice-draft branch from 2469f58 to fb114fb Compare March 3, 2025 22:55

justinchuby reviewed Mar 3, 2025

View reviewed changes

onnx/onnx.in.proto Show resolved Hide resolved

kevinch-nv added 2 commits March 3, 2025 15:12

Update wording can -> may

fb114fb

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

Update configuration -> device_configurations

72ecbf1

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

gramalingam approved these changes Mar 4, 2025

View reviewed changes

Update version comment, regenerate protos

485c235

Signed-off-by: Kevin Chen <kevinch@nvidia.com>

justinchuby reviewed Mar 4, 2025

View reviewed changes

docs/proposals/ONNXMultiDeviceProposal.md Outdated Show resolved Hide resolved

Update docs/proposals/ONNXMultiDeviceProposal.md

ac6edb0

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>

justinchuby enabled auto-merge March 4, 2025 19:52

justinchuby added the release notes Important changes to call out in release notes label Mar 4, 2025

justinchuby added this pull request to the merge queue Mar 4, 2025

Merged via the queue into onnx:main with commit 51092aa Mar 4, 2025
52 checks passed

github-project-automation bot moved this from Reviewer approved to Done in PR Tracker Mar 4, 2025

botantony mentioned this pull request May 14, 2025

onnx 1.18.0 Homebrew/homebrew-core#223395

Closed

Add multi-device execution support in ONNX #6641

Add multi-device execution support in ONNX #6641

Uh oh!

Conversation

kevinch-nv commented Jan 16, 2025

Description

Motivation and Context

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinchuby commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinch-nv commented Feb 25, 2025

Uh oh!

justinchuby commented Feb 26, 2025

Uh oh!

justinchuby commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gramalingam commented Mar 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinchuby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gramalingam commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gramalingam commented Mar 4, 2025

Uh oh!

Uh oh!

Uh oh!

lutzroeder commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

justinchuby commented Feb 20, 2025 •

edited

Loading

codecov bot commented Feb 20, 2025 •

edited

Loading

gramalingam commented Mar 4, 2025 •

edited

Loading

lutzroeder commented Mar 7, 2025 •

edited

Loading