-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Add FLOAT4E2M1 data type #6318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FLOAT4E2M1 data type #6318
Conversation
Signed-off-by: Yuan Yao <yuanyao@nvidia.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6318 +/- ##
==========================================
+ Coverage 56.95% 57.26% +0.30%
==========================================
Files 506 507 +1
Lines 30467 31354 +887
Branches 4592 4679 +87
==========================================
+ Hits 17353 17954 +601
- Misses 12285 12550 +265
- Partials 829 850 +21 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yuanyao-nv I just noticed - the test data shouldn’t change in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was auto generated by update_doc.sh
. I've seen cases where some seemingly irrelevant pb files get updated by the script. Do you understand why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pb files can be different in bytes when generated on different OS. I think that’s what’s happening here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I wonder if it's possible to let the pipeline auto-generate these test files instead and reduce the number of files changed in each PR.
### Description - Add FLOAT4E2M1 as a new data type to proto as well as relevant helper functions and tests. - This PR splits out the portion of onnx#6283 relevant to data type updates to reduce the PR's size. ### Motivation and Context Narrow precision data types with sub-byte bit widths are becoming solutions to the rising cost, performance, and deployment challenges of LLMs. ONNX already has INT4/UINT4. FP4 is another commonly used narrow-precision data type for compressing both the weights and activations of LLMs. For example [OCP](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) MXFP4 uses E2M1 as element type. Similar to INT4/UNIT4, FP4 weights/inputs are expected to be packed. Signed-off-by: Yuan Yao <yuanyao@nvidia.com> Signed-off-by: Andreas Fehlner <fehlner@arcor.de>
### Description - FLOAT4E2M1 has been added to proto in #6318 - This PR adds FLOAT4E2M1 support for QuantizeLinear, DequantizeLinear, Cast, CastLike (opset 23). - Also add support to non-compute ops: Constant, ConstantOfShape, Identity, Reshape, Shape, Size, If, Loop, Scan, Flatten, Pad, Squeeze, Unsqueeze, Transpose (opset 23). Similar to INT4/UNIT4, FP4 weights/inputs are expected to be packed. --------- Signed-off-by: Yuan Yao (yuanyao) <yuanyao@nvidia.com> Signed-off-by: Yuan Yao <yuanyao@nvidia.com>
### Description - Add FLOAT4E2M1 as a new data type to proto as well as relevant helper functions and tests. - This PR splits out the portion of onnx#6283 relevant to data type updates to reduce the PR's size. ### Motivation and Context Narrow precision data types with sub-byte bit widths are becoming solutions to the rising cost, performance, and deployment challenges of LLMs. ONNX already has INT4/UINT4. FP4 is another commonly used narrow-precision data type for compressing both the weights and activations of LLMs. For example [OCP](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) MXFP4 uses E2M1 as element type. Similar to INT4/UNIT4, FP4 weights/inputs are expected to be packed. Signed-off-by: Yuan Yao <yuanyao@nvidia.com> Signed-off-by: Linsho Kaku <linsho@preferred.jp>
### Description - FLOAT4E2M1 has been added to proto in onnx#6318 - This PR adds FLOAT4E2M1 support for QuantizeLinear, DequantizeLinear, Cast, CastLike (opset 23). - Also add support to non-compute ops: Constant, ConstantOfShape, Identity, Reshape, Shape, Size, If, Loop, Scan, Flatten, Pad, Squeeze, Unsqueeze, Transpose (opset 23). Similar to INT4/UNIT4, FP4 weights/inputs are expected to be packed. --------- Signed-off-by: Yuan Yao (yuanyao) <yuanyao@nvidia.com> Signed-off-by: Yuan Yao <yuanyao@nvidia.com> Signed-off-by: Linsho Kaku <linsho@preferred.jp>
### Description onnx/onnx#6318 and onnx/onnx#6283 added FP4 support to ONNX. This change introduces the FP4 type in ORT and adds type support to one relevant operator (`Cast`) as a proof-of-concept for the type integration into ORT. More op support will be added on a need-basis. This change took inspiration from the following PRs: #14731 #22228 #20362 Some notes: 1) Only `tensor` type gets support for FP4 initially. Secondary types like `seq(tensor)`, `sparse_tensor`, `optional` do not get support (so as to not introduce unnecessary bloat to the framework without a solid use-case) 2) Flatbuffer related files receive no updates in this PR ### Motivation and Context Be able to run FP4 models with ORT
Description
Motivation and Context
Narrow precision data types with sub-byte bit widths are becoming solutions to the rising cost, performance, and deployment challenges of LLMs. ONNX already has INT4/UINT4. FP4 is another commonly used narrow-precision data type for compressing both the weights and activations of LLMs. For example OCP MXFP4 uses E2M1 as element type.
Similar to INT4/UNIT4, FP4 weights/inputs are expected to be packed.