Skip to content

Conversation

yuanyao-nv
Copy link
Contributor

@yuanyao-nv yuanyao-nv commented Aug 24, 2024

Description

Motivation and Context

Narrow precision data types with sub-byte bit widths are becoming solutions to the rising cost, performance, and deployment challenges of LLMs. ONNX already has INT4/UINT4. FP4 is another commonly used narrow-precision data type for compressing both the weights and activations of LLMs. For example OCP MXFP4 uses E2M1 as element type.

Similar to INT4/UNIT4, FP4 weights/inputs are expected to be packed.

Signed-off-by: Yuan Yao <yuanyao@nvidia.com>
@yuanyao-nv yuanyao-nv requested review from a team as code owners August 24, 2024 00:43
@yuanyao-nv yuanyao-nv changed the title Add FP4E2M1 data type Add FLOAT4E2M1 data type Aug 24, 2024
Copy link

codecov bot commented Aug 24, 2024

Codecov Report

Attention: Patch coverage is 77.77778% with 22 lines in your changes missing coverage. Please review.

Project coverage is 57.26%. Comparing base (83194ed) to head (74e11de).
Report is 465 commits behind head on main.

Files with missing lines Patch % Lines
onnx/numpy_helper.py 58.62% 10 Missing and 2 partials ⚠️
onnx/reference/ops/op_cast.py 33.33% 7 Missing and 1 partial ⚠️
onnx/reference/op_run.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6318      +/-   ##
==========================================
+ Coverage   56.95%   57.26%   +0.30%     
==========================================
  Files         506      507       +1     
  Lines       30467    31354     +887     
  Branches     4592     4679      +87     
==========================================
+ Hits        17353    17954     +601     
- Misses      12285    12550     +265     
- Partials      829      850      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@justinchuby justinchuby added the run release CIs Use this label to trigger release tests in CI label Aug 24, 2024
@justinchuby justinchuby reopened this Aug 24, 2024
@yuanyao-nv yuanyao-nv added this pull request to the merge queue Aug 24, 2024
Merged via the queue into onnx:main with commit c057d17 Aug 24, 2024
105 checks passed
@yuanyao-nv yuanyao-nv deleted the dev-fp4-type-only branch August 24, 2024 03:31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuanyao-nv I just noticed - the test data shouldn’t change in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was auto generated by update_doc.sh. I've seen cases where some seemingly irrelevant pb files get updated by the script. Do you understand why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pb files can be different in bytes when generated on different OS. I think that’s what’s happening here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I wonder if it's possible to let the pipeline auto-generate these test files instead and reduce the number of files changed in each PR.

lutzroeder added a commit to lutzroeder/netron that referenced this pull request Aug 24, 2024
lutzroeder added a commit to lutzroeder/netron that referenced this pull request Aug 24, 2024
andife pushed a commit to andife/onnx that referenced this pull request Aug 26, 2024
### Description
- Add FLOAT4E2M1 as a new data type to proto as well as relevant helper
functions and tests.
- This PR splits out the portion of
onnx#6283 relevant to data type updates to
reduce the PR's size.

### Motivation and Context
Narrow precision data types with sub-byte bit widths are becoming
solutions to the rising cost, performance, and deployment challenges of
LLMs. ONNX already has INT4/UINT4. FP4 is another commonly used
narrow-precision data type for compressing both the weights and
activations of LLMs. For example
[OCP](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)
MXFP4 uses E2M1 as element type.

Similar to INT4/UNIT4, FP4 weights/inputs are expected to be packed.

Signed-off-by: Yuan Yao <yuanyao@nvidia.com>
Signed-off-by: Andreas Fehlner <fehlner@arcor.de>
github-merge-queue bot pushed a commit that referenced this pull request Aug 27, 2024
### Description
- FLOAT4E2M1 has been added to proto in
#6318
- This PR adds FLOAT4E2M1 support for QuantizeLinear, DequantizeLinear,
Cast, CastLike (opset 23).
- Also add support to non-compute ops: Constant, ConstantOfShape,
Identity, Reshape, Shape, Size, If, Loop, Scan, Flatten, Pad, Squeeze,
Unsqueeze, Transpose (opset 23).

Similar to INT4/UNIT4, FP4 weights/inputs are expected to be packed.

---------

Signed-off-by: Yuan Yao (yuanyao) <yuanyao@nvidia.com>
Signed-off-by: Yuan Yao <yuanyao@nvidia.com>
linshokaku pushed a commit to linshokaku/onnx that referenced this pull request Oct 2, 2024
### Description
- Add FLOAT4E2M1 as a new data type to proto as well as relevant helper
functions and tests.
- This PR splits out the portion of
onnx#6283 relevant to data type updates to
reduce the PR's size.

### Motivation and Context
Narrow precision data types with sub-byte bit widths are becoming
solutions to the rising cost, performance, and deployment challenges of
LLMs. ONNX already has INT4/UINT4. FP4 is another commonly used
narrow-precision data type for compressing both the weights and
activations of LLMs. For example
[OCP](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)
MXFP4 uses E2M1 as element type.

Similar to INT4/UNIT4, FP4 weights/inputs are expected to be packed.

Signed-off-by: Yuan Yao <yuanyao@nvidia.com>
Signed-off-by: Linsho Kaku <linsho@preferred.jp>
linshokaku pushed a commit to linshokaku/onnx that referenced this pull request Oct 2, 2024
### Description
- FLOAT4E2M1 has been added to proto in
onnx#6318
- This PR adds FLOAT4E2M1 support for QuantizeLinear, DequantizeLinear,
Cast, CastLike (opset 23).
- Also add support to non-compute ops: Constant, ConstantOfShape,
Identity, Reshape, Shape, Size, If, Loop, Scan, Flatten, Pad, Squeeze,
Unsqueeze, Transpose (opset 23).

Similar to INT4/UNIT4, FP4 weights/inputs are expected to be packed.

---------

Signed-off-by: Yuan Yao (yuanyao) <yuanyao@nvidia.com>
Signed-off-by: Yuan Yao <yuanyao@nvidia.com>
Signed-off-by: Linsho Kaku <linsho@preferred.jp>
hariharans29 added a commit to microsoft/onnxruntime that referenced this pull request Sep 3, 2025
### Description
onnx/onnx#6318 and
onnx/onnx#6283 added FP4 support to ONNX. This
change introduces the FP4 type in ORT and adds type support to one
relevant operator (`Cast`) as a proof-of-concept for the type
integration into ORT. More op support will be added on a need-basis.

This change took inspiration from the following PRs:

#14731
#22228
#20362 

Some notes:

1) Only `tensor` type gets support for FP4 initially. Secondary types
like `seq(tensor)`, `sparse_tensor`, `optional` do not get support (so
as to not introduce unnecessary bloat to the framework without a solid
use-case)

2) Flatbuffer related files receive no updates in this PR

### Motivation and Context
Be able to run FP4 models with ORT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: spec run release CIs Use this label to trigger release tests in CI
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants