Add Attention Op to ONNX Opset 23 #6501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

gramalingam merged 32 commits into onnx:main from shubhambhokare1:sbhokare/llm-ops-3

Mar 1, 2025

Contributor

shubhambhokare1 commented Oct 28, 2024 •

edited by justinchuby

Loading

Description

Add the following key LLM ops to the ONNX standard: Attention.
This standardized attention operator should cover:

Self and Cross Attentions
Multi-Head Attention (MHA)
Group-Query Attention (GQA)
Multi-Query Attention (MQA)
No-bias and Causal Mask attentions

Motivation and Context

Standardize Operators that are showing up in key LLM models.

shubhambhokare1 requested a review from a team as a code owner

October 28, 2024 20:49

codecov bot commented Oct 28, 2024 •

edited

Loading

Codecov Report

Attention: Patch coverage is 0% with 469 lines in your changes missing coverage. Please review.

Project coverage is 56.45%. Comparing base (3d5acaf) to head (bdc31a3).
Report is 143 commits behind head on main.

Files with missing lines	Patch %	Lines
onnx/backend/test/case/node/attention.py	0.00%	469 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6501      +/-   ##
==========================================
- Coverage   57.13%   56.45%   -0.68%     
==========================================
  Files         507      509       +2     
  Lines       31927    32515     +588     
  Branches     3040     3057      +17     
==========================================
+ Hits        18240    18356     +116     
- Misses      12864    13334     +470     
- Partials      823      825       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-advanced-security bot found potential problems

View reviewed changes

onnx/backend/test/case/node/scalardotproductattention.py Fixed Show fixed Hide fixed

github-advanced-security bot found potential problems

View reviewed changes

onnx/backend/test/case/node/scalardotproductattention.py Fixed Show fixed Hide fixed

gramalingam reviewed

View reviewed changes

onnx/defs/nn/defs.cc Show resolved Hide resolved

gramalingam reviewed

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

WilliamTambellini mentioned this pull request

Add Support for Multi-Head Attention Operator #5449

Open

WilliamTambellini commented Nov 5, 2024

+1

Contributor

yuanyao-nv commented Nov 6, 2024

Some comments and questions:

I think the name of the op should be Scaled... rather than Scalar...?
Should kv-caching happen inside SDPA as you have proposed here? My understanding is it should be outside, since the input to SDPA is already the projected QKV.
In some cases the attention mask can show up more generally as a sequence of pointwise ops, such as in the grok model where it is essentially a scale+tanh operation. Is it possible/Should we strive to make the SDPA definition more general?
In many cases more granular control over the precision of each operation is desired. In fact most backends will have highly tuned precision combinations for the sequence of ops in attention. Should the spec be more flexible to accommodate that?

nvpohanh reviewed

View reviewed changes

docs/Changelog.md Outdated Show resolved Hide resolved

docs/Changelog.md Outdated Show resolved Hide resolved

docs/Changelog.md Outdated Show resolved Hide resolved

georgen117 reviewed

View reviewed changes

docs/Changelog.md Outdated Show resolved Hide resolved

georgen117 reviewed

View reviewed changes

docs/Changelog.md Outdated Show resolved Hide resolved

georgen117 reviewed

View reviewed changes

docs/Changelog.md Outdated Show resolved Hide resolved

georgen117 reviewed

View reviewed changes

onnx/backend/test/case/node/scalardotproductattention.py Outdated Show resolved Hide resolved

gramalingam reviewed

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

gramalingam reviewed

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

Contributor

gramalingam commented Dec 12, 2024

Minor nit about the name: may be "Attention" would be a better choice. It is a shorter name. Anyway, the op covers multiple variants like MHA/GQA/SDPA etc.

gramalingam reviewed

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

gramalingam reviewed

View reviewed changes

onnx/defs/nn/defs.cc Show resolved Hide resolved

gramalingam reviewed

View reviewed changes

onnx/defs/nn/defs.cc Show resolved Hide resolved

gramalingam reviewed

View reviewed changes

onnx/defs/nn/defs.cc Show resolved Hide resolved

justinchuby mentioned this pull request

[PTAL] Add ScaledDotProductAttention operator #4311

Open

justinchuby changed the title ~~Add Attention Op (ScalarDotProductAttention) to ONNX Opset 23~~ Add Attention Op (ScaledDotProductAttention) to ONNX Opset 23

justinchuby added this to the 1.18 milestone

shubhambhokare1 force-pushed the sbhokare/llm-ops-3 branch from 0799e71 to 4c8ae3d Compare

January 23, 2025 17:27

shubhambhokare1 requested a review from a team as a code owner

January 23, 2025 17:27

shubhambhokare1 changed the title ~~Add Attention Op (ScaledDotProductAttention) to ONNX Opset 23~~ Add Attention Op to ONNX Opset 23

github-advanced-security bot found potential problems

View reviewed changes

onnx/backend/test/case/node/attention.py Fixed Show fixed Hide fixed

onnx/reference/ops/_op_list.py

    
              from onnx.reference.ops.op_asinh import Asinh

              from onnx.reference.ops.op_atan import Atan

              from onnx.reference.ops.op_atanh import Atanh

              from onnx.reference.ops.op_attention import Attention

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'AttributeHasValue' is not used.

onnx/reference/ops/_op_list.py

    
              from onnx.reference.ops.op_asinh import Asinh

              from onnx.reference.ops.op_atan import Atan

              from onnx.reference.ops.op_atanh import Atanh

              from onnx.reference.ops.op_attention import Attention

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'Attention' is not used.

github-advanced-security bot found potential problems

View reviewed changes

onnx/defs/nn/defs.cc Fixed Show fixed Hide fixed

yuanyao-nv reviewed

View reviewed changes

docs/Changelog.md Show resolved Hide resolved

shubhambhokare1 added 2 commits

February 27, 2025 22:30


          docs rebase

83b75d3

Signed-off-by: shubhambhokare1 <shubhambhokare@gmail.com>


          fix failing tests

bef1541

Signed-off-by: shubhambhokare1 <shubhambhokare@gmail.com>

shubhambhokare1 force-pushed the sbhokare/llm-ops-3 branch from 044bb07 to bef1541 Compare

February 27, 2025 22:31

github-advanced-security bot found potential problems

View reviewed changes

onnx/reference/ops/op_attention.py Fixed Show fixed Hide fixed

shubhambhokare1 added 2 commits

February 27, 2025 22:51


          try fix failing tests

b224eeb

Signed-off-by: shubhambhokare1 <shubhambhokare@gmail.com>


          replace np.bool uses

c3604a4

Signed-off-by: shubhambhokare1 <shubhambhokare@gmail.com>

gramalingam reviewed

View reviewed changes

onnx/defs/nn/defs.cc Outdated Show resolved Hide resolved

shubhambhokare1 added 3 commits

February 28, 2025 18:21


          add hasOutput call

365ad43

Signed-off-by: shubhambhokare1 <shubhambhokare@gmail.com>


          add softmax precision description

d8e0641

Signed-off-by: shubhambhokare1 <shubhambhokare@gmail.com>


          clean up docs

3962e4a

Signed-off-by: shubhambhokare1 <shubhambhokare@gmail.com>

kunal-vaishnavi reviewed

View reviewed changes

docs/Operators.md Show resolved Hide resolved

kunal-vaishnavi reviewed

View reviewed changes

docs/Operators.md Show resolved Hide resolved


          add qk_matmul_output_mode

335daa9

Signed-off-by: shubhambhokare1 <shubhambhokare@gmail.com>

gramalingam reviewed

View reviewed changes

onnx/defs/shape_inference.h Outdated Show resolved Hide resolved

gramalingam approved these changes

View reviewed changes

github-project-automation bot moved this from In progress to Reviewer approved in PR Tracker

shubhambhokare1 added 2 commits

February 28, 2025 23:26


          nit fixes

a7c52ee

Signed-off-by: shubhambhokare1 <shubhambhokare@gmail.com>


          reorder operator operator_sets

bdc31a3

Signed-off-by: shubhambhokare1 <shubhambhokare@gmail.com>

kunal-vaishnavi approved these changes

View reviewed changes

gramalingam added this pull request to the merge queue

Merged via the queue into onnx:main with commit d9b1e4f

39 of 41 checks passed

github-project-automation bot moved this from Reviewer approved to Done in PR Tracker

justinchuby added the release notes label

justinchuby mentioned this pull request

[Feature request] FlexAttention alike in ONNX #6389

Open

tianleiwu reviewed

View reviewed changes

onnx/reference/ops/op_attention.py

    
                      head_size_q = int(hidden_size_q / q_num_heads)

                      new_shape_q = [batch_size, q_num_heads, Q.shape[1], head_size_q]

                      Q = np.reshape(Q, new_shape_q)

Contributor

tianleiwu Mar 19, 2025

When Q is 3D shape (batch_size, q_sequence_length, q_hidden_size). It cannot directly reshape to [batch_size, q_num_heads, q_sequence_length, head_size_q]. Need reshape first then transpose.

lucien-east Jun 29, 2025

Agree with @tianleiwu . When Q is of 3D: [b, seq_len, num_head * head_size], it should be applied with reshape: [b, seq_len, num_head, head_size], and then the transpose: [b, num_head, seq_len, head_size].

tianleiwu reviewed

View reviewed changes

onnx/reference/ops/op_attention.py

    
                  softcap=None,

                  qk_matmul_output_mode=None,

              ) -> np.ndarray:

                  assert len(Q.shape) == len(K.shape) == len(V.shape)

Contributor

tianleiwu Mar 19, 2025 •

edited

Loading

If we require Q, K and V (and Output) have same rank, it is better to add to operator spec.

ramkrishna2910 added the topic: operator label

botantony mentioned this pull request

onnx 1.18.0 Homebrew/homebrew-core#223395

Closed

kunal-vaishnavi mentioned this pull request

[CPU] GQA supports attention scores output microsoft/onnxruntime#25319

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

justinchuby justinchuby left review comments

yuanyao-nv yuanyao-nv left review comments

Copilot code review Copilot Copilot left review comments

gramalingam gramalingam approved these changes

+6 more reviewers

georgen117 georgen117 left review comments

lucien-east lucien-east left review comments

tianleiwu tianleiwu left review comments

gedoensmax gedoensmax left review comments

nvpohanh nvpohanh left review comments

kunal-vaishnavi kunal-vaishnavi approved these changes

Labels

release notes topic: operator