-
Notifications
You must be signed in to change notification settings - Fork 25k
[quant] Quantizable MultiheadAttention #49866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is still work in progress, and the tests will be amended into this PR. [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit a689cc1 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This is still work in progress, and the tests will be amended into this PR. [ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR. [ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR. [ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chatted offline, requesting changes for either reusing code instead of copy-pasting, or getting some alignment from PyTorch core that copy-pasta is the right way to go. If we do decide to copy, can we document the reasoning in the PR summary and the code?
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lg, thanks for addressing the comments!
@@ -0,0 +1,439 @@ | |||
import torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit on the filename, is activation.py
the best filename for it? I would have expected something like multiheadattention.py
, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MHA is a type of activation by convention -- it's more of a head (imho), but pytorch does not have a "head" subgroup for the layers, so it is put in the activation group
self.dequant_v = torch.quantization.DeQuantStub() | ||
|
||
def _get_name(self): | ||
return 'QuantizableMultiheadAttention' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this going to return QuantizableMultiheadAttention
in both fp32 with observers and quantized state? If yes, would it be possible to refactor the code so that the final quantized module prints out QuantizedMultiheadAttention
, to help with debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not as straight-forward. This will require reflection design pattern with the change of function when the from_observed
is called. I would not recommend doing it, as this is a power-feature. Alternatively, one can check if the layer weights are quantized, and change the name accordingly. But I think it might be error prone -- I'll think about it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to fix this in the long term, the current state will be confusing to people who use this without reading the code
|
||
q = self.q_scaling_product.mul_scalar(q, scaling) | ||
|
||
if attn_mask is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sg
for bias, add_bias_kv, add_zero_attn in itertools.product( | ||
Bias, Add_bias_kv, Add_zero_attn): | ||
min_power = 20 | ||
max_mse = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a tighter check on the quantized module vs expected? Perhaps having a reference op that does exactly the same numerics with fp32 operations can be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already implemented the dequantize
method inside the module that creates a dequantized reference -- the numerics don't exceed 20dB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi z-a-f
now,how can i dynamic_quantize wav2vec model?
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]
This pull request has been merged in b8584b8. |
Summary: Pull Request resolved: pytorch#49866 - Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. (Note: this ignores all push blocking failures!) Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Imported from OSS Reviewed By: vkuzo Differential Revision: D25706179 fbshipit-source-id: e27ab641d8d1eccc64cf9e44343459331f89eea4
- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` ghstack-source-id: db53795 Pull Request resolved: pytorch/pytorch#49866
Stack from ghstack:
torch.nn.quantizable.MultiheadAttention
The quantizable version can serve as a fully equivalent to
torch.nn.MultiheadAttention
module.The main difference is that it allows for linear units observation after the
prepare
step in the quantization flow.Note: The
from_observed
method (called during theconvert
) removes thebias_k
andbias_v
parameters, and resets them as attributes.This is done to avoid an error of assigning a quantized tensor to the
torch.nn.Parameter
.Test Plan:
Differential Revision: D25706179