[quant] Quantizable MultiheadAttention #49866

z-a-f · 2020-12-26T10:14:49Z

Stack from ghstack:

[quant][refactor] Factor out MHA code dup from nn and nn.quantizable #51169 [quant][refactor] Factor out MHA code dup from nn and nn.quantizable
[quant] Factoring out the list of no_observers #50459 [quant] Factoring out the list of no_observers
[quant] Quantizable MultiheadAttention #49866 [quant] Quantizable MultiheadAttention

Adds the torch.nn.quantizable.MultiheadAttention

The quantizable version can serve as a fully equivalent to torch.nn.MultiheadAttention module.
The main difference is that it allows for linear units observation after the prepare step in the quantization flow.

Note: The from_observed method (called during the convert) removes the bias_k and bias_v parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the torch.nn.Parameter.

Test Plan:

python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention

Differential Revision: D25706179

This is still work in progress, and the tests will be amended into this PR. [ghstack-poisoned]

facebook-github-bot · 2020-12-26T10:15:01Z

💊 CI failures summary and remediations

As of commit a689cc1 (more details on the Dr. CI page):

1/1 failures possibly* introduced in this PR
- 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This is still work in progress, and the tests will be amended into this PR. [ghstack-poisoned]

This is still work in progress, and the tests will be amended into this PR. ghstack-source-id: 6d5a107 Pull Request resolved: #49866

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

This is still work in progress, and the tests will be amended into this PR. ghstack-source-id: 279f20f Pull Request resolved: #49866

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

This is still work in progress, and the tests will be amended into this PR. ghstack-source-id: 713f510 Pull Request resolved: #49866

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

This is still work in progress, and the tests will be amended into this PR. ghstack-source-id: 5ec8371 Pull Request resolved: #49866

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

vkuzo

chatted offline, requesting changes for either reusing code instead of copy-pasting, or getting some alignment from PyTorch core that copy-pasta is the right way to go. If we do decide to copy, can we document the reasoning in the PR summary and the code?

- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

vkuzo

lg, thanks for addressing the comments!

vkuzo · 2021-01-26T23:40:31Z

torch/nn/quantizable/modules/activation.py

@@ -0,0 +1,439 @@
+import torch


nit on the filename, is activation.py the best filename for it? I would have expected something like multiheadattention.py, etc

MHA is a type of activation by convention -- it's more of a head (imho), but pytorch does not have a "head" subgroup for the layers, so it is put in the activation group

vkuzo · 2021-01-26T23:42:03Z

torch/nn/quantizable/modules/activation.py

+        self.dequant_v = torch.quantization.DeQuantStub()
+
+    def _get_name(self):
+        return 'QuantizableMultiheadAttention'


is this going to return QuantizableMultiheadAttention in both fp32 with observers and quantized state? If yes, would it be possible to refactor the code so that the final quantized module prints out QuantizedMultiheadAttention, to help with debugging?

It is not as straight-forward. This will require reflection design pattern with the change of function when the from_observed is called. I would not recommend doing it, as this is a power-feature. Alternatively, one can check if the layer weights are quantized, and change the name accordingly. But I think it might be error prone -- I'll think about it...

it would be good to fix this in the long term, the current state will be confusing to people who use this without reading the code

vkuzo · 2021-01-26T23:43:21Z

torch/nn/quantizable/modules/activation.py

+
+        q = self.q_scaling_product.mul_scalar(q, scaling)
+
+        if attn_mask is not None:


torch/quantization/quantize.py

raghuramank100 · 2021-01-27T02:05:18Z

test/quantization/test_quantized_op.py

+            for bias, add_bias_kv, add_zero_attn in itertools.product(
+                    Bias, Add_bias_kv, Add_zero_attn):
+                min_power = 20
+                max_mse = 4


Can we have a tighter check on the quantized module vs expected? Perhaps having a reference op that does exactly the same numerics with fp32 operations can be done.

I already implemented the dequantize method inside the module that creates a dequantized reference -- the numerics don't exceed 20dB

hi z-a-f
now,how can i dynamic_quantize wav2vec model?

- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

facebook-github-bot · 2021-02-17T20:39:04Z

This pull request has been merged in b8584b8.

Summary: Pull Request resolved: pytorch#49866 - Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. (Note: this ignores all push blocking failures!) Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` Imported from OSS Reviewed By: vkuzo Differential Revision: D25706179 fbshipit-source-id: e27ab641d8d1eccc64cf9e44343459331f89eea4

- Adds the `torch.nn.quantizable.MultiheadAttention` The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module. The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow. Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes. This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`. Test Plan: ``` python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention ``` ghstack-source-id: db53795 Pull Request resolved: pytorch/pytorch#49866

[quant][WIP] Quantizable MultiheadAttention

7567cd2

This is still work in progress, and the tests will be amended into this PR. [ghstack-poisoned]

z-a-f mentioned this pull request Dec 26, 2020

[quant] Quantizable LSTM #49671

Closed

facebook-github-bot added the cla signed label Dec 26, 2020

z-a-f requested review from vkuzo and raghuramank100 December 26, 2020 10:15

Update on "[quant][WIP] Quantizable MultiheadAttention"

a9cf82b

This is still work in progress, and the tests will be amended into this PR. [ghstack-poisoned]

Update on "[quant][WIP] Quantizable MultiheadAttention"

4b3a504

This is still work in progress, and the tests will be amended into this PR. [ghstack-poisoned]

Update on "[quant][WIP] Quantizable MultiheadAttention"

e3af2f7

This is still work in progress, and the tests will be amended into this PR. [ghstack-poisoned]

z-a-f pushed a commit that referenced this pull request Dec 26, 2020

[quant][WIP] Quantizable MultiheadAttention

57cb366

This is still work in progress, and the tests will be amended into this PR. ghstack-source-id: 6d5a107 Pull Request resolved: #49866

Zafar added 2 commits December 26, 2020 18:48

Update on "[quant][WIP] Quantizable MultiheadAttention"

320ae30

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

Update on "[quant][WIP] Quantizable MultiheadAttention"

b3358ac

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

z-a-f pushed a commit that referenced this pull request Dec 27, 2020

[quant][WIP] Quantizable MultiheadAttention

30742fd

This is still work in progress, and the tests will be amended into this PR. ghstack-source-id: 279f20f Pull Request resolved: #49866

Update on "[quant][WIP] Quantizable MultiheadAttention"

447e9d9

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

Update on "[quant][WIP] Quantizable MultiheadAttention"

67ec568

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

z-a-f pushed a commit that referenced this pull request Dec 28, 2020

[quant][WIP] Quantizable MultiheadAttention

5214b25

This is still work in progress, and the tests will be amended into this PR. ghstack-source-id: 713f510 Pull Request resolved: #49866

Update on "[quant][WIP] Quantizable MultiheadAttention"

160a638

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

z-a-f pushed a commit that referenced this pull request Dec 30, 2020

[quant][WIP] Quantizable MultiheadAttention

d26dc35

This is still work in progress, and the tests will be amended into this PR. ghstack-source-id: 5ec8371 Pull Request resolved: #49866

Update on "[quant][WIP] Quantizable MultiheadAttention"

14afb66

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

Update on "[quant][WIP] Quantizable MultiheadAttention"

c672cc8

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

Update on "[quant][WIP] Quantizable MultiheadAttention"

269dd52

This is still work in progress, and the tests will be amended into this PR. Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179) [ghstack-poisoned]

vkuzo requested changes Jan 13, 2021

View reviewed changes

z-a-f mentioned this pull request Jan 13, 2021

[quant] Factoring out the list of no_observers #50459

Closed

Zafar added 2 commits January 26, 2021 13:34

z-a-f requested review from raghuramank100 and vkuzo January 26, 2021 22:04

vkuzo approved these changes Jan 26, 2021

View reviewed changes

z-a-f mentioned this pull request Jan 27, 2021

[quant][refactor] Factor out MHA code dup from nn and nn.quantizable #51169

Closed

raghuramank100 reviewed Jan 27, 2021

View reviewed changes

z-a-f requested a review from raghuramank100 January 27, 2021 03:07

This was referenced Jan 30, 2021

Quantization error with nn.Transformer #32764

Closed

Quantizable MHA error too high when bias=False #51662

Closed

Zafar added 7 commits February 3, 2021 11:49

facebook-github-bot closed this in b8584b8 Feb 17, 2021

facebook-github-bot added the Merged label Feb 17, 2021

facebook-github-bot deleted the gh/z-a-f/93/head branch February 21, 2021 15:17

bhosmer mentioned this pull request May 21, 2021

fix nn.MHA + quantized scriptability #58727

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[quant] Quantizable MultiheadAttention #49866

[quant] Quantizable MultiheadAttention #49866

Uh oh!

z-a-f commented Dec 26, 2020 •

edited

Loading

Uh oh!

facebook-github-bot commented Dec 26, 2020 •

edited

Loading

Uh oh!

vkuzo left a comment

Uh oh!

vkuzo left a comment

Uh oh!

vkuzo Jan 26, 2021

Uh oh!

z-a-f Jan 27, 2021

Uh oh!

vkuzo Jan 26, 2021

Uh oh!

z-a-f Jan 27, 2021

Uh oh!

vkuzo Jan 27, 2021

Uh oh!

vkuzo Jan 26, 2021

Uh oh!

Uh oh!

raghuramank100 Jan 27, 2021

Uh oh!

z-a-f Jan 27, 2021

Uh oh!

mehrdad78 Aug 10, 2022

Uh oh!

facebook-github-bot commented Feb 17, 2021

Uh oh!

Uh oh!


		q = self.q_scaling_product.mul_scalar(q, scaling)

		if attn_mask is not None:

[quant] Quantizable MultiheadAttention #49866

[quant] Quantizable MultiheadAttention #49866

Uh oh!

Conversation

z-a-f commented Dec 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Dec 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Feb 17, 2021

Uh oh!

Uh oh!

z-a-f commented Dec 26, 2020 •

edited

Loading

facebook-github-bot commented Dec 26, 2020 •

edited

Loading