Skip to content

[quant] Quantizable MultiheadAttention #49866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 29 commits into from
Closed

Conversation

z-a-f
Copy link

@z-a-f z-a-f commented Dec 26, 2020

Stack from ghstack:

  • Adds the torch.nn.quantizable.MultiheadAttention

The quantizable version can serve as a fully equivalent to torch.nn.MultiheadAttention module.
The main difference is that it allows for linear units observation after the prepare step in the quantization flow.

Note: The from_observed method (called during the convert) removes the bias_k and bias_v parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the torch.nn.Parameter.

Test Plan:

python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention

Differential Revision: D25706179

This is still work in progress, and the tests will be amended into this PR.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Dec 26, 2020

💊 CI failures summary and remediations

As of commit a689cc1 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

This is still work in progress, and the tests will be amended into this PR.

[ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR.

[ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR.

[ghstack-poisoned]
z-a-f pushed a commit that referenced this pull request Dec 26, 2020
This is still work in progress, and the tests will be amended into this PR.

ghstack-source-id: 6d5a107
Pull Request resolved: #49866
Zafar added 2 commits December 26, 2020 18:48
This is still work in progress, and the tests will be amended into this PR.

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR.

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
z-a-f pushed a commit that referenced this pull request Dec 27, 2020
This is still work in progress, and the tests will be amended into this PR.

ghstack-source-id: 279f20f
Pull Request resolved: #49866
This is still work in progress, and the tests will be amended into this PR.

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR.

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
z-a-f pushed a commit that referenced this pull request Dec 28, 2020
This is still work in progress, and the tests will be amended into this PR.

ghstack-source-id: 713f510
Pull Request resolved: #49866
This is still work in progress, and the tests will be amended into this PR.

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
z-a-f pushed a commit that referenced this pull request Dec 30, 2020
This is still work in progress, and the tests will be amended into this PR.

ghstack-source-id: 5ec8371
Pull Request resolved: #49866
This is still work in progress, and the tests will be amended into this PR.

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR.

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
This is still work in progress, and the tests will be amended into this PR.

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
Copy link
Contributor

@vkuzo vkuzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chatted offline, requesting changes for either reusing code instead of copy-pasting, or getting some alignment from PyTorch core that copy-pasta is the right way to go. If we do decide to copy, can we document the reasoning in the PR summary and the code?

- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
Zafar added 2 commits January 26, 2021 13:34
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
@z-a-f z-a-f requested review from raghuramank100 and vkuzo January 26, 2021 22:04
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
Copy link
Contributor

@vkuzo vkuzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg, thanks for addressing the comments!

@@ -0,0 +1,439 @@
import torch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit on the filename, is activation.py the best filename for it? I would have expected something like multiheadattention.py, etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MHA is a type of activation by convention -- it's more of a head (imho), but pytorch does not have a "head" subgroup for the layers, so it is put in the activation group

self.dequant_v = torch.quantization.DeQuantStub()

def _get_name(self):
return 'QuantizableMultiheadAttention'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to return QuantizableMultiheadAttention in both fp32 with observers and quantized state? If yes, would it be possible to refactor the code so that the final quantized module prints out QuantizedMultiheadAttention, to help with debugging?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not as straight-forward. This will require reflection design pattern with the change of function when the from_observed is called. I would not recommend doing it, as this is a power-feature. Alternatively, one can check if the layer weights are quantized, and change the name accordingly. But I think it might be error prone -- I'll think about it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to fix this in the long term, the current state will be confusing to people who use this without reading the code


q = self.q_scaling_product.mul_scalar(q, scaling)

if attn_mask is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg

for bias, add_bias_kv, add_zero_attn in itertools.product(
Bias, Add_bias_kv, Add_zero_attn):
min_power = 20
max_mse = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a tighter check on the quantized module vs expected? Perhaps having a reference op that does exactly the same numerics with fp32 operations can be done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already implemented the dequantize method inside the module that creates a dequantized reference -- the numerics don't exceed 20dB

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi z-a-f
now,how can i dynamic_quantize wav2vec model?

@z-a-f z-a-f requested a review from raghuramank100 January 27, 2021 03:07
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
Zafar added 7 commits February 3, 2021 11:49
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Differential Revision: [D25706179](https://our.internmc.facebook.com/intern/diff/D25706179)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in b8584b8.

@facebook-github-bot facebook-github-bot deleted the gh/z-a-f/93/head branch February 21, 2021 15:17
xsacha pushed a commit to xsacha/pytorch that referenced this pull request Mar 31, 2021
Summary:
Pull Request resolved: pytorch#49866

- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

(Note: this ignores all push blocking failures!)

Test Plan:
```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D25706179

fbshipit-source-id: e27ab641d8d1eccc64cf9e44343459331f89eea4
jasperzhong pushed a commit to jasperzhong/swift that referenced this pull request Nov 25, 2021
- Adds the `torch.nn.quantizable.MultiheadAttention`

The quantizable version can serve as a fully equivalent to `torch.nn.MultiheadAttention` module.
The main difference is that it allows for linear units observation after the `prepare` step in the quantization flow.

Note: The `from_observed` method (called during the `convert`) removes the `bias_k` and `bias_v` parameters, and resets them as attributes.
This is done to avoid an error of assigning a quantized tensor to the `torch.nn.Parameter`.

Test Plan:

```
python test/test_quantization.py TestQuantizedOps.test_custom_module_multi_head_attention
```

ghstack-source-id: db53795
Pull Request resolved: pytorch/pytorch#49866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants