Skip to content

Conversation

jerryzh168
Copy link
Contributor

Summary:
Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and
fp32 weight -> q -> dq -> linear -> ... in the quantized model

After this PR, we'll produce a graph with int8 weight by default after convert_pt2e:
int8 weight -> dq -> linear -> ...

We'll remove the fold_quantize flag in the next PR

Test Plan: CI

Differential Revision: D51730862

Summary:
Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and
`fp32 weight -> q -> dq -> linear -> ...` in the quantized model

After this PR, we'll produce a graph with int8 weight by default after convert_pt2e:
`int8 weight -> dq -> linear -> ...`

We'll remove the fold_quantize flag in the next PR

Test Plan: CI

Differential Revision: D51730862
Copy link

pytorch-bot bot commented Jan 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118605

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4489b36 with merge base 2eefbc0 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D51730862

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 30, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Comment with id 1917893715 not found

Details for Dev Infra team Raised by workflow job

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jerryzh168 added a commit to jerryzh168/executorch-1 that referenced this pull request Jan 31, 2024
Summary:
This is a follow up to pytorch/pytorch#118605 to remove `fold_quantize` flag from
`convert_pt2e`

Differential Revision: D53247301
jerryzh168 added a commit to jerryzh168/executorch-1 that referenced this pull request Jan 31, 2024
Summary:
X-link: pytorch/pytorch#118701

This is a follow up to pytorch/pytorch#118605 to remove `fold_quantize` flag from
`convert_pt2e`

Differential Revision: D53247301
@sdasgup3
Copy link

sdasgup3 commented Feb 1, 2024

@jerryzh168 I am wondering about the possibility of keeping the flag? This would allow the backends/pattern matchers to consume an unaltered reference quantized model (with q/dq for both weights vs activations) and fold the quantize weights on a need basis.

Moreover, if you have a strong use case for "always" folding the constant, then the flag can very well default to true.

jerryzh168 added a commit to jerryzh168/executorch-1 that referenced this pull request Feb 1, 2024
Summary:

X-link: pytorch/pytorch#118701

This is a follow up to pytorch/pytorch#118605 to remove `fold_quantize` flag from
`convert_pt2e`

Reviewed By: andrewor14

Differential Revision: D53247301
pytorch-bot bot pushed a commit that referenced this pull request Feb 1, 2024
Summary:
X-link: pytorch/executorch#1766


This is a follow up to #118605 to remove `fold_quantize` flag from
`convert_pt2e`

Test Plan: CI

Reviewed By: andrewor14

Differential Revision: D53247301
@jerryzh168
Copy link
Contributor Author

jerryzh168 commented Feb 2, 2024

@jerryzh168 I am wondering about the possibility of keeping the flag? This would allow the backends/pattern matchers to consume an unaltered reference quantized model (with q/dq for both weights vs activations) and fold the quantize weights on a need basis.

Moreover, if you have a strong use case for "always" folding the constant, then the flag can very well default to true.

the use cases we are seeing mostly benefit from this folding so we are turning on this by default. the pattern should only concern dequantize op for the weight I think. if you have a backend that don't want to fold the weight, it shouldn't be too hard to revert this change as well? please let me know if there is any problems about reverting the folding in your code

@kimishpatel
Copy link
Contributor

@jerryzh168 I am wondering about the possibility of keeping the flag? This would allow the backends/pattern matchers to consume an unaltered reference quantized model (with q/dq for both weights vs activations) and fold the quantize weights on a need basis.
Moreover, if you have a strong use case for "always" folding the constant, then the flag can very well default to true.

the use cases we are seeing mostly benefit from this folding so we are turning on this by default. the pattern should only concern dequantize op for the weight I think. if you have a backend that don't want to fold the weight, it shouldn't be too hard to revert this change as well? please let me know if there is any problems about reverting the folding in your code

@sdasgup3 I would also like to understand the usecase you may have. One of the major reasons to turn on quantize by default is to be able to produce quantized model that actually has quantized weights and thus smaller in size. I am curious whether this does not work in your backends because you have front-ends that produce q-dq representation?

@qihqi
Copy link
Contributor

qihqi commented Feb 5, 2024

Hi @jerryzh168, I understand the desire of turning on fold_quantize by default, but this PR does more than that: it not only turns on that behavior by default; but also removes the possibility to turn it off.

folding in constants is non-reversable after the fact.

it shouldn't be too hard to revert this change as well? please let me know if there is any problems about reverting the folding in your code

Would you also clarify what it means by "reverting in your code"? We depend on the upstream (official) pytorch release and we do not maintain a separate copy of pytorch code.

To clarify the ask:
We are OK if the default behavior changes, we just need to have the choice of using the old behavior.

@jerryzh168
Copy link
Contributor Author

@qihqi sorry by reverting the folding I mean to find the dq op for weight, and add back the quantize op for weight:

for n in model.node:
    if n.target == torch.ops.quantized_decomposed.dequantize_per_tensor:
        arg = n.args[0]
        if arg.op == "get_attr":
            # change this arg to an corresponding quantize op

is this possible to do?

@paulinesho
Copy link

@qihqi sorry by reverting the folding I mean to find the dq op for weight, and add back the quantize op for weight:

for n in model.node:
    if n.target == torch.ops.quantized_decomposed.dequantize_per_tensor:
        arg = n.args[0]
        if arg.op == "get_attr":
            # change this arg to an corresponding quantize op

is this possible to do?

Possibly, needs to be tested across the board. But I have a few concerns with this approach even if it works well for the following reasons:

  1. I believe there's value in a true reference quantized representation (Q/DQ format) as a stable standard with minimal breaking changes, but this new default sneaks in a framework-enforced optimization (actual compression) rather than deferring that to the backend.
  2. From a user API perspective this introduces the need to postprocess a PT2E-converted graph outside of the API as the snippet above shows. I believe there currently isn't a convenient hook (though this can be added later).
  3. Undoing a optimization downstream is not as clean a design as not applying the optimization in the first place.
  4. Can we support sub-byte quantization cleanly this way? We could do some kind of packing in framework, but if a backend wants to pack in a different manner they would have to unpack framework packing first. We could continue to store them in 8-bit containers, but it seems too specific of a design choice IMO.

I support an option for a true reference representation (fully floating point + QDQs) that allows flexibility for the backend to transform the model as needed, though that option doesn't need to be the default.

@kimishpatel
Copy link
Contributor

Let me respond,

  • I believe there's value in a true reference quantized representation (Q/DQ format) as a stable standard with minimal breaking changes, but this new default sneaks in a framework-enforced optimization (actual compression) rather than deferring that to the backend.

From a user API perspective this introduces the need to postprocess a PT2E-converted graph outside of the API as the snippet above shows. I believe there currently isn't a convenient hook (though this can be added later).

I think this is a question of the representation chosen here. I am not 100% bought whether one is better than the other (although another reason you mentioned below is interesting and I will respond). Primary motivation for us was to have actual quantized model serialized by quantized datatypes supported by the framework. In this case 8bit dtypes.

Now I do agree with you that we should at least provide an option to hav Q/DQ representation than to force this on the downstream stack and to this we should expose the option.

Undoing a optimization downstream is not as clean a design as not applying the optimization in the first place.

Agree.

Can we support sub-byte quantization cleanly this way? We could do some kind of packing in framework, but if a backend wants to pack in a different manner they would have to unpack framework packing first. We could continue to store them in 8-bit containers, but it seems too specific of a design choice IMO.

I think this is the most interesting question. My opinion is that here we try to decouple serialized representation from the underlying tensor dtypes supported by framework. If say int4 was natively supported dtype then we could have 4 bit quantized model with DQ on the weights and thus chose DQ only representation. At the moment however int4 dtype does not exist and hence we have to either 1) use int8 and waste space OR 2) use packing. I am, somewhat reluctantly, favoring 1, instead of having to introduce pack/unpack in graph representation. At this point you can say then why not just have Q/DQ representation. My stance is that DQ based IR is the choice we made. Because pytorch natively does not have 4bit dtype so 4-bit quantized models cant be represented using just DQ and 4bit storage. But if we do add 4bit native dtype tomorrow, we should be able to do that. I definitely sense disagreement on this, understandably, but I do feel, cringingly, that this is an ok compromise.

Either way, having the option of Q/DQ representation sounds reasonable to me.

@jerryzh168
Copy link
Contributor Author

Thanks @paulinesho and @kimishpatel for the discussions. I actually raised similar points around packing before when discussing with @kimishpatel on this issue, I think for now it makes sense to keep the flag but turn on fold_quantize=True by default since I didn't see much downside of keeping the flag, the only concern might just be the model produced from the flow may not directly work on different backends that's expecting different flags.

I'll update my PR #118701 to do this

@paulinesho
Copy link

Thanks for the prompt decision! I appreciate the quick turnaround time.

I think this is a question of the representation chosen here.

Agreed. I think the only strong requirement would be API and representation stability.

Because pytorch natively does not have 4bit dtype so 4-bit quantized models cant be represented using just DQ and 4bit storage. But if we do add 4bit native dtype tomorrow, we should be able to do that.

I do agree that sub-byte storage would be ideal, but perhaps rather expecting every framework to do that consistently, the second best option for a backend would be a representation that's stable and generic enough where sharing downstream pipeline could be possible.

facebook-github-bot pushed a commit to pytorch/executorch that referenced this pull request Feb 7, 2024
Summary:
Pull Request resolved: #1766

X-link: pytorch/pytorch#118701

This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e`

Reviewed By: andrewor14, digantdesai

Differential Revision: D53247301

fbshipit-source-id: 5b2dbbc76487a8779f30c483b5ff4f054ba1ae8c
pytorchmergebot pushed a commit that referenced this pull request Feb 7, 2024
Summary:
This is a follow up to #118605 to remove `fold_quantize` flag from
`convert_pt2e`

Test Plan: CI

Differential Revision: D53247301

BC Breaking Note:

flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization.
2.2
```
folded_model = convert_pt2e(model, fold_quantize=True)

non_folded_model = convert_pt2e(model)
```

2.3
```
folded_model = convert_pt2e(model)

non_folded_model = convert_pt2e(model, fold_quantize=False)
```

Pull Request resolved: #118701
Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel
jerryzh168 added a commit to jerryzh168/executorch-1 that referenced this pull request Feb 8, 2024
Summary: This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e`

Differential Revision: D53550237
jerryzh168 added a commit to jerryzh168/executorch-1 that referenced this pull request Feb 8, 2024
Summary:
X-link: pytorch/pytorch#119425


This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e`

Reviewed By: digantdesai

Differential Revision: D53550237
jerryzh168 added a commit to jerryzh168/pytorch that referenced this pull request Feb 8, 2024
…2e` (pytorch#119425)

Summary:

X-link: pytorch/executorch#1882

This is a follow up to pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e`

Test Plan: CI

Reviewed By: digantdesai

Differential Revision: D53550237
jerryzh168 added a commit to jerryzh168/executorch-1 that referenced this pull request Feb 8, 2024
Summary:
X-link: pytorch/pytorch#119425


This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e`

Reviewed By: digantdesai

Differential Revision: D53550237
jerryzh168 added a commit to jerryzh168/pytorch that referenced this pull request Feb 8, 2024
…2e` (pytorch#119425)

Summary:

X-link: pytorch/executorch#1882

This is a follow up to pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e`

Test Plan: CI

Reviewed By: digantdesai

Differential Revision: D53550237
pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Summary:
Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and
`fp32 weight -> q -> dq -> linear -> ...` in the quantized model

After this PR, we'll produce a graph with int8 weight by default after convert_pt2e:
`int8 weight -> dq -> linear -> ...`

We'll remove the fold_quantize flag in the next PR

Test Plan: CI

Differential Revision: D51730862

Pull Request resolved: #118605
Approved by: https://github.com/andrewor14
pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Summary:
This is a follow up to #118605 to remove `fold_quantize` flag from
`convert_pt2e`

Test Plan: CI

Differential Revision: D53247301

BC Breaking Note:

flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization.
2.2
```
folded_model = convert_pt2e(model, fold_quantize=True)

non_folded_model = convert_pt2e(model)
```

2.3
```
folded_model = convert_pt2e(model)

non_folded_model = convert_pt2e(model, fold_quantize=False)
```

Pull Request resolved: #118701
Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel
facebook-github-bot pushed a commit to pytorch/executorch that referenced this pull request Feb 9, 2024
Summary:
X-link: pytorch/pytorch#119425

Pull Request resolved: #1882

This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e`

Reviewed By: digantdesai

Differential Revision: D53550237

fbshipit-source-id: fad66e5d860113d4510c605d86b8045eef4b449c
pytorchmergebot pushed a commit that referenced this pull request Feb 9, 2024
…2e` (#119425)

Summary: This is a follow up to #118605 to set `fold_quantize` flag to True in `convert_pt2e`

Test Plan: CI

Differential Revision: D53550237

Pull Request resolved: #119425
Approved by: https://github.com/andrewor14
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
Summary:
This is a follow up to #118605 to remove `fold_quantize` flag from
`convert_pt2e`

Test Plan: CI

Differential Revision: D53247301

BC Breaking Note:

flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization.
2.2
```
folded_model = convert_pt2e(model, fold_quantize=True)

non_folded_model = convert_pt2e(model)
```

2.3
```
folded_model = convert_pt2e(model)

non_folded_model = convert_pt2e(model, fold_quantize=False)
```

Pull Request resolved: #118701
Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
…2e` (#119425)

Summary: This is a follow up to #118605 to set `fold_quantize` flag to True in `convert_pt2e`

Test Plan: CI

Differential Revision: D53550237

Pull Request resolved: #119425
Approved by: https://github.com/andrewor14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants