-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[quant][bc-breaking] Turn on fold_quantize by default #118605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and `fp32 weight -> q -> dq -> linear -> ...` in the quantized model After this PR, we'll produce a graph with int8 weight by default after convert_pt2e: `int8 weight -> dq -> linear -> ...` We'll remove the fold_quantize flag in the next PR Test Plan: CI Differential Revision: D51730862
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118605
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4489b36 with merge base 2eefbc0 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D51730862 |
Merge failedReason: Comment with id 1917893715 not found Details for Dev Infra teamRaised by workflow job |
@pytorchbot merge -f 'Landed internally' (Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally) |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: This is a follow up to pytorch/pytorch#118605 to remove `fold_quantize` flag from `convert_pt2e` Differential Revision: D53247301
Summary: X-link: pytorch/pytorch#118701 This is a follow up to pytorch/pytorch#118605 to remove `fold_quantize` flag from `convert_pt2e` Differential Revision: D53247301
@jerryzh168 I am wondering about the possibility of keeping the flag? This would allow the backends/pattern matchers to consume an unaltered reference quantized model (with q/dq for both weights vs activations) and fold the quantize weights on a need basis. Moreover, if you have a strong use case for "always" folding the constant, then the flag can very well default to |
Summary: X-link: pytorch/pytorch#118701 This is a follow up to pytorch/pytorch#118605 to remove `fold_quantize` flag from `convert_pt2e` Reviewed By: andrewor14 Differential Revision: D53247301
Summary: X-link: pytorch/executorch#1766 This is a follow up to #118605 to remove `fold_quantize` flag from `convert_pt2e` Test Plan: CI Reviewed By: andrewor14 Differential Revision: D53247301
the use cases we are seeing mostly benefit from this folding so we are turning on this by default. the pattern should only concern |
@sdasgup3 I would also like to understand the usecase you may have. One of the major reasons to turn on quantize by default is to be able to produce quantized model that actually has quantized weights and thus smaller in size. I am curious whether this does not work in your backends because you have front-ends that produce q-dq representation? |
Hi @jerryzh168, I understand the desire of turning on fold_quantize by default, but this PR does more than that: it not only turns on that behavior by default; but also removes the possibility to turn it off. folding in constants is non-reversable after the fact.
Would you also clarify what it means by "reverting in your code"? We depend on the upstream (official) pytorch release and we do not maintain a separate copy of pytorch code. To clarify the ask: |
@qihqi sorry by reverting the folding I mean to find the dq op for weight, and add back the quantize op for weight:
is this possible to do? |
Possibly, needs to be tested across the board. But I have a few concerns with this approach even if it works well for the following reasons:
I support an option for a true reference representation (fully floating point + QDQs) that allows flexibility for the backend to transform the model as needed, though that option doesn't need to be the default. |
Let me respond,
I think this is a question of the representation chosen here. I am not 100% bought whether one is better than the other (although another reason you mentioned below is interesting and I will respond). Primary motivation for us was to have actual quantized model serialized by quantized datatypes supported by the framework. In this case 8bit dtypes. Now I do agree with you that we should at least provide an option to hav Q/DQ representation than to force this on the downstream stack and to this we should expose the option.
Agree.
I think this is the most interesting question. My opinion is that here we try to decouple serialized representation from the underlying tensor dtypes supported by framework. If say int4 was natively supported dtype then we could have 4 bit quantized model with DQ on the weights and thus chose DQ only representation. At the moment however int4 dtype does not exist and hence we have to either 1) use int8 and waste space OR 2) use packing. I am, somewhat reluctantly, favoring 1, instead of having to introduce pack/unpack in graph representation. At this point you can say then why not just have Q/DQ representation. My stance is that DQ based IR is the choice we made. Because pytorch natively does not have 4bit dtype so 4-bit quantized models cant be represented using just DQ and 4bit storage. But if we do add 4bit native dtype tomorrow, we should be able to do that. I definitely sense disagreement on this, understandably, but I do feel, cringingly, that this is an ok compromise. Either way, having the option of Q/DQ representation sounds reasonable to me. |
Thanks @paulinesho and @kimishpatel for the discussions. I actually raised similar points around packing before when discussing with @kimishpatel on this issue, I think for now it makes sense to keep the flag but turn on fold_quantize=True by default since I didn't see much downside of keeping the flag, the only concern might just be the model produced from the flow may not directly work on different backends that's expecting different flags. I'll update my PR #118701 to do this |
Thanks for the prompt decision! I appreciate the quick turnaround time.
Agreed. I think the only strong requirement would be API and representation stability.
I do agree that sub-byte storage would be ideal, but perhaps rather expecting every framework to do that consistently, the second best option for a backend would be a representation that's stable and generic enough where sharing downstream pipeline could be possible. |
Summary: Pull Request resolved: #1766 X-link: pytorch/pytorch#118701 This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Reviewed By: andrewor14, digantdesai Differential Revision: D53247301 fbshipit-source-id: 5b2dbbc76487a8779f30c483b5ff4f054ba1ae8c
Summary: This is a follow up to #118605 to remove `fold_quantize` flag from `convert_pt2e` Test Plan: CI Differential Revision: D53247301 BC Breaking Note: flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization. 2.2 ``` folded_model = convert_pt2e(model, fold_quantize=True) non_folded_model = convert_pt2e(model) ``` 2.3 ``` folded_model = convert_pt2e(model) non_folded_model = convert_pt2e(model, fold_quantize=False) ``` Pull Request resolved: #118701 Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel
Summary: This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Differential Revision: D53550237
Summary: X-link: pytorch/pytorch#119425 This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Reviewed By: digantdesai Differential Revision: D53550237
…2e` (pytorch#119425) Summary: X-link: pytorch/executorch#1882 This is a follow up to pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Test Plan: CI Reviewed By: digantdesai Differential Revision: D53550237
Summary: X-link: pytorch/pytorch#119425 This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Reviewed By: digantdesai Differential Revision: D53550237
…2e` (pytorch#119425) Summary: X-link: pytorch/executorch#1882 This is a follow up to pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Test Plan: CI Reviewed By: digantdesai Differential Revision: D53550237
Summary: Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and `fp32 weight -> q -> dq -> linear -> ...` in the quantized model After this PR, we'll produce a graph with int8 weight by default after convert_pt2e: `int8 weight -> dq -> linear -> ...` We'll remove the fold_quantize flag in the next PR Test Plan: CI Differential Revision: D51730862 Pull Request resolved: #118605 Approved by: https://github.com/andrewor14
Summary: This is a follow up to #118605 to remove `fold_quantize` flag from `convert_pt2e` Test Plan: CI Differential Revision: D53247301 BC Breaking Note: flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization. 2.2 ``` folded_model = convert_pt2e(model, fold_quantize=True) non_folded_model = convert_pt2e(model) ``` 2.3 ``` folded_model = convert_pt2e(model) non_folded_model = convert_pt2e(model, fold_quantize=False) ``` Pull Request resolved: #118701 Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel
Summary: X-link: pytorch/pytorch#119425 Pull Request resolved: #1882 This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Reviewed By: digantdesai Differential Revision: D53550237 fbshipit-source-id: fad66e5d860113d4510c605d86b8045eef4b449c
…2e` (#119425) Summary: This is a follow up to #118605 to set `fold_quantize` flag to True in `convert_pt2e` Test Plan: CI Differential Revision: D53550237 Pull Request resolved: #119425 Approved by: https://github.com/andrewor14
Summary: This is a follow up to #118605 to remove `fold_quantize` flag from `convert_pt2e` Test Plan: CI Differential Revision: D53247301 BC Breaking Note: flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization. 2.2 ``` folded_model = convert_pt2e(model, fold_quantize=True) non_folded_model = convert_pt2e(model) ``` 2.3 ``` folded_model = convert_pt2e(model) non_folded_model = convert_pt2e(model, fold_quantize=False) ``` Pull Request resolved: #118701 Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel
…2e` (#119425) Summary: This is a follow up to #118605 to set `fold_quantize` flag to True in `convert_pt2e` Test Plan: CI Differential Revision: D53550237 Pull Request resolved: #119425 Approved by: https://github.com/andrewor14
Summary:
Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and
fp32 weight -> q -> dq -> linear -> ...
in the quantized modelAfter this PR, we'll produce a graph with int8 weight by default after convert_pt2e:
int8 weight -> dq -> linear -> ...
We'll remove the fold_quantize flag in the next PR
Test Plan: CI
Differential Revision: D51730862