[quant][bc-breaking] Turn on fold_quantize by default #118605

jerryzh168 · 2024-01-30T03:07:52Z

Summary:
Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and
fp32 weight -> q -> dq -> linear -> ... in the quantized model

After this PR, we'll produce a graph with int8 weight by default after convert_pt2e:
int8 weight -> dq -> linear -> ...

We'll remove the fold_quantize flag in the next PR

Test Plan: CI

Differential Revision: D51730862

Summary: Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and `fp32 weight -> q -> dq -> linear -> ...` in the quantized model After this PR, we'll produce a graph with int8 weight by default after convert_pt2e: `int8 weight -> dq -> linear -> ...` We'll remove the fold_quantize flag in the next PR Test Plan: CI Differential Revision: D51730862

pytorch-bot · 2024-01-30T03:07:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118605

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4489b36 with merge base 2eefbc0 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-01-30T03:08:01Z

This pull request was exported from Phabricator. Differential Revision: D51730862

pytorchmergebot · 2024-01-30T21:08:30Z

Merge failed

Reason: Comment with id 1917893715 not found

Details for Dev Infra team

Raised by workflow job

facebook-github-bot · 2024-01-30T21:40:21Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-01-30T21:42:06Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: This is a follow up to pytorch/pytorch#118605 to remove `fold_quantize` flag from `convert_pt2e` Differential Revision: D53247301

Summary: X-link: pytorch/pytorch#118701 This is a follow up to pytorch/pytorch#118605 to remove `fold_quantize` flag from `convert_pt2e` Differential Revision: D53247301

sdasgup3 · 2024-02-01T22:51:37Z

@jerryzh168 I am wondering about the possibility of keeping the flag? This would allow the backends/pattern matchers to consume an unaltered reference quantized model (with q/dq for both weights vs activations) and fold the quantize weights on a need basis.

Moreover, if you have a strong use case for "always" folding the constant, then the flag can very well default to true.

Summary: X-link: pytorch/pytorch#118701 This is a follow up to pytorch/pytorch#118605 to remove `fold_quantize` flag from `convert_pt2e` Reviewed By: andrewor14 Differential Revision: D53247301

Summary: X-link: pytorch/executorch#1766 This is a follow up to #118605 to remove `fold_quantize` flag from `convert_pt2e` Test Plan: CI Reviewed By: andrewor14 Differential Revision: D53247301

jerryzh168 · 2024-02-02T00:16:55Z

@jerryzh168 I am wondering about the possibility of keeping the flag? This would allow the backends/pattern matchers to consume an unaltered reference quantized model (with q/dq for both weights vs activations) and fold the quantize weights on a need basis.

Moreover, if you have a strong use case for "always" folding the constant, then the flag can very well default to true.

the use cases we are seeing mostly benefit from this folding so we are turning on this by default. the pattern should only concern dequantize op for the weight I think. if you have a backend that don't want to fold the weight, it shouldn't be too hard to revert this change as well? please let me know if there is any problems about reverting the folding in your code

kimishpatel · 2024-02-05T02:40:57Z

@jerryzh168 I am wondering about the possibility of keeping the flag? This would allow the backends/pattern matchers to consume an unaltered reference quantized model (with q/dq for both weights vs activations) and fold the quantize weights on a need basis.
Moreover, if you have a strong use case for "always" folding the constant, then the flag can very well default to true.

the use cases we are seeing mostly benefit from this folding so we are turning on this by default. the pattern should only concern dequantize op for the weight I think. if you have a backend that don't want to fold the weight, it shouldn't be too hard to revert this change as well? please let me know if there is any problems about reverting the folding in your code

@sdasgup3 I would also like to understand the usecase you may have. One of the major reasons to turn on quantize by default is to be able to produce quantized model that actually has quantized weights and thus smaller in size. I am curious whether this does not work in your backends because you have front-ends that produce q-dq representation?

qihqi · 2024-02-05T18:22:27Z

Hi @jerryzh168, I understand the desire of turning on fold_quantize by default, but this PR does more than that: it not only turns on that behavior by default; but also removes the possibility to turn it off.

folding in constants is non-reversable after the fact.

it shouldn't be too hard to revert this change as well? please let me know if there is any problems about reverting the folding in your code

Would you also clarify what it means by "reverting in your code"? We depend on the upstream (official) pytorch release and we do not maintain a separate copy of pytorch code.

To clarify the ask:
We are OK if the default behavior changes, we just need to have the choice of using the old behavior.

jerryzh168 · 2024-02-05T20:56:32Z

@qihqi sorry by reverting the folding I mean to find the dq op for weight, and add back the quantize op for weight:

for n in model.node:
    if n.target == torch.ops.quantized_decomposed.dequantize_per_tensor:
        arg = n.args[0]
        if arg.op == "get_attr":
            # change this arg to an corresponding quantize op

is this possible to do?

paulinesho · 2024-02-05T22:20:48Z

@qihqi sorry by reverting the folding I mean to find the dq op for weight, and add back the quantize op for weight:
for n in model.node:
    if n.target == torch.ops.quantized_decomposed.dequantize_per_tensor:
        arg = n.args[0]
        if arg.op == "get_attr":
            # change this arg to an corresponding quantize op
is this possible to do?

Possibly, needs to be tested across the board. But I have a few concerns with this approach even if it works well for the following reasons:

I believe there's value in a true reference quantized representation (Q/DQ format) as a stable standard with minimal breaking changes, but this new default sneaks in a framework-enforced optimization (actual compression) rather than deferring that to the backend.
From a user API perspective this introduces the need to postprocess a PT2E-converted graph outside of the API as the snippet above shows. I believe there currently isn't a convenient hook (though this can be added later).
Undoing a optimization downstream is not as clean a design as not applying the optimization in the first place.
Can we support sub-byte quantization cleanly this way? We could do some kind of packing in framework, but if a backend wants to pack in a different manner they would have to unpack framework packing first. We could continue to store them in 8-bit containers, but it seems too specific of a design choice IMO.

I support an option for a true reference representation (fully floating point + QDQs) that allows flexibility for the backend to transform the model as needed, though that option doesn't need to be the default.

kimishpatel · 2024-02-06T01:49:22Z

Let me respond,

I believe there's value in a true reference quantized representation (Q/DQ format) as a stable standard with minimal breaking changes, but this new default sneaks in a framework-enforced optimization (actual compression) rather than deferring that to the backend.

From a user API perspective this introduces the need to postprocess a PT2E-converted graph outside of the API as the snippet above shows. I believe there currently isn't a convenient hook (though this can be added later).

I think this is a question of the representation chosen here. I am not 100% bought whether one is better than the other (although another reason you mentioned below is interesting and I will respond). Primary motivation for us was to have actual quantized model serialized by quantized datatypes supported by the framework. In this case 8bit dtypes.

Now I do agree with you that we should at least provide an option to hav Q/DQ representation than to force this on the downstream stack and to this we should expose the option.

Undoing a optimization downstream is not as clean a design as not applying the optimization in the first place.

Agree.

Can we support sub-byte quantization cleanly this way? We could do some kind of packing in framework, but if a backend wants to pack in a different manner they would have to unpack framework packing first. We could continue to store them in 8-bit containers, but it seems too specific of a design choice IMO.

I think this is the most interesting question. My opinion is that here we try to decouple serialized representation from the underlying tensor dtypes supported by framework. If say int4 was natively supported dtype then we could have 4 bit quantized model with DQ on the weights and thus chose DQ only representation. At the moment however int4 dtype does not exist and hence we have to either 1) use int8 and waste space OR 2) use packing. I am, somewhat reluctantly, favoring 1, instead of having to introduce pack/unpack in graph representation. At this point you can say then why not just have Q/DQ representation. My stance is that DQ based IR is the choice we made. Because pytorch natively does not have 4bit dtype so 4-bit quantized models cant be represented using just DQ and 4bit storage. But if we do add 4bit native dtype tomorrow, we should be able to do that. I definitely sense disagreement on this, understandably, but I do feel, cringingly, that this is an ok compromise.

Either way, having the option of Q/DQ representation sounds reasonable to me.

jerryzh168 · 2024-02-06T18:28:00Z

Thanks @paulinesho and @kimishpatel for the discussions. I actually raised similar points around packing before when discussing with @kimishpatel on this issue, I think for now it makes sense to keep the flag but turn on fold_quantize=True by default since I didn't see much downside of keeping the flag, the only concern might just be the model produced from the flow may not directly work on different backends that's expecting different flags.

I'll update my PR #118701 to do this

paulinesho · 2024-02-07T17:36:06Z

Thanks for the prompt decision! I appreciate the quick turnaround time.

I think this is a question of the representation chosen here.

Agreed. I think the only strong requirement would be API and representation stability.

Because pytorch natively does not have 4bit dtype so 4-bit quantized models cant be represented using just DQ and 4bit storage. But if we do add 4bit native dtype tomorrow, we should be able to do that.

I do agree that sub-byte storage would be ideal, but perhaps rather expecting every framework to do that consistently, the second best option for a backend would be a representation that's stable and generic enough where sharing downstream pipeline could be possible.

Summary: Pull Request resolved: #1766 X-link: pytorch/pytorch#118701 This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Reviewed By: andrewor14, digantdesai Differential Revision: D53247301 fbshipit-source-id: 5b2dbbc76487a8779f30c483b5ff4f054ba1ae8c

Summary: This is a follow up to #118605 to remove `fold_quantize` flag from `convert_pt2e` Test Plan: CI Differential Revision: D53247301 BC Breaking Note: flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization. 2.2 ``` folded_model = convert_pt2e(model, fold_quantize=True) non_folded_model = convert_pt2e(model) ``` 2.3 ``` folded_model = convert_pt2e(model) non_folded_model = convert_pt2e(model, fold_quantize=False) ``` Pull Request resolved: #118701 Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel

Summary: This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Differential Revision: D53550237

Summary: X-link: pytorch/pytorch#119425 This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Reviewed By: digantdesai Differential Revision: D53550237

…2e` (pytorch#119425) Summary: X-link: pytorch/executorch#1882 This is a follow up to pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Test Plan: CI Reviewed By: digantdesai Differential Revision: D53550237

Summary: X-link: pytorch/pytorch#119425 This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Reviewed By: digantdesai Differential Revision: D53550237

…2e` (pytorch#119425) Summary: X-link: pytorch/executorch#1882 This is a follow up to pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Test Plan: CI Reviewed By: digantdesai Differential Revision: D53550237

Summary: Previously by default we don't generate quantized weight, that is, we'll have fp32 weight, and `fp32 weight -> q -> dq -> linear -> ...` in the quantized model After this PR, we'll produce a graph with int8 weight by default after convert_pt2e: `int8 weight -> dq -> linear -> ...` We'll remove the fold_quantize flag in the next PR Test Plan: CI Differential Revision: D51730862 Pull Request resolved: #118605 Approved by: https://github.com/andrewor14

Summary: This is a follow up to #118605 to remove `fold_quantize` flag from `convert_pt2e` Test Plan: CI Differential Revision: D53247301 BC Breaking Note: flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization. 2.2 ``` folded_model = convert_pt2e(model, fold_quantize=True) non_folded_model = convert_pt2e(model) ``` 2.3 ``` folded_model = convert_pt2e(model) non_folded_model = convert_pt2e(model, fold_quantize=False) ``` Pull Request resolved: #118701 Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel

Summary: X-link: pytorch/pytorch#119425 Pull Request resolved: #1882 This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Reviewed By: digantdesai Differential Revision: D53550237 fbshipit-source-id: fad66e5d860113d4510c605d86b8045eef4b449c

…2e` (#119425) Summary: This is a follow up to #118605 to set `fold_quantize` flag to True in `convert_pt2e` Test Plan: CI Differential Revision: D53550237 Pull Request resolved: #119425 Approved by: https://github.com/andrewor14

Summary: This is a follow up to #118605 to remove `fold_quantize` flag from `convert_pt2e` Test Plan: CI Differential Revision: D53247301 BC Breaking Note: flag `fold_quantize` set to True `convert_pt2e` and now we'll fold the quantize op in the weight by default, so users will see model size reduction by default after pt2e quantization. 2.2 ``` folded_model = convert_pt2e(model, fold_quantize=True) non_folded_model = convert_pt2e(model) ``` 2.3 ``` folded_model = convert_pt2e(model) non_folded_model = convert_pt2e(model, fold_quantize=False) ``` Pull Request resolved: #118701 Approved by: https://github.com/andrewor14, https://github.com/leslie-fang-intel

…2e` (#119425) Summary: This is a follow up to #118605 to set `fold_quantize` flag to True in `convert_pt2e` Test Plan: CI Differential Revision: D53550237 Pull Request resolved: #119425 Approved by: https://github.com/andrewor14

pytorch-bot bot added the release notes: AO frontend label Jan 30, 2024

facebook-github-bot added the fb-exported label Jan 30, 2024

github-actions bot added the release notes: quantization release notes category label Jan 30, 2024

jerryzh168 requested review from kimishpatel, digantdesai and andrewor14 January 30, 2024 03:08

andrewor14 approved these changes Jan 30, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 30, 2024

pytorchmergebot added merging and removed merging labels Jan 30, 2024

pytorchmergebot added the merging label Jan 30, 2024

pytorchmergebot closed this in 82a7460 Jan 30, 2024

pytorchmergebot added Merged and removed merging labels Jan 30, 2024

This was referenced Jan 31, 2024

[quant][pt2e][bc-breaking] Set fold_quantize to True in convert_pt2e pytorch/executorch#1766

Closed

[quant][pt2e][bc-breaking] Set fold_quantize to True in convert_pt2e #118701

Closed

jerryzh168 added a commit to jerryzh168/executorch-1 that referenced this pull request Jan 31, 2024

Remove fold_quantize flag

8730152

Summary: This is a follow up to pytorch/pytorch#118605 to remove `fold_quantize` flag from `convert_pt2e` Differential Revision: D53247301

jerryzh168 mentioned this pull request Feb 1, 2024

Qualcomm AI Engine Direct - Fold_Quant Enabled pytorch/executorch#1773

Closed

This was referenced Feb 8, 2024

Set fold_quantize to True in convert_pt2e pytorch/executorch#1882

Closed

[quant][pt2e][bc-breaking] Set fold_quantize to True in convert_pt2e #119425

Closed

jerryzh168 added a commit to jerryzh168/executorch-1 that referenced this pull request Feb 8, 2024

Set fold_quantize to True in convert_pt2e

efff0ef

Summary: This is a follow up to pytorch/pytorch#118605 to set `fold_quantize` flag to True in `convert_pt2e` Differential Revision: D53550237

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[quant][bc-breaking] Turn on fold_quantize by default #118605

[quant][bc-breaking] Turn on fold_quantize by default #118605

Uh oh!

jerryzh168 commented Jan 30, 2024

Uh oh!

pytorch-bot bot commented Jan 30, 2024 •

edited

Loading

Uh oh!

facebook-github-bot commented Jan 30, 2024

Uh oh!

pytorchmergebot commented Jan 30, 2024

Uh oh!

facebook-github-bot commented Jan 30, 2024

Uh oh!

pytorchmergebot commented Jan 30, 2024

Uh oh!

sdasgup3 commented Feb 1, 2024 •

edited

Loading

Uh oh!

jerryzh168 commented Feb 2, 2024 •

edited

Loading

Uh oh!

kimishpatel commented Feb 5, 2024

Uh oh!

qihqi commented Feb 5, 2024 •

edited

Loading

Uh oh!

jerryzh168 commented Feb 5, 2024

Uh oh!

paulinesho commented Feb 5, 2024

Uh oh!

kimishpatel commented Feb 6, 2024

Uh oh!

jerryzh168 commented Feb 6, 2024

Uh oh!

paulinesho commented Feb 7, 2024

Uh oh!

Uh oh!

[quant][bc-breaking] Turn on fold_quantize by default #118605

[quant][bc-breaking] Turn on fold_quantize by default #118605

Uh oh!

Conversation

jerryzh168 commented Jan 30, 2024

Uh oh!

pytorch-bot bot commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118605

✅ No Failures

Uh oh!

facebook-github-bot commented Jan 30, 2024

Uh oh!

pytorchmergebot commented Jan 30, 2024

Merge failed

Uh oh!

facebook-github-bot commented Jan 30, 2024

Uh oh!

pytorchmergebot commented Jan 30, 2024

Merge started

Uh oh!

sdasgup3 commented Feb 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 commented Feb 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kimishpatel commented Feb 5, 2024

Uh oh!

qihqi commented Feb 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerryzh168 commented Feb 5, 2024

Uh oh!

paulinesho commented Feb 5, 2024

Uh oh!

kimishpatel commented Feb 6, 2024

Uh oh!

jerryzh168 commented Feb 6, 2024

Uh oh!

paulinesho commented Feb 7, 2024

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 30, 2024 •

edited

Loading

sdasgup3 commented Feb 1, 2024 •

edited

Loading

jerryzh168 commented Feb 2, 2024 •

edited

Loading

qihqi commented Feb 5, 2024 •

edited

Loading