Skip to content

Improper use of quantization API for MHA should fail fast #58969

@bhosmer

Description

@bhosmer

Currently, direct quantization of a torch.nn.MultiheadAttention module is not supported via torch.quantization.quantize_dynamic() - a conversion must be performed first (reference). However, an erroneous attempt to quantize directly does not produce an error - instead, there's a silent failure to quantize certain layers, but the returned model is callable. There's no overt sign that something has gone wrong. See #58727 for full details.

A recent cleanup of torch.nn introduced a downstream error when scripting such an improperly quantized module, by rationalizing module names in a way that allowed the problematic quantization to go forward. The error is a vanilla TorchScript type error due to a naming collision and has no clues that would lead one back to the original problem without nontrivial sleuthing. Also, the error was reported by users, meaning the quantization API is in fact being inadvertently misused in this way. Repro used to track down the issue here, derived from a bug report.

In the short term we've reinstated the name difference that defeats quantization when the API is used improperly, restoring the original behavior. But a proper fix would be for quantize_dynamic (or something in the quantization API) to give the caller signal that an unsupported model has been specified.

cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo

Metadata

Metadata

Assignees

Labels

oncall: quantizationQuantization support in PyTorchtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions