[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass #16756

ProExpertProg · 2025-04-17T04:15:37Z

This PR implements the fusion of fp8 quantization onto attention, described in #16220. It performs this fusion using a new AttnFusionPass, which uses the pattern matcher and only performs the fusion if the backend supports it. It is currently off by default, pending more robust V1 support and performance measurement.

This PR also makes the following changes:

output_scale added as a parameter to unified_attention_with_output. During the torch.compile fusion pass, we do not have access to the scale, just the graph node corresponding to it. Hence, we cannot just set the scale on the layer object.
A new method fused_output_quant_supported on the AttentionImpl. This method tells the fusion pass that it is safe to fuse the output quantization onto attention. It is opt-in, so fusion will only be performed if the backend impl supports it.
Support for attention + quant fusion to the ROCmFlashAttentionImpl attention backend. This is the motivating case for this pass, as AMD is adding support for fused attention quantization to the Triton kernel in [Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel #12591.
PostGradPassManager now accepts the forward context as a parameter, so it can be passed to passes that need it (like AttnFusionPass). We cannot currently pass the whole compilation config as that would create a cycle when the manager adds itself to CompilationConfig.inductor_config. (since [Feature] support sequence parallelism using compilation pass #16155 we pass vllm_config to passes anyway)
Additional case to NoOpEliminationPass. We now also replace a chain of reshapes with the last one: t.view(*args1).view(*args2).view(*args3) -> t.view(*args3). This is needed for correct pattern matching. Either way it's always good to simplify the graph.
Calling lazy_format_graph_pass inside VllmInductorPass makes sure the graph gets printed when debugging with depyf.
Test for AttnFusionPass. Using LLM instances instead of silly models to avoid redoing metadata setup in test code.

While this PR only adds fusion support on ROCm, it makes it easy to add support for other backends once their attention kernels add support for fused quantization of output. This includes V1, although we'll either need to use full cudagraphs or address the piecewise problem as described in #16220. Additionally, support for other quantization schemes can be added as well with minor additions to the pass (matching appropriate quant ops and saving quant metadata into the attention layer).

This PR depends on #12591, #15734, #16431 and #17139. All of them have been merged to main.

Pewrf numbers below, improvement on decode (ITL) and reduction on prefill (TTFT), but not going to invest as it's using a deprecated V0 prefill triton kernel. Will revisit performance with support for other backends.

VLLM_USE_V1=0 vllm serve amd/Llama-3.1-8B-Instruct-FP8-KV -O '{"pass_config":{"enable_attn_fusion": false}}':

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.74|±  |0.0441|
|     |       |strict-match    |     5|exact_match|↑  | 0.70|±  |0.0461|

VLLM_USE_V1=0 vllm serve serve amd/Llama-3.1-8B-Instruct-FP8-KV -O '{"pass_config":{"enable_attn_fusion": true}}':

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.74|±  |0.0441|
|     |       |strict-match    |     5|exact_match|↑  | 0.68|±  |0.0469|

github-actions · 2025-04-17T04:15:46Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

tests/compile/test_fusion_attn.py

ProExpertProg · 2025-04-17T18:10:11Z

Triton compile issue resolved

The code is currently failing with a Triton compilation error (weird):

loc("/home/luka/git/vllm/vllm/attention/ops/triton_flash_attention.py":863:57): error: operand #1 does not dominate this use

The offending line:

                        v_descale_ptrs = v_descale_ptr + off_h_k

Repro steps (even without this PR, no torch.compile, nothing):

VLLM_USE_V1=0 python examples/offline_inference/basic/generate.py --model amd/Llama-3.1-8B-Instruct-FP8-KV --kv-cache-dtype fp8

ProExpertProg · 2025-04-17T18:12:56Z

Memory issue resolved

Triton memory issue

Repro steps:

VLLM_USE_V1=0 python examples/offline_inference/basic/generate.py --compilation-config="{'debug_dump_path':'debug-amd','level':3,'pass_config':{'enable_attn_fusion':True}}" --model amd/Llama-3.1-8B-Instruct-FP8-KV --kv-cache-dtype fp8

Works without attention fusion:

VLLM_USE_V1=0 python examples/offline_inference/basic/generate.py --compilation-config="{'debug_dump_path':'debug-amd','level':3,'pass_config':{'enable_attn_fusion':False}}" --model amd/Llama-3.1-8B-Instruct-FP8-KV --kv-cache-dtype fp8

mergify · 2025-04-17T22:47:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

houseroad · 2025-04-18T05:47:03Z

Hi @zou3519 , could you also help review on the torch.compile pass part? Thanks.

vllm/compilation/vllm_inductor_pass.py

vllm/compilation/noop_elimination.py

vllm/compilation/fx_utils.py

vllm/compilation/fusion_attn.py

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

- cleanup backends to release llms - increase gpu_model_utilization Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

zou3519

LGTM

gshtras

That's awesome, thanks!

…compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: minpeter <kali2005611@gmail.com>

…compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Sage Moore <sage@neuralmagic.com>

…compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

…compile pass (vllm-project#16756) Signed-off-by: Luka Govedič <lgovedic@redhat.com> Co-authored-by: Sage Moore <sage@neuralmagic.com>

ProExpertProg requested review from tlrmchlsmth, WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, mgoin, zhuohan123 and youkaichao as code owners April 17, 2025 04:15

mergify bot added v1 tpu Related to Google TPUs labels Apr 17, 2025

ProExpertProg commented Apr 17, 2025

View reviewed changes

tests/compile/test_fusion_attn.py Outdated Show resolved Hide resolved

ProExpertProg force-pushed the luka/fusion-attention-fp8 branch from d6b46c4 to d9d415d Compare April 17, 2025 04:59

mergify bot added the needs-rebase label Apr 17, 2025

houseroad requested a review from zou3519 April 18, 2025 05:36

hongxiayang added the rocm Related to AMD ROCm label Apr 18, 2025