Skip to content

Conversation

ProExpertProg
Copy link
Collaborator

@ProExpertProg ProExpertProg commented Apr 17, 2025

This PR implements the fusion of fp8 quantization onto attention, described in #16220. It performs this fusion using a new AttnFusionPass, which uses the pattern matcher and only performs the fusion if the backend supports it. It is currently off by default, pending more robust V1 support and performance measurement.

This PR also makes the following changes:

  • output_scale added as a parameter to unified_attention_with_output. During the torch.compile fusion pass, we do not have access to the scale, just the graph node corresponding to it. Hence, we cannot just set the scale on the layer object.
  • A new method fused_output_quant_supported on the AttentionImpl. This method tells the fusion pass that it is safe to fuse the output quantization onto attention. It is opt-in, so fusion will only be performed if the backend impl supports it.
  • Support for attention + quant fusion to the ROCmFlashAttentionImpl attention backend. This is the motivating case for this pass, as AMD is adding support for fused attention quantization to the Triton kernel in [Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel #12591.
  • PostGradPassManager now accepts the forward context as a parameter, so it can be passed to passes that need it (like AttnFusionPass). We cannot currently pass the whole compilation config as that would create a cycle when the manager adds itself to CompilationConfig.inductor_config. (since [Feature] support sequence parallelism using compilation pass #16155 we pass vllm_config to passes anyway)
  • Additional case to NoOpEliminationPass. We now also replace a chain of reshapes with the last one: t.view(*args1).view(*args2).view(*args3) -> t.view(*args3). This is needed for correct pattern matching. Either way it's always good to simplify the graph.
  • Calling lazy_format_graph_pass inside VllmInductorPass makes sure the graph gets printed when debugging with depyf.
  • Test for AttnFusionPass. Using LLM instances instead of silly models to avoid redoing metadata setup in test code.

While this PR only adds fusion support on ROCm, it makes it easy to add support for other backends once their attention kernels add support for fused quantization of output. This includes V1, although we'll either need to use full cudagraphs or address the piecewise problem as described in #16220. Additionally, support for other quantization schemes can be added as well with minor additions to the pass (matching appropriate quant ops and saving quant metadata into the attention layer).

This PR depends on #12591, #15734, #16431 and #17139. All of them have been merged to main.

Pewrf numbers below, improvement on decode (ITL) and reduction on prefill (TTFT), but not going to invest as it's using a deprecated V0 prefill triton kernel. Will revisit performance with support for other backends.

VLLM_USE_V1=0 vllm serve amd/Llama-3.1-8B-Instruct-FP8-KV -O '{"pass_config":{"enable_attn_fusion": false}}':

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.74|±  |0.0441|
|     |       |strict-match    |     5|exact_match|↑  | 0.70|±  |0.0461|

VLLM_USE_V1=0 vllm serve serve amd/Llama-3.1-8B-Instruct-FP8-KV -O '{"pass_config":{"enable_attn_fusion": true}}':

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.74|±  |0.0441|
|     |       |strict-match    |     5|exact_match|↑  | 0.68|±  |0.0469|

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added v1 tpu Related to Google TPUs labels Apr 17, 2025
@ProExpertProg ProExpertProg force-pushed the luka/fusion-attention-fp8 branch from d6b46c4 to d9d415d Compare April 17, 2025 04:59
@ProExpertProg
Copy link
Collaborator Author

ProExpertProg commented Apr 17, 2025

Triton compile issue resolved

The code is currently failing with a Triton compilation error (weird):

loc("/home/luka/git/vllm/vllm/attention/ops/triton_flash_attention.py":863:57): error: operand #1 does not dominate this use

The offending line:

                        v_descale_ptrs = v_descale_ptr + off_h_k

Repro steps (even without this PR, no torch.compile, nothing):

VLLM_USE_V1=0 python examples/offline_inference/basic/generate.py --model amd/Llama-3.1-8B-Instruct-FP8-KV --kv-cache-dtype fp8

@ProExpertProg
Copy link
Collaborator Author

ProExpertProg commented Apr 17, 2025

Memory issue resolved

Triton memory issue

Repro steps:

VLLM_USE_V1=0 python examples/offline_inference/basic/generate.py --compilation-config="{'debug_dump_path':'debug-amd','level':3,'pass_config':{'enable_attn_fusion':True}}" --model amd/Llama-3.1-8B-Instruct-FP8-KV --kv-cache-dtype fp8

Works without attention fusion:

VLLM_USE_V1=0 python examples/offline_inference/basic/generate.py --compilation-config="{'debug_dump_path':'debug-amd','level':3,'pass_config':{'enable_attn_fusion':False}}" --model amd/Llama-3.1-8B-Instruct-FP8-KV --kv-cache-dtype fp8

Copy link

mergify bot commented Apr 17, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ProExpertProg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 17, 2025
@houseroad houseroad requested a review from zou3519 April 18, 2025 05:36
@houseroad
Copy link
Collaborator

Hi @zou3519 , could you also help review on the torch.compile pass part? Thanks.

@hongxiayang hongxiayang added the rocm Related to AMD ROCm label Apr 18, 2025
@ProExpertProg ProExpertProg force-pushed the luka/fusion-attention-fp8 branch from ca19be3 to fc60dcc Compare April 25, 2025 04:28
@mergify mergify bot removed the needs-rebase label Apr 25, 2025
@ProExpertProg ProExpertProg force-pushed the luka/fusion-attention-fp8 branch from 130f5a8 to 030f5ce Compare April 25, 2025 05:44
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
- cleanup backends to release llms
- increase gpu_model_utilization

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
@ProExpertProg ProExpertProg force-pushed the luka/fusion-attention-fp8 branch from 66152d1 to 98de2f9 Compare June 11, 2025 16:59
Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Copy link
Collaborator

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@gshtras gshtras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's awesome, thanks!

@simon-mo simon-mo merged commit f98548b into vllm-project:main Jun 12, 2025
94 of 96 checks passed
@ProExpertProg ProExpertProg deleted the luka/fusion-attention-fp8 branch June 12, 2025 15:34
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
…compile pass (vllm-project#16756)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: minpeter <kali2005611@gmail.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025
…compile pass (vllm-project#16756)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
…compile pass (vllm-project#16756)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
…compile pass (vllm-project#16756)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
…compile pass (vllm-project#16756)

Signed-off-by: Luka Govedič <lgovedic@redhat.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.