[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger #17331

rasmith · 2025-04-28T22:42:11Z

This adds a flag to use override dtype in VllmConfig instead of using the kv_cache_dtype flag so any FP8 model will work instead of just those with fp8 kv cache

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

github-actions · 2025-04-28T22:42:20Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

ProExpertProg · 2025-04-30T00:58:04Z

I think generally we prefer CLI flags and config over environment variables. Am I missing any context for why this PR is needed?

ProExpertProg · 2025-04-30T01:00:41Z

Is this just for output scaling? So it decouples that from the kvcache dtype? Either way, could you add more details to the description of the PR?

rasmith · 2025-05-01T21:37:12Z

Is this just for output scaling? So it decouples that from the kvcache dtype? Either way, could you add more details to the description of the PR?

I think it's so any FP8 model will work instead of just those with FP8 KV cache.

…cales

mergify · 2025-05-08T05:41:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @rasmith.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

rasmith · 2025-05-08T23:05:31Z

Is this just for output scaling? So it decouples that from the kvcache dtype? Either way, could you add more details to the description of the PR?

@ProExpertProg Please take another look, also had to fix set_current_vllm_config, please see description.

ProExpertProg · 2025-05-09T16:46:52Z

vllm/config.py

@@ -4363,7 +4367,8 @@ def set_current_vllm_config(vllm_config: VllmConfig, check_compile=False):
                " if you want it to be supported.",
                vllm_config.model_config.model)
    finally:
-        _current_vllm_config = old_vllm_config
+        if was_raised:
+            _current_vllm_config = old_vllm_config


Why is this necessary? Shouldn't we always restore the current config?

Because _current_vllm_config is always getting overwritten with old_vllm_config when set_current_vllm_config is called, whether there was an exception or not.

Yeah I understand that. But why would we not want to restore old config if there was no exception?

Ohhh I see now, I think you might be using this incorrectly. set_vllm_config is meant to be used as a context manager:

with set_vllm_config(...): ...

ProExpertProg · 2025-05-09T16:48:07Z

vllm/attention/backends/rocm_flash_attn.py

@@ -766,9 +767,15 @@ def forward(
                            query.dtype,
                            seq_lens,
                            make_attn_mask=causal_mask)  # type: ignore
+
+                    vllm_config = get_current_vllm_config()


We shouldn't be reading config in the forward method. Instead it should be read during init

rocm_flash_attn doesn't seem to have any other access to the VllmConfig object. Is there another way for it to get access to the value it needs?

Sorry I meant the backend/impl's __init__ (mentioned in the meeting)

ProExpertProg · 2025-05-09T16:49:45Z

vllm/config.py

@@ -397,6 +397,8 @@ class ModelConfig:
    available.\n
    - "vllm" will use the vLLM model implementation.\n
    - "transformers" will use the Transformers model implementation."""
+    use_fp8_scales: bool = True


I think this needs a better name. One idea is override-attention-dtype and then it's specified as fp8 on the CLI/in the config

What do you mean by "and then it's specified as fp8 on the CLI/in the config"?

It's a string property that specifies the datatype (so not limited to fp8) - explained in the meeting

…cales

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

rasmith · 2025-05-22T19:29:36Z

@ProExpertProg Please take another look, I was able to remove the changes to set_current_vllm_config after adding the call to get_current_vllm_config in init. I renamed the use_fp8_scales flag in favor of the override flag.

ProExpertProg

This looks great and is much cleaner. My only remaining concern is that we should really warn the user if the flag is ignored. If somebody specifies --overide-attention-dtype=fp8 on NVIDIA or when not using the ROCMFlash backend, we should print a warning saying the flag is not actually doing anything

ProExpertProg · 2025-05-22T20:51:54Z

vllm/config.py

@@ -407,6 +407,8 @@ class ModelConfig:
    available.\n
    - "vllm" will use the vLLM model implementation.\n
    - "transformers" will use the Transformers model implementation."""
+    override_attention_dtype: str = "fp8"


This should be None by default:

Suggested change

override_attention_dtype: str = "fp8"

override_attention_dtype: Optional[str] = None

ProExpertProg · 2025-05-22T20:52:57Z

vllm/attention/backends/rocm_flash_attn.py

@@ -580,6 +581,7 @@ def __init__(
                logger.debug("Using naive (SDPA) attention in ROCmBackend")

        self.aiter_kv_scales_initialized = False
+        self.vllm_config = get_current_vllm_config()


No need to save the whole config, just do self.force_fp8_attention = ...

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

ProExpertProg · 2025-05-30T16:28:24Z

vllm/attention/backends/rocm_flash_attn.py

                    use_fp8_scales = (layer._q_scale and layer._k_scale
                                      and layer._v_scale and layer._prob_scale
-                                      and self.kv_cache_dtype == "fp8")
+                                      and self.force_fp8_attention)


Should we check here if the KV cache is in fp8 already?

…cales

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

ProExpertProg

LGTM! Sorry for the delay

… using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: minpeter <kali2005611@gmail.com>

… using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <Randall.Smith@amd.com>

… using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>

… using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <Randall.Smith@amd.com>

Add VLLM_ROCM_USE_FP8_SCALES flag

b9f9f81

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

lint

9048aa5

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

rasmith mentioned this pull request Apr 29, 2025

Fix noisy warning for uncalibrated q_scale/p_scale #17414

Merged

Merge branch 'vllm-project:main' into rasmith_add_vllm_use_rocm_fp8_s…

98705ad

…cales

mergify bot added the needs-rebase label May 8, 2025

Use vllm config instead of env variable for fp8 scales option

2f31d6b

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

mergify bot removed the needs-rebase label May 8, 2025

rasmith changed the title ~~[AMD] [Quantization] Add VLLM_ROCM_USE_FP8_SCALES flag~~ [AMD] [Quantization] Add flag for using fp8 scales instead of using kv_cache_dtype trigger May 8, 2025

ProExpertProg reviewed May 9, 2025

View reviewed changes

rasmith added 5 commits May 20, 2025 15:15

Merge branch 'vllm-project:main' into rasmith_add_vllm_use_rocm_fp8_s…

bf8166c

…cales

use override instead

fdc428b

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

format

44b18ce

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

remove was_raised from set_current_vllm_config

1bc79b7

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

remove was_raised

5cec76f

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

rasmith changed the title ~~[AMD] [Quantization] Add flag for using fp8 scales instead of using kv_cache_dtype trigger~~ [AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger May 22, 2025

ProExpertProg reviewed May 22, 2025

View reviewed changes

rasmith added 3 commits May 22, 2025 21:52

simplify and add warning

2c5ffb0

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

set stacklevel for warning

e7400c1

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

fix typo

e135f78

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

ProExpertProg reviewed May 30, 2025

View reviewed changes

rasmith added 3 commits June 3, 2025 10:24

Merge branch 'vllm-project:main' into rasmith_add_vllm_use_rocm_fp8_s…

4c6244b

…cales

check if kv cache is fp8

7ad4a10

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

check if kv cache is fp8

85ccf7c

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

ProExpertProg approved these changes Jun 7, 2025

View reviewed changes

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 10, 2025

mergify bot added the rocm Related to AMD ROCm label Jun 11, 2025

gshtras merged commit c7ea0b5 into vllm-project:main Jun 11, 2025
78 checks passed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025

[AMD] [Quantization] Add override flag for attention dtype instead of…

264d263

… using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <Randall.Smith@amd.com>

wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025

[AMD] [Quantization] Add override flag for attention dtype instead of…

4b8256b

… using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <Randall.Smith@amd.com>

googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025

[AMD] [Quantization] Add override flag for attention dtype instead of…

60bc103

… using kv_cache_dtype trigger (vllm-project#17331) Signed-off-by: Randall Smith <Randall.Smith@amd.com>

	override_attention_dtype: str = "fp8"
	override_attention_dtype: Optional[str] = None

Uh oh!

[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger #17331

[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger #17331

Uh oh!

Conversation

rasmith commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

ProExpertProg commented Apr 30, 2025

Uh oh!

ProExpertProg commented Apr 30, 2025

Uh oh!

rasmith commented May 1, 2025

Uh oh!

mergify bot commented May 8, 2025

Uh oh!

rasmith commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rasmith May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rasmith commented May 22, 2025

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rasmith commented Apr 28, 2025 •

edited

Loading

rasmith commented May 8, 2025 •

edited

Loading

rasmith May 9, 2025 •

edited

Loading