Skip to content

Conversation

gshtras
Copy link
Collaborator

@gshtras gshtras commented Jul 18, 2025

Fix for the regression added in #20330
Before that change certain configs relied on the undefined behavior on ROCm (using warpSize compiler builtin on host), and after they started crashing due to the value mismatch.

On ROCm the same compiled image can be used on different platforms with different warp sizes (Instinct, Radeon), therefore the WARP_SIZE that is used in the host code can't be made constexpr, but rather needs to be queried from the driver.

This PR addresses this issue, as well as fixing the paths to cuda_compat to be relative, to not include the un-hippified version in case hipify script is being used (ROCm)

Affected configs include:

VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 python benchmarks/benchmark_throughput.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --dtype float16 --num-prompts 500 --input-len 2048 --output-len 2048 -tp 8 -O '{"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fusion":true},"full_cuda_graph":true,"custom_ops":["+silu_and_mul"]}' --no-enable-prefix-caching
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 python benchmarks/benchmark_latency.py --model mistralai/Mixtral-8x22B-v0.1 --dtype float16 --batch-size 16 --input-len 1024 --output-len 1024 -tp 8 -O '{"pass_config":{"enable_noop":true,"enable_fusion":true},"full_cuda_graph":true}' --num-iters-warmup 1 --num-iters 3

gshtras added 2 commits July 18, 2025 12:06
…st side that can be shared for multiple architectures with different values

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@gshtras gshtras force-pushed the host_side_warp_size branch from 3bf10f0 to 88827d2 Compare July 18, 2025 21:26
@mergify mergify bot added the rocm Related to AMD ROCm label Jul 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug related to warp_size usage on ROCm by querying the warp size from the driver at runtime. The changes also include fixing paths to cuda_compat to be relative. A potential data race condition has been identified in the caching mechanism of the get_warp_size function, and a thread-safe solution has been proposed.

Comment on lines 8 to 27
static __host__ int get_warp_size() {
#if defined(USE_ROCM)
static bool is_cached = false;
static int result;

if (!is_cached) {
int device_id;
cudaDeviceProp deviceProp;
cudaGetDevice(&device_id);
cudaGetDeviceProperties(&deviceProp, device_id);

result = deviceProp.warpSize;
is_cached = true;
}

return result;
#else
return 32;
#endif
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation of the get_warp_size function on the host side uses a static boolean is_cached and a static variable result to cache the warp size. This caching mechanism is susceptible to a data race if multiple threads call get_warp_size simultaneously before the value is cached. This can lead to incorrect warp size values or potential crashes. To fix this, use a thread-safe initialization of the static variable using a lambda to ensure that the warp size is initialized only once in a thread-safe manner.

  static __host__ int get_warp_size() {
#if defined(USE_ROCM)
    // C++11 guarantees that static local variable initialization is thread-safe.
    // This avoids a data race on is_cached and result.
    static const int result = [] {
      int device_id;
      cudaDeviceProp deviceProp;
      cudaGetDevice(&device_id);
      cudaGetDeviceProperties(&deviceProp, device_id);
      return deviceProp.warpSize;
    }();
    return result;
#else
    return 32;
#endif
  }

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@gshtras gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 22, 2025
Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, @gshtras

@vllm-bot vllm-bot merged commit 90eeea8 into vllm-project:main Jul 24, 2025
96 of 97 checks passed
DW934 pushed a commit to DW934/vllm that referenced this pull request Jul 28, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: 董巍 <dongwei@U-5XFVDYCF-2058.local>
gshtras added a commit to ROCm/vllm that referenced this pull request Jul 31, 2025
Commits included:

Using cuda_compat to defint the WARP_SIZE once

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

On ROCm toe constant compile time warp size can not be used on the host side that can be shared for multiple architectures with different values

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

Formatting

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

Refactor to use cuda_compat, and not the unhippified version

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

Leaving CUDA side as just a simple define

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
gshtras added a commit to ROCm/vllm that referenced this pull request Jul 31, 2025
Commits included:

Using cuda_compat to defint the WARP_SIZE once



On ROCm toe constant compile time warp size can not be used on the host side that can be shared for multiple architectures with different values



Formatting



Refactor to use cuda_compat, and not the unhippified version



Leaving CUDA side as just a simple define

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
wenscarl pushed a commit to wenscarl/vllm that referenced this pull request Aug 4, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: shuw <shuw@nvidia.com>
x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: x22x22 <wadeking@qq.com>
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
heyselbi pushed a commit to heyselbi/vllm that referenced this pull request Aug 8, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Paul Pak <paulpak58@gmail.com>
taneem-ibrahim pushed a commit to taneem-ibrahim/vllm that referenced this pull request Aug 14, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
BoyuanFeng pushed a commit to BoyuanFeng/vllm that referenced this pull request Aug 14, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com>
diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Diego-Castan <diego.castan@ibm.com>
epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants