-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[Bugfix][ROCm] Fix for warp_size uses on host #21205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…st side that can be shared for multiple architectures with different values Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
3bf10f0
to
88827d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical bug related to warp_size
usage on ROCm by querying the warp size from the driver at runtime. The changes also include fixing paths to cuda_compat to be relative. A potential data race condition has been identified in the caching mechanism of the get_warp_size
function, and a thread-safe solution has been proposed.
csrc/cuda_compat.h
Outdated
static __host__ int get_warp_size() { | ||
#if defined(USE_ROCM) | ||
static bool is_cached = false; | ||
static int result; | ||
|
||
if (!is_cached) { | ||
int device_id; | ||
cudaDeviceProp deviceProp; | ||
cudaGetDevice(&device_id); | ||
cudaGetDeviceProperties(&deviceProp, device_id); | ||
|
||
result = deviceProp.warpSize; | ||
is_cached = true; | ||
} | ||
|
||
return result; | ||
#else | ||
return 32; | ||
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of the get_warp_size
function on the host side uses a static boolean is_cached
and a static variable result
to cache the warp size. This caching mechanism is susceptible to a data race if multiple threads call get_warp_size
simultaneously before the value is cached. This can lead to incorrect warp size values or potential crashes. To fix this, use a thread-safe initialization of the static variable using a lambda to ensure that the warp size is initialized only once in a thread-safe manner.
static __host__ int get_warp_size() {
#if defined(USE_ROCM)
// C++11 guarantees that static local variable initialization is thread-safe.
// This avoids a data race on is_cached and result.
static const int result = [] {
int device_id;
cudaDeviceProp deviceProp;
cudaGetDevice(&device_id);
cudaGetDeviceProperties(&deviceProp, device_id);
return deviceProp.warpSize;
}();
return result;
#else
return 32;
#endif
}
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable, @gshtras
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: 董巍 <dongwei@U-5XFVDYCF-2058.local>
Commits included: Using cuda_compat to defint the WARP_SIZE once Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> On ROCm toe constant compile time warp size can not be used on the host side that can be shared for multiple architectures with different values Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Formatting Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Refactor to use cuda_compat, and not the unhippified version Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Leaving CUDA side as just a simple define Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Commits included: Using cuda_compat to defint the WARP_SIZE once On ROCm toe constant compile time warp size can not be used on the host side that can be shared for multiple architectures with different values Formatting Refactor to use cuda_compat, and not the unhippified version Leaving CUDA side as just a simple define Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: x22x22 <wadeking@qq.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Fix for the regression added in #20330
Before that change certain configs relied on the undefined behavior on ROCm (using warpSize compiler builtin on host), and after they started crashing due to the value mismatch.
On ROCm the same compiled image can be used on different platforms with different warp sizes (Instinct, Radeon), therefore the WARP_SIZE that is used in the host code can't be made constexpr, but rather needs to be queried from the driver.
This PR addresses this issue, as well as fixing the paths to cuda_compat to be relative, to not include the un-hippified version in case hipify script is being used (ROCm)
Affected configs include: