-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[CPU] Update custom ops for the CPU backend #20255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @bigPYJ1151, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request updates the CPU backend with several optimizations, including improvements to AllReduce, GEMM, and Grouped-GEMM operations. It also adds support for AVX512-BF16 and AVX512VNNI instruction sets, along with corresponding tests and documentation updates. The changes aim to improve the performance and efficiency of the CPU backend, especially for small-batch sizes and MoE layers.
Highlights
- CPU Backend Optimization: Optimized shared-memory based AllReduce for CPU backend.
- GEMM and Grouped-GEMM Adaptation: Adapted small-batch optimized GEMM and Grouped-GEMM from sglang-cpu for linear and MoE layers.
- AVX512-BF16 and AVX512VNNI Support: Added support for AVX512-BF16 and AVX512VNNI instruction sets, including detection and compilation flags.
- Test Updates: Enabled corresponding tests for the new CPU backend features.
- Buildkite Configuration: Increased the timeout for CPU tests in Buildkite to 1.5 hours.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces optimizations for CPU inference by integrating custom kernels from sglang for small-batch GEMM and MoE operations, and refactors the shared memory communication layer. The changes are substantial and well-executed. My review includes suggestions for removing debug code, improving code comments for the new synchronization logic, and adding user-friendly checks for the new kernel requirements. Overall, this is a great enhancement for vLLM's CPU backend.
inline void print_16x32i(const __m512i x) { | ||
int32_t a[16]; | ||
_mm512_storeu_si512((__m512i *)a, x); | ||
|
||
for (int i = 0; i < 16; i++){ | ||
std::cout << a[i] << " "; | ||
} | ||
std::cout << std::endl; | ||
} | ||
|
||
inline void print_16x32(const __m512 x) { | ||
float a[16]; | ||
_mm512_storeu_ps((__m512 *)a, x); | ||
|
||
for (int i = 0; i < 16; i++){ | ||
std::cout << a[i] << " "; | ||
} | ||
std::cout << std::endl; | ||
} | ||
|
||
|
||
inline void print_32x8u(const __m256i x) { | ||
uint8_t a[32]; | ||
_mm256_storeu_si256((__m256i *)a, x); | ||
|
||
for (int i = 0; i < 32; ++i) { | ||
std::cout << int32_t(a[i]) << " "; | ||
} | ||
std::cout << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (envs.VLLM_CPU_SGL_KERNEL | ||
and torch._C._cpu._is_amx_tile_supported() | ||
and dtype == torch.bfloat16): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SGL kernels have alignment requirements for the weight tensor dimensions. Add checks here for the MoE layer weights (w13_weight
and w2_weight
) to ensure their dimensions are divisible by the required tile sizes (e.g., 32).
This would provide a more user-friendly warning if the model is not compatible, rather than failing with a TORCH_CHECK
from the C++ code.
if (envs.VLLM_CPU_SGL_KERNEL | |
and torch._C._cpu._is_amx_tile_supported() | |
and dtype == torch.bfloat16): | |
if (envs.VLLM_CPU_SGL_KERNEL | |
and torch._C._cpu._is_amx_tile_supported() | |
and dtype == torch.bfloat16): | |
_, N2, K = layer.w13_weight.shape | |
N = N2 // 2 | |
if K % 32 != 0 or N % 32 != 0: | |
logger.warning( | |
"CPU SGL kernels for MoE require K and N dims to be multiples of 32. " | |
"Disabling SGL kernels.") | |
layer.cpu_fused_moe = cpu_fused_moe.IPEXFusedMOE(layer) | |
return |
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Hi @DarkLight1337 @Isotr0py , would you please help to check this PR? Thanks :) There are lots of new added code but most are CPU-specific only. The changes on vLLM core components are limited. And both of the fast checks and the CPU test passed. |
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
Update the CPU backend custom ops include:
Test Plan
Test Result
(Optional) Documentation Update