[CPU] add optimizations for INT8 and FP8 DeepSeek #6769

chunyuan-w · 2025-05-30T08:02:43Z

Motivation

Call the below kernels in DeepSeek to optimize INT8 and FP8:
INT8 linear: int8_scaled_mm_with_quant
FP8 linear: fp8_scaled_mm
INT8 and FP8 shared_expert: shared_expert_cpu
INT8 and FP8 MoE: fused_experts_cpu

For bmm, currently we don't support weight to be FP8, we convert weight to BF16 (only for CPU).

gemini-code-assist

Hello @chunyuan-w, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team! Gemini here, providing a summary of this pull request.

This PR introduces CPU-specific optimizations for the DeepSeek model, focusing on INT8 and FP8 quantization. The primary goal is to leverage Intel AMX instructions for accelerated matrix multiplication operations on CPU, specifically within linear layers and the Mixture-of-Experts (MoE) layers (both fused experts and shared experts). The changes involve adding new execution paths for CPU devices that utilize dedicated CPU kernels for these quantized operations, alongside necessary weight processing and checks for CPU capabilities.

Highlights

CPU Optimization for Quantized Layers: Adds support for running INT8 (W8A8) and FP8 (W8A16) quantized linear and MoE layers on CPU, specifically targeting Intel AMX acceleration.
Integration of CPU Kernels: Introduces calls to new CPU kernels (fused_experts_cpu, fp8_scaled_mm_cpu, int8_scaled_mm_with_quant, shared_expert_cpu) within the forward passes of quantized layers when running on CPU with AMX support.
DeepSeekV2 MoE CPU Path: Adds a dedicated forward_cpu method to the DeepseekV2MoE layer to handle the execution flow when using the new CPU-optimized quantized layers, including processing both the selected experts and the shared expert.
Weight Processing for CPU: Includes logic in the weight processing steps for quantized layers to handle weights specifically for the CPU backend, ensuring compatibility with the new kernels.
FP8 Attention Handling on CPU: Adds specific post-loading handling for FP8 attention weights (w_kc, w_vc) in the DeepSeekV2 model when running on CPU with AMX, converting them to bfloat16 and applying scale.

Changelog

Click here to see the changelog

python/sglang/srt/layers/moe/fused_moe_triton/layer.py
- Added a forward_cpu method (L239-L271) to the FusedMoELayer class to handle CPU execution.
- The forward_cpu method checks for layer.use_intel_amx_backend and calls torch.ops.sgl_kernel.fused_experts_cpu if enabled, otherwise falls back to the native implementation.
python/sglang/srt/layers/quantization/fp8.py
- Imported _process_weight_after_loading and cpu_has_amx_support (L67-L68).
- Added CPU device and AMX support check in Fp8LinearMethod.process_weights_after_loading (L326-L331) to call _process_weight_after_loading.
- Added CPU/AMX check in Fp8LinearMethod.apply (L428-L437) for block quantization, calling torch.ops.sgl_kernel.fp8_scaled_mm_cpu.
- Added CPU device and AMX support check in Fp8MoEMethod.process_weights_after_loading (L760-L765) to call _process_weight_after_loading for MoE weights.
- Added CPU/AMX check in Fp8MoEMethod.apply (L989-L1005) to call torch.ops.sgl_kernel.fused_experts_cpu with FP8 flags.
python/sglang/srt/layers/quantization/moe_wna16.py
- Modified the condition in is_moe_wna16_compatible (L134) to check if all elements in capability_tuple are None, instead of just the tuple itself.
python/sglang/srt/layers/quantization/w8a8_int8.py
- Imported _process_weight_after_loading and cpu_has_amx_support (L15-L16).
- Added CPU device and AMX support check in W8A8Int8LinearMethod.process_weights_after_loading (L80-L85) to call _process_weight_after_loading.
- Added CPU/AMX check in W8A8Int8LinearMethod.apply (L127-L135) to call torch.ops.sgl_kernel.int8_scaled_mm_with_quant.
- Added CPU device and AMX support check in W8A8Int8MoEMethod.process_weights_after_loading (L231-L236) to call _process_weight_after_loading for MoE weights.
- Added CPU/AMX check in W8A8Int8MoEMethod.apply (L282-L298) to call torch.ops.sgl_kernel.fused_experts_cpu with INT8 flags.
python/sglang/srt/models/deepseek_v2.py
- Imported cpu_has_amx_support (L94).
- Added shared_experts_is_int8, shared_experts_is_fp8, and shared_experts_weight_block_size attributes (L265-L267).
- Initialized the new shared expert quantization attributes based on weight dtype and quant config (L285-L300).
- Modified the main forward method (L346-L353) to call a new forward_cpu method under specific conditions (no DeepEP, no shared experts fusion, AMX backend enabled).
- Added the forward_cpu method (L371-L427) for the DeepSeekV2 MoE layer, which calls self.experts and then torch.ops.sgl_kernel.shared_expert_cpu for the shared expert part, passing quantization details.
- Added specific handling in post_load_weights (L1908-L1918) for FP8 attention weights (w_kc, w_vc) on CPU with AMX, converting them to bfloat16 and applying scale.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

CPU gets new speed,
INT8 and FP8 indeed,
AMX helps it fly.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR adds valuable CPU optimizations for DeepSeek models using Intel AMX for INT8 and FP8. The changes are generally well-structured by introducing new dispatch paths for AMX-enabled CPUs.

Key changes include:

Addition of torch.ops.sgl_kernel.*_cpu calls for AMX-optimized operations in linear and MoE layers.
Handling of weight processing for CPU backends via _process_weight_after_loading.
Detection of quantized shared experts in DeepseekV2MoE to correctly pass parameters to the shared_expert_cpu kernel.

I've identified one potentially high-severity issue related to in-place operations in DeepseekV2MoE.forward_cpu that needs to be addressed. There's also a TODO comment that should be clarified. The checklist in the PR description is also incomplete, which is typical for work-in-progress but should be completed before merging.

I am unable to approve this pull request myself. Please ensure that the identified issues are addressed and that other reviewers approve this code before merging.

Summary of Findings

Potential In-place Modification Issue: In DeepseekV2MoE.forward_cpu, hidden_states might be modified in-place by self.experts() and then reused, potentially leading to incorrect inputs for shared_expert_cpu. A code suggestion has been provided to address this.
TODO Comment: A TODO comment regarding reshaping hidden_states in DeepseekV2MoE.forward_cpu should be clarified or resolved.
Clarity on Unquantized AMX Path: Clarification requested on whether the AMX path in UnquantizedFusedMoEMethod.forward_cpu is strictly for unquantized data, with INT8/FP8 AMX paths handled elsewhere.

Merge Readiness

This pull request makes significant strides in CPU optimization for DeepSeek models. However, there is a high-severity issue concerning potential in-place modification of tensors in DeepseekV2MoE.forward_cpu that must be addressed before merging to ensure correctness. Additionally, a medium-severity TODO comment should be resolved, and a point of clarification on the AMX path for unquantized data would be beneficial. The PR checklist should also be completed.

I am unable to approve this pull request myself. It is recommended that the identified high-severity issue is fixed, and other points are considered. After these changes, further review and approval by other maintainers will be necessary before merging.