Skip to content

Conversation

mickqian
Copy link
Collaborator

@mickqian mickqian commented Jun 22, 2025

Motivation

Some bugs in calculating weighted hidden_states from experts output, in apply_shuffle_mul_sum, where experts' output were wrongly truncated, leading to nans in outputs.

Will fix in the future, if the kernel above shows an unneglectable speed difference with shuffle_sum + mul approach

$ SGL_ENABLE_JIT_DEEPGEMM=false SGLANG_CUTLASS_MOE=true python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3-0324 --host 0.0.0.0 -tp=8 --dp=8 --enable-dp-attention

$ python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
gsm8k
transpose weights 0.817 1875.9 tok/s
no-transpose weights 0.818 2092.073 tok/s

Modifications

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @mickqian, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've addressed a critical bug in the Cutlass Mixture of Experts (MoE) implementation that was causing NaN values in the output. This was primarily due to incorrect handling of expert outputs and their associated weights, specifically an issue with truncation. The fix involves adjusting the expected tensor layouts for MoE weights and directly implementing the weighted sum logic, bypassing a previously problematic helper function. This ensures numerical stability and correctness for FP8 MoE operations.

Highlights

  • MoE Weight Tensor Layout Correction: I've updated the cutlass_fused_experts_fp8 function to expect w1_q and w2_q weight tensors in a transposed layout, specifically (E, n*2, k) for w1_q and (E, k, n) for w2_q. This change is reflected in docstrings and internal assertions to ensure correct shape validation.
  • Direct Weighted Sum Calculation for Expert Outputs: I've replaced the call to apply_shuffle_mul_sum with explicit PyTorch operations (shuffle, reshape, element-wise multiplication with topk_weights, and sum) within cutlass_fused_experts_fp8. This directly addresses the bug causing NaN outputs by ensuring correct aggregation of expert results without truncation issues.
  • Optimized Weight Handling in FP8 Quantization: I've removed redundant transpose(1, 2) operations when passing weights to cutlass_fused_experts_fp8 from the FP8 quantization layer. This aligns the weight preparation with the new expected tensor layout of the MoE kernel, potentially improving efficiency by avoiding unnecessary transpositions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mickqian mickqian force-pushed the blackwell-cutlass-moe branch from 3a9567c to 4e688c0 Compare June 22, 2025 16:55
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request addresses a bug in the calculation of weighted hidden states from experts' output in the apply_shuffle_mul_sum function, which was causing incorrect truncation and leading to NaNs in the outputs. The fix involves modifying the shape assertions and the final calculation of the result in cutlass_fused_experts_fp8. Additionally, the code removes transpose operations in fp8.py before calling cutlass_fused_experts_fp8. The changes seem reasonable, but it's crucial to ensure that the weight shapes are correct after removing the transpose operations.

@mickqian mickqian mentioned this pull request Jun 23, 2025
6 tasks
@mickqian mickqian added the bug Something isn't working label Jul 6, 2025
@mickqian mickqian force-pushed the blackwell-cutlass-moe branch from 3927e91 to e7c9d35 Compare July 11, 2025 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant