-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[Feature] Integrate DeepEP into SGLang #4232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Integrate DeepEP into SGLang #4232
Conversation
aba1c69
to
3223a15
Compare
@liz-badada May you rebase the latest main? |
done |
Hi liz-badada, I want to reproduce the code locally. Thank you. |
Hi, please make sure you have already setup GDRCopy on host and some information:
|
Great Job~ Is there any data about the performance of throughput/latency benchmark results |
Thank you and the amazing open-source community! However, while working with your repository, I encountered the following CUDA out-of-memory error. I’ve tried multiple approaches without success and would appreciate any suggestions you might have.
|
Hi liz-badada, Please help me to check some system information.
thanks a lot. |
Hi Xiaofei-fei, Try to set |
Thank you for your reply. I’m using an H20 GPU with 96GB of VRAM. Below are the solutions I’ve tried:
However, the problem remains unresolved. |
Please check here: sys_info.zip |
For 1 node, may you try with smaller value like '--mem-fraction-static 0.8'? And also please use V3 instead of R1 |
Did anyone encountered the following ERROR? when I was running "tp=16 dp=16", there's an error:
some frame error info:
my exec command:
I have tried without "--enable-dp-attention" or "--enable-deepep-moe", this error still stays. And I have check the idea of issue 1479, but I install the sglang with source, it maybe not effective for this situation. I have encounted this error before, it disappeared when i merged the main branch , so I didn't delve into it, but this time it blocks. 😭 Did anyone encountered the same error? |
So far whether other --tp --dp combines is supported? like --dp 2 --tp 8 which is actually DP2 with Group TP4, that will need group all_reduce before the dispatch operate right? |
I tried several tp & dp combinations, acted as following table:
all "Error" is :
"otherError" is:
I couldn't understand why it is. |
@Huixxi We recommend dp=tp at this moment. We are going to implement reduce_scatter to adapt deepep to a broader scenario. |
@liz-badada I'm confused about the performance table. It shows that the throughput of DP Attn + DeepEP is lower than original DP+EP,and Input Throughput (tok/s) == Output Throughput. |
No. As I comment above, the token permutation mechanism has not yet been optimized. We've encountered some issues with the permute triton kernel, which has necessitated a temporary fallback to PyTorch's native permute function. Additionally, the low-latency dispatch for decoding remains disabled. These limitations indicate that significant optimization work still needs to be carried out to achieve our target performance levels. |
Yes, and FYI. from the profiling file, it really really has too many HtoD and DtoD operators that 10x ~ 100x much more than ep moe which only have less than 100 HtoD + DtoD ops. |
@liz-badada When do you plan to integrate the low_latency mode? |
low latency is WIP. |
Update, after pr Optimize Permute Kernel in DeepEP #4643, the number of HtoD and DtoD has decreased to the same order of magnitude as ep moe, just a little more than its. |
May I ask how to deploy on 4 nodes with TP/DP/EP? Should we adopt the following commands with
|
Hi, please check this: #4836 |
Thanks for your reply! |
Is the data in the table incorrect? |
Hi @CSEEduanyu, these perf data are pretty out of date, they are several weeks ago. Suggest using main branch to do the benchmarking. |
What I want to ask is why the throughput of DeepEP is worse than EP? Could it be that the data in the table is reversed for the two? |
Thank you very much for your contribution! However, I have two questions regarding the implementation of DP+EP: (1) When deploying with 16DP+16EP, why do we have to set both TP Size and DP Size to 16? |
Motivation
Intergrate DeepEP into SGLang framework. Still WIP but could use '--enable-dp-attention --enable-deepep-moe' to trigger DeepEP intranode / internode, please follow the install guide of NVSHMEM dependency, also provide a Dockerfile.deepep based on SGLang image.
Co-auther @xutizhou
Note:
Single node:
ChatCompletion(id='c72b3edaf08f4145a53d497428c534d9', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='1. France - Capital: Paris \n2. Japan - Capital: Tokyo \n3. Brazil - Capital: Brasília', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=1)], created=1741594490, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=38, prompt_tokens=17, total_tokens=55, completion_tokens_details=None, prompt_tokens_details=None))
Multi node:
ChatCompletion(id='72a8328a7ca14e98b2c10604dfbee7ee', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Sure! Here are three countries and their capitals:\n\n1. France - Paris \n2. Japan - Tokyo \n3. Brazil - Brasília', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=1)], created=1741605820, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=35, prompt_tokens=17, total_tokens=52, completion_tokens_details=None, prompt_tokens_details=None))
Performance (Current performance is below expectations as token permutation is not yet optimized. Due to some bugs in the permute triton kernel, we have temporarily fallen back to using PyTorch's native permute function):
Modifications
Checklist