Skip to content

Conversation

yosoyjay
Copy link
Contributor

@yosoyjay yosoyjay commented Feb 12, 2025

Motivation

Add support for MI300X in virtual environments in ROCm container images. See discussion in #3219.

Modifications

Add RUN command in Dockerfile.rocm to find all config files for MI300X and copy them to the same directory only changing the string in the file name from "MI300X" to "MI300X_VF" to support devices in virtualized environments.

Checklist

  • Format your code according to the Code Formatting with Pre-Commit.
  • [-] Add unit tests as outlined in the Running Unit Tests.
  • [-] Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
  • Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and [Accuracy Results(https://docs.sglang.ai/references/accuracy_evaluation.html).
  • [-] For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.

Benchmark on 8 virtualized MI300Xs

Invoked with python -m sglang.bench_one_batch --model deepseek-ai/DeepSeek-R1 --tp 8 --batch 32 --input-len 256 --output-len 32 --trust-remote-code in container built from Dockerfile.rocm.

Before:

Benchmark ...
Prefill. latency: 0.81376 s, throughput:  10066.82 token/s
Decode.  latency: 0.10565 s, throughput:    302.89 token/s
Decode.  latency: 0.10617 s, throughput:    301.40 token/s
Decode.  latency: 0.10633 s, throughput:    300.96 token/s
Decode.  latency: 0.10639 s, throughput:    300.77 token/s
Decode.  latency: 0.10635 s, throughput:    300.88 token/s
Decode.  median latency: 0.10636 s, median throughput:    300.88 token/s
Total. latency:  4.110 s, throughput:   2242.37 token/s

After:

Benchmark ...
Prefill. latency: 0.82810 s, throughput:   9892.55 token/s
Decode.  latency: 0.04025 s, throughput:    794.98 token/s
Decode.  latency: 0.04025 s, throughput:    794.95 token/s
Decode.  latency: 0.04087 s, throughput:    783.05 token/s
Decode.  latency: 0.04208 s, throughput:    760.47 token/s
Decode.  latency: 0.04239 s, throughput:    754.95 token/s
Decode.  median latency: 0.04493 s, median throughput:    712.22 token/s
Total. latency:  2.201 s, throughput:   4186.69 token/s

# Copy config files to support MI300X in virtualized environments (MI300X_VF). Symlinks will not be created in image build.
RUN find /sgl-workspace/sglang/python/sglang/srt/layers/quantization/configs/ \
/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/configs/ \
-type f -name '*MI300X*' | xargs -I {} sh -c 'vf_config=$(echo "$1" | sed "s/MI300X/MI300X_VF/"); cp "$1" "$vf_config"' -- {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this out on 8*MI300X machine in Azure, resulted in the configs actually being picked up and jumped single sequence gen throughput from 17tok/s to 28tok/s!

@tot0
Copy link

tot0 commented Feb 14, 2025

@HaiShaw @zhyncs This fix is very important for users of MI300X on platforms virtualizing the GPUs.

@zhyncs zhyncs merged commit 6ce6eab into sgl-project:main Feb 14, 2025
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants