-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Open
Labels
Description
Checklist
- 1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 2. Please use English, otherwise it will be closed.
Motivation
Research Questions
- Explore the tradeoffs of increasing the number of chips with more memory, H200, versus increasing the parallel inference world size when using less HBM GPUs, H100 (see [Efficiently Scaling Transformer Inference](https://arxiv.org/abs/2211.05102)). Reduce as much as possible price/generation at scale.
- How can we leverage H200 extra HBM for efficient KV cache management? Test long context window.
- Measure the implications of faster GPU memory bandwidth while executing parallel inference.
Models of Interest
- Llama 3.3 70B
- Llama 3.1 405B
- DeepSeek Models: Testing latest sglang
0.4
data parallelism attention for MLA. Focus on:
Preliminar Results
Following the benchmarks from sglang benchmarks
Environment Configuration
Using the latest Docker image lmsysorg/sglang:latest
with SGLang v0.4
Collecting environment information...
PyTorch version: 2.5.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31
Python version: 3.10.16 (main, Dec 4 2024, 08:53:37) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-124-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H200
GPU 1: NVIDIA H200
GPU 2: NVIDIA H200
GPU 3: NVIDIA H200
GPU 4: NVIDIA H200
GPU 5: NVIDIA H200
GPU 6: NVIDIA H200
GPU 7: NVIDIA H200
Nvidia driver version: 550.127.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 52 bits physical, 57 bits virtual
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 1
Core(s) per socket: 96
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 25
Model: 17
Model name: AMD EPYC 9654 96-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 1479.783
CPU max MHz: 3707.8120
CPU min MHz: 1500.0000
BogoMIPS: 4799.99
Virtualization: AMD-V
L1d cache: 6 MiB
L1i cache: 6 MiB
L2 cache: 192 MiB
L3 cache: 768 MiB
NUMA node0 CPU(s): 0-95
NUMA node1 CPU(s): 96-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Online benchmark results
Llama 3.1 70B Instruct 4 x H200 141GB
RPS | Num Prompts | Engine | Median E2E Latency | Median TTFT | Median TPOT | Median ITL |
---|---|---|---|---|---|---|
4 | 1200 | SGLang | 3005.24 | 65.72 | 18.47 | 15.94 |
8 | 2400 | SGLang | 4064.98 | 73.70 | 24.02 | 17.75 |
Offline benchmark results
Llama 3.1 70B Instruct 4 x H200 141GB
RPS | Num Prompts | Engine | Request throughput | Output token throughput | Tensor Parallel |
---|---|---|---|---|---|
inf | 5000 | SGLang | 25.14 | 4885.17 | 4 |
Llama 3.1 70B Instruct 8 x H200 141GB
RPS | Num Prompts | Engine | Request throughput | Output token throughput | Tensor Parallel |
---|---|---|---|---|---|
inf | 5000 | SGLang | 37.96 | 7376.03 | 8 |
Llama 3.1 405B Instruct 8 x H200 141GB
RPS | Num Prompts | Engine | Request throughput | Output token throughput | Tensor Parallel |
---|---|---|---|---|---|
inf | 5000 | SGLang | 9.16 | 1779.16 | 8 |
Q: Where should we place this benchmarking information, in existing docs or create a new one? @merrymercy @zhyncs
Related resources
Hopper GPU HW specs comparison: H100 & H200
Technical Specifications | ||
---|---|---|
H100 SXM | H200 SXM | |
BFLOAT16 | 989.5 TFLOPS | 989.5 TFLOPS |
FP16 | 989.5 TFLOPS | 989.5 TFLOPS |
FP8 | 1979 TFLOPS | 1979 TFLOPS |
INT8 | 1979 TFLOPS | 1979 TFLOPS |
GPU Memory | 80 GB | 144 GB |
GPU Memory Bandwidth | 3.35 TB/s | 4.8 TB/s |
zhyncs and Ying1123