-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
A potential memory leak has been identified related to the handling and broadcasting of multimodal data in distributed setups. The issue seems to originate from the interaction between the BaseMultiModalProcessor
logic and the broadcast_pyobj
utility function used by the Scheduler
.
Code Locations:
- Relevant processing logic:
kwargs["device"] = "cuda" - Serialization/Broadcasting utility:
sglang/python/sglang/srt/utils.py
Line 907 in b5be569
serialized_data = bytes(tensor_data.cpu().numpy())
The suspected cause lies within the Scheduler
's recv_requests -> broadcast_pyobj
workflow.
- Data (containing tensors,
pixel_values
) is serialized on non-zero ranks (rank != 0
) usingserialized_data = bytes(tensor_data.cpu().numpy())
withinbroadcast_pyobj
(ref:utils.py#L907
). - When this serialized data is received and deserialized by other worker ranks (
rank > 0
), the tensors within seem to be incorrectly assigned to devicecuda:0
. - It is expected that these tensors should be placed on the receiving rank's corresponding device (e.g.,
cuda:rank
). - This apparent misallocation to
cuda:0
on all receiving ranks leads to memory accumulating incorrectly, causing a leak.
-
The severity of the memory leak appears to increase as the total number of ranks (
world_size
) increases. -
It is also hypothesized that the total amount of leaked memory may be proportional to the number/size of images (or tensor data) being broadcasted through this mechanism. This requires further investigation and testing to confirm.
Reproduction
https://docs.sglang.ai/backend/openai_api_vision.html
When using the Qwen2.5-VL model and following the configuration guidelines provided on the sglang official website, a memory leak can be observed after sending a single request.
Environment
Python: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 565.57.01
PyTorch: 2.5.1+cu124
sglang: 0.4.5.post1
sgl_kernel: 0.0.9.post1
flashinfer: Module Not Found
triton: 3.1.0
transformers: 4.51.1
torchao: 0.10.0
numpy: 2.2.4
aiohttp: 3.11.16
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.30.2
interegular: 0.3.3
modelscope: 1.25.0
orjson: 3.10.16
outlines: 0.1.11
packaging: 24.2
psutil: 7.0.0
pydantic: 2.11.3
multipart: Module Not Found
zmq: Module Not Found
uvicorn: 0.34.1
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.17
openai: 1.75.0
tiktoken: 0.9.0
anthropic: 0.49.0
litellm: 1.66.1
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS PIX NODE SYS SYS 0-89 0N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS PIX NODE SYS SYS 0-89 0N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS NODE PIX SYS SYS 0-89 0N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS NODE PIX SYS SYS 0-89 0N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS PIX NODE 90-179 1N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS PIX NODE 90-179 1N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS NODE PIX 90-179 1N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS NODE PIX 90-179 1N/A
NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS
NIC1 PIX PIX NODE NODE SYS SYS SYS SYS SYS X NODE SYS SYS
NIC2 NODE NODE PIX PIX SYS SYS SYS SYS SYS NODE X SYS SYS
NIC3 SYS SYS SYS SYS PIX PIX NODE NODE SYS SYS SYS X NODE
NIC4 SYS SYS SYS SYS NODE NODE PIX PIX SYS SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
Hypervisor vendor: KVM
ulimit soft: 1048576