Skip to content

[Bug] Incorrect Memory Allocation on CUDA:0 by Non-Zero CUDA Processes in TP/DP #5732

@yhyang201

Description

@yhyang201

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

A potential memory leak has been identified related to the handling and broadcasting of multimodal data in distributed setups. The issue seems to originate from the interaction between the BaseMultiModalProcessor logic and the broadcast_pyobj utility function used by the Scheduler.

Image

Code Locations:

  1. Relevant processing logic:
  2. Serialization/Broadcasting utility:
    serialized_data = bytes(tensor_data.cpu().numpy())

The suspected cause lies within the Scheduler's recv_requests -> broadcast_pyobj workflow.

  1. Data (containing tensors, pixel_values ) is serialized on non-zero ranks (rank != 0) using serialized_data = bytes(tensor_data.cpu().numpy()) within broadcast_pyobj (ref: utils.py#L907).
  2. When this serialized data is received and deserialized by other worker ranks (rank > 0), the tensors within seem to be incorrectly assigned to device cuda:0.
  3. It is expected that these tensors should be placed on the receiving rank's corresponding device (e.g., cuda:rank).
  4. This apparent misallocation to cuda:0 on all receiving ranks leads to memory accumulating incorrectly, causing a leak.
  • The severity of the memory leak appears to increase as the total number of ranks (world_size) increases.

  • It is also hypothesized that the total amount of leaked memory may be proportional to the number/size of images (or tensor data) being broadcasted through this mechanism. This requires further investigation and testing to confirm.

Reproduction

https://docs.sglang.ai/backend/openai_api_vision.html

When using the Qwen2.5-VL model and following the configuration guidelines provided on the sglang official website, a memory leak can be observed after sending a single request.

Environment

Python: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 565.57.01
PyTorch: 2.5.1+cu124
sglang: 0.4.5.post1
sgl_kernel: 0.0.9.post1
flashinfer: Module Not Found
triton: 3.1.0
transformers: 4.51.1
torchao: 0.10.0
numpy: 2.2.4
aiohttp: 3.11.16
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.30.2
interegular: 0.3.3
modelscope: 1.25.0
orjson: 3.10.16
outlines: 0.1.11
packaging: 24.2
psutil: 7.0.0
pydantic: 2.11.3
multipart: Module Not Found
zmq: Module Not Found
uvicorn: 0.34.1
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.17
openai: 1.75.0
tiktoken: 0.9.0
anthropic: 0.49.0
litellm: 1.66.1
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS PIX NODE SYS SYS 0-89 0N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS PIX NODE SYS SYS 0-89 0N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS NODE PIX SYS SYS 0-89 0N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS NODE PIX SYS SYS 0-89 0N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS PIX NODE 90-179 1N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS PIX NODE 90-179 1N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS NODE PIX 90-179 1N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS NODE PIX 90-179 1N/A
NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS
NIC1 PIX PIX NODE NODE SYS SYS SYS SYS SYS X NODE SYS SYS
NIC2 NODE NODE PIX PIX SYS SYS SYS SYS SYS NODE X SYS SYS
NIC3 SYS SYS SYS SYS PIX PIX NODE NODE SYS SYS SYS X NODE
NIC4 SYS SYS SYS SYS NODE NODE PIX PIX SYS SYS SYS NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4

Hypervisor vendor: KVM
ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions