Skip to content

[Docs] Document how to configure shared memory for multi GPU deployments #4259

@jsuchome

Description

@jsuchome

This is a copy of sgl-project/sgl-project.github.io#5. I did not realize the documentation content is generated, so it seems more likely the request belongs here... (?)

The documentation states

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2

is a way to enable multi-GPU tensor parallelism. However one must think how the processes (?) communicate together, usually there's a shared memory setup needed. And if this is not properly set, one might run into issues like:

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.cpp:81, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
Error while creating shared memory segment /dev/shm/nccl-vzIpS6 (size 9637888)

when running sglang server.

This means the size of shared memory is too low.

When running in docker containers, this could be set up with --shm-size flag (see vllm's doc at https://docs.vllm.ai/en/latest/deployment/docker.html)

When running in kubernetes, it's possible that the default size for shared memory will not be enough for your containers, so one might need to set up bigger size. Common way to do it is mount /dev/shm as emptyDir and set up proper sizeLimit. Like this:

    spec:
      containers:
      - command:
        ... < your usual container setup > ...
        volumeMounts:
        - mountPath: /dev/shm
          name: shared
      volumes:
      - emptyDir:
          medium: Memory
          sizeLimit: 1Gi
        name: shared

I have found out that vllm project recommends 20Gi as a default value for the shared memory size, see vllm-project/production-stack#44 and their helm chart value https://github.com/vllm-project/production-stack/pull/105/files#diff-7d931e53fe7db67b34609c58ca5e5e2788002e7f99657cc2879c7957112dd908R130

However I'm not sure where does this number come from. I was testing on the node with 2 NVIDIA L40 GPU's with DeepSeek-R1-Distill-Qwen-32B model, and having 1GiB of shared memory seemed enough.

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationgood first issueGood for newcomers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions