-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
I am trying to serve gemma3 27b-it on RTX 5090 using sglang blackwell image. However, I'm getting this error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.11/importlib/metadata/__init__.py", line 563, in from_name
return next(cls.discover(name=name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/utils.py", line 684, in assert_pkg_version
installed_version = version(pkg)
^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/importlib/metadata/__init__.py", line 1008, in version
return distribution(distribution_name).version
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/importlib/metadata/__init__.py", line 981, in distribution
return Distribution.from_name(distribution_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/importlib/metadata/__init__.py", line 565, in from_name
raise PackageNotFoundError(name)
importlib.metadata.PackageNotFoundError: No package metadata was found for flashinfer_python
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 14, in <module>
launch_server(server_args)
File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 726, in launch_server
tokenizer_manager, scheduler_info = _launch_subprocesses(server_args=server_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 513, in _launch_subprocesses
_set_envs_and_config(server_args)
File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 464, in _set_envs_and_config
assert_pkg_version(
File "/sgl-workspace/sglang/python/sglang/srt/utils.py", line 691, in assert_pkg_version
raise Exception(
Exception: flashinfer_python with minimum required version 0.2.5 is not installed. Please uninstall the old version and reinstall the latest version by following the instructions at https://docs.flashinfer.ai/installation.html.
Reproduction
The model is retrieved from huggingface
Here is the docker compose to run it:
generation_gemma_3_27b_sglang:
image: lmsysorg/sglang:blackwell
container_name: generation-gemma-3-27b-sglang
volumes:
- ./models/google--gemma-3-27b-it:/models/google--gemma-3-27b-it
- ./models/torchinductor_cache:/models/torchinductor_cache
# restart: always
network_mode: host # required by RDMA
privileged: true # required by RDMA
# Or you can only publish port 30000
# ports:
# - 30000:30000
environment:
- TORCHINDUCTOR_CACHE_DIR=/models/torchinductor_cache
entrypoint: python3 -m sglang.launch_server
command: --model-path /models/google--gemma-3-27b-it
--host 0.0.0.0
--context-length 8192
--port 30000
--random-seed 0
--log-requests-level 2
--enable-metrics
--max-running-requests 4
--show-time-cost
--dtype float16
--stream-interval 2
--served-model-name "gemma-3-27b"
--tp 4
--attention-backend flashinfer
# --enable-torch-compile
# --tokenizer-mode auto
# --enable-mixed-chunk
# --chat-template /models/CohereForAI--aya-expanse-8b/chat_template.json
ulimits:
memlock: -1
stack: 67108864
ipc: host
# healthcheck:
# test: ["CMD-SHELL", "curl -f http://localhost:30000/health || exit 1"]
# retries: 3
# interval: 1h
# timeout: 1m
# start_period: 2m
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [GPU]
Environment
python3 -m sglang.check_env
/home/ubuntu-user/miniconda3/envs/default/lib/python3.12/site-packages/torch/cuda/init.py:287: UserWarning:
NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.
If you want to use the NVIDIA GeForce RTX 5090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(
Python: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 18:56:27) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GeForce RTX 5090
GPU 0,1,2,3 Compute Capability: 12.0
CUDA_HOME: /usr
NVCC: Cuda compilation tools, release 12.2, V12.2.140
CUDA Driver Version: 570.144
PyTorch: 2.7.0+cu126
sglang: 0.4.6.post2
sgl_kernel: Module Not Found
flashinfer_python: Module Not Found
triton: 3.3.0
transformers: Module Not Found
torchao: Module Not Found
numpy: 2.2.5
aiohttp: 3.11.18
fastapi: 0.115.12
hf_transfer: Module Not Found
huggingface_hub: 0.31.1
interegular: Module Not Found
modelscope: Module Not Found
orjson: Module Not Found
outlines: Module Not Found
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.4
python-multipart: Module Not Found
pyzmq: 26.4.0
uvicorn: Module Not Found
uvloop: Module Not Found
vllm: Module Not Found
xgrammar: Module Not Found
openai: Module Not Found
tiktoken: Module Not Found
anthropic: Module Not Found
litellm: Module Not Found
decord: Module Not Found
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE SYS SYS 0-7,16-23 0 N/A
GPU1 NODE X SYS SYS 0-7,16-23 0 N/A
GPU2 SYS SYS X NODE 8-15,24-31 1 N/A
GPU3 SYS SYS NODE X 8-15,24-31 1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
ulimit soft: 1073741816
(Please note that I ran this in a conda environment on my machine because I'm using a docker container where I'm getting the error, and the docker container is exiting so I can't run inside it)