Enable Nvidia's ModelOpt fp8 quantized models #2535

Edwardf0t1 · 2024-12-21T01:20:45Z

Motivation

As discussed in our sync meeting @merrymercy @Ying1123 , we aim to contribute to SGLang by integrating NVIDIA's TensorRT Model Optimizer (ModelOpt) with optimized and quantized models, fostering collaboration to enhance the open-source inference ecosystem.

Modifications

This PR serves as an initial step toward adding support for ModelOpt quantized models in SGLang, starting with FP8 LLaMA 3.1 model inference. A basic test can be executed using the script provided below.

import sglang as sgl

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = {"temperature": 0.8, "top_p": 0.95}
    llm = sgl.Engine(model_path="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt")

    outputs = llm.generate(prompts, sampling_params)
    for prompt, output in zip(prompts, outputs):
        print("===============================")
        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

if __name__ == "__main__":
    main()

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

python/sglang/srt/layers/modelopt_quant.py

zhyncs · 2024-12-21T16:47:02Z

@Edwardf0t1 Please help resolve the conflicts

Edwardf0t1 · 2024-12-31T08:02:20Z

@Edwardf0t1 Please help resolve the conflicts

Done

python/pyproject.toml

merrymercy

Also, please fix the CI errors.

merrymercy · 2025-01-02T23:02:58Z

python/pyproject.toml

@@ -23,7 +23,7 @@ runtime_common = ["aiohttp", "decord", "fastapi",
    "psutil", "pydantic", "python-multipart",
    "pyzmq>=25.1.2", "torchao>=0.7.0", "uvicorn", "uvloop",
    "xgrammar>=0.1.6"]
-srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post10"]
+srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post10", "nvidia-modelopt"]


can we make this an optional dependency?

IIUC this is already under [project.optional-dependencies]. Also see this comment from @zhyncs :
#2535 (comment)

Most people will just install pip install sglang[all]. Can we not specify nvidia-modelopt here and ask people to manually install it when they want to use the model?

LGTM. Actually the nvidia/Llama-3.1-8B-Instruct-FP8 is pre-quantized meaning that for deployment only workflow we don't need modelopt. So it makes sense to remove it for now.

Edwardf0t1 · 2025-01-03T01:37:43Z

Hi @merrymercy I left a comment in your recently merged PR that I found it could bring issues in my test when run llm = sgl.Engine(model_path="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt").

merrymercy · 2025-01-03T08:35:15Z

I see. What is the correct value of GLOO_SOCKET_IFNAME in your environment?

Edwardf0t1 · 2025-01-03T19:58:54Z

I see. What is the correct value of GLOO_SOCKET_IFNAME in your environment?

I can use ens8np0 or enp2s0 interface for GLOO_SOCKET_IFNAME, depending on the system.

merrymercy · 2025-01-05T06:38:14Z

Fixed by 3a22a30

python/pyproject.toml

merrymercy · 2025-01-06T22:55:03Z

@Edwardf0t1 Thanks. It is merged.

Edwardf0t1 requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners December 21, 2024 01:20

zhyncs reviewed Dec 21, 2024

View reviewed changes

python/sglang/srt/layers/modelopt_quant.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/modelopt_quant.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/modelopt_quant.py Outdated Show resolved Hide resolved

zhyncs added the await-response label Dec 23, 2024

merrymercy assigned zhyncs and ispobock Dec 26, 2024

Edwardf0t1 force-pushed the zhiyu/enable-modelopt-fp8 branch from d95ae5e to 1b98f9a Compare December 31, 2024 07:52

zhyncs reviewed Dec 31, 2024

View reviewed changes

python/pyproject.toml Outdated Show resolved Hide resolved

zhyncs added high priority quant LLM Quantization and removed await-response labels Dec 31, 2024

Edwardf0t1 force-pushed the zhiyu/enable-modelopt-fp8 branch from 84aee77 to e847fac Compare December 31, 2024 18:11

merrymercy reviewed Jan 2, 2025

View reviewed changes

Edwardf0t1 added 8 commits January 3, 2025 00:34

add modelopt dependency

273fee6

enable modelopt quant in sglang

04ed027

add a test script

4df78db

refine modelopt quant script

57cb97b

update modelopt_quant

876ab12

remove simple test script in root dir

decb85f

fix format

6b62dd8

relocate nvidia-modelopt dependency to srt

4cecb9c

Edwardf0t1 force-pushed the zhiyu/enable-modelopt-fp8 branch from 687ae9b to 4cecb9c Compare January 3, 2025 00:38

fix CI

b616692

Edwardf0t1 mentioned this pull request Jan 3, 2025

[Docs] clean up structured outputs docs #2654

Merged

Merge branch 'main' into zhiyu/enable-modelopt-fp8

1f04ad1

Merge branch 'main' into zhiyu/enable-modelopt-fp8

24f1739

zhyncs mentioned this pull request Jan 5, 2025

chore: bump v0.4.1.post4 #2713

Merged

3 tasks

Merge branch 'main' into zhiyu/enable-modelopt-fp8

b08d768

merrymercy reviewed Jan 6, 2025

View reviewed changes

python/pyproject.toml Outdated Show resolved Hide resolved

Apply suggestions from code review

0693f91

merrymercy merged commit 287427e into sgl-project:main Jan 6, 2025
15 checks passed

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025

Enable Nvidia's ModelOpt fp8 quantized models (sgl-project#2535)

96ae623

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Nvidia's ModelOpt fp8 quantized models #2535

Enable Nvidia's ModelOpt fp8 quantized models #2535

Uh oh!

Edwardf0t1 commented Dec 21, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Dec 21, 2024

Uh oh!

Edwardf0t1 commented Dec 31, 2024

Uh oh!

Uh oh!

merrymercy left a comment

Uh oh!

merrymercy Jan 2, 2025

Uh oh!

Edwardf0t1 Jan 3, 2025

Uh oh!

merrymercy Jan 6, 2025 •

edited

Loading

Uh oh!

Edwardf0t1 Jan 7, 2025

Uh oh!

Edwardf0t1 commented Jan 3, 2025

Uh oh!

merrymercy commented Jan 3, 2025 •

edited

Loading

Uh oh!

Edwardf0t1 commented Jan 3, 2025

Uh oh!

merrymercy commented Jan 5, 2025

Uh oh!

Uh oh!

Uh oh!

merrymercy commented Jan 6, 2025

Uh oh!

Uh oh!

Enable Nvidia's ModelOpt fp8 quantized models #2535

Enable Nvidia's ModelOpt fp8 quantized models #2535

Uh oh!

Conversation

Edwardf0t1 commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhyncs commented Dec 21, 2024

Uh oh!

Edwardf0t1 commented Dec 31, 2024

Uh oh!

Uh oh!

merrymercy left a comment

Choose a reason for hiding this comment

Uh oh!

merrymercy Jan 2, 2025

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Jan 3, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 commented Jan 3, 2025

Uh oh!

merrymercy commented Jan 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Edwardf0t1 commented Jan 3, 2025

Uh oh!

merrymercy commented Jan 5, 2025

Uh oh!

Uh oh!

Uh oh!

merrymercy commented Jan 6, 2025

Uh oh!

Uh oh!

Edwardf0t1 commented Dec 21, 2024 •

edited

Loading

merrymercy Jan 6, 2025 •

edited

Loading

merrymercy commented Jan 3, 2025 •

edited

Loading