Skip to content

Conversation

Edwardf0t1
Copy link
Collaborator

@Edwardf0t1 Edwardf0t1 commented Dec 21, 2024

Motivation

As discussed in our sync meeting @merrymercy @Ying1123 , we aim to contribute to SGLang by integrating NVIDIA's TensorRT Model Optimizer (ModelOpt) with optimized and quantized models, fostering collaboration to enhance the open-source inference ecosystem.

Modifications

This PR serves as an initial step toward adding support for ModelOpt quantized models in SGLang, starting with FP8 LLaMA 3.1 model inference. A basic test can be executed using the script provided below.

import sglang as sgl

def main():
    prompts = [
        "Hello, my name is",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]
    sampling_params = {"temperature": 0.8, "top_p": 0.95}
    llm = sgl.Engine(model_path="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt")

    outputs = llm.generate(prompts, sampling_params)
    for prompt, output in zip(prompts, outputs):
        print("===============================")
        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

if __name__ == "__main__":
    main()

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs
Copy link
Member

zhyncs commented Dec 21, 2024

@Edwardf0t1 Please help resolve the conflicts

@Edwardf0t1
Copy link
Collaborator Author

@Edwardf0t1 Please help resolve the conflicts

Done

@zhyncs zhyncs added high priority quant LLM Quantization and removed await-response labels Dec 31, 2024
@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/enable-modelopt-fp8 branch from 84aee77 to e847fac Compare December 31, 2024 18:11
Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please fix the CI errors.

@@ -23,7 +23,7 @@ runtime_common = ["aiohttp", "decord", "fastapi",
"psutil", "pydantic", "python-multipart",
"pyzmq>=25.1.2", "torchao>=0.7.0", "uvicorn", "uvloop",
"xgrammar>=0.1.6"]
srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post10"]
srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post10", "nvidia-modelopt"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make this an optional dependency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is already under [project.optional-dependencies]. Also see this comment from @zhyncs :
#2535 (comment)

Copy link
Contributor

@merrymercy merrymercy Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most people will just install pip install sglang[all]. Can we not specify nvidia-modelopt here and ask people to manually install it when they want to use the model?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Actually the nvidia/Llama-3.1-8B-Instruct-FP8 is pre-quantized meaning that for deployment only workflow we don't need modelopt. So it makes sense to remove it for now.

@Edwardf0t1 Edwardf0t1 force-pushed the zhiyu/enable-modelopt-fp8 branch from 687ae9b to 4cecb9c Compare January 3, 2025 00:38
@Edwardf0t1
Copy link
Collaborator Author

Hi @merrymercy I left a comment in your recently merged PR that I found it could bring issues in my test when run llm = sgl.Engine(model_path="nvidia/Llama-3.1-8B-Instruct-FP8", quantization="modelopt").

@merrymercy
Copy link
Contributor

merrymercy commented Jan 3, 2025

I see. What is the correct value of GLOO_SOCKET_IFNAME in your environment?

@Edwardf0t1
Copy link
Collaborator Author

I see. What is the correct value of GLOO_SOCKET_IFNAME in your environment?

I can use ens8np0 or enp2s0 interface for GLOO_SOCKET_IFNAME, depending on the system.

@merrymercy
Copy link
Contributor

Fixed by 3a22a30

@zhyncs zhyncs mentioned this pull request Jan 5, 2025
3 tasks
@merrymercy merrymercy merged commit 287427e into sgl-project:main Jan 6, 2025
15 checks passed
@merrymercy
Copy link
Contributor

@Edwardf0t1 Thanks. It is merged.

timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority quant LLM Quantization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants