Skip to content

Conversation

Alcanderian
Copy link
Collaborator

@Alcanderian Alcanderian commented Mar 13, 2025

Motivation

Fix issue #4366

Modifications

Replace all torch.no_grad() with torch.inference_mode()

Benchmark on H100

Conclusion: There is basically no difference in performance. It seems to have a slight improvement, which could be due to fluctuations.

With CUDA Graph

command python3 -m sglang.bench_one_batch --model-path Qwen/Qwen2.5-7B-Instruct --batch 32 --input-len 256 --output-len 32

  • Before
Benchmark ...
Prefill. latency: 0.16847 s, throughput:  48626.79 token/s
Decode.  latency: 0.01420 s, throughput:   2253.15 token/s
Decode.  latency: 0.00777 s, throughput:   4117.87 token/s
Decode.  latency: 0.00763 s, throughput:   4194.57 token/s
Decode.  latency: 0.00760 s, throughput:   4209.70 token/s
Decode.  latency: 0.00758 s, throughput:   4223.74 token/s
Decode.  median latency: 0.00750 s, median throughput:   4269.28 token/s
Total. latency:  0.408 s, throughput:  22582.28 token/s
  • After
Benchmark ...
Prefill. latency: 0.16690 s, throughput:  49082.68 token/s
Decode.  latency: 0.00823 s, throughput:   3889.69 token/s
Decode.  latency: 0.00776 s, throughput:   4124.83 token/s
Decode.  latency: 0.00774 s, throughput:   4133.72 token/s
Decode.  latency: 0.00762 s, throughput:   4197.32 token/s
Decode.  latency: 0.00757 s, throughput:   4224.54 token/s
Decode.  median latency: 0.00749 s, median throughput:   4269.69 token/s
Total. latency:  0.401 s, throughput:  23004.95 token/s

Without CUDA Graph

command python3 -m sglang.bench_one_batch --model-path Qwen/Qwen2.5-7B-Instruct --batch 32 --input-len 256 --output-len 32 --disable-cuda-graph

  • Before
Benchmark ...
Prefill. latency: 0.16517 s, throughput:  49595.89 token/s
Decode.  latency: 0.01936 s, throughput:   1653.23 token/s
Decode.  latency: 0.01764 s, throughput:   1814.42 token/s
Decode.  latency: 0.01754 s, throughput:   1823.91 token/s
Decode.  latency: 0.01753 s, throughput:   1825.55 token/s
Decode.  latency: 0.01740 s, throughput:   1839.56 token/s
Decode.  median latency: 0.01740 s, median throughput:   1839.05 token/s
Total. latency:  0.707 s, throughput:  13027.06 token/s
  • After
Benchmark ...
Prefill. latency: 0.16725 s, throughput:  48981.85 token/s
Decode.  latency: 0.01692 s, throughput:   1890.79 token/s
Decode.  latency: 0.01692 s, throughput:   1890.98 token/s
Decode.  latency: 0.01685 s, throughput:   1899.65 token/s
Decode.  latency: 0.01678 s, throughput:   1907.42 token/s
Decode.  latency: 0.01678 s, throughput:   1907.56 token/s
Decode.  median latency: 0.01658 s, median throughput:   1930.19 token/s
Total. latency:  0.684 s, throughput:  13483.45 token/s

Checklist

@zhyncs zhyncs marked this pull request as ready for review March 13, 2025 08:48
@zhyncs
Copy link
Member

zhyncs commented Mar 13, 2025

@Alcanderian Some unit tests failed, may you help fix that? Thanks!

@Alcanderian
Copy link
Collaborator Author

Refer to the error in unit-test-backend-2-gpu

File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
    work.wait()
RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See 

It seems that gloo backend has some issues with inference mode in the current version of torch.

Ref: pytorch/pytorch#126032

I am going to fix it by reintroducing torch.no_grad() for event_loop_overlap and event_loop_normal. But the critical risk is that the CPU engine is not compatible with torch.inference_mode().

@Alcanderian
Copy link
Collaborator Author

It looks like most of the issues have been resolved, but there are still some accuracy issues. How should we approach this situation? @zhyncs

Further work: Create a smart_inference_mode decorator for automatically switching to torch.no_grad() when using the CPU backend.
Additional suggestion: Replace all torch.concat with torch.cat.

@zhyncs
Copy link
Member

zhyncs commented Mar 16, 2025

gentle ping @Alcanderian three gold bro, let's go :)

@zhyncs zhyncs merged commit 0212d2e into sgl-project:main Mar 17, 2025
2 of 18 checks passed
@@ -127,6 +128,63 @@ def is_cuda_available():
return is_cuda()


_ENABLE_TORCH_INFERENCE_MODE = os.getenv(
"SGLANG_ENABLE_TORCH_INFERENCE_MODE", "false"
).lower() in ("true", "1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use this function instead

def get_bool_env_var(name: str, default: str = "false") -> bool:

@Alcanderian Alcanderian deleted the ljx/dev/fix-no-grad-bug branch March 21, 2025 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants