[Fix] use `torch.inference_mode()` instead of `torch.no_grad()` #4372

Alcanderian · 2025-03-13T07:21:28Z

Motivation

Fix issue #4366

Modifications

Replace all torch.no_grad() with torch.inference_mode()

Benchmark on H100

Conclusion: There is basically no difference in performance. It seems to have a slight improvement, which could be due to fluctuations.

With CUDA Graph

command python3 -m sglang.bench_one_batch --model-path Qwen/Qwen2.5-7B-Instruct --batch 32 --input-len 256 --output-len 32

Before

Benchmark ...
Prefill. latency: 0.16847 s, throughput:  48626.79 token/s
Decode.  latency: 0.01420 s, throughput:   2253.15 token/s
Decode.  latency: 0.00777 s, throughput:   4117.87 token/s
Decode.  latency: 0.00763 s, throughput:   4194.57 token/s
Decode.  latency: 0.00760 s, throughput:   4209.70 token/s
Decode.  latency: 0.00758 s, throughput:   4223.74 token/s
Decode.  median latency: 0.00750 s, median throughput:   4269.28 token/s
Total. latency:  0.408 s, throughput:  22582.28 token/s

After

Benchmark ...
Prefill. latency: 0.16690 s, throughput:  49082.68 token/s
Decode.  latency: 0.00823 s, throughput:   3889.69 token/s
Decode.  latency: 0.00776 s, throughput:   4124.83 token/s
Decode.  latency: 0.00774 s, throughput:   4133.72 token/s
Decode.  latency: 0.00762 s, throughput:   4197.32 token/s
Decode.  latency: 0.00757 s, throughput:   4224.54 token/s
Decode.  median latency: 0.00749 s, median throughput:   4269.69 token/s
Total. latency:  0.401 s, throughput:  23004.95 token/s

Without CUDA Graph

command python3 -m sglang.bench_one_batch --model-path Qwen/Qwen2.5-7B-Instruct --batch 32 --input-len 256 --output-len 32 --disable-cuda-graph

Before

Benchmark ...
Prefill. latency: 0.16517 s, throughput:  49595.89 token/s
Decode.  latency: 0.01936 s, throughput:   1653.23 token/s
Decode.  latency: 0.01764 s, throughput:   1814.42 token/s
Decode.  latency: 0.01754 s, throughput:   1823.91 token/s
Decode.  latency: 0.01753 s, throughput:   1825.55 token/s
Decode.  latency: 0.01740 s, throughput:   1839.56 token/s
Decode.  median latency: 0.01740 s, median throughput:   1839.05 token/s
Total. latency:  0.707 s, throughput:  13027.06 token/s

After

Benchmark ...
Prefill. latency: 0.16725 s, throughput:  48981.85 token/s
Decode.  latency: 0.01692 s, throughput:   1890.79 token/s
Decode.  latency: 0.01692 s, throughput:   1890.98 token/s
Decode.  latency: 0.01685 s, throughput:   1899.65 token/s
Decode.  latency: 0.01678 s, throughput:   1907.42 token/s
Decode.  latency: 0.01678 s, throughput:   1907.56 token/s
Decode.  median latency: 0.01658 s, median throughput:   1930.19 token/s
Total. latency:  0.684 s, throughput:  13483.45 token/s

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zhyncs · 2025-03-13T10:24:10Z

@Alcanderian Some unit tests failed, may you help fix that? Thanks!

Alcanderian · 2025-03-13T13:44:05Z

Refer to the error in unit-test-backend-2-gpu

File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 3440, in all_gather_into_tensor
    work.wait()
RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See

It seems that gloo backend has some issues with inference mode in the current version of torch.

Ref: pytorch/pytorch#126032

I am going to fix it by reintroducing torch.no_grad() for event_loop_overlap and event_loop_normal. But the critical risk is that the CPU engine is not compatible with torch.inference_mode().

…loop_normal.

Alcanderian · 2025-03-13T16:12:42Z

It looks like most of the issues have been resolved, but there are still some accuracy issues. How should we approach this situation? @zhyncs

Further work: Create a smart_inference_mode decorator for automatically switching to torch.no_grad() when using the CPU backend.
Additional suggestion: Replace all torch.concat with torch.cat.

zhyncs · 2025-03-16T04:34:24Z

gentle ping @Alcanderian three gold bro, let's go :)

merrymercy · 2025-03-18T07:44:07Z

python/sglang/srt/utils.py

@@ -127,6 +128,63 @@ def is_cuda_available():
    return is_cuda()


+_ENABLE_TORCH_INFERENCE_MODE = os.getenv(
+    "SGLANG_ENABLE_TORCH_INFERENCE_MODE", "false"
+).lower() in ("true", "1")


Use this function instead

sglang/python/sglang/srt/utils.py

Line 1330 in 45212ce

def get_bool_env_var(name: str, default: str = "false") -> bool:

[fix] use torch.inference_mode() instead of torch.no_grad()

3727b20

zhyncs marked this pull request as ready for review March 13, 2025 08:48

zhyncs requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock, ByronHsu and HaiShaw as code owners March 13, 2025 08:48

Merge branch 'main' into ljx/dev/fix-no-grad-bug

50b1032

zhyncs assigned merrymercy Mar 13, 2025

zhyncs added the high priority label Mar 13, 2025

Alcanderian added 2 commits March 13, 2025 21:57

[fix] reintroducing torch.no_grad() for event_loop_overlap and event_…

8d47583

…loop_normal.

use difference grad mode in difference stage

8c204b9

zhyncs added 2 commits March 13, 2025 10:28

Merge branch 'main' into ljx/dev/fix-no-grad-bug

733d077

Merge branch 'main' into ljx/dev/fix-no-grad-bug

8c8cb78

zhyncs assigned Alcanderian and zhyncs Mar 16, 2025

Alcanderian and others added 8 commits March 16, 2025 05:49

revert inference mode

d828c53

[feature] add DynamicGradMode controlled by env

53182a5

[fix] fix lint

ac218ab

Merge branch 'main' into ljx/dev/fix-no-grad-bug

62294dc

[test] add nested grad mode test

33ff565

Merge branch 'main' into ljx/dev/fix-no-grad-bug

86921cd

Merge branch 'main' into ljx/dev/fix-no-grad-bug

825af6e

Merge branch 'main' into ljx/dev/fix-no-grad-bug

6b8d482

Alcanderian requested a review from xiezhq-hermann as a code owner March 17, 2025 00:35

Merge branch 'main' into ljx/dev/fix-no-grad-bug

617960f

zhyncs merged commit 0212d2e into sgl-project:main Mar 17, 2025
2 of 18 checks passed

merrymercy reviewed Mar 18, 2025

View reviewed changes

Alcanderian mentioned this pull request Mar 18, 2025

[fix] fix initialization of _ENABLE_TORCH_INFERENCE_MODE #4549

Merged

6 tasks

Alcanderian deleted the ljx/dev/fix-no-grad-bug branch March 21, 2025 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] use `torch.inference_mode()` instead of `torch.no_grad()` #4372

[Fix] use `torch.inference_mode()` instead of `torch.no_grad()` #4372

Uh oh!

Alcanderian commented Mar 13, 2025 •

edited

Loading

Uh oh!

zhyncs commented Mar 13, 2025

Uh oh!

Alcanderian commented Mar 13, 2025

Uh oh!

Alcanderian commented Mar 13, 2025

Uh oh!

zhyncs commented Mar 16, 2025

Uh oh!

Uh oh!

merrymercy Mar 18, 2025

Uh oh!

Uh oh!

[Fix] use torch.inference_mode() instead of torch.no_grad() #4372

[Fix] use torch.inference_mode() instead of torch.no_grad() #4372

Uh oh!

Conversation

Alcanderian commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Benchmark on H100

With CUDA Graph

Without CUDA Graph

Checklist

Uh oh!

zhyncs commented Mar 13, 2025

Uh oh!

Alcanderian commented Mar 13, 2025

Uh oh!

Alcanderian commented Mar 13, 2025

Uh oh!

zhyncs commented Mar 16, 2025

Uh oh!

Uh oh!

merrymercy Mar 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

[Fix] use `torch.inference_mode()` instead of `torch.no_grad()` #4372

[Fix] use `torch.inference_mode()` instead of `torch.no_grad()` #4372

Alcanderian commented Mar 13, 2025 •

edited

Loading