fix: apply cache size limit of attention mask for VisionAttention #3657

mickqian · 2025-02-18T06:18:27Z

Motivation

enforce an upper-bound limit for VisionAttention mask cache size.

This is included in #3203, and moved to here.

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

mickqian · 2025-02-18T06:20:25Z

ref #3651

yizhang2077

LGTM, pls run benckmark like mmmu to verify accuracy and build an OOM case to check if this pr has solved oom problem. Thanks! cc @zhaochenyang20

zhaochenyang20 · 2025-02-19T08:01:28Z

@yizhang2077 Will merge it after the CI.

Lzhang-hub · 2025-07-01T10:46:57Z

@mickqian I use the latest version run qwen2.5-vl-7b model with command

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --port 8080  --chat-template qwen2-vl --chunked-prefill-size -1 --disable-radix-cache --mm-attention-backend fa3 --attention-backend fa3  --enable-torch-compile --cuda-graph-bs 80 --torch-compile-max-bs 80

then benchmark server with concurrency=80, after run sometime, server got OOM error

fix: apply cache size limit for VisionAttention

d599085

mickqian requested review from merrymercy, Ying1123, zhyncs and ispobock as code owners February 18, 2025 06:18

mickqian mentioned this pull request Feb 18, 2025

[Bug] Vision attention mask cache is never released and cause OOM #3651

Closed

5 tasks

Merge branch 'main' into fix-attention

0368519

yizhang2077 self-assigned this Feb 19, 2025

yizhang2077 self-requested a review February 19, 2025 06:26

yizhang2077 reviewed Feb 19, 2025

View reviewed changes

mickqian changed the title ~~fix: apply cache size limit for VisionAttention~~ fix: apply cache size limit of attention mask for VisionAttention Feb 19, 2025

zhyncs merged commit 99c1b9d into sgl-project:main Feb 19, 2025
17 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: apply cache size limit of attention mask for VisionAttention #3657

fix: apply cache size limit of attention mask for VisionAttention #3657

Uh oh!

mickqian commented Feb 18, 2025 •

edited

Loading

Uh oh!

mickqian commented Feb 18, 2025 •

edited

Loading

Uh oh!

yizhang2077 left a comment

Uh oh!

zhaochenyang20 commented Feb 19, 2025

Uh oh!

Uh oh!

Lzhang-hub commented Jul 1, 2025

Uh oh!

Uh oh!

fix: apply cache size limit of attention mask for VisionAttention #3657

fix: apply cache size limit of attention mask for VisionAttention #3657

Uh oh!

Conversation

mickqian commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

mickqian commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yizhang2077 left a comment

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 commented Feb 19, 2025

Uh oh!

Uh oh!

Lzhang-hub commented Jul 1, 2025

Uh oh!

Uh oh!

mickqian commented Feb 18, 2025 •

edited

Loading

mickqian commented Feb 18, 2025 •

edited

Loading