Skip to content

Conversation

mickqian
Copy link
Collaborator

@mickqian mickqian commented Feb 18, 2025

Motivation

enforce an upper-bound limit for VisionAttention mask cache size.

This is included in #3203, and moved to here.

Modifications

Checklist

@mickqian
Copy link
Collaborator Author

mickqian commented Feb 18, 2025

ref #3651

@yizhang2077 yizhang2077 self-assigned this Feb 19, 2025
@yizhang2077 yizhang2077 self-requested a review February 19, 2025 06:26
Copy link
Collaborator

@yizhang2077 yizhang2077 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pls run benckmark like mmmu to verify accuracy and build an OOM case to check if this pr has solved oom problem. Thanks! cc @zhaochenyang20

@mickqian mickqian changed the title fix: apply cache size limit for VisionAttention fix: apply cache size limit of attention mask for VisionAttention Feb 19, 2025
@zhaochenyang20
Copy link
Collaborator

@yizhang2077 Will merge it after the CI.

@zhyncs zhyncs merged commit 99c1b9d into sgl-project:main Feb 19, 2025
17 of 19 checks passed
@Lzhang-hub
Copy link
Contributor

@mickqian I use the latest version run qwen2.5-vl-7b model with command

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --host 0.0.0.0 --port 8080  --chat-template qwen2-vl --chunked-prefill-size -1 --disable-radix-cache --mm-attention-backend fa3 --attention-backend fa3  --enable-torch-compile --cuda-graph-bs 80 --torch-compile-max-bs 80

then benchmark server with concurrency=80, after run sometime, server got OOM error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants