-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[PERF] Speed up of prepare_inputs / mrope #17617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@centml.ai>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! The change makes sense to me and I left a comment - cc @imkero
@@ -1404,6 +1405,7 @@ def get_next_input_positions( | |||
] | |||
|
|||
@staticmethod | |||
@functools.lru_cache(maxsize=1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason why we set this to 1024?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason why we set this to 1024?
I guess this is expected to be large enough to catch as much output mrope positions as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason why we set this to 1024?
I guess this is expected to be large enough to catch as much output mrope positions as possible.
In fact we are caching a lot of 3x1 tensors like [[32],[32],[32]]
, [[33],[33],[33]]
I will take a closer look if we can have a better way to do this.
@ywang96 LGTM for those changes. And maybe please take a look on #16881 ? It optimize the rest part (new request) of MRoPE position preparation, and i will take Thanks @vadiklyutiy for reporting this with benchmark |
I feel that those changes should not take too much effect on the E2E time cost as they cost <0.1ms each op... even x1000 will bring only 0.1s difference To make the benchmark reproducible across different runs, I suggest setting (those settings require some modifications on fib) |
|
sorry, i mean 0.1ms * 1000 = 0.1s in total, maybe i'm still missing something... I am doing a more precious & reproducible benchmark with those suggestions applied |
Behaviour with the same image for 1000 requests and different images for 1000 requests is a bit different. So, different images was intentionally implemented in diff. |
Alright i forgot multiplying the output tokens num. Just benchmarking piecewise for 200,000 times |
Agree, this make results more stable |
If numpy+numba speed ups |
a simple piecewise benchmark script here: import timeit
import torch
import numpy as np
import numba
from functools import lru_cache
def mrope_get_next_input_positions_tensor(
mrope_position_delta: int,
context_len: int,
seq_len: int,
) -> torch.Tensor:
return torch.arange(
mrope_position_delta + context_len,
mrope_position_delta + seq_len,
).expand(3, -1)
@lru_cache(1024)
def mrope_get_next_input_positions_tensor_lru(
mrope_position_delta: int,
context_len: int,
seq_len: int,
) -> torch.Tensor:
return torch.arange(
mrope_position_delta + context_len,
mrope_position_delta + seq_len,
).expand(3, -1)
@numba.jit(nopython=True)
def mrope_assign_next_input_positions(
out: np.ndarray,
out_offset: int,
mrope_position_delta: int,
context_len: int,
num_new_tokens: int,
):
for dim in range(3):
for idx in range(num_new_tokens):
out[dim,
out_offset + idx] = mrope_position_delta + context_len + idx
out = torch.empty(3, 1000, dtype=torch.int64)
out_np = out.numpy() # they shares the underlying data
def run_torch():
out_offset = 5
mrope_position_delta = 100
context_len = 20
seq_len = 21
positions = mrope_get_next_input_positions_tensor(mrope_position_delta, context_len, seq_len)
out[:, out_offset:out_offset + (seq_len - context_len)] = positions
def run_torch_lru():
out_offset = 5
mrope_position_delta = 100
context_len = 20
seq_len = 21
positions = mrope_get_next_input_positions_tensor_lru(mrope_position_delta, context_len, seq_len)
out[:, out_offset:out_offset + (seq_len - context_len)] = positions
def run_np():
out_offset = 5
mrope_position_delta = 100
context_len = 20
seq_len = 21
mrope_assign_next_input_positions(out_np, out_offset, mrope_position_delta, context_len, seq_len - context_len)
run_torch()
run_torch_lru()
run_np()
r1 = timeit.timeit(run_torch, number=200000)
print(f"run_torch: {r1:.3f}s")
r2 = timeit.timeit(run_np, number=200000)
print(f"run_numba: {r2:.3f}s")
r3 = timeit.timeit(run_torch_lru, number=200000)
print(f"run_torch_lru: {r3:.3f}s") And the result is here:
|
I have confirmed some speedup with this PR's changes And it seems consistent with the piecewise benchmark result above (torch vs torch lru cached)
|
I tested with H100. reqs/s for H100 is around 30. So, with A10 model itself takes longer. |
Yes, I think for CPU overhead focusing on the |
This pull request has merge conflicts that must be resolved before it can be |
Speed up of
_prepare_intups
for models with M-RoPE by caching constant tensor in M-RoPE implementation.A huge part of
_prepare_inputs
for Qwen2.5-VL-3B is work for M-RoPE. 1/3 of time is creating small constant CPU tensors inMRotaryEmbedding.get_next_input_positions_tensor()
. This PR make a cache for these tensors to speed up.Performance results.
Tested with Qwen2.5-VL-3B
prepare_intups
itself speeded up by 35%E2E
For workload used https://github.com/CentML/flexible-inference-bench
Below command sends 1000 requests, with 50 reqs per second, with one image 512x512 per request.
Before 31.93 reqs/s
After 32.54 reqs/s
Speed up 1.9%