-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client. #19423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc][Benchmarking] Add variable request-rate ("ramp-up") to the benchmarking client. #19423
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @dtransposed, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request enhances the benchmarking tools by adding support for ramping up the request rate during a benchmark run. This feature allows users to simulate increasing load scenarios, providing better insights into the system's performance under stress. The implementation includes new command-line arguments, dynamic request rate calculation, argument validation, and reporting of ramp-up details in the results.
Highlights
- Ramp-Up Request Rate: Introduces the ability to ramp up the request rate over the duration of a benchmark run in
benchmark_serving.py
andvllm/benchmarks/serve.py
. This is useful for stress testing and finding the maximum throughput. - New Arguments: Adds command-line arguments
--ramp-up-strategy
(linear
orexponential
),--ramp-up-start-rps
, and--ramp-up-end-rps
to control the ramp-up behavior. - Dynamic Request Rate: Modifies the
get_request
async generator to calculate the current request rate dynamically based on the chosen ramp-up strategy and the number of requests already sent. - Argument Validation: Adds validation logic to ensure ramp-up arguments are used correctly (e.g.,
--request-rate
is not used with ramp-up, start/end RPS are provided and valid). - Output Reporting: Includes the ramp-up configuration and timestamped RPS change events in the benchmark's JSON output results.
- Documentation: Updates the
benchmarks/README.md
to document the new ramp-up feature and its usage.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configureGemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a ramp-up request rate feature to the benchmarking scripts. It allows users to specify a linear or exponential increase in requests per second (RPS) over the duration of a benchmark run. The changes include updates to the get_request
and benchmark
functions, argument parsing, validation for the new parameters, and documentation in the README. My suggestions focus on improving the clarity of the help text for the new CLI arguments and code readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I think a ramp-up mode makes sense but I have left some comments. Please take a look!
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
125eacc
to
bc79e5c
Compare
@ywang96 thanks for swift review. Addressed all your comments. |
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final comments otherwise LGTM
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
@ywang96 CI/CD seems to be stuck. |
…ure/damian/rampup
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
@ywang96 kind reminder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dtransposed Sorry for the late response - I've enabled CI!
@ywang96 looks like the failing tests are unrelated, WDYT? |
Yea - though we typically don't force merge. I've updated this branch with upstream main so the CI failures should be resolved! |
@ywang96 this is sadly not the case :( |
Looks like HF was down - I retried the tests so hopefully they all go through... |
…chmarking client. (vllm-project#19423) Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: Roger Wang <hey@rogerw.me>
…chmarking client. (vllm-project#19423) Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: Roger Wang <hey@rogerw.me>
…chmarking client. (vllm-project#19423) Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: Roger Wang <hey@rogerw.me> Signed-off-by: Will Eaton <weaton@redhat.com>
…chmarking client. (vllm-project#19423) Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: Roger Wang <hey@rogerw.me>
…chmarking client. (vllm-project#19423) Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: Roger Wang <hey@rogerw.me>
…chmarking client. (vllm-project#19423) Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: Roger Wang <hey@rogerw.me> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
…chmarking client. (vllm-project#19423) Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal> Co-authored-by: Roger Wang <hey@rogerw.me>
Purpose
The goal of this PR is to add to the benchmark the possibility to model more realistic request traffic.
Already existing request traffic scenarios (unthrottled traffic, constant request rate, burstiness) are useful, but now, with this "ramp-up" feature, we can gradually increase the request rate throughout the benchmark.
This is especially useful in real-world scenarios, when we want to find out how far we can push the request rate, while staying under some predetermined latency budget.
Test
Judging by the Grafana metrics, I can see an exponentially increasing load on my service. In the resulting
.json
file, I additionally save information when each RPS value has been reached (so we can later match RPS with e.g. E2E latency values in Grafana and determine the maximum RPS given the latency budget).