-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
[Perf][Frontend] Cached resolution for resolving chat templates #20065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf][Frontend] Cached resolution for resolving chat templates #20065
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ilyal-cerebras, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly improves the performance of chat template resolution within the system. By introducing a caching mechanism for the _detect_content_format
function, it eliminates the overhead of repeatedly loading and parsing Jinja templates, particularly beneficial in high-throughput scenarios where the same templates are frequently requested. This optimization has demonstrated a notable reduction in processing delays during benchmarks.
Highlights
- Performance Optimization: Applied
functools.lru_cache
with amaxsize
of 32 to the_detect_content_format
function invllm/entrypoints/chat_utils.py
. This change prevents redundant parsing and loading of Jinja templates for frequently used chat templates, addressing a reported 6-8 second delay in throughput benchmarks.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces an lru_cache
to improve the performance of the _detect_content_format
function, which is called frequently during prompt processing. The addition of the cache should reduce the delay caused by repetitive loading and iteration of Jinja templates. The change appears straightforward and well-justified by the performance improvement reported in the pull request description.
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net>
4e6b6b9
to
9cb99ce
Compare
ready |
Hi @simon-mo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems neat. Thanks for doing this.
…roject#20065) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net>
…roject#20065) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net> Signed-off-by: Will Eaton <weaton@redhat.com>
…roject#20065) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net>
…roject#20065) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net>
…roject#20065) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
…roject#20065) Signed-off-by: Ilya Lavrenov <ilya.lavrenov@cerebras.net>
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.Purpose
This PR is indented to fix performance for repetitive calls of
resolve_chat_template_content_format
, which is currently called on each prompt tov1/chat/completions
.Internally, this function uses
_detect_content_format
which loads Jinja template and iterates over it. It's turned out that in tput benchmark mode when we send 1000 requests to OpenAI server, it internally calls this function 1000 times which results in 6-8 seconds delay before prompts actually are started to process.As a solution, we can just cache invocation of
_detect_content_format
function, since chat templates are not often changed from request to request. In a simple case, when users don't override chat template and default one is used, we call_detect_content_format
only one.Test Plan
Not required
Test Result
6-8 seconds improvement on tput benchmark with 1000 prompts on meta-llama/Llama-3.1-8B-Instruct
(Optional) Documentation Update