Skip to content

Conversation

yaochengji
Copy link
Collaborator

@yaochengji yaochengji commented Jun 25, 2025

Purpose

Fix the bug when padding head_size for TPU. Previously block_num is only changed by get_kv_cache_shape. It's not enough since we still need to modify the value in KVCacheConfig as well.

Test Plan

vllm serve microsoft/Phi-3-mini-128k-instruct --seed 42 --disable-log-requests --gpu-memory-utilization 0.95 --max-num-batched-tokens 2048 --max-num-seqs 128 --tensor-parallel-size 1 --max-model-len 2048 --no-enable-prefix-caching
python3 ./benchmarks/benchmark_serving.py --model microsoft/Phi-3-mini-128k-instruct --dataset-name sonnet --dataset-path benchmarks/sonnet_4x.txt --sonnet-input-len 1800 --sonnet-output-len 128 --ignore_eos

Test Result

Passed. (It will crash without the patch)

Signed-off-by: Chengji Yao <chengjiyao@google.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yaochengji, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug in VLLM's TPU backend where the KV cache memory allocation did not correctly account for head size padding. Previously, this oversight could lead to out-of-memory (OOM) errors. The changes ensure that the usable memory for the KV cache is appropriately adjusted based on the padded head size, thereby preventing crashes and improving stability on TPU.

Highlights

  • TPU KV Cache Memory Bugfix: Fixed a critical bug in VLLM's TPU backend where the KV cache memory allocation did not correctly account for head_size padding. This oversight previously led to out-of-memory (OOM) errors, which are now prevented by ensuring sufficient memory is reserved.
  • Refactored KV Cache Shape Calculation: The get_kv_cache_shape function in vllm/v1/attention/backends/pallas.py was simplified. It now solely returns the padded head size and no longer attempts to adjust num_blocks or log warnings, as this logic was insufficient and has been relocated.
  • Corrected KV Cache Memory Sizing: In vllm/v1/worker/tpu_worker.py, the determine_available_memory function now explicitly calculates the padded_head_size and proportionally reduces the tpu_kv_cache_bytes (usable memory for KV cache) if padding occurs. This ensures sufficient memory is reserved to accommodate the padded head size, preventing OOMs. The warning for head size padding is also now issued from this location.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added v1 tpu Related to Google TPUs labels Jun 25, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request addresses a bug related to KV cache padding on TPUs. The changes involve modifying the get_kv_cache_shape function in vllm/v1/attention/backends/pallas.py and adjusting the usable memory size calculation in vllm/v1/worker/tpu_worker.py to account for padded head sizes. The changes look correct and necessary for the proper functioning of the code on TPUs.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the fix

@mgoin mgoin enabled auto-merge (squash) June 25, 2025 17:03
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 25, 2025
@vanbasten23
Copy link
Collaborator

@yaochengji , the pr description mentioned "we still need to modify the value in KVCacheConfig as well.", but I don't see it's changed in the pr. Do you mean modify the determine_available_memory to account for the padding in head_size?

@yaochengji
Copy link
Collaborator Author

@yaochengji , the pr description mentioned "we still need to modify the value in KVCacheConfig as well.", but I don't see it's changed in the pr. Do you mean modify the determine_available_memory to account for the padding in head_size?

Yes.

@mgoin mgoin merged commit 2cc2069 into vllm-project:main Jun 25, 2025
81 checks passed
m-misiura pushed a commit to m-misiura/vllm that referenced this pull request Jun 26, 2025
Signed-off-by: Chengji Yao <chengjiyao@google.com>
gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Jun 26, 2025
Signed-off-by: Chengji Yao <chengjiyao@google.com>
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 30, 2025
Signed-off-by: Chengji Yao <chengjiyao@google.com>
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
wseaton pushed a commit to wseaton/vllm that referenced this pull request Jun 30, 2025
Signed-off-by: Chengji Yao <chengjiyao@google.com>
wwl2755-google pushed a commit to wwl2755-google/vllm that referenced this pull request Jul 1, 2025
Signed-off-by: Chengji Yao <chengjiyao@google.com>
avigny pushed a commit to avigny/vllm that referenced this pull request Jul 31, 2025
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
googlercolin pushed a commit to googlercolin/vllm that referenced this pull request Aug 29, 2025
Signed-off-by: Chengji Yao <chengjiyao@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants