-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[BugFix] Fix ray import error mem cleanup bug #21381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix ray import error mem cleanup bug #21381
Conversation
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a memory leak caused by holding a reference to an ImportError
exception object, which in turn held a reference to a stack frame, preventing garbage collection. The fix, which involves storing the string representation of the exception instead of the object itself, is a clean and correct approach to break this reference cycle.
The necessary adjustments to the raise
statements that use this error information have been handled correctly by removing the from
clause and incorporating the error message into the new exception's text. The changes are well-contained and directly address the reported bug. Overall, this is a solid improvement to the codebase's stability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @joerunde @tjohnson31415!
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
@joerunde could you rebase 🙏 |
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: qizixi <qizixi@meta.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: shuw <shuw@nvidia.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: x22x22 <wadeking@qq.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Signed-off-by: Joe Runde <Joseph.Runde@ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>
This is a duplicate of #21149 since @tjohnson31415 is now OOO for a while
Purpose
Fix a bug that prevents an LLM() instance from being garbage collected after deletion. This could result in unexpected memory usage or an EADDRINUSE error when attempting to create a second LLM() in the same process.
Test Plan
This is a simple script to reproduce the problem:
If Ray is installed, then I get BEFORE refcount 1 and the weakllm loses its referent. If Ray is not installed, I instead get:
Test Result
After applying the fix from this PR, the refcount is only 1 even with ray uninstalled, so the llm resources get garbage collected as desired.