Skip to content

Conversation

RunDevelopment
Copy link
Member

I noticed that reloading the frontend would sometimes cause the host server to not respond anymore even though the worker was fine. The cause of this issue was a resource leak with our _session client session. Turns out, everything an aiohttp.ClientSession returns is a resource that must be disposed of. We were not doing that in a few methods, and aiohttp doesn't do auto-clean up apparently, so after a few leaks the entire backend host server would become unresponsive as _session would always timeout.

Leaks were hard to reproduce btw. I had to do 5 leaky requests to the worker's /nodes in parallel for to consistently trigger the bug.

The fix was to properly manage resource everywhere. I also improved wait_for_ready to not spam the worker with requests by (1) remembering whether we successfully connected to it before and (2) making processing concurrent calls to wait_for_ready sequentially which benefits from (1).

Copy link
Member

@joeyballentine joeyballentine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this might fix the weird stuff I was seeing the past few days

@joeyballentine joeyballentine merged commit 5b4fe11 into chaiNNer-org:main Mar 12, 2024
@RunDevelopment RunDevelopment deleted the fix-backend-resource-leak branch March 12, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants