-
Notifications
You must be signed in to change notification settings - Fork 441
Don't fail if multiprocessing is already set to spawn; fix issues from lack of cleanup #1050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm hitting this issue on #1016, but this change doesn't resolve it.
|
Somebody is setting the MP default to |
Python multiprocessing does not allow th change the default start method when another code path has already configured a default start method. Even `get_start_method` or `get_context` sets a default context. I was running into this exception when doing training with some hardware accelerator frameworks like Habana. ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method raise RuntimeError('context has already been set') RuntimeError: context has already been set ``` Signed-off-by: Christian Heimes <cheimes@redhat.com>
Signed-off-by: Christian Heimes <cheimes@redhat.com>
I have updated the PR to set the start method even earlier. |
``` src/instructlab/lab.py:34:25: W1309: Using an f-string that does not have any interpolated variables (f-string-without-interpolation) src/instructlab/lab.py:779:12: W0621: Redefining name 'e' from outer scope (line 32) (redefined-outer-name) ``` Signed-off-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Christian Heimes <cheimes@redhat.com> Signed-off-by: Russell Bryant <rbryant@redhat.com>
The functional test failure in the 3.9 test job looks related to the change |
Signed-off-by: Christian Heimes <cheimes@redhat.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
I pulled this into #1016. The e2e job passed, but the functional test job failed on one instance (but passed the others). The error does look related, though.
|
I marked this as a P0 blocker since it seems to break the core workflow on Linux, though the scope of the impact isn't totally clear. |
Signed-off-by: Christian Heimes <cheimes@redhat.com>
looks we hit that same error again
|
Apparently a context manager is not good enough to clean up a pool. One has to call Now most test runs are passing. One is having trouble with leaked resources:
|
The tqdm docs say closing is "highly recommended" but not required if you update() it to 100%, which we do. It still seems like a good idea and may fix a "leaked semaphore" error we're seeing in CI. Signed-off-by: Russell Bryant <rbryant@redhat.com>
Some searching around indicated that |
dang, the 3.9 job hit the issue -- the python version that hits it is not consistent. It was 3.11 last time.
|
We create a Queue to allow the child serve process to send expcetions back to the parent, but neither process explicitly closed the queue. It's a good thing to do and may be related to a "leaked semaphore" error we are seeing in some CI jobs. Signed-off-by: Russell Bryant <rbryant@redhat.com>
next up, the multiprocessing Queue() we used between chat and generate with the automatically spawned |
tests all green now! Edit: I told the tests workflow to run again, just to get one more check Edit 2: green through a 2nd pass, as well |
There was a todo to try switching back to the default on Linux. We've fixed some other bugs here, so let's see if switching back is good now. Signed-off-by: Russell Bryant <rbryant@redhat.com>
Ship it! 👍 |
I'm good with it, too. I'll see if I can find someone else to review it since we both have changes included. |
server With the help of instructlab#1050 we have a working implementation of multiprocessing. We can then add again the 'spawn' method when running the temporary server. Signed-off-by: Sébastien Han <seb@redhat.com>
Changes
Description of your changes:
Python multiprocessing does not allow th change the default start method when another code path has already configured a default start method. Even
get_start_method
orget_context
sets a default context.I was running into this exception when doing training with some hardware accelerator frameworks like Habana.
After fixing the issue above, we were still seeing multiprocessing-related bugs exposed via CI. We went through and made several other changes ensuring that resources were cleaned up appropriately. Those fixes are included, as well.