Don't fail if multiprocessing is already set to spawn; fix issues from lack of cleanup #1050

tiran · 2024-04-30T15:48:49Z

Changes

Description of your changes:

Python multiprocessing does not allow th change the default start method when another code path has already configured a default start method. Even get_start_method or get_context sets a default context.

I was running into this exception when doing training with some hardware accelerator frameworks like Habana.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

After fixing the issue above, we were still seeing multiprocessing-related bugs exposed via CI. We went through and made several other changes ensuring that resources were cleaned up appropriately. Those fixes are included, as well.

russellb · 2024-05-01T06:39:25Z

I'm hitting this issue on #1016, but this change doesn't resolve it.

Traceback (most recent call last):
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/train/linux_train.py", line 30, in <module>
    torch.multiprocessing.set_start_method(DEFAULT_MULTIPROCESSING_START_METHOD)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/instructlab/instructlab/venv/bin/ilab", line 8, in <module>
    sys.exit(cli())
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/lab.py", line 951, in train
    from .train.linux_train import linux_train
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/train/linux_train.py", line 34, in <module>
    raise ValueError(
ValueError: multiprocessing start method already set to fork.

tiran · 2024-05-01T07:04:55Z

Somebody is setting the MP default to fork before we have a change to set it to spawn. How annoying...

Python multiprocessing does not allow th change the default start method when another code path has already configured a default start method. Even `get_start_method` or `get_context` sets a default context. I was running into this exception when doing training with some hardware accelerator frameworks like Habana. ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method raise RuntimeError('context has already been set') RuntimeError: context has already been set ``` Signed-off-by: Christian Heimes <cheimes@redhat.com>

Signed-off-by: Christian Heimes <cheimes@redhat.com>

tiran · 2024-05-01T07:10:39Z

I have updated the PR to set the start method even earlier.

``` src/instructlab/lab.py:34:25: W1309: Using an f-string that does not have any interpolated variables (f-string-without-interpolation) src/instructlab/lab.py:779:12: W0621: Redefining name 'e' from outer scope (line 32) (redefined-outer-name) ``` Signed-off-by: Russell Bryant <rbryant@redhat.com>

Signed-off-by: Christian Heimes <cheimes@redhat.com> Signed-off-by: Russell Bryant <rbryant@redhat.com>

russellb · 2024-05-01T08:09:06Z

The functional test failure in the 3.9 test job looks related to the change

https://github.com/instructlab/instructlab/actions/runs/8906864876/job/24459709765?pr=1050#step:7:817

Signed-off-by: Christian Heimes <cheimes@redhat.com>

Signed-off-by: Russell Bryant <rbryant@redhat.com>

russellb · 2024-05-01T13:28:24Z

I pulled this into #1016. The e2e job passed, but the functional test job failed on one instance (but passed the others). The error does look related, though.

https://github.com/instructlab/instructlab/actions/runs/8909614961/job/24467281678?pr=1016#step:7:804

  +2024-05-01 13:11:21 ./scripts/functional-tests.sh:1: test_generate(): cleanup 124
  /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  Error occurred in function: test_generate with exit code: 124
    warnings.warn('resource_tracker: There appear to be %d '

russellb · 2024-05-01T14:22:35Z

I marked this as a P0 blocker since it seems to break the core workflow on Linux, though the scope of the impact isn't totally clear.

src/instructlab/server.py

Signed-off-by: Christian Heimes <cheimes@redhat.com>

russellb · 2024-05-02T16:30:22Z

looks we hit that same error again

  +2024-05-02 16:13:18 ./scripts/functional-tests.sh:1: test_generate(): cleanup 124
  Error occurred in function: test_generate with exit code: 124
  /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '

https://github.com/instructlab/instructlab/actions/runs/8926843905/job/24518839487?pr=1050#step:7:781

tiran · 2024-05-02T16:32:16Z

Apparently a context manager is not good enough to clean up a pool. One has to call pool.join(), too. The API has sharp edges...

Now most test runs are passing. One is having trouble with leaked resources:

/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

The tqdm docs say closing is "highly recommended" but not required if you update() it to 100%, which we do. It still seems like a good idea and may fix a "leaked semaphore" error we're seeing in CI. Signed-off-by: Russell Bryant <rbryant@redhat.com>

russellb · 2024-05-02T17:25:46Z

Some searching around indicated that tqdm could be the source of our problem. We didn't explicitly close() the progress bar, so let's see if that was the culprit.

russellb · 2024-05-02T17:33:20Z

dang, the 3.9 job hit the issue -- the python version that hits it is not consistent. It was 3.11 last time.

  +2024-05-02 17:31:32 ./scripts/functional-tests.sh:1: test_generate(): cleanup 124
  /opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '
  +2024-05-02 17:31:32 ./scripts/functional-tests.sh:25: cleanup(): set +e
  Error occurred in function: test_generate with exit code: 124

We create a Queue to allow the child serve process to send expcetions back to the parent, but neither process explicitly closed the queue. It's a good thing to do and may be related to a "leaked semaphore" error we are seeing in some CI jobs. Signed-off-by: Russell Bryant <rbryant@redhat.com>

russellb · 2024-05-02T17:55:59Z

next up, the multiprocessing Queue() we used between chat and generate with the automatically spawned serve process was not being closed and joined. Let's see if that's it!

russellb · 2024-05-02T18:05:31Z

tests all green now!

Edit: I told the tests workflow to run again, just to get one more check

Edit 2: green through a 2nd pass, as well

src/instructlab/server.py

There was a todo to try switching back to the default on Linux. We've fixed some other bugs here, so let's see if switching back is good now. Signed-off-by: Russell Bryant <rbryant@redhat.com>

tiran · 2024-05-02T19:52:50Z

Ship it! 👍

russellb · 2024-05-02T20:26:34Z

Ship it! 👍

I'm good with it, too. I'll see if I can find someone else to review it since we both have changes included.

server With the help of instructlab#1050 we have a working implementation of multiprocessing. We can then add again the 'spawn' method when running the temporary server. Signed-off-by: Sébastien Han <seb@redhat.com>

tiran mentioned this pull request May 1, 2024

Introduce an end-to-end CI job on Linux #1016

Merged

3 tasks

tiran added 2 commits May 1, 2024 09:09

Set MP start method even earlier

916c982

Signed-off-by: Christian Heimes <cheimes@redhat.com>

tiran force-pushed the mp-start-method branch from a7c7fdc to 916c982 Compare May 1, 2024 07:10

russellb pushed a commit to stefwalter/instructlab that referenced this pull request May 1, 2024

TEST: combined changes from PR instructlab#1050

bb145c5

Signed-off-by: Christian Heimes <cheimes@redhat.com> Signed-off-by: Russell Bryant <rbryant@redhat.com>

Force spawn MP context, don't use daemonic process

f5e3ede

Signed-off-by: Christian Heimes <cheimes@redhat.com>

tiran force-pushed the mp-start-method branch from 5bcf7e9 to f5e3ede Compare May 1, 2024 11:41

Remove an unused import

df5e0af

Signed-off-by: Russell Bryant <rbryant@redhat.com>

leseb reviewed May 2, 2024

View reviewed changes

src/instructlab/server.py Outdated Show resolved Hide resolved

multiprocessing Pool needs pool.join

88effce

Signed-off-by: Christian Heimes <cheimes@redhat.com>

russellb force-pushed the mp-start-method branch from fe89fdd to 6b16c37 Compare May 2, 2024 17:54

tiran commented May 2, 2024

View reviewed changes

src/instructlab/server.py Outdated Show resolved Hide resolved

serve: Attempt to use default multiprocessing method

78c9fb2

There was a todo to try switching back to the default on Linux. We've fixed some other bugs here, so let's see if switching back is good now. Signed-off-by: Russell Bryant <rbryant@redhat.com>

russellb force-pushed the mp-start-method branch from 6b8b5a3 to 78c9fb2 Compare May 2, 2024 19:12

russellb changed the title ~~Don't fail if multiprocessing is already set to spawn~~ Don't fail if multiprocessing is already set to spawn; fix issues from lack of cleanup May 2, 2024

leseb mentioned this pull request May 3, 2024

wip: re-use spawn for temp server #979

Closed

russellb added this to the Release - 5/3 milestone May 3, 2024

cdoern requested review from alimaredia and cdoern May 3, 2024 14:17

russellb mentioned this pull request May 3, 2024

ilab generate fails when automatically connecting to server hosting model #1062

Closed

alimaredia approved these changes May 3, 2024

View reviewed changes

hickeyma approved these changes May 3, 2024

View reviewed changes

hickeyma merged commit 776deb0 into instructlab:main May 3, 2024

leseb mentioned this pull request May 6, 2024

fix: re-add 'spawn' method for the multiprocessing task on the temp server #1118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't fail if multiprocessing is already set to spawn; fix issues from lack of cleanup #1050

Don't fail if multiprocessing is already set to spawn; fix issues from lack of cleanup #1050

Uh oh!

tiran commented Apr 30, 2024 •

edited by russellb

Loading

Uh oh!

russellb commented May 1, 2024

Uh oh!

tiran commented May 1, 2024

Uh oh!

tiran commented May 1, 2024

Uh oh!

russellb commented May 1, 2024

Uh oh!

russellb commented May 1, 2024 •

edited

Loading

Uh oh!

russellb commented May 1, 2024

Uh oh!

Uh oh!

russellb commented May 2, 2024

Uh oh!

tiran commented May 2, 2024

Uh oh!

russellb commented May 2, 2024

Uh oh!

russellb commented May 2, 2024

Uh oh!

russellb commented May 2, 2024

Uh oh!

russellb commented May 2, 2024 •

edited

Loading

Uh oh!

Uh oh!

tiran commented May 2, 2024

Uh oh!

russellb commented May 2, 2024

Uh oh!

Uh oh!

Don't fail if multiprocessing is already set to spawn; fix issues from lack of cleanup #1050

Don't fail if multiprocessing is already set to spawn; fix issues from lack of cleanup #1050

Uh oh!

Conversation

tiran commented Apr 30, 2024 • edited by russellb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

russellb commented May 1, 2024

Uh oh!

tiran commented May 1, 2024

Uh oh!

tiran commented May 1, 2024

Uh oh!

russellb commented May 1, 2024

Uh oh!

russellb commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

russellb commented May 1, 2024

Uh oh!

Uh oh!

russellb commented May 2, 2024

Uh oh!

tiran commented May 2, 2024

Uh oh!

russellb commented May 2, 2024

Uh oh!

russellb commented May 2, 2024

Uh oh!

russellb commented May 2, 2024

Uh oh!

russellb commented May 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tiran commented May 2, 2024

Uh oh!

russellb commented May 2, 2024

Uh oh!

Uh oh!

tiran commented Apr 30, 2024 •

edited by russellb

Loading

russellb commented May 1, 2024 •

edited

Loading

russellb commented May 2, 2024 •

edited

Loading