Skip to content

Conversation

tiran
Copy link
Contributor

@tiran tiran commented Apr 30, 2024

Changes

Description of your changes:

Python multiprocessing does not allow th change the default start method when another code path has already configured a default start method. Even get_start_method or get_context sets a default context.

I was running into this exception when doing training with some hardware accelerator frameworks like Habana.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

After fixing the issue above, we were still seeing multiprocessing-related bugs exposed via CI. We went through and made several other changes ensuring that resources were cleaned up appropriately. Those fixes are included, as well.

@russellb
Copy link
Member

russellb commented May 1, 2024

I'm hitting this issue on #1016, but this change doesn't resolve it.

Traceback (most recent call last):
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/train/linux_train.py", line 30, in <module>
    torch.multiprocessing.set_start_method(DEFAULT_MULTIPROCESSING_START_METHOD)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/instructlab/instructlab/venv/bin/ilab", line 8, in <module>
    sys.exit(cli())
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/lab.py", line 951, in train
    from .train.linux_train import linux_train
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/train/linux_train.py", line 34, in <module>
    raise ValueError(
ValueError: multiprocessing start method already set to fork.

@tiran
Copy link
Contributor Author

tiran commented May 1, 2024

Somebody is setting the MP default to fork before we have a change to set it to spawn. How annoying...

tiran added 2 commits May 1, 2024 09:09
Python multiprocessing does not allow th change the default start method
when another code path has already configured a default start method.
Even `get_start_method` or `get_context` sets a default context.

I was running into this exception when doing training with some hardware
accelerator frameworks like Habana.

```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set
```

Signed-off-by: Christian Heimes <cheimes@redhat.com>
Signed-off-by: Christian Heimes <cheimes@redhat.com>
@tiran tiran force-pushed the mp-start-method branch from a7c7fdc to 916c982 Compare May 1, 2024 07:10
@tiran
Copy link
Contributor Author

tiran commented May 1, 2024

I have updated the PR to set the start method even earlier.

```
src/instructlab/lab.py:34:25: W1309: Using an f-string that does not have any interpolated variables (f-string-without-interpolation)
src/instructlab/lab.py:779:12: W0621: Redefining name 'e' from outer scope (line 32) (redefined-outer-name)
```

Signed-off-by: Russell Bryant <rbryant@redhat.com>
russellb pushed a commit to stefwalter/instructlab that referenced this pull request May 1, 2024
Signed-off-by: Christian Heimes <cheimes@redhat.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
@russellb
Copy link
Member

russellb commented May 1, 2024

The functional test failure in the 3.9 test job looks related to the change

https://github.com/instructlab/instructlab/actions/runs/8906864876/job/24459709765?pr=1050#step:7:817

Signed-off-by: Christian Heimes <cheimes@redhat.com>
@tiran tiran force-pushed the mp-start-method branch from 5bcf7e9 to f5e3ede Compare May 1, 2024 11:41
Signed-off-by: Russell Bryant <rbryant@redhat.com>
@russellb
Copy link
Member

russellb commented May 1, 2024

I pulled this into #1016. The e2e job passed, but the functional test job failed on one instance (but passed the others). The error does look related, though.

https://github.com/instructlab/instructlab/actions/runs/8909614961/job/24467281678?pr=1016#step:7:804

  +2024-05-01 13:11:21 ./scripts/functional-tests.sh:1: test_generate(): cleanup 124
  /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  Error occurred in function: test_generate with exit code: 124
    warnings.warn('resource_tracker: There appear to be %d '

@russellb
Copy link
Member

russellb commented May 1, 2024

I marked this as a P0 blocker since it seems to break the core workflow on Linux, though the scope of the impact isn't totally clear.

Signed-off-by: Christian Heimes <cheimes@redhat.com>
@russellb
Copy link
Member

russellb commented May 2, 2024

looks we hit that same error again

  +2024-05-02 16:13:18 ./scripts/functional-tests.sh:1: test_generate(): cleanup 124
  Error occurred in function: test_generate with exit code: 124
  /opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '

https://github.com/instructlab/instructlab/actions/runs/8926843905/job/24518839487?pr=1050#step:7:781

@tiran
Copy link
Contributor Author

tiran commented May 2, 2024

Apparently a context manager is not good enough to clean up a pool. One has to call pool.join(), too. The API has sharp edges...

Now most test runs are passing. One is having trouble with leaked resources:

/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown

The tqdm docs say closing is "highly recommended" but not required if
you update() it to 100%, which we do. It still seems like a good idea
and may fix a "leaked semaphore" error we're seeing in CI.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
@russellb
Copy link
Member

russellb commented May 2, 2024

Some searching around indicated that tqdm could be the source of our problem. We didn't explicitly close() the progress bar, so let's see if that was the culprit.

@russellb
Copy link
Member

russellb commented May 2, 2024

dang, the 3.9 job hit the issue -- the python version that hits it is not consistent. It was 3.11 last time.

  +2024-05-02 17:31:32 ./scripts/functional-tests.sh:1: test_generate(): cleanup 124
  /opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
    warnings.warn('resource_tracker: There appear to be %d '
  +2024-05-02 17:31:32 ./scripts/functional-tests.sh:25: cleanup(): set +e
  Error occurred in function: test_generate with exit code: 124

We create a Queue to allow the child serve process to send expcetions
back to the parent, but neither process explicitly closed the queue.
It's a good thing to do and may be related to a "leaked semaphore"
error we are seeing in some CI jobs.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
@russellb russellb force-pushed the mp-start-method branch from fe89fdd to 6b16c37 Compare May 2, 2024 17:54
@russellb
Copy link
Member

russellb commented May 2, 2024

next up, the multiprocessing Queue() we used between chat and generate with the automatically spawned serve process was not being closed and joined. Let's see if that's it!

@russellb
Copy link
Member

russellb commented May 2, 2024

tests all green now!

Edit: I told the tests workflow to run again, just to get one more check

Edit 2: green through a 2nd pass, as well

There was a todo to try switching back to the default on Linux. We've
fixed some other bugs here, so let's see if switching back is good
now.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
@russellb russellb force-pushed the mp-start-method branch from 6b8b5a3 to 78c9fb2 Compare May 2, 2024 19:12
@tiran
Copy link
Contributor Author

tiran commented May 2, 2024

Ship it! 👍

@russellb russellb changed the title Don't fail if multiprocessing is already set to spawn Don't fail if multiprocessing is already set to spawn; fix issues from lack of cleanup May 2, 2024
@russellb
Copy link
Member

russellb commented May 2, 2024

Ship it! 👍

I'm good with it, too. I'll see if I can find someone else to review it since we both have changes included.

@hickeyma hickeyma merged commit 776deb0 into instructlab:main May 3, 2024
leseb added a commit to leseb/instructlab that referenced this pull request May 6, 2024
server

With the help of instructlab#1050 we
have a working implementation of multiprocessing. We can then add again
the 'spawn' method when running the temporary server.

Signed-off-by: Sébastien Han <seb@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants