-
Notifications
You must be signed in to change notification settings - Fork 441
Introduce an end-to-end CI job on Linux #1016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4e25c98
to
d7288f3
Compare
@lmilbaum @rhatdan These is an e2e test we'd run in https://github.com/containers/ai-lab-recipes/ ... related to containers/ai-lab-recipes#341 |
d7288f3
to
c424940
Compare
This work needs #1018 to fix the container first. The work is incomplete right now, with lots of TODO statements indicating what remains to be done. |
@russellb The Container file also needs this for cuda:
|
Is this ready for broader review and testing? Did a CI run work successfully in GitHub? |
I don’t have the CI job passing yet. I’ll update again later today. |
e36c8f5
to
5f7a350
Compare
My last push included a rebase to be on top of a bunch of newer changes. Some outstanding issues:
|
42d4628
to
2d7f133
Compare
The end-to-end CI job that does init/download/generate/train on Linux is passing. This is running on the host OS using a GPU worker available via github actions. Next steps:
|
4cafab2
to
f0a29d2
Compare
I applied the
|
f0a29d2
to
4d0d838
Compare
src/instructlab/train/linux_train.py
Outdated
try: | ||
torch.multiprocessing.set_start_method(DEFAULT_MULTIPROCESSING_START_METHOD) | ||
except RuntimeError: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tiran Do you have any advice on this one? I hit this reliably in this CI job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message here has the backtrace: 2865268
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have seen the issue before on HPUs and created #1050
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I replaced my change with the commit from your PR. Assuming the job still works, I'll approve that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't fix it here.
Traceback (most recent call last):
File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/train/linux_train.py", line 30, in <module>
torch.multiprocessing.set_start_method(DEFAULT_MULTIPROCESSING_START_METHOD)
File "/usr/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method
raise RuntimeError('context has already been set')
RuntimeError: context has already been set
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/runner/work/instructlab/instructlab/venv/bin/ilab", line 8, in <module>
sys.exit(cli())
File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/lab.py", line 951, in train
from .train.linux_train import linux_train
File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/train/linux_train.py", line 34, in <module>
raise ValueError(
ValueError: multiprocessing start method already set to fork.
Note that the workflow is currently limited and stops before ensuring you can run |
A data point on costs. The last two successful runs took 22 minutes. That costs us $1.54 each time ($0.07 per minute). If we run this only manually, I think the costs will stay under control. I also think we could speed this up. A lot of time is spent on downloading and installing stuff. No caching is done in the workflow, yet. |
For caching I recommend to use - name: Setup Python 3.11
uses: actions/setup-python@v5
with:
python-version: py311
cache: pip
cache-dependency-path: |
**/pyproject.toml
**/requirements*.txt
- name: Cache huggingface
uses: actions/cache@v4
with:
path: ~/.cache/huggingface
# config contains DEFAULT_MODEL
key: huggingface-${{ hashFiles('src/instructlab/config.py') }} |
Are you sure the PR uses a CUDA-accelerated build of llama-cpp-python? It downloads and builds llama-cpp-python two times:
I think you are overriding the CUDA build with a CPU-only build. Our installation instructions have sharp edges. Please try this: sed 's/\[.*\]//' requirements.txt constraints.txt
python3 -m pip cache remove llama_cpp_python
python3 -m pip install --no-binary llama_cpp_python -c constraints.txt llama_cpp_python |
2865268
to
be15e3b
Compare
I guess I'm not! I was going off of the output from |
a561bd2
to
ab9d559
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I'd love if we could incorporate this into the release workflow somehow - maybe having an automic trigger when we cut a new release branch for Y-streams or for Backport PRs to an existing release branch (to ensure Z-streams have no regressions)
This PR introduces a new job that runs in CI that performs a minimal configuration of the `ilab` workflow: - `ilab init` - `ilab download` - `ilab serve` - `ilab generate` - `ilab train` It runs on a GPU-enabled github actions runner. This is a Linux VM with a single Tesla T4 GPU. Given the resources required to run this job, we do not propose that it runs automatically on every PR. Instead, it is a workflow that must be manually launched. When launching it manually, you can specify which branch or pull request to run it against. Signed-off-by: Stef Walter <stefw@redhat.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Russell Bryant <rbryant@redhat.com>
@tiran pointed out that llama-cpp-python was being installed twice. My last approach was not ensuring that the cuda enabled build was set to the correct version. This change includes his suggestion on how to do that properly. Signed-off-by: Russell Bryant <rbryant@redhat.com>
Free up GPU memory by killing `ilab serve` before running `ilab train`. Signed-off-by: Russell Bryant <rbryant@redhat.com>
This mirrors the caching approach used in `test.yml`. Signed-off-by: Russell Bryant <rbryant@redhat.com>
ab9d559
to
759ef3e
Compare
Signed-off-by: Russell Bryant <rbryant@redhat.com>
Yeah, that's a good idea. We could definitely trigger it on tag creation. At the moment, it would be triggering it manually, and using the input field that lets you specify which branch or tag to run against. You't put in |
This PR introduces a new job that runs in CI that performs a minimal
configuration of the
ilab
workflow:ilab init
ilab download
ilab serve
ilab generate
ilab train
It runs on a GPU-enabled github actions runner. This is a Linux VM
with a single Tesla T4 GPU.
Given the resources required to run this job, we do not propose that
it runs automatically on every PR. Instead, it is a workflow that must
be manually launched. When launching it manually, you can specify
which branch or pull request to run it against.
Signed-off-by: Stef Walter stefw@redhat.com
Co-authored-by: Russell Bryant rbryant@redhat.com
Signed-off-by: Russell Bryant rbryant@redhat.com
Depends on:
Related:
Follow-up work after this PR: