Introduce an end-to-end CI job on Linux #1016

stefwalter · 2024-04-26T08:55:25Z

This PR introduces a new job that runs in CI that performs a minimal
configuration of the ilab workflow:

ilab init
ilab download
ilab serve
ilab generate
ilab train

It runs on a GPU-enabled github actions runner. This is a Linux VM
with a single Tesla T4 GPU.

Given the resources required to run this job, we do not propose that
it runs automatically on every PR. Instead, it is a workflow that must
be manually launched. When launching it manually, you can specify
which branch or pull request to run it against.

Signed-off-by: Stef Walter stefw@redhat.com
Co-authored-by: Russell Bryant rbryant@redhat.com
Signed-off-by: Russell Bryant rbryant@redhat.com

Depends on:

Scope of what can be tested is limited by llamacpp_convert_to_gguf does not support 4-bit quantied models (KeyError: 'U8') #579

Follow-up work after this PR:

Make another variant of this job that builds our instructlab container with cuda support and uses that to run the same workflow
Consider how and where to run end-to-end tests for other environments, file tracking issues for those

stefwalter · 2024-04-26T08:56:05Z

@rhatdan @jeremyeder @lmilbaum FYI

stefwalter · 2024-04-26T10:48:56Z

@lmilbaum @rhatdan These is an e2e test we'd run in https://github.com/containers/ai-lab-recipes/ ... related to containers/ai-lab-recipes#341

stefwalter · 2024-04-26T11:36:36Z

This work needs #1018 to fix the container first.

The work is incomplete right now, with lots of TODO statements indicating what remains to be done.

stefwalter · 2024-04-26T18:12:39Z

@russellb The Container file also needs this for cuda:

-RUN dnf install -y libcudnn8 nvidia-driver-NVML
+RUN dnf install -y libcudnn8 nvidia-driver-NVML nvidia-driver-cuda-libs

stefwalter · 2024-04-29T08:11:49Z

Is this ready for broader review and testing? Did a CI run work successfully in GitHub?

russellb · 2024-04-29T10:14:16Z

Is this ready for broader review and testing? Did a CI run work successfully in GitHub?

I don’t have the CI job passing yet. I’ll update again later today.

russellb · 2024-04-30T11:58:56Z

My last push included a rebase to be on top of a bunch of newer changes. Some outstanding issues:

I defined an e2e CI job, but it doesn't work reliably. I don't think the github action runners have enough CPU to finish even the most minimal configuration. I'm going to switch to looking at running it on a GPU runner instead. The tradeoff will be that it won't run on every PR for now, since it's not free.
Eventually, we'll squash all of this, but that's for when it's ready for review and merge ... I'd like to have a working CI job ready for this repo first though

russellb · 2024-04-30T23:39:35Z

The end-to-end CI job that does init/download/generate/train on Linux is passing. This is running on the host OS using a GPU worker available via github actions.

Next steps:

I'm going to clean up this branch by squashing most of it together and putting it in a form ready for review and merging this first workflow
Next I want to add a variant that instead of running it on the host, builds a cuda container using our containerfile and runs the same workflow using the container. I'd like to do that in a follow-up PR, though.

russellb · 2024-05-01T00:40:53Z

I applied the hold label because:

This PR includes linux-train: Skip convert to gguf when --4-bit-quant is used #1058, so that should be reviewed first
There is a commit on the branch that must be removed before merging (the change to make the job run automatically against this PR). The intent is to only merge it as a manually triggered workflow.
I have a hack commit applied to avoid an error I'm hitting during ilab train, but I need to figure out the proper fix for it

russellb · 2024-05-01T00:54:57Z

src/instructlab/train/linux_train.py

+try:
+    torch.multiprocessing.set_start_method(DEFAULT_MULTIPROCESSING_START_METHOD)
+except RuntimeError:
+    pass


@tiran Do you have any advice on this one? I hit this reliably in this CI job.

The commit message here has the backtrace: 2865268

I have seen the issue before on HPUs and created #1050

Thanks! I replaced my change with the commit from your PR. Assuming the job still works, I'll approve that one.

It doesn't fix it here.

Traceback (most recent call last): File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/train/linux_train.py", line 30, in <module> torch.multiprocessing.set_start_method(DEFAULT_MULTIPROCESSING_START_METHOD) File "/usr/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method raise RuntimeError('context has already been set') RuntimeError: context has already been set The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/home/runner/work/instructlab/instructlab/venv/bin/ilab", line 8, in <module> sys.exit(cli()) File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), *args, **kwargs) File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/lab.py", line 951, in train from .train.linux_train import linux_train File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/train/linux_train.py", line 34, in <module> raise ValueError( ValueError: multiprocessing start method already set to fork.

russellb · 2024-05-01T01:19:05Z

Note that the workflow is currently limited and stops before ensuring you can run ilab serve with a model after ilab train because of #579

russellb · 2024-05-01T02:29:09Z

A data point on costs. The last two successful runs took 22 minutes. That costs us $1.54 each time ($0.07 per minute). If we run this only manually, I think the costs will stay under control.

I also think we could speed this up. A lot of time is spent on downloading and installing stuff. No caching is done in the workflow, yet.

tiran · 2024-05-01T05:40:41Z

For caching I recommend to use setup-python with cache pip and actions/cache to cache the Hugging Face models. We are using the same technique in test.yaml. You could also cache the APT packages with something like https://github.com/marketplace/actions/cache-apt-packages

      - name: Setup Python 3.11
        uses: actions/setup-python@v5
        with:
          python-version: py311
          cache: pip
          cache-dependency-path: |
            **/pyproject.toml
            **/requirements*.txt

      - name: Cache huggingface
        uses: actions/cache@v4
        with:
          path: ~/.cache/huggingface
          # config contains DEFAULT_MODEL
          key: huggingface-${{ hashFiles('src/instructlab/config.py') }}

tiran · 2024-05-01T06:07:15Z

Are you sure the PR uses a CUDA-accelerated build of llama-cpp-python? It downloads and builds llama-cpp-python two times:

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.68.tar.gz (42.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.5/42.5 MB 170.8 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'

 Collecting llama_cpp_python[server]==0.2.55
  Downloading llama_cpp_python-0.2.55.tar.gz (36.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.8/36.8 MB 172.1 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'

I think you are overriding the CUDA build with a CPU-only build. Our installation instructions have sharp edges.

Please try this:

sed 's/\[.*\]//' requirements.txt constraints.txt
python3 -m pip cache remove llama_cpp_python
python3 -m pip install --no-binary llama_cpp_python -c constraints.txt llama_cpp_python

russellb · 2024-05-01T06:27:29Z

Are you sure the PR uses a CUDA-accelerated build of llama-cpp-python?

I guess I'm not! I was going off of the output from ilab train confirming that it was using the GPU, but I wasn't thinking that ilab serve might be different. Thank you.

nathan-weinberg

LGTM - I'd love if we could incorporate this into the release workflow somehow - maybe having an automic trigger when we cut a new release branch for Y-streams or for Backport PRs to an existing release branch (to ensure Z-streams have no regressions)

This PR introduces a new job that runs in CI that performs a minimal configuration of the `ilab` workflow: - `ilab init` - `ilab download` - `ilab serve` - `ilab generate` - `ilab train` It runs on a GPU-enabled github actions runner. This is a Linux VM with a single Tesla T4 GPU. Given the resources required to run this job, we do not propose that it runs automatically on every PR. Instead, it is a workflow that must be manually launched. When launching it manually, you can specify which branch or pull request to run it against. Signed-off-by: Stef Walter <stefw@redhat.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: Russell Bryant <rbryant@redhat.com>

@tiran

@tiran pointed out that llama-cpp-python was being installed twice. My last approach was not ensuring that the cuda enabled build was set to the correct version. This change includes his suggestion on how to do that properly. Signed-off-by: Russell Bryant <rbryant@redhat.com>

Free up GPU memory by killing `ilab serve` before running `ilab train`. Signed-off-by: Russell Bryant <rbryant@redhat.com>

This mirrors the caching approach used in `test.yml`. Signed-off-by: Russell Bryant <rbryant@redhat.com>

Signed-off-by: Russell Bryant <rbryant@redhat.com>

russellb · 2024-05-03T18:57:33Z

LGTM - I'd love if we could incorporate this into the release workflow somehow - maybe having an automic trigger when we cut a new release branch for Y-streams or for Backport PRs to an existing release branch (to ensure Z-streams have no regressions)

Yeah, that's a good idea. We could definitely trigger it on tag creation. At the moment, it would be triggering it manually, and using the input field that lets you specify which branch or tag to run against. You't put in release-v0.15 or v0.15.0 or whatever you want and it'll run for you.

stefwalter marked this pull request as draft April 26, 2024 08:55

stefwalter force-pushed the container-testing branch 3 times, most recently from 4e25c98 to d7288f3 Compare April 26, 2024 10:14

This was referenced Apr 26, 2024

Add building of instructlab container images containers/ai-lab-recipes#341

Merged

containers: cuda contianer needs libcudann8 and nvidia-driver-NVML #1018

Merged

stefwalter force-pushed the container-testing branch from d7288f3 to c424940 Compare April 26, 2024 11:33

russellb self-assigned this Apr 26, 2024

lmilbaum mentioned this pull request Apr 29, 2024

E2E testing for RHEL AI images containers/ai-lab-recipes#391

Closed

russellb force-pushed the container-testing branch from e36c8f5 to 5f7a350 Compare April 30, 2024 11:44

mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing labels Apr 30, 2024

russellb force-pushed the container-testing branch 5 times, most recently from 42d4628 to 2d7f133 Compare April 30, 2024 14:18

russellb force-pushed the container-testing branch from 4cafab2 to f0a29d2 Compare May 1, 2024 00:39

russellb added the hold In-progress PR. Tag should be removed before merge. label May 1, 2024

russellb force-pushed the container-testing branch from f0a29d2 to 4d0d838 Compare May 1, 2024 00:45

russellb changed the title ~~WIP basic workflow test for containers~~ Introduce an end-to-end CI job on Linux May 1, 2024

russellb reviewed May 1, 2024

View reviewed changes

russellb marked this pull request as ready for review May 1, 2024 00:56

russellb force-pushed the container-testing branch from 2865268 to be15e3b Compare May 1, 2024 06:17

russellb mentioned this pull request May 1, 2024

Don't fail if multiprocessing is already set to spawn; fix issues from lack of cleanup #1050

Merged

russellb force-pushed the container-testing branch 7 times, most recently from a561bd2 to ab9d559 Compare May 2, 2024 20:42

russellb added this to the Release - 5/3 milestone May 3, 2024

nathan-weinberg self-requested a review May 3, 2024 14:20

nathan-weinberg approved these changes May 3, 2024

View reviewed changes

stefwalter and others added 4 commits May 3, 2024 14:34

e2e: Kill ilab serve before running ilab train

109d90e

Free up GPU memory by killing `ilab serve` before running `ilab train`. Signed-off-by: Russell Bryant <rbryant@redhat.com>

e2e: Add caching for pip and huggingface

759ef3e

This mirrors the caching approach used in `test.yml`. Signed-off-by: Russell Bryant <rbryant@redhat.com>

russellb force-pushed the container-testing branch from ab9d559 to 759ef3e Compare May 3, 2024 18:35

docs: Add doc on how to run the end-to-end CI job

7758db1

Signed-off-by: Russell Bryant <rbryant@redhat.com>

mergify bot added the documentation Improvements or additions to documentation label May 3, 2024

russellb removed the hold In-progress PR. Tag should be removed before merge. label May 3, 2024

mergify bot merged commit d905a25 into instructlab:main May 3, 2024

Gregory-Pereira mentioned this pull request May 4, 2024

adding Tmate session, and testing playbook containers/ai-lab-recipes#468

Merged

Introduce an end-to-end CI job on Linux #1016

Introduce an end-to-end CI job on Linux #1016

Uh oh!

Conversation

stefwalter commented Apr 26, 2024 • edited by russellb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefwalter commented Apr 26, 2024

Uh oh!

stefwalter commented Apr 26, 2024

Uh oh!

stefwalter commented Apr 26, 2024

Uh oh!

stefwalter commented Apr 26, 2024

Uh oh!

stefwalter commented Apr 29, 2024

Uh oh!

russellb commented Apr 29, 2024

Uh oh!

russellb commented Apr 30, 2024

Uh oh!

russellb commented Apr 30, 2024

Uh oh!

russellb commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

russellb May 1, 2024

Choose a reason for hiding this comment

Uh oh!

russellb May 1, 2024

Choose a reason for hiding this comment

Uh oh!

tiran May 1, 2024

Choose a reason for hiding this comment

Uh oh!

russellb May 1, 2024

Choose a reason for hiding this comment

Uh oh!

russellb May 1, 2024

Choose a reason for hiding this comment

Uh oh!

russellb commented May 1, 2024

Uh oh!

russellb commented May 1, 2024

Uh oh!

tiran commented May 1, 2024

Uh oh!

tiran commented May 1, 2024

Uh oh!

russellb commented May 1, 2024

Uh oh!

nathan-weinberg left a comment

Choose a reason for hiding this comment

Uh oh!

russellb commented May 3, 2024

Uh oh!

Uh oh!

stefwalter commented Apr 26, 2024 •

edited by russellb

Loading

russellb commented May 1, 2024 •

edited

Loading