Skip to content

Conversation

stefwalter
Copy link
Contributor

@stefwalter stefwalter commented Apr 26, 2024

This PR introduces a new job that runs in CI that performs a minimal
configuration of the ilab workflow:

  • ilab init
  • ilab download
  • ilab serve
  • ilab generate
  • ilab train

It runs on a GPU-enabled github actions runner. This is a Linux VM
with a single Tesla T4 GPU.

Given the resources required to run this job, we do not propose that
it runs automatically on every PR. Instead, it is a workflow that must
be manually launched. When launching it manually, you can specify
which branch or pull request to run it against.

Signed-off-by: Stef Walter stefw@redhat.com
Co-authored-by: Russell Bryant rbryant@redhat.com
Signed-off-by: Russell Bryant rbryant@redhat.com


Depends on:

Related:


Follow-up work after this PR:

  • Make another variant of this job that builds our instructlab container with cuda support and uses that to run the same workflow
  • Consider how and where to run end-to-end tests for other environments, file tracking issues for those

@stefwalter stefwalter marked this pull request as draft April 26, 2024 08:55
@stefwalter
Copy link
Contributor Author

@rhatdan @jeremyeder @lmilbaum FYI

@stefwalter stefwalter force-pushed the container-testing branch 3 times, most recently from 4e25c98 to d7288f3 Compare April 26, 2024 10:14
@stefwalter
Copy link
Contributor Author

@lmilbaum @rhatdan These is an e2e test we'd run in https://github.com/containers/ai-lab-recipes/ ... related to containers/ai-lab-recipes#341

@stefwalter
Copy link
Contributor Author

This work needs #1018 to fix the container first.

The work is incomplete right now, with lots of TODO statements indicating what remains to be done.

@russellb russellb self-assigned this Apr 26, 2024
@stefwalter
Copy link
Contributor Author

@russellb The Container file also needs this for cuda:

-RUN dnf install -y libcudnn8 nvidia-driver-NVML
+RUN dnf install -y libcudnn8 nvidia-driver-NVML nvidia-driver-cuda-libs

@stefwalter
Copy link
Contributor Author

Is this ready for broader review and testing? Did a CI run work successfully in GitHub?

@russellb
Copy link
Member

Is this ready for broader review and testing? Did a CI run work successfully in GitHub?

I don’t have the CI job passing yet. I’ll update again later today.

@russellb russellb force-pushed the container-testing branch from e36c8f5 to 5f7a350 Compare April 30, 2024 11:44
@mergify mergify bot added CI/CD Affects CI/CD configuration testing Relates to testing labels Apr 30, 2024
@russellb
Copy link
Member

My last push included a rebase to be on top of a bunch of newer changes. Some outstanding issues:

  • I defined an e2e CI job, but it doesn't work reliably. I don't think the github action runners have enough CPU to finish even the most minimal configuration. I'm going to switch to looking at running it on a GPU runner instead. The tradeoff will be that it won't run on every PR for now, since it's not free.
  • Eventually, we'll squash all of this, but that's for when it's ready for review and merge ... I'd like to have a working CI job ready for this repo first though

@russellb russellb force-pushed the container-testing branch 5 times, most recently from 42d4628 to 2d7f133 Compare April 30, 2024 14:18
@russellb
Copy link
Member

The end-to-end CI job that does init/download/generate/train on Linux is passing. This is running on the host OS using a GPU worker available via github actions.

Next steps:

  • I'm going to clean up this branch by squashing most of it together and putting it in a form ready for review and merging this first workflow
  • Next I want to add a variant that instead of running it on the host, builds a cuda container using our containerfile and runs the same workflow using the container. I'd like to do that in a follow-up PR, though.

@russellb russellb force-pushed the container-testing branch from 4cafab2 to f0a29d2 Compare May 1, 2024 00:39
@russellb russellb added the hold In-progress PR. Tag should be removed before merge. label May 1, 2024
@russellb
Copy link
Member

russellb commented May 1, 2024

I applied the hold label because:

  • This PR includes linux-train: Skip convert to gguf when --4-bit-quant is used #1058, so that should be reviewed first
  • There is a commit on the branch that must be removed before merging (the change to make the job run automatically against this PR). The intent is to only merge it as a manually triggered workflow.
  • I have a hack commit applied to avoid an error I'm hitting during ilab train, but I need to figure out the proper fix for it

@russellb russellb force-pushed the container-testing branch from f0a29d2 to 4d0d838 Compare May 1, 2024 00:45
@russellb russellb changed the title WIP basic workflow test for containers Introduce an end-to-end CI job on Linux May 1, 2024
try:
torch.multiprocessing.set_start_method(DEFAULT_MULTIPROCESSING_START_METHOD)
except RuntimeError:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiran Do you have any advice on this one? I hit this reliably in this CI job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message here has the backtrace: 2865268

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have seen the issue before on HPUs and created #1050

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I replaced my change with the commit from your PR. Assuming the job still works, I'll approve that one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't fix it here.

Traceback (most recent call last):
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/train/linux_train.py", line 30, in <module>
    torch.multiprocessing.set_start_method(DEFAULT_MULTIPROCESSING_START_METHOD)
  File "/usr/lib/python3.10/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/runner/work/instructlab/instructlab/venv/bin/ilab", line 8, in <module>
    sys.exit(cli())
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/lab.py", line 951, in train
    from .train.linux_train import linux_train
  File "/home/runner/work/instructlab/instructlab/venv/lib/python3.10/site-packages/instructlab/train/linux_train.py", line 34, in <module>
    raise ValueError(
ValueError: multiprocessing start method already set to fork.

@russellb russellb marked this pull request as ready for review May 1, 2024 00:56
@russellb
Copy link
Member

russellb commented May 1, 2024

Note that the workflow is currently limited and stops before ensuring you can run ilab serve with a model after ilab train because of #579

@russellb
Copy link
Member

russellb commented May 1, 2024

A data point on costs. The last two successful runs took 22 minutes. That costs us $1.54 each time ($0.07 per minute). If we run this only manually, I think the costs will stay under control.

I also think we could speed this up. A lot of time is spent on downloading and installing stuff. No caching is done in the workflow, yet.

@tiran
Copy link
Contributor

tiran commented May 1, 2024

For caching I recommend to use setup-python with cache pip and actions/cache to cache the Hugging Face models. We are using the same technique in test.yaml. You could also cache the APT packages with something like https://github.com/marketplace/actions/cache-apt-packages

      - name: Setup Python 3.11
        uses: actions/setup-python@v5
        with:
          python-version: py311
          cache: pip
          cache-dependency-path: |
            **/pyproject.toml
            **/requirements*.txt

      - name: Cache huggingface
        uses: actions/cache@v4
        with:
          path: ~/.cache/huggingface
          # config contains DEFAULT_MODEL
          key: huggingface-${{ hashFiles('src/instructlab/config.py') }}

@tiran
Copy link
Contributor

tiran commented May 1, 2024

Are you sure the PR uses a CUDA-accelerated build of llama-cpp-python? It downloads and builds llama-cpp-python two times:

Collecting llama-cpp-python
  Downloading llama_cpp_python-0.2.68.tar.gz (42.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 42.5/42.5 MB 170.8 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
 Collecting llama_cpp_python[server]==0.2.55
  Downloading llama_cpp_python-0.2.55.tar.gz (36.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.8/36.8 MB 172.1 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Installing backend dependencies: started
  Installing backend dependencies: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'

I think you are overriding the CUDA build with a CPU-only build. Our installation instructions have sharp edges.

Please try this:

sed 's/\[.*\]//' requirements.txt constraints.txt
python3 -m pip cache remove llama_cpp_python
python3 -m pip install --no-binary llama_cpp_python -c constraints.txt llama_cpp_python

@russellb russellb force-pushed the container-testing branch from 2865268 to be15e3b Compare May 1, 2024 06:17
@russellb
Copy link
Member

russellb commented May 1, 2024

Are you sure the PR uses a CUDA-accelerated build of llama-cpp-python?

I guess I'm not! I was going off of the output from ilab train confirming that it was using the GPU, but I wasn't thinking that ilab serve might be different. Thank you.

Copy link
Member

@nathan-weinberg nathan-weinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I'd love if we could incorporate this into the release workflow somehow - maybe having an automic trigger when we cut a new release branch for Y-streams or for Backport PRs to an existing release branch (to ensure Z-streams have no regressions)

stefwalter and others added 4 commits May 3, 2024 14:34
This PR introduces a new job that runs in CI that performs a minimal
configuration of the `ilab` workflow:

- `ilab init`
- `ilab download`
- `ilab serve`
- `ilab generate`
- `ilab train`

It runs on a GPU-enabled github actions runner. This is a Linux VM
with a single Tesla T4 GPU.

Given the resources required to run this job, we do not propose that
it runs automatically on every PR. Instead, it is a workflow that must
be manually launched. When launching it manually, you can specify
which branch or pull request to run it against.

Signed-off-by: Stef Walter <stefw@redhat.com>
Co-authored-by: Russell Bryant <rbryant@redhat.com>
Signed-off-by: Russell Bryant <rbryant@redhat.com>
@tiran pointed out that llama-cpp-python was being installed twice. My
last approach was not ensuring that the cuda enabled build was set to
the correct version. This change includes his suggestion on how to do
that properly.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Free up GPU memory by killing `ilab serve` before running `ilab
train`.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
This mirrors the caching approach used in `test.yml`.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
@russellb russellb force-pushed the container-testing branch from ab9d559 to 759ef3e Compare May 3, 2024 18:35
Signed-off-by: Russell Bryant <rbryant@redhat.com>
@mergify mergify bot added the documentation Improvements or additions to documentation label May 3, 2024
@russellb
Copy link
Member

russellb commented May 3, 2024

LGTM - I'd love if we could incorporate this into the release workflow somehow - maybe having an automic trigger when we cut a new release branch for Y-streams or for Backport PRs to an existing release branch (to ensure Z-streams have no regressions)

Yeah, that's a good idea. We could definitely trigger it on tag creation. At the moment, it would be triggering it manually, and using the input field that lets you specify which branch or tag to run against. You't put in release-v0.15 or v0.15.0 or whatever you want and it'll run for you.

@russellb russellb removed the hold In-progress PR. Tag should be removed before merge. label May 3, 2024
@mergify mergify bot merged commit d905a25 into instructlab:main May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration documentation Improvements or additions to documentation testing Relates to testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants