CI: adding a GPU-enabled CI job

_We've touched on this in a few places, in a [community meeting](https://hackmd.io/pyhudrC5TgSwdJtpPldHgQ#Agenda-for-July-24th-2024), on a PR I can't find back right now, and in in-person discussions. So here is a new issue with a concrete proposal._

We have experimental support for CuPy, PyTorch and JAX - all of which have CUDA (and in some cases ROCm and/or Intel XPUs) support, which we are not exercising in CI. As a result, we get issues like gh-21486 where someone finds locally that the test suite has regressed for a GPU-enabled array library. This is obviously not ideal, and we'd like a low-overhead and low-cost way to test PRs that a reviewer thinks may impact GPU support in CI.

Scikit-learn has such a CI job, see [the GHA workflow file for it](https://github.com/scikit-learn/scikit-learn/blob/9d39f57399d6f1f7d8e8d4351dbc3e9244b98d28/.github/workflows/cuda-ci.yml) and @betatim's [blog post describing the design](https://betatim.github.io/posts/github-action-with-gpu/). 

The other thing we needed for it was a credit card from NumFOCUS that we can attach to this repo (with a limit on spending level, as described in the blog post) - we have that as well now. So we're all set in principle to add this.

Suggested logistics and design:
- The SciPy core team should sign off on the spending limit (just like we did for Cirrus CI usage). I'd suggest either $50/month or $100/month as the limit, and adjust if needed once we have some experience with it.
- The label-based triggering that scikit-learn has, where adding the label to a PR does a single run and then the label gets removed again so new commits don't auto-run the job again, seems nice - so let's copy it.
- We can start with building SciPy in the GPU-enabled CI runnner; if it becomes expensive we can move the SciPy wheel build to a separate (CPU-only, free) job, upload the wheel to the GHA cache, and then trigger a second job based on the first one to run the CUDA tests.
- The one change from the scikit-learn setup we should make is to use `pixi` rather than `conda-lock`. The latter makes sense for scikit-learn since they were already using it, but `pixi` will work better for us and we've mostly already got the setup worked out in https://github.com/rgommers/pixi-dev-scipystack. I don't want to propose adding that setup in the root of the repo, but only the single environment + lock file that we need to run this GPU job.

Thoughts?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

CI: adding a GPU-enabled CI job #21740

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

CI: adding a GPU-enabled CI job #21740

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions