Skip to content

Conversation

rgommers
Copy link
Member

@rgommers rgommers commented Jan 7, 2025

draft for now because not all tests pass yet. these are recent regressions in main, and I'd prefer for those to be fixed in another PR first

  • Runs tests for all array API-enabled functionality with PyTorch, JAX and CuPy in a single CI job (for now)
  • Builds on the GPU runner, because with ccache builds are very fast and the setup time is dominated by the ~3 GB worth of packages that are needed for GPU installs of PyTorch, JAX and CuPy in a single conda environment.
  • Uses Pixi to ensure we get a maintainable and robust environment.
    • Keeps pixi.toml inside .github/workflows/ for now, because I don't (yet) want to cross the bridge of enabling Pixi for all SciPy contributors (there's too much logistics to work out still there for how to do lock file updates).
    • Failures reproduce quite well locally. Should be a matter of cd .github/workflows and then running the same pixi run test-cuda ... command as the failing CI job step.

Closes gh-21740

Closes scipy issue 21740.
@rgommers rgommers added enhancement A new feature or improvement CI Items related to the CI tools such as CircleCI, GitHub Actions or Azure array types Items related to array API support and input array validation (see gh-18286) labels Jan 7, 2025
@rgommers
Copy link
Member Author

rgommers commented Jan 7, 2025

Note that the job itself won't run on this PR before merging. Here's a CI log that's less than an hour old.

@ev-br
Copy link
Member

ev-br commented Jan 8, 2025

Note that the job itself won't run on this PR before merging.

How about merge and iterate then? Unless you want to fix regression before merging

@rgommers rgommers marked this pull request as draft January 8, 2025 09:12
@rgommers
Copy link
Member Author

rgommers commented Jan 8, 2025

How about merge and iterate then? Unless you want to fix regression before merging

I do want the job to pass before merging, otherwise it will make CI on all PRs red.

Help is very welcome - most issues are recent regressions for PRs that weren't tested on GPU. I won't be able to do more before Friday at least. It may be a matter of some skips in signal and stats, at least to fix PyTorch GPU.

@ev-br
Copy link
Member

ev-br commented Jan 8, 2025

#22279 fixes all torch and jax.numpy failures I see locally.

@lucascolley lucascolley marked this pull request as ready for review January 8, 2025 21:13
@rgommers
Copy link
Member Author

rgommers commented Jan 8, 2025

All tests pass after gh-22279, thanks again @ev-br and @lucascolley.

Test suite runtime is reasonable at the moment:

image

If we enable much more functionality on GPU then we may end up splitting it, but for now it's easier to have it in a single job.

Still optimizing cache strategies a bit, then this should be ready.

@rgommers
Copy link
Member Author

rgommers commented Jan 8, 2025

Caching performance after the changes I'm about to push, that's about as good as it's going to get:

image

@rgommers
Copy link
Member Author

rgommers commented Jan 8, 2025

Okay, ready for final review.

@rgommers
Copy link
Member Author

rgommers commented Jan 8, 2025

See https://github.com/scipy/scipy/actions/workflows/gpu-ci.yml for more logs if you want to see what the changes do.

Copy link
Member

@lucascolley lucascolley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's give this a go and follow up if anything goes red - thanks Ralf!

@lucascolley lucascolley merged commit 185233e into scipy:main Jan 8, 2025
37 of 38 checks passed
@j-bowhay j-bowhay added this to the 1.16.0 milestone Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array types Items related to array API support and input array validation (see gh-18286) CI Items related to the CI tools such as CircleCI, GitHub Actions or Azure enhancement A new feature or improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: adding a GPU-enabled CI job
4 participants