Skip to content

Multi-GPU training with PyTorch lightning #3171

@vitkl

Description

@vitkl

I and @macwiatrak are trying to figure out how to train a Pyro / scvi-tools model on multiple GPUs using PyTorch lightning.

I tried PyTorch Lightning Trainer(strategy="horovod", accelerator="GPU", devices=2) with Pyro HorovodOptimizer - however, I am getting ValueError: Tensor is required to be contiguous. which doesn't really suggest what to do next.

Also, https://github.com/pyro-ppl/pyro/blob/dev/examples/svi_horovod.py fails for me on the LSF cluster because it fails to find certain environmental variables.

Would be great to get some help figuring out what's needed to "natively" train pyro models on multiple GPU using PyTorch Lightning horovod or any other strategy.

We can use https://github.com/BayraktarLab/cell2location as a public test case that should have most of the properties relevant to our current and future projects.

@fritzo

Here is what @adamgayoso thinks about scvi-tools + PyTorch lightning context: scverse/scvi-tools#1226 (comment)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions