Skip to content

Conversation

stefwalter
Copy link
Contributor

@stefwalter stefwalter commented Apr 26, 2024

The libcudann8 container needs to be installed in the container or
else we see errors like this one:

    File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 237, in <module>
        from torch._C import *  # noqa: F403
        ^^^^^^^^^^^^^^^^^^^^^^
    ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory

And in order to find devices in torch:

    $ python3.11
    >>> import torch
    >>> torch.cuda.device_count()
    1

The above returns zero without nvidia-driver-NVML

@stefwalter
Copy link
Contributor Author

This is needed by #1016

@stefwalter stefwalter force-pushed the cuda-container-fixes branch from d91b651 to 3f0f299 Compare April 26, 2024 14:50
@stefwalter stefwalter changed the title containers: The cuda contianer needs libcudann8 containers: cuda contianer needs libcudann8 and nvidia-driver-NVML Apr 26, 2024
@stefwalter stefwalter marked this pull request as ready for review April 26, 2024 15:04
@stefwalter
Copy link
Contributor Author

Tested this on a G5 AWS instance. And it does the trick.

Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to let mergify do the merge on this one -- please hold off on merging

(tested locally and this builds)

@russellb
Copy link
Member

@Mergifyio rebase

The libcudann8 container needs to be installed in the container or
else we see errors like this one:

    File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 237, in <module>
        from torch._C import *  # noqa: F403
        ^^^^^^^^^^^^^^^^^^^^^^
    ImportError: libcudnn.so.8: cannot open shared object file: No such file or directory

And in order to find devices in torch:

    $ python3.11
    >>> import torch
    >>> torch.cuda.device_count()
    1

The above returns zero without nvidia-driver-NVML

Signed-off-by: Stef Walter <stefw@redhat.com>
Copy link
Contributor

mergify bot commented Apr 26, 2024

rebase

✅ Branch has been successfully rebased

@russellb russellb force-pushed the cuda-container-fixes branch from 3f0f299 to 8704229 Compare April 26, 2024 15:34
@mergify mergify bot merged commit 97986af into instructlab:main Apr 26, 2024
@stefwalter
Copy link
Contributor Author

Groan, look like I forgot one depnedency:

-RUN dnf install -y libcudnn8 nvidia-driver-NVML
+RUN dnf install -y libcudnn8 nvidia-driver-NVML nvidia-driver-cuda-libs

russellb added a commit to russellb/instructlab that referenced this pull request Apr 26, 2024
This was called out by @stefwalter on instructlab#1018. He did that PR and
commented after it merged that this is needed as well.

Signed-off-by: Russell Bryant <rbryant@redhat.com>
@russellb
Copy link
Member

Groan, look like I forgot one depnedency:

-RUN dnf install -y libcudnn8 nvidia-driver-NVML
+RUN dnf install -y libcudnn8 nvidia-driver-NVML nvidia-driver-cuda-libs

@stefwalter posted in #1023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants