Skip to content

Conversation

leo9800
Copy link

@leo9800 leo9800 commented May 23, 2025

after upgrading libnvidia-container and nvidia-container-toolkit to 1.17.7, Nvidia GPU enabled docker containers stop working with a segfault reported in dmesg upon starting. this has been reported at NVIDIA/nvidia-container-toolkit#1101.

i got coredump file from my system suffering from this issue and pinpointed the issue with gdb:

$ gdb /usr/bin/nvidia-container-cli core.nvidia-containe.0.89d855bc81c143af89edec60406bca16.134900.1747905214000000
GNU gdb (GDB) 16.3
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/nvidia-container-cli...

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n]) y
Debuginfod has been enabled.
To make this setting permanent, add 'set debuginfod enabled on' to .gdbinit.
Reading symbols from /home/leo/.cache/debuginfod_client/1094c1d4a95fbb47c19c6a2b845588f2037ea6ba/debuginfo...

warning: File /usr/lib/libtirpc.so.3.0.0 doesn't match build-id from core-file during file-backed mapping processing
[New LWP 134900]

warning: Build-id of /usr/lib/libtirpc.so.3 does not match core file.

warning: Could not load shared library symbols for /usr/lib/libtirpc.so.3.
Do you need "set solib-search-path" or "set sysroot"?
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `/usr/bin/nvidia-container-cli --load-kmods '' --cuda-compat-mode=ldconfig --ldconfig=@/sbin/ldconfig --device=0 --compute --utility --pid=134894 /var/lib/docker/overlay2/49588aca3e10f127980e9b617088233834e49c7af6253e67aaf0bfd3f1a310cc/merged'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000688ca58d5398 in nvc_ldcache_update (ctx=0xf33c42c8290, cnt=0xf33c42c8840)
    at /usr/src/debug/libnvidia-container/libnvidia-container-1.17.7/src/nvc_ldcache.c:488
488	        if (*argv[0] == '@') {
(gdb) backtrace 
#0  0x0000688ca58d5398 in nvc_ldcache_update (ctx=0xf33c42c8290, cnt=0xf33c42c8840)
    at /usr/src/debug/libnvidia-container/libnvidia-container-1.17.7/src/nvc_ldcache.c:488
#1  0x00000f33bf2f5451 in configure_command (ctx=<optimized out>)
    at /usr/src/debug/libnvidia-container/libnvidia-container-1.17.7/src/cli/configure.c:469
#2  0x00000f33bf2f1353 in main (argc=10, argv=0x71357b76a3b8)
    at /usr/src/debug/libnvidia-container/libnvidia-container-1.17.7/src/cli/main.c:149
(gdb) frame 0
#0  0x0000688ca58d5398 in nvc_ldcache_update (ctx=0xf33c42c8290, cnt=0xf33c42c8840)
    at /usr/src/debug/libnvidia-container/libnvidia-container-1.17.7/src/nvc_ldcache.c:488
488	        if (*argv[0] == '@') {
(gdb) list
483	                argv = (char * []){cnt->cfg.ldconfig, "-f", "/etc/ld.so.conf", "-C", "/etc/ld.so.cache", cnt->cuda_compat_dir, cnt->cfg.libs_dir, cnt->cfg.libs32_dir, NULL};
484	        } else {
485	                argv = (char * []){cnt->cfg.ldconfig, "-f", "/etc/ld.so.conf", "-C", "/etc/ld.so.cache", cnt->cfg.libs_dir, cnt->cfg.libs32_dir, NULL};
486	        }
487	
488	        if (*argv[0] == '@') {
489	                /*
490	                 * We treat this path specially to be relative to the host filesystem.
491	                 * Force proc to be remounted since we're creating a PID namespace and fexecve depends on it.
492	                 */    
(gdb) print *argv
$1 = 0x0

and obviously

(gdb) print *argv[0]
Cannot access memory at address 0x0

therefore the 3 commits within >1.17.6 && <= 1.17.7 has been inspected:

seems the culprit code in gdb is introduced by 8ed5824 and merged by d26524a. thus this patch partially reverted the related code. after applying this patch and rebuild libnvidia-container=1.17.7 and nvidia-container-toolkit==1.17.7 everything works like a charm with docker.

Signed-off-by: Leo <i@hardrain980.com>
@leo9800
Copy link
Author

leo9800 commented May 23, 2025

it remains mystery why

        if ((cnt->flags & OPT_CUDA_COMPAT_MODE_LDCONFIG) && (cnt->cuda_compat_dir != NULL)) {
                /*
                 * We include the cuda_compat_dir directory on the ldconfig
                 * command line. This ensures that the CUDA Forward compat
                 * libraries take precendence over the user-mode driver
                 * libraries in the standard library paths (libs_dir and
                 * libs32_dir).
                 * */
                log_info("prefering CUDA Forward Compatibility dir when running ldconfig");
                argv = (char * []){cnt->cfg.ldconfig, "-f", "/etc/ld.so.conf", "-C", "/etc/ld.so.cache", cnt->cuda_compat_dir, cnt->cfg.libs_dir, cnt->cfg.libs32_dir, NULL};
        } else {
                argv = (char * []){cnt->cfg.ldconfig, "-f", "/etc/ld.so.conf", "-C", "/etc/ld.so.cache", cnt->cfg.libs_dir, cnt->cfg.libs32_dir, NULL};
        }

causes segfault because null dereferencing *argv while

        argv = (char * []){cnt->cfg.ldconfig, "-f", "/etc/ld.so.conf", "-C", "/etc/ld.so.cache", cnt->cfg.libs_dir, cnt->cfg.libs32_dir, NULL};

won't.

actually it is just the code in the else block, interesting ...

@raldone01
Copy link

See #316 for the reasons. It's because of UB in Cpp.

@leo9800
Copy link
Author

leo9800 commented May 24, 2025

#316 pinpointed the true culprit of this issue and carried out a better solution that would not rollback the new feature.

prefer implementing as #316 suggested.

closed.

@leo9800 leo9800 closed this May 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants