pytorch
6f92f10c - Use a leaky singleton for CublasHandlePool. (#60987)

Commit
3 years ago
Use a leaky singleton for CublasHandlePool. (#60987) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60987 We were seeing deadlocks as follows during shutdown: ``` Thread 1 (LWP 2432101): #0 0x00007efca470190b in __pause_nocancel () from /lib64/libc.so.6 #1 0x00007efca49de485 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0 #2 0x00007ef91d4c42c6 in __cuda_CallJitEntryPoint () from /lib64/libnvidia-ptxjitcompiler.so.1 #3 0x00007efc651ac8f1 in ?? () from /lib64/libcuda.so #4 0x00007efc651aee03 in ?? () from /lib64/libcuda.so #5 0x00007efc64f76b84 in ?? () from /lib64/libcuda.so #6 0x00007efc64f77f5d in ?? () from /lib64/libcuda.so #7 0x00007efc64eac858 in ?? () from /lib64/libcuda.so #8 0x00007efc64eacfbc in ?? () from /lib64/libcuda.so #9 0x00007efc7810a924 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #10 0x00007efc780fa2be in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #11 0x00007efc78111044 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #12 0x00007efc7811580a in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #13 0x00007efc78115aa4 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #14 0x00007efc781079ec in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #15 0x00007efc780e6a7a in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #16 0x00007efc7811cfa5 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #17 0x00007efc777ea98c in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #18 0x00007efc777ebd80 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #19 0x00007efc777ea2c9 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #20 0x00007efc778c2e2d in cublasDestroy_v2 () from /usr/local/cuda/lib64/libcublas.so.11 #21 0x00007efc51a3fb56 in std::_Sp_counted_ptr_inplace<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle>, std::allocator<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so #22 0x00007efc51a3fc5f in std::shared_ptr<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >::~shared_ptr() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so #23 0x00007efca4648b0c in __run_exit_handlers () from /lib64/libc.so.6 #24 0x00007efca4648c40 in exit () from /lib64/libc.so.6 #25 0x0000558c8852e5f9 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1614362349910/work/Python/pylifecycle.c:2292 #26 0x0000558c8852e6a7 in handle_system_exit () at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:636 #27 0x0000558c8852e742 in PyErr_PrintEx (set_sys_last_vars=<optimized out>, set_sys_last_vars=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:646 #28 0x0000558c88540dd6 in PyRun_SimpleStringFlags (command=0x7efca4dc9050 "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=9, pipe_handle=13)\n", flags=0x7ffe3a986110) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:457 #29 0x0000558c88540ead in pymain_run_command (cf=0x7ffe3a986110, command=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:420 #30 pymain_run_python (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:2907 #31 pymain_main (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3460 #32 0x0000558c8854122c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3495 #33 0x00007efca4632493 in __libc_start_main () from /lib64/libc.so.6 #34 0x0000558c884e5e90 in _start () at ../sysdeps/x86_64/elf/start.S:103 ``` This was likely caused due to a static singleton that wasn't leaky. Following the guidance in https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 to use a leaky singleton instead. ghstack-source-id: 132847448 Test Plan: Verified locally. Reviewed By: malfet Differential Revision: D29468866 fbshipit-source-id: 89250594c5cd2643417b1da584c658b742dc5a5c
Author
Parents
Loading