pytorch
e3691be2 - Dump C++ stack traces of all threads for distributed tests. (#55003)

Commit
3 years ago
Dump C++ stack traces of all threads for distributed tests. (#55003) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55003 Using the `caffe2::setPrintStackTracesOnFatalSignal` utility in distributed tests to set a signal handler that dumps the state of all threads for all processes when it receives a FATAL signal. This would help in debugging tests further. I had to revert all the python faulthandler code since only one signal handler function is supported, so running python faulthandler with `setPrintStackTracesOnFatalSignal` doesn't work. Sample output: ``` SIGSEGV(11), PID: 3492872, Thread 3492872: [0] ???(0x7fa7b2d1d61b) in libcaffe2_caffe2_caffe2_cpu.so [1] ???(0x7fa7b2d1d3fb) in libcaffe2_caffe2_caffe2_cpu.so [2] ???(0x7fa7b2d1d33d) in libcaffe2_caffe2_caffe2_cpu.so [3] ???(0x7fa7b2d1d167) in libcaffe2_caffe2_caffe2_cpu.so [4] ???(0x7fa7ce683150) in libpthread.so.0 [5] ???(0x7fa7be2b233c) in libcaffe2__C_impl_cuda.so [6] ???(0x7fa7be2ce80c) in libcaffe2__C_impl_cuda.so [7] ???(0x7fa7be2a0512) in libcaffe2__C_impl_cuda.so [8] torch::distributed::rpc::TensorPipeAgent::send(torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, float, std::unordered_map<signed char, signed char, std::hash<signed char>, std::equal_to<signed char>, std::allocator<std::pair<signed char const, signed char> > > const&)+0x24f(0x7fa7be29f71f) in libcaffe2__C_impl_cuda.so [9] torch::distributed::autograd::sendMessageWithAutograd(torch::distributed::rpc::RpcAgent&, torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, bool, float, bool)+0x393(0x7fa7b602b203) in libcaffe2_libtorch.so [10] torch::distributed::rpc::pyRpcPythonUdf(torch::distributed::rpc::WorkerInfo const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, float, bool)+0x201(0x7fa7bd844971) in libcaffe2__C_impl_cuda.so ``` ghstack-source-id: 125630551 Test Plan: waitforbuildbot Reviewed By: SciPioneer Differential Revision: D27419714 fbshipit-source-id: 8aca9a14ef688004053d8798124d9c3a3fbe3489
Author
Parents
Loading