surface ncclUniqueId store broadcast error (#68597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68597
Users got confused by just 'Socket timeout'. Surfacing detailed error message. https://fb.workplace.com/groups/319878845696681/posts/650320792652483/. As we are using store more often for desync timeout/slowness detection, will need a good wrapper to surface error message for all store APIs.
Test Plan:
```
RuntimeError: [3] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got exception: Socket Timeout
Exception raised from recvBytes at caffe2/torch/csrc/distributed/c10d/Utils.hpp:595 (most recent call first):
# 0 c10::get_backtrace[abi:cxx11](unsigned long, unsigned long, bool)
# 1 std::_Function_handler<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > (), c10::(anonymous namespace)::GetFetchStackTrace()::$_0>::_M_invoke(std::_Any_data const&)
# 2 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
# 3 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*)
# 4 c10d::TCPStore::doWait(c10::ArrayRef<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::chrono::duration<long, std::ratio<1l, 1000l> >)
# 5 c10d::TCPStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
# 6 c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
# 7 c10d::PrefixStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
# 8 c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, c10d::OpType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int)
# 9 c10d::ProcessGroupNCCL::getNCCLComm(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool)
# 10 c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&)
# 11 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::ProcessGroup::Work, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup::Work> >, c10d::ProcessGroup, std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::ProcessGroup::WorkTraceback (most recent call last):
```
Reviewed By: rohan-varma, mingzhe09088
Differential Revision: D32533304
fbshipit-source-id: e471636ee0c5291215cb6cde659b10bee13b7d12