[NCCL] Provide additional information about NCCL error codes. (#45950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45950
A pain point for debugging failed training jobs due to NCCL errors has
been understanding the source of the error, since NCCL does not itself report
too many details (usually just "unhandled {system, cuda, internal} error").
In this PR, we add some basic debug information about what went wrong. The information is collected by grepping the NCCL codebase for when these errors are thrown. For example, `ncclSystemError` is what is thrown when system calls such as malloc or munmap fail.
Tested by forcing `result = ncclSystemError` in the macro. The new error
message looks like:
```RuntimeError: NCCL error in:
caffe2/torch/lib/c10d/ProcessGroupNCCL.cpp:759, unhandled system error, NCCL
version 2.7.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
```
The last line is what we have added to the message.
In the future, we will also evaluate setting NCCL_DEBUG=WARN, by which NCCL
provides more details about errors sa well.
ghstack-source-id: 114219288
Test Plan: CI
Reviewed By: mingzhe09088
Differential Revision: D24155894
fbshipit-source-id: 10810ddf94d6f8cd4989ddb3436ddc702533e1e1