[c10d] Provide failure reason from ProcessGroup when aborting NCCL comm (#64241)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64241
When things go wrong PG NCCL aborts nccl communicators via `ncclCommAbort`, but one issues is that often the error can be set to `ncclSystemError` (see https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L176) when that might not be the true cause of the issue and the actual issue is that some prior work timed out, communicator was aborted on other rank, etc.
This results in a lot of confusion when debugging jobs with a large no. of processes as the current message for ncclSystemError is not very informative: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L22
The fix here is to pass in a string exception message from PG NCCL down to `NCCLUtils` which will aim to raise that as the actual issue and not the confusing `ncclSystemError` message.
Test Plan: CI
Reviewed By: pallab-zz, cbalioglu
Differential Revision: D30658855
fbshipit-source-id: 17661dbe0a1bb8cc5b87b637c47634b1f52f54e1