pytorch
e0e832c2 - [c10d] Provide failure reason from ProcessGroup when aborting NCCL comm (#64241)

Commit
4 years ago
[c10d] Provide failure reason from ProcessGroup when aborting NCCL comm (#64241) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64241 When things go wrong PG NCCL aborts nccl communicators via `ncclCommAbort`, but one issues is that often the error can be set to `ncclSystemError` (see https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L176) when that might not be the true cause of the issue and the actual issue is that some prior work timed out, communicator was aborted on other rank, etc. This results in a lot of confusion when debugging jobs with a large no. of processes as the current message for ncclSystemError is not very informative: https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/c10d/NCCLUtils.hpp#L22 The fix here is to pass in a string exception message from PG NCCL down to `NCCLUtils` which will aim to raise that as the actual issue and not the confusing `ncclSystemError` message. Test Plan: CI Reviewed By: pallab-zz, cbalioglu Differential Revision: D30658855 fbshipit-source-id: 17661dbe0a1bb8cc5b87b637c47634b1f52f54e1
Author
Parents
Loading