pytorch
9fbfaaa5 - [c10d] Add flag value for direct teardown without comm abort (#102599)

Commit
1 year ago
[c10d] Add flag value for direct teardown without comm abort (#102599) It was recently reported that `ncclCommAbort` itself may hang in some NCCL versions. For example, https://github.com/NVIDIA/nccl/issues/829. In that case, it may be desirable to directly tear down the program without properly aborting the NCCL communicator, so that user does not wait for hours before noticing a hang. This PR adds new value 3 for env `NCCL_ASYNC_ERROR_HANDLING` that skips the comm abort, and directly throws error in case of exception (timeout, async error, etc) Pull Request resolved: https://github.com/pytorch/pytorch/pull/102599 Approved by: https://github.com/fegin
Author
Committer
Parents
Loading