pytorch
c4e0c927 - [c10d] Add a soft error handling mode (#84386)

Commit
3 years ago
[c10d] Add a soft error handling mode (#84386) Adding new value "2" to env `NCCL_ASYNC_ERROR_HANDLING` standing for a "CleanUpOnly" error handling mode. Comparing to `NCCL_ASYNC_ERROR_HANDLING=1`, the "CleanUpOnly" mode will just abort the collectives and NCCL communicators, and will not tear down the process. User will have the chance to query the state of the process group (in a later PR) and abort the process group (in a later PR), and re-create the process group if needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/84386 Approved by: https://github.com/rohan-varma
Author
Committer
Parents
Loading