pytorch
0a231512 - Detect and handle NCCL errors appropriately in ProcessGroupNCCL. (#22907)

Commit
5 years ago
Detect and handle NCCL errors appropriately in ProcessGroupNCCL. (#22907) Summary: This change adds the following functionality: 1) WorkNCCL isCompleted, isSuccess methods check for NCCL errors and set the appropriate exception. 2) Added a watchdog thread to ProcessGroupNCCL which checks for errors in the cached communicators and removes them from the cache. 3) Use ncclCommAbort in NCCLComm destructor since ncclCommDestroy can block forever waiting for work. 4) Added a simulate_nccl_errors.py script to simulate NCCL errors. https://github.com/pytorch/pytorch/issues/17882 Pull Request resolved: https://github.com/pytorch/pytorch/pull/22907 Test Plan: 1) Run the simulate_nccl_errors.py to verify NCCL errors are caught. Differential Revision: D16220638 fbshipit-source-id: fbc8881ea0c38a4d09a77045691e36557b7b0b25
Author
Parents
Loading