Detect and handle NCCL errors appropriately in ProcessGroupNCCL. (#25012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/25012
Resubmitting https://github.com/pytorch/pytorch/pull/22907 with build fix.
This change adds the following functionality:
1) WorkNCCL isCompleted, isSuccess methods check for NCCL errors and set the
appropriate exception.
2) Added a watchdog thread to ProcessGroupNCCL which checks for errors in the
cached communicators and removes them from the cache.
3) Use ncclCommAbort in NCCLComm destructor since ncclCommDestroy can block
forever waiting for work.
4) Added a simulate_nccl_errors.py script to simulate NCCL errors.
https://github.com/pytorch/pytorch/issues/17882
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22907
Test Plan: 1) Run the simulate_nccl_errors.py to verify NCCL errors are caught.
Differential Revision: D16958078
fbshipit-source-id: 662b0b8b8ee250e2b6d15bdfc9306d71c4f66219