Change default NCCL_ASYNC_ERROR_HANDLING to 3:SkipCleanUp (#110723)
Summary
Currently, when detecting a timeout/exception in the watchdog
workCleanupLoop, we call nccl APIs to abort all the active communicators
before finally re-raising the exception and killing the process. The
nccl APIs may hang, causing additional problems. Instead, just re-raise.
@kumpera proposed that changing this default should save us from a lot of commonly observed errors.
Note: there are other cuda/nccl api calls in our watchdog, which also could hang. This change is not a substitute for a deeper refactor.
Detail
The current default (NCCL_ASYNC_ERROR_HANDLING=1:TearDown) meant the following:
SHOULD_TEAR_DOWN() evaluates to true
- This affects 'ProcessGroupNCCL::WorkNCCL::handleException`
- handleException is called from two places:
- work.wait() -> synchronizeInternal() -> handleException()
- workCleanupLoop() -> handleException()
- when true, the excpetion is logged and rethrown
SHOULD_CLEAN_UP() evaluates to true
- This only impacts the workCleanupLoop()
- When true, it means all communicators will be aborted (ncclCommAbort())
upon work exception or timeout
The proposed new default is NCCL_ASYNC_ERROR_HANDLING3=3:SkipCleanUp.
This only changes SHOULD_CLEAN_UP() to false, impacting workCleanupLoop() behavior.
Communicators will no longer be aborted, which should avoid a class of bugs where the watchdog hangs due to calling nccl APIs which may block/hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110723
Approved by: https://github.com/fduwjj, https://github.com/xw285cornell