pytorch
733368a8 - Change default NCCL_ASYNC_ERROR_HANDLING to 3:SkipCleanUp (#110723)

Commit
1 year ago
Change default NCCL_ASYNC_ERROR_HANDLING to 3:SkipCleanUp (#110723) Summary Currently, when detecting a timeout/exception in the watchdog workCleanupLoop, we call nccl APIs to abort all the active communicators before finally re-raising the exception and killing the process. The nccl APIs may hang, causing additional problems. Instead, just re-raise. @kumpera proposed that changing this default should save us from a lot of commonly observed errors. Note: there are other cuda/nccl api calls in our watchdog, which also could hang. This change is not a substitute for a deeper refactor. Detail The current default (NCCL_ASYNC_ERROR_HANDLING=1:TearDown) meant the following: SHOULD_TEAR_DOWN() evaluates to true - This affects 'ProcessGroupNCCL::WorkNCCL::handleException` - handleException is called from two places: - work.wait() -> synchronizeInternal() -> handleException() - workCleanupLoop() -> handleException() - when true, the excpetion is logged and rethrown SHOULD_CLEAN_UP() evaluates to true - This only impacts the workCleanupLoop() - When true, it means all communicators will be aborted (ncclCommAbort()) upon work exception or timeout The proposed new default is NCCL_ASYNC_ERROR_HANDLING3=3:SkipCleanUp. This only changes SHOULD_CLEAN_UP() to false, impacting workCleanupLoop() behavior. Communicators will no longer be aborted, which should avoid a class of bugs where the watchdog hangs due to calling nccl APIs which may block/hang. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110723 Approved by: https://github.com/fduwjj, https://github.com/xw285cornell
Author
Committer
Parents
Loading