pytorch
48c47db8 - [NCCL] Add Environment Variable to guard Async Error Handling feature (#44163)

Commit
4 years ago
[NCCL] Add Environment Variable to guard Async Error Handling feature (#44163) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44163 In this PR, we introduce a new environment variable (NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling feature. We intend to eventually turn this feature on by default for all users, but this is a temporary solution so the change in behavior from hanging to crashing is not the default for users all of a sudden. ghstack-source-id: 111637788 Test Plan: CI/Sandcastle. We will turn on this env var by default in torchelastic and HPC trainer soon. Reviewed By: jiayisuse Differential Revision: D23517895 fbshipit-source-id: e7cd244b2ddf2dc0800ff7df33c73a6f00b63dcc
Author
Parents
Loading