[NCCL] Add Environment Variable to guard Async Error Handling feature (#44163)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44163
In this PR, we introduce a new environment variable
(NCCL_ASYNC_ERROR_HANDLING), which guards the asynchronous error handling
feature. We intend to eventually turn this feature on by default for all users,
but this is a temporary solution so the change in behavior from hanging to
crashing is not the default for users all of a sudden.
ghstack-source-id: 111637788
Test Plan:
CI/Sandcastle. We will turn on this env var by default in
torchelastic and HPC trainer soon.
Reviewed By: jiayisuse
Differential Revision: D23517895
fbshipit-source-id: e7cd244b2ddf2dc0800ff7df33c73a6f00b63dcc