abort nccl communicators before throwing operation timed out (#31128)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31128
When operation times out due to some errors that are not detected by nccl communicators, ncclCommWatchdog can not check this time out error and thus can not abort ncclComms accordingly. So explicitly abort ncclComms here before throwing this timed out exception to users, after this, ncclCommWatchdog can detect nccl communicators are aborted and clean up devNCCLCommMap_ accordingly. if throwing timed out excepiton without aborting nccl communicators here, it was observed that CUDA GPU will have 100% utilization and can not run new events successfully.
ghstack-source-id: 95528488
Test Plan: newly revised test _test_nccl_errors_blocking passed with the changes in this diff; the reviesed test failed withtout the changes in this diff
Reviewed By: isunjin
Differential Revision: D18928607
fbshipit-source-id: be65a05ce4ff005f0c7fed36ae8e28903e8ffe2b