[PyTorch/NCCL] Fix async error handling (#45456)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45456
Remove work while not holding lock, to avoid deadlock with watchdog thread while GPU is 100%
SyncBatchNorm failure trace: P143879560
Test Plan:
**Desync test:**
BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_spawn#binary.par -r test_DistributedDataParallel_desync
**SyncBatchNorm test:**
BACKEND=nccl WORLD_SIZE=3 NCCL_ASYNC_ERROR_HANDLING=1 ./buck-out/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_DistributedDataParallel_SyncBatchNorm_Diff_Input_Sizes_gradient
Reviewed By: osalpekar
Differential Revision: D23972071
fbshipit-source-id: f03d9637a6ec998d64dab1a062a81e0f3697275f