Catch cuda driver shutdown error in NCCLWatchdog (#106503)
There is a design flaw in NCCLWatchdog, namely it spawns threads that
talk to the CUDA api, but the CUDA api may have been deinitialized,
forming a race.
This is a known issue with widespread impact
(https://github.com/pytorch/pytorch/issues/90848).
I should point out that i tested this fix on the repro command for https://github.com/pytorch/pytorch/issues/82632 by running `NCCL_DESYNC_DEBUG=1 CUDA_LAUNCH_BLOCKING=1 python test/distributed/test_c10d_nccl.py -k test_find_unused_parameters_kwarg_debug_detail` and observing that instead of crashing, we observe log messages with the exception string about the cuda driver shutdown error.
A partial fix was landed already, but it applied too narrowly:
https://github.com/pytorch/pytorch/commit/ec071a0815adfd180ba1ab103be2f7d2227f07cc
This PR is a copy-paste of the previous fix, applying to one more case,
plugging a hole. We probably need to do a more thorough review and
either plug all the holes, or design this differently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106503
Approved by: https://github.com/kwen2501