Move cuda driver exit handling from helpers to threads (#111061)
The pattern here is that main may exit and kill cuda driver before
c10d watchdog related threads have cleanly exited. If this happens,
c10d threads may still make CUDA api calls and raise an exception about
the cuda driver being dead.
In the past we've patched a few helper functions that call into cuda
to specifically handle this driver exiting message. Instead, we know
that this problem applies only to codepaths in our background threads,
so we should catch at that scope and not worry about fine-grained
catching at the helper granularity. (and if a helper is used from the main
thread, we should NOT catch this exception- it's the application's fault)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111061
Approved by: https://github.com/malfet, https://github.com/fduwjj