pytorch
aec0f98e - Move cuda driver exit handling from helpers to threads (#111061)

Commit
1 year ago
Move cuda driver exit handling from helpers to threads (#111061) The pattern here is that main may exit and kill cuda driver before c10d watchdog related threads have cleanly exited. If this happens, c10d threads may still make CUDA api calls and raise an exception about the cuda driver being dead. In the past we've patched a few helper functions that call into cuda to specifically handle this driver exiting message. Instead, we know that this problem applies only to codepaths in our background threads, so we should catch at that scope and not worry about fine-grained catching at the helper granularity. (and if a helper is used from the main thread, we should NOT catch this exception- it's the application's fault) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111061 Approved by: https://github.com/malfet, https://github.com/fduwjj
Author
Committer
Parents
Loading