pytorch
aec0f98e - Move cuda driver exit handling from helpers to threads (#111061)

Commit

1 year ago

Move cuda driver exit handling from helpers to threads (#111061) The pattern here is that main may exit and kill cuda driver before c10d watchdog related threads have cleanly exited. If this happens, c10d threads may still make CUDA api calls and raise an exception about the cuda driver being dead. In the past we've patched a few helper functions that call into cuda to specifically handle this driver exiting message. Instead, we know that this problem applies only to codepaths in our background threads, so we should catch at that scope and not worry about fine-grained catching at the helper granularity. (and if a helper is used from the main thread, we should NOT catch this exception- it's the application's fault) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111061 Approved by: https://github.com/malfet, https://github.com/fduwjj

Author

wconstab

Committer

pytorchmergebot

Parents

2f53085f

pytorch aec0f98e - Move cuda driver exit handling from helpers to threads (#111061)

pytorch
aec0f98e - Move cuda driver exit handling from helpers to threads (#111061)