pytorch
f8c559db - [resubmit] Providing more information while crashing process in async error handling (#47246)

Commit
4 years ago
[resubmit] Providing more information while crashing process in async error handling (#47246) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47246 We crash the process in NCCL Async Error Handling if the collective has been running for greater than some set timeout. This PR introduces more information about the rank and duration the collective ran. ghstack-source-id: 116676182 Test Plan: Run desync tests and flow. Reviewed By: pritamdamania87 Differential Revision: D24695126 fbshipit-source-id: 61ae46477065a1a451dc46fb29c3ac0073ca531b
Author
Parents
Loading