pytorch
7eb427e9 - Providing more information while crashing process in async error handling (#46274)

Commit
4 years ago
Providing more information while crashing process in async error handling (#46274) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46274 We crash the process in NCCL Async Error Handling if the collective has been running for greater than some set timeout. This PR logs more information about the rank and duration the collective ran before throwing an exception. ghstack-source-id: 115614622 Test Plan: Run desync tests and flow. Here are the Flow runs showing the right messages: f225031389 f225032004 Reviewed By: jiayisuse Differential Revision: D24200144 fbshipit-source-id: 02d48f13352aed40a4476768c123d5cebbedc8e0
Author
Parents
Loading