Providing more information while crashing process in async error handling (#46274)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46274
We crash the process in NCCL Async Error Handling if the collective
has been running for greater than some set timeout. This PR logs more
information about the rank and duration the collective ran before throwing an exception.
ghstack-source-id: 115614622
Test Plan:
Run desync tests and flow. Here are the Flow runs showing the right messages: f225031389
f225032004
Reviewed By: jiayisuse
Differential Revision: D24200144
fbshipit-source-id: 02d48f13352aed40a4476768c123d5cebbedc8e0