Enable desync root cause analysis for NCCL (#68310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68310
Enable desync root cause analysis by recording the last footprint of collective calls. When timeout we parse the store trace and figure out the root cause of the desync issue. This feature is built based on async error handling.
Test Plan:
Standalone test
* Typical desync - P467288969
* Mismatched collectives - P467288916
* Mismatched broadcast size - P467288873
DDP benchmark
* DDP benchmark desync - P467433483, P467520195
No perf regression:
* w/o this diff https://www.internalfb.com/intern/fblearner/details/308379789?tab=Outputs
* w/ this diff https://www.internalfb.com/intern/fblearner/details/308534088?tab=Outputs
Reviewed By: mingzhe09088
Differential Revision: D32348647
fbshipit-source-id: 43e7e96e3fa2be0ac66c1325bceb639b461a8b3a