pytorch
122f8648 - [PyTorch Distributed] Add debug hint for NCCL async system error (#73897)

Commit
2 years ago
[PyTorch Distributed] Add debug hint for NCCL async system error (#73897) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/73897 add a debug hint that async system error can be caused by unexpected exit of a remote process if not an actual network issue. For example, the exit of the remote process can cause a closed network connection error at a local process. The hint helps to direct the debug focus to the remote process. Test Plan: unit tests Reviewed By: pritamdamania87, rohan-varma Differential Revision: D34702348 fbshipit-source-id: d19f9116e9efe5f6d76c0158a7a447616437ca69 (cherry picked from commit 005e74b7b6764ecd832b3410063285bff2411b56)
Author
Committer
Parents
Loading