pytorch
3ea59b68 - [c10d] Enhance broadcastUniqueNCCLID error reporting (#94752)

Commit
1 year ago
[c10d] Enhance broadcastUniqueNCCLID error reporting (#94752) When this error is hit, usually it is because rank 0 has hit an error and crashed before setting the unique ID on rank 0. However, in many job scheduling tools the rank 0 error is not clearly reported and user must look for it, so add a small log reminding users to do so. Differential Revision: [D43245190](https://our.internmc.facebook.com/intern/diff/D43245190/) **NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43245190/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/94752 Approved by: https://github.com/H-Huang
Author
Committer
Parents
Loading