pytorch
ab75d64e - Add ability to abort NCCL communicators from the store. (#32895)

Commit

4 years ago

Add ability to abort NCCL communicators from the store. (#32895) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32895 When a particular rank calls `ncclCommAbort` on a communicator, it is important to ensure all other ranks call `ncclCommAbort` on their respective communicators. If this is not done, the other ranks could get stuck causing the GPU to spin with 100% utilization. To alleviate this issue, whenever any rank calls `ncclCommAbort` we put the unique communicator id in the store. The NCCL watchdog thread then monitors the store and aborts any communicators found in the store as "aborted". A few more general fixes in this PR: 1) Use std::shared_ptr for the store in PrefixStore. PrefixStore was using a reference to the store and when that reference went out of scope the store object it was holding onto was invalid. This caused a segfault in the watchdog thread. 2) Enhanced logging for the watchdog thread. Test Plan: waitforbuildbot Differential Revision: D19638159 fbshipit-source-id: 596cd87c9fe6d4aeaaab4cb7319cc37784d06eaa

Author

pritamdamania

Committer

facebook-github-bot

Parents

df1d68d5

pytorch ab75d64e - Add ability to abort NCCL communicators from the store. (#32895)

pytorch
ab75d64e - Add ability to abort NCCL communicators from the store. (#32895)