Add ability to abort NCCL communicators from the store. (#32895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32895
When a particular rank calls `ncclCommAbort` on a communicator, it is
important to ensure all other ranks call `ncclCommAbort` on their respective
communicators. If this is not done, the other ranks could get stuck causing the
GPU to spin with 100% utilization.
To alleviate this issue, whenever any rank calls `ncclCommAbort` we put the
unique communicator id in the store. The NCCL watchdog thread then monitors the
store and aborts any communicators found in the store as "aborted".
A few more general fixes in this PR:
1) Use std::shared_ptr for the store in PrefixStore. PrefixStore was using a
reference to the store and when that reference went out of scope the store
object it was holding onto was invalid. This caused a segfault in the watchdog
thread.
2) Enhanced logging for the watchdog thread.
Test Plan: waitforbuildbot
Differential Revision: D19638159
fbshipit-source-id: 596cd87c9fe6d4aeaaab4cb7319cc37784d06eaa