Enhance NCCL watchdog to acitvely abort communicators for timed out ops. (#32338)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/32338
Timed out ops could linger around if the user doesn't actually call
`wait()` on that OP. As result, to fix this I've introduced the following
functionality in this PR:
1. Keep track of all outstanding work in ProcessGroupNCCL.
2. Enhance NCCL watchdog to sweep through all outstanding work and perform the
following operations:
i. If the work has timed out, abort all communicators for that work and
remove them from the cache.
ii. If the communicators for the work receive an error, abort the
communicators and remove them from the cache.
iii. If the work has completed (successfully/unsuccessfully), remove it from
the list of outstanding work.
ghstack-source-id: 96895704
Test Plan: waitforbuildbot
Differential Revision: D19401625
fbshipit-source-id: 8f6f277ba2750a1e1aa03cdbc76e8c11862e7ce5