pytorch
f86d6c6a - Enhance NCCL watchdog to acitvely abort communicators for timed out ops. (#32338)

Commit
4 years ago
Enhance NCCL watchdog to acitvely abort communicators for timed out ops. (#32338) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32338 Timed out ops could linger around if the user doesn't actually call `wait()` on that OP. As result, to fix this I've introduced the following functionality in this PR: 1. Keep track of all outstanding work in ProcessGroupNCCL. 2. Enhance NCCL watchdog to sweep through all outstanding work and perform the following operations: i. If the work has timed out, abort all communicators for that work and remove them from the cache. ii. If the communicators for the work receive an error, abort the communicators and remove them from the cache. iii. If the work has completed (successfully/unsuccessfully), remove it from the list of outstanding work. ghstack-source-id: 96895704 Test Plan: waitforbuildbot Differential Revision: D19401625 fbshipit-source-id: 8f6f277ba2750a1e1aa03cdbc76e8c11862e7ce5
Author
Parents
Loading