pytorch
d0672974 - [RFC] Add _abort method to ProcessGroupNCCL (#96017)

Commit
1 year ago
[RFC] Add _abort method to ProcessGroupNCCL (#96017) **Summary:** Currently the only way to destroy a process group is calling `dist.destroy_process_group`. However, this API does not guarantee destruction of the ProcessGroup object since it only deletes references inside `distributed_c10d.py`. In cases where the process group is used in multiple places it is not feasible to hunt down all the references and delete them. In particular for NCCL if a collective gets stuck the only way to recover from this is calling ncclCommAbort on all the communicators. Currently there is no API to achieve this. To address this, in this PR I've added an `_abort` method to ProcessGroupNCCL to achieve this, where now we have a guaranteed way to kill all NCCL communicators associated with a ProcessGroup **Test Plan:** Added a unit test to validate this works as expected Pull Request resolved: https://github.com/pytorch/pytorch/pull/96017 Approved by: https://github.com/wanchaol
Committer
Parents
Loading