pytorch
f8f7b784 - [NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread (#41052)

Commit
4 years ago
[NCCL] Abort Errored and Timed Out NCCL Communicators from Watchdog Thread (#41052) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41052 **This Commit:** Watchdog Thread checks for error-ed or timed out `WorkNCCL` objects and aborts all associated NCCL Communicators. For now, we also process these aborted communicators as with the existing Watchdog logic (by adding them to abortedCommIds and writing aborted communicator ids to the store.) **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111614313 Test Plan: See D22054298 for verification of correctness and performance Reviewed By: jiayisuse Differential Revision: D21943151 fbshipit-source-id: 337bfcb8af7542c451f1e4b3dcdfc5870bdec453
Author
Parents
Loading