pytorch
afbf2f14 - [NCCL] WorkNCCL Helper Functions (#41053)

Commit
4 years ago
[NCCL] WorkNCCL Helper Functions (#41053) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41053 **This Commit:** Some minor refactoring - added helper to check if `WorkNCCL` objects have timed out. Adding a new finish function to ProcessGroupNCCL::WorkNCCL that avoids notifying CV and uses `lock_guard`. Also renaming the timeoutCVMutex mutex to be more descriptive. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111614315 Test Plan: See D22054298 for verification of correctness and performance Reviewed By: jiayisuse Differential Revision: D21943520 fbshipit-source-id: b27ee329f0da6465857204ee9d87953ed6072cbb
Author
Parents
Loading