pytorch
4e5c55ef - [NCCL] Use cudaEventQuery to Poll for GPU operation errors (#41051)

Commit View On GitHub

Commit

4 years ago

[NCCL] Use cudaEventQuery to Poll for GPU operation errors (#41051) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41051 **This Commit:** In the workCleanupThread, we process completion and exception handling for workNCCL objects corresponding to collective calls that have either completed GPU Execution, or have already thrown an exception. This way, we throw an exception from the workCleanupThread for failed GPU operations. This approach replaces the previous (and lower performance) approach of enqueuing a callback on the CUDA stream to process failures. **This Stack:** The purpose of this stack is to fix the hanging behavior observed in when using PyTorch DDP training with NCCL. In various situations (desynchronization, high GPU utilization, etc.), NCCL collectives may hang due to waiting on an unresponsive worker. This stack detects such hanging behavior and aborts timed-out collectives by throwing a user-visible exception, all with minimal perf regression. Training can then be restarted from a previous checkpoint with something like torchelastic. ghstack-source-id: 111614319 Test Plan: See D22054298 for verification of correctness and performance Reviewed By: jiayisuse Differential Revision: D21938498 fbshipit-source-id: df598365031ff210afba57e0c7be865e3323ca07

Author

osalpekar

Committer

facebook-github-bot

Parents

1df24fd4

pytorch 4e5c55ef - [NCCL] Use cudaEventQuery to Poll for GPU operation errors (#41051)

Commit

pytorch
4e5c55ef - [NCCL] Use cudaEventQuery to Poll for GPU operation errors (#41051)