pytorch
952dc7ed - [NCCL] Fix Hang in Async Error Handling due to Work logging (#46265)

Commit View On GitHub

Commit

3 years ago

[NCCL] Fix Hang in Async Error Handling due to Work logging (#46265) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/46265 tl;dr - we must remove tensor-related logging from the WorkNCCL::operator<< function, otherwise printing the work objects tracked in the workMetaList_ will cause segfaults. The Work objects we track in the workMetaList for the NCCL Async Error Handling mechanism don't have any `outputs_`. As described in the workEnqueue function, destructing the output tensors calls into autograd_meta, which happens in the user thread, but our system destructs work objects in the workCleanupThread, so this could lead to a deadlock scenario. We avoid this problem by not tracking the tensors in the work objects in the workMetaList (it's called work meta list because these work objects only track the metadata and not the actual tensors), so when the WorkNCCL::operator<< function tried to log tensor shapes for work objects from the watchdog thread, the async error handling mechanism hanged (in the desync test) or segfaulted (in the desync flow). This PR removes the tensor-related logging from the operator<< function. ghstack-source-id: 114192929 Test Plan: Verified that this fixes the desync test and desync flow. Reviewed By: jiayisuse Differential Revision: D24268204 fbshipit-source-id: 20ccb8800aa3d71a48bfa3cbb65e07ead42cd0dc

Author

osalpekar

Committer

facebook-github-bot

Parents

b1d24dde

pytorch 952dc7ed - [NCCL] Fix Hang in Async Error Handling due to Work logging (#46265)

Commit

pytorch
952dc7ed - [NCCL] Fix Hang in Async Error Handling due to Work logging (#46265)