[NCCL] Fix Hang in Async Error Handling due to Work logging (#46265)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46265
tl;dr - we must remove tensor-related logging from the
WorkNCCL::operator<< function, otherwise printing the work objects tracked in
the workMetaList_ will cause segfaults.
The Work objects we track in the workMetaList for the NCCL Async Error
Handling mechanism don't have any `outputs_`. As described in the workEnqueue
function, destructing the output tensors calls into autograd_meta, which
happens in the user thread, but our system destructs work objects in the
workCleanupThread, so this could lead to a deadlock scenario. We avoid this
problem by not tracking the tensors in the work objects in the workMetaList
(it's called work meta list because these work objects only track the metadata
and not the actual tensors), so when the WorkNCCL::operator<< function tried to
log tensor shapes for work objects from the watchdog thread, the async error
handling mechanism hanged (in the desync test) or segfaulted (in the desync
flow). This PR removes the tensor-related logging from the operator<< function.
ghstack-source-id: 114192929
Test Plan: Verified that this fixes the desync test and desync flow.
Reviewed By: jiayisuse
Differential Revision: D24268204
fbshipit-source-id: 20ccb8800aa3d71a48bfa3cbb65e07ead42cd0dc