Fix deadlock for multi-output forward AD (#67995)
Summary:
Will hide some of the issues from https://github.com/pytorch/pytorch/issues/67367
This will at least allow us to run gradcheck for now until the above issue is fixed.
For more context, the deadlock happens when we (wrongfully) set a forward grad that also has a forward grad of the same level.
In particular, when exiting the level from https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L23
We are taking the `all_forward_levels_mutex_` lock and proceed to delete the level at https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L29 (nothing else usually references this object, so it gets deleted as soon as it gets removed from the vector). Note that, at this point, we still have the lock!
In the level destructor in https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L55 we are deleting the forward grad. Which triggers the deletion the grad Tensor and everything it holds (assuming nothing else references it).
But in the (bad) case where this Tensor also has a forward grad for this level, the autograd meta clears the fw grads: https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.h#L124
While clearing, we access the level (to de-register this forward grad) via https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.h#L139
But this tries to access the level again in https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L39 and deadlocks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/67995
Reviewed By: soulitzer
Differential Revision: D32250996
Pulled By: albanD
fbshipit-source-id: f6118117effd3114fa90dc8fe22865339445f70c