pytorch
f9422e1c - Fix deadlock for multi-output forward AD (#67995)

Commit View On GitHub

Commit

2 years ago

Fix deadlock for multi-output forward AD (#67995) Summary: Will hide some of the issues from https://github.com/pytorch/pytorch/issues/67367 This will at least allow us to run gradcheck for now until the above issue is fixed. For more context, the deadlock happens when we (wrongfully) set a forward grad that also has a forward grad of the same level. In particular, when exiting the level from https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L23 We are taking the `all_forward_levels_mutex_` lock and proceed to delete the level at https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L29 (nothing else usually references this object, so it gets deleted as soon as it gets removed from the vector). Note that, at this point, we still have the lock! In the level destructor in https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L55 we are deleting the forward grad. Which triggers the deletion the grad Tensor and everything it holds (assuming nothing else references it). But in the (bad) case where this Tensor also has a forward grad for this level, the autograd meta clears the fw grads: https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.h#L124 While clearing, we access the level (to de-register this forward grad) via https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.h#L139 But this tries to access the level again in https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L39 and deadlocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/67995 Reviewed By: soulitzer Differential Revision: D32250996 Pulled By: albanD fbshipit-source-id: f6118117effd3114fa90dc8fe22865339445f70c

References

#68130 - Merge master

Author

albanD

Committer

facebook-github-bot

Parents

f8297d40

pytorch f9422e1c - Fix deadlock for multi-output forward AD (#67995)

Commit

pytorch
f9422e1c - Fix deadlock for multi-output forward AD (#67995)