Fix distributed autograd gradients synchronization (#57792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57792
There are two problems when using CUDA RPC with distributed autograd
and distributed optimizer:
1) In local autograd engine, all autograd functions/nodes, including
AccumualteGrad will use the forward stream for backward computation.
But distributed autograd skips AccumulateGrad autograd function/node
and directly calls into `AccumulateGrad::accumulateGrad`. As the
result, it will use the default stream to accumulate gradients
instead of the forward stream. This commit changes that and uses the
forward stream to accumulate gradients, matching forward behavior.
2) Distributed optimizer and distributed autograd backward are
separate RPC calls, and CUDA streams are not synchronized across
different RPC calls. As a result, distributed optimizer might
consume gradients before they are ready. This commit uses CUDA
events to record the completion of gradient computation, and use
those events to block current streams when getGradients() are called.
Test Plan: Imported from OSS
Reviewed By: pritamdamania87
Differential Revision: D28274876
Pulled By: mrshenli
fbshipit-source-id: 22e607152324ae918084066cde8c5dbb418bba7c