pytorch
fc55290e - Fix distributed autograd gradients synchronization (#57792)

Commit
4 years ago
Fix distributed autograd gradients synchronization (#57792) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57792 There are two problems when using CUDA RPC with distributed autograd and distributed optimizer: 1) In local autograd engine, all autograd functions/nodes, including AccumualteGrad will use the forward stream for backward computation. But distributed autograd skips AccumulateGrad autograd function/node and directly calls into `AccumulateGrad::accumulateGrad`. As the result, it will use the default stream to accumulate gradients instead of the forward stream. This commit changes that and uses the forward stream to accumulate gradients, matching forward behavior. 2) Distributed optimizer and distributed autograd backward are separate RPC calls, and CUDA streams are not synchronized across different RPC calls. As a result, distributed optimizer might consume gradients before they are ready. This commit uses CUDA events to record the completion of gradient computation, and use those events to block current streams when getGradients() are called. Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D28274876 Pulled By: mrshenli fbshipit-source-id: 22e607152324ae918084066cde8c5dbb418bba7c
Author
Parents
Loading