pytorch
0b84f45f - Perform appropriate CUDA stream synchronization in distributed autograd. (#53769)

Commit
5 years ago
Perform appropriate CUDA stream synchronization in distributed autograd. (#53769) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/53769 The local autograd engine performs appropriate stream synchronization between autograd nodes in the graph to ensure a consumer's stream is synchronized with the producer's stream before executing the consumer. However in case of distributed autograd, the SendRpcBackward function receives gradients over the wire and TensorPipe uses its own pool of streams for this purpose. As a result, the tensors are received on TensorPipe's stream pool but SendRpcBackward runs on a different stream during the backward pass and there is no logic to synchronize these streams. To fix this, I've enhanced DistEngine to synchronize these streams appropriately when it receives grads over the wire. ghstack-source-id: 123607221 Test Plan: 1) Added unit test which reproduced the issue. 2) waitforbuildbot. Reviewed By: wanchaol, mrshenli Differential Revision: D26955317 fbshipit-source-id: eace6d4f91d4006c9c16ede5ac16362ada052406
Author
Parents
Loading