Run dist_autograd backward RPCs on appropriate CUDA streams. (#60606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60606
TensorPipe receives tensors over the wire on custom streams and these
streams are passed to some RPC callbacks but not to `BACKWARD_AUTOGRAD_REQ`. As a
result, `BACKWARD_AUTOGRAD_REQ` ran on the default stream while still using
tensors from the custom stream. This resulted in downstream autograd operations
running on the incorrect stream.
To fix this, I've passed the streams to `BACKWARD_AUTOGRAD_REQ` as well and
added an appropriate guard.
#Closes: https://github.com/pytorch/pytorch/issues/59793
ghstack-source-id: 132252069
Test Plan: Test https://github.com/pytorch/pytorch/issues/59793
Reviewed By: mrshenli
Differential Revision: D29347244
fbshipit-source-id: 8ff8b150763c970ab15c2cac8dccf56e66e9ef5d