Let child CUDAFuture wait for parent CUDAFuture's CUDAEvents (#51820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51820
If the child cannot extract tensors from returned IValue, the
current child CUDAFuture won't wait for anything. In this case,
if the `wait()` wasn't called on the parent Future, streams are
not synchronized, and it is possible that parent Future's CUDA
ops have not been added to streams yet.
This commit adds a `markCompletedWithDataPtrs()` to `ivalue::Future`,
and RPC uses this API to pass Message tensor dataPtrs to the
`PyObject` Future when marking it as completed.
Test Plan: Imported from OSS
Reviewed By: pritamdamania87
Differential Revision: D26324068
Pulled By: mrshenli
fbshipit-source-id: 3d838754f6daabad5cd9fb8953e4360196d110bb