Let RRef getValue() synchronize CUDA streams (#56895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56895
PR #54932 fixes CUDA stream synchronization between RPC-created
OwnerRRef and UserRRef when `to_here()` is invoked. However, there
are two more gaps.
1. RRef value can be accessed on the owner directly through
`local_value`, which bypasses the fix in #54932.
2. When RRef is created directly through RRef ctor instead of RPC,
the OwnerRRef won't be able to correctly record CUDA events.
This PR fixes 1 by letting current streams wait for RRef recorded
CUDA events before returning the value in `RRef::getValue()`.
For 2, more discussions is needed to decide whether we should add
a `devices` argument to RRef ctor, or should RRef ctor inspect the
given values.
Test Plan: Imported from OSS
Reviewed By: lw
Differential Revision: D27992775
Pulled By: mrshenli
fbshipit-source-id: ed0e5bfbf715460208c85e46dd3317deef17f8fe