Make TP agent use streams from Future when sending response (#58428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58428
Until now, the TP agent expected the output of a remote function to be on the same streams as the inputs. In other words, it used the lazy stream context of the inputs to synchronize the output tensors. This was true in the most common case of a synchronous remote function. However it wasn't true for async functions, for fetching RRefs, ... The more generic way is to use the CUDA events held by the Future to perform this synchronization. (These events may be on the input streams, or they may not be!).
ghstack-source-id: 129567045
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28474982
fbshipit-source-id: c0034eb3f2a2ea525efb63a31b839bc086060e7e