[reland] Make TP agent use streams from Future when sending response (#59212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59212
Reland of https://github.com/pytorch/pytorch/pull/58428
Until now, the TP agent expected the output of a remote function to be on the same streams as the inputs. In other words, it used the lazy stream context of the inputs to synchronize the output tensors. This was true in the most common case of a synchronous remote function. However it wasn't true for async functions, for fetching RRefs, ... The more generic way is to use the CUDA events held by the Future to perform this synchronization. (These events may be on the input streams, or they may not be!).
ghstack-source-id: 130202842
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28623885
fbshipit-source-id: 29333bcb75d077ab801eac92017d0e381e8f5569