Fix CUDA sync when switching streams in RPC tests (#59297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59297
PyTorch requires users to manually record tensors with the CUDA caching allocator when switching streams. We weren't doing it.
Also, the usage of an Event can be simplified by using `s1.wait(s2)`.
ghstack-source-id: 130583777
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28832902
fbshipit-source-id: cd4f40ff811fa1b0042deedda2456e22f33b92bd