[FSDP2] Used stream APIs for CUDA event handling (#120231)
If we already have Python `Stream` objects, then calling `stream1.wait_stream(stream2)` is syntactic sugar for creating an `event: Event`, recording it in `stream2`, and calling `stream1.wait_event(event)`.
~~Getting a Python `Stream` object incurs some CPU overhead, so we prefer to not change other callsites where we do not already have the `Stream` objects.~~
Update: Calling `event.record()` with no stream specified calls `torch.cuda.current_stream()`, so the overhead should be identical.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120231
Approved by: https://github.com/yifuwang
ghstack dependencies: #118298, #119985