Make FutureNCCL record events in current stream (#48497)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48497
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
When we record the events to mark a "follow-up" future complete (for a callback), we used to record them onto the dedicated stream, but that streams is the current stream at that time, so instead we could just record them onto the current stream. This introduces no functional differences. The reason I'm adding such an additional layer of indirection is so that the dedicated stream is only referenced inside the `addCallback` method, which will later allow us to more easily change how that stream works.
ghstack-source-id: 118180035
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25177553
fbshipit-source-id: c6373eddd34bd399df09fd4861915bf98fd50681