Use fresh stream from pool for each FutureNCCL callback (#48498)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48498
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
FutureNCCL has a dedicated CUDA stream that it sets as current when running callbacks. This stream is initialized by the ProcessGroupNCCL by extracting it from the global ATen pool.
In order to decouple FutureNCCL from that specific ProcessGroup and make it more generic, in this commit we make FutureNCCL extract a fresh stream from the ATen pool each time it needs one.
This introduces a functional change, because it removes the implicit synchronization and ordering between the callbacks of a same Future. In fact, such an ordering is hard to guarantee in the general case as, for example, a user could attach a new callback just after the future becomes completed, and thus that callback would be run inline, immediately, out-of-order wrt the other callbacks. (There are ways to "fix" this but they are complicated). NCCL got around this because its futures are already marked complete when they're returned, but in fact it could also run into issues if multiple threads were adding callbacks simultaneously.
Note that it remains still possible to enforce ordering between callbacks, but one must now do so explicitly. Namely, instead of this:
```
fut.then(cb1)
fut.then(cb2)
```
one must now do:
```
fut.then(cb1).then(cb2)
```
ghstack-source-id: 118180029
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25177559
fbshipit-source-id: 4d4e73ea7bda0ea65066548109b9ea6d5b465599