Drop FutureNCCL in favor of vanilla CUDAFuture (#49014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49014
We extracted a generic and reusable CUDAFuture class from FutureNCCL, but we had left FutureNCCL around, as a subclass of CUDAFuture, in order to deal with some peculiarity of ProcessGroupNCCL, namely that the future would be completed right away when constructed and that its CUDA events would be _shared_ with the ones of the WorkNCCL. This required some "hacks" in CUDAFuture itself (protected members, fields wrapped in shared_ptrs, ...).
My understanding is that creating CUDA events is a rather cheap operation. That would mean that we could afford to record _twice_ the events after each NCCL call, once for the WorkNCCL and once for the future. By doing so, we can use the CUDAFuture class directly and revert all its hacks.
ghstack-source-id: 118391217
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25355272
fbshipit-source-id: 3a2a0891724928221ff0f08600675d2f5990e674