Avoid using FutureNCCL before it's ready (#48561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48561
This commit is part of a stack that reworks FutureNCCL in order to extract a generic CUDA-aware Future subclass. The stack deliberately breaks up this transition into elementary changes, to make it easier to verify that the behavior is preserved (or to highlight how it gets changed).
---
WorkNCCL allows to extract a FutureNCCL through getFuture(). There is one instance of this method being called by ProcessGroupNCCL itself, in order to attach a callback to it. This was happening _before_ the work was actually launched, however FutureNCCL does _always_ invoke its callbacks immediately inline. The events that the FutureNCCL was using hadn't been recorded yet, thus blocking on them was a no-op. Moreover, the function that was being called was installed by the generic ProcessGroup superclass, which is not CUDA-aware, and thus probably didn't make any use of the CUDA events or streams.
https://github.com/pytorch/pytorch/blob/383abf1f0c1f74e0f471d47e505895d1b0e6bb20/torch/lib/c10d/ProcessGroup.cpp#L66
In short: I believe that creating a FutureNCCL and attaching a callback was equivalent to just invoking that function directly, without any CUDA-specific thing. I'm thus converting the code to do just that, in order to simplify it.
Note that, given the comment, I don't think this was the original intention of that code. It seems that the function was intended to be run once the work finished. However, I am not familiar with this code, and I don't want to introduce any functional changes.
ghstack-source-id: 118180037
Test Plan: Unit tests
Reviewed By: mrshenli
Differential Revision: D25210337
fbshipit-source-id: 54033c814ac77641cbbe79b4d01686dfc2b45495