[NCCL] DDP communication hook: getFuture() (#41596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41596
We've modified the previous design of `convert_dist_work_to_future` API in the GH Issue [#39272](https://github.com/pytorch/pytorch/issues/39272).
1. Whenever we create a `WorkNCCL` object, create a `Future` associated with `WorkNCCL` and store it with the object.
2. Add an API `c10::intrusive_ptr<c10::ivalue::Future> getFuture()` to `c10d::ProcessGroup::Work`.
3. This API will only be supported by NCCL in the first version, the default implementation will throw UnsupportedOperation.
4. To mark the future associated with WorkNCCL completed, implement a `cudaStreamCallback` function.
`cudaStreamAddCallback` is marked as deprecated. An alternative is `cudaLaunchHostFunc`, but it is supported for CUDA > 10 and may not be deprecated until there's a reasonable alternative available according to [this discussion](https://stackoverflow.com/questions/56448390/how-to-recover-from-cuda-errors-when-using-cudalaunchhostfunc-instead-of-cudastr).
ghstack-source-id: 108409748
Test Plan:
Run old python test/distributed/test_c10d.py.
Some additional tests:
`test_ddp_comm_hook_allreduce_hook_nccl`: This unit test verifies whether a DDP communication hook that just calls allreduce gives the same result result with the case of no hook registered. Without the then callback, the future_value in reducer is no longer a PyObject, and this unit test verifies future_value is properly checked.
`test_ddp_comm_hook_allreduce_then_mult_ten_hook_nccl`: This unit test verifies whether a DDP communication hook that calls allreduce and then multiplies the result by ten gives the expected result.
As of v10:
```
........................s.....s.....................................................s...............................
----------------------------------------------------------------------
Ran 116 tests
OK (skipped=3)
```
`flow-cli` performance validation using a stacked diff where `bucket.work` is completely replaced with `bucket.future_work` in `reducer`. See PR [#41840](https://github.com/pytorch/pytorch/pull/41840) [D22660198](https://www.internalfb.com/intern/diff/D22660198/).
Reviewed By: izdeby
Differential Revision: D22583690
fbshipit-source-id: 8c059745261d68d543eaf21a5700e64826e8d94a