Allow to specify a set of device for CUDAFuture (#56515)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56515
In https://github.com/pytorch/pytorch/pull/56405 we finally found a solution to support RPC remote user functions that created/used CUDA tensors on devices that were not used by their arguments, by defining a "bounding set" of devices when constructing the agent and allowing all functions to freely use any of those devices.
We had the same exact problem with the callbacks of CUDAFuture, and in this PR I'm adopting the same exact solution: I allow to specify a set of devices when constructing a CUDAFuture, and then every callback is allowed to use any of those devices. (These devices will also be propagated to child futures).
I'm also making ProcessGroupNCCL pass these devices. I can't yet do it for TensorPipeAgent, until #56405 lands.
ghstack-source-id: 127261552
Test Plan: Added a test for this later in the stack.
Reviewed By: mrshenli
Differential Revision: D27861067
fbshipit-source-id: 8ab2c9d06a514c0407a7e96abc3704e8d5c5dc09