[RPC Framework] Supporting reading the input from the remote worker (#56943)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56943
If the module is placed on a CUDA device, then all the CPU tensors in `args` and `kwargs` will also be implicitly moved to the same CUDA device to run forward.
Currently still need to move the forward output from CUDA device back to CPU, until:
1) Process group RPC backend is completely deprecated, and we always use TensorPipe RPC backend;
2) A device map is explicitly provided to TensorPipe RPC backend.
These steps will be done in a separate PR.
#Original PR issue: https://github.com/pytorch/pytorch/issues/51670
ghstack-source-id: 127457584
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test - test_load_di_parts (caffe2.torch.fb.training_toolkit.applications.sparse_nn.batch_distributed_inference.tests.batch_distributed_inference_test.BatchDistributedInferenceTest)'
Reviewed By: wanchaol
Differential Revision: D27934791
fbshipit-source-id: de27e27b905db83cc52800e63684fc6c942e9dc7