Add device guard around MPI operations (#22446)
Summary:
If the current CUDA device is not the same as the device that hosts
the tensor the operation works on then OpenMPI will segfault, as
reported in https://github.com/pytorch/pytorch/issues/21922. This changes adds a device guard for every
operation to ensure the correct device is set.
Fixes https://github.com/pytorch/pytorch/issues/21922.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22446
Differential Revision: D16106823
Pulled By: pietern
fbshipit-source-id: 99d762eb3851c0a0e0b4fe81cf27c1c8d35596cc