[thread_pg] fix all_reduce to respect different cuda device (#107151)
The previous implementation only works on CPU and it does not respect
the fact that each rank have its data in different devices (i.e. cuda),
so the implementation will raise the error like below:
```
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
```
See report in https://github.com/pytorch/pytorch/pull/105604#issuecomment-1675472670
This PR fix this issue and tested that the failed tests on GPU now works
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107151
Approved by: https://github.com/kumpera