[thread_pg] fix reduce_scatter to respect different cuda device (#107152)
Same reason as the previous all_reduce PR, see context in the allreduce
PR https://github.com/pytorch/pytorch/pull/107151 instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107152
Approved by: https://github.com/kumpera
ghstack dependencies: #107151