Remove unnecessary copies in ProcessGroupGloo for multiple inputs allreduce (#43543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43543
Closes https://github.com/pytorch/pytorch/issues/14691. This is not needed in the multiple outputs case, because gloo allreduce
will broadcast the result tensor to all the outputs. See
https://github.com/facebookincubator/gloo/issues/152 and commit
https://github.com/facebookincubator/gloo/commit/9cabb5aaa4f02356bc8db05e5630cb550b3f5b5c
for more details. Came across this when debugging https://github.com/pytorch/pytorch/pull/42577.
This effectively reverts https://github.com/pytorch/pytorch/pull/14688 while still keeping the tests.
Tested by ensuring `test_allreduce_basics` in `test_c10d.py` still works as expected.
ghstack-source-id: 110636498
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D23173945
fbshipit-source-id: d1ae08f84b4ac9919c53080949b8fffcb2fe63a8