Avoid initializing unnecessary tensors in nccl.reduce (#39688)
Summary:
While working on https://github.com/pytorch/pytorch/issues/38911, I realized that `nccl.reduce` only needs a single output tensor, while our current implementation requires a list of output tensors. This, along with a TODO I fixed in reduce_add, should have some speed up for data parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39688
Differential Revision: D22034547
Pulled By: mrshenli
fbshipit-source-id: e74d54d673ebbb062474b1bb5cc93a095a3a5f6c