use allgatherv for sparse all reduce (#23917)
Summary:
per https://github.com/pytorch/pytorch/issues/22226, The current sparse allreduce in ProcessGroupGloo pads the indices and values tensors to the maximum length across all processes and then performs a regular allgather (because they'll have equal size across processes). Instead, we can use allgatherv. This is mostly a win for memory usage if there is severe size imbalance between processes.
close https://github.com/pytorch/pytorch/issues/22226
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23917
Test Plan:
buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics
buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_basics_cuda
buck run mode/dev-nosan caffe2/test:c10d -- test_c10d.ProcessGroupGlooTest.test_sparse_allreduce_checks
Differential Revision: D16664985
Pulled By: zhaojuanmao
fbshipit-source-id: e7d3c0770cbc09f9175b3027b527e95053724843