Add c10d::broadcast_coalesced and tests (#20234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20234
The differences with the existing function _dist_broadcast_coalesced
is that this one works for both CPU and CUDA tensors and that it has a
maximum number of in flight operations.
This should be the final change needed to have only a single version
of DistributedDataParallel that both supports CPU and CUDA models, or
even a mix of both.
See #17757 for more information.
Reviewed By: mrshenli
Differential Revision: D15228099
fbshipit-source-id: a2113ba6b09b68cb5328f49f4c1960031eb43c93