Refactor ProcessGroupNCCL collective primitives (#18820)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18820
ghimport-source-id: 220b2a3dd9d4d6d2e557e1802851f082c2dc6452
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18820 Refactor ProcessGroupNCCL collective primitives**
Planning to add reduce-scatter, but no room in my stomach for more
copypasta.
Also rewrote the tensor list validation logic. The existing validation
was ill-suited for all the cases it was being used for; it took a vector
of input tensors and a vector of output tensors, but only ever received
either two references to the same vector, or a bespoke singleton vector
and a vector of outputs (for which it would ignore all but the first
output). In the first case, it performed unnecessary checks, and in the
second, it skipped necessary ones.
Reviewed By: mrshenli
Differential Revision: D14762369
fbshipit-source-id: dcf882ce1c5854333a9eb4424bfc18d9f4648ddf