Fix NCCL/Gloo process groups and DDP stream sync bug (#18465)
Summary:
DDP with NCCL backend uses a [worker stream](https://github.com/pytorch/pytorch/blob/d3eb941ed96774efb8d89a0b20c9e49807ea85a7/torch/csrc/distributed/c10d/ddp.cpp#L142) to flatten grand batch
tensors, and passes the flattened tensor to [another stream](https://github.com/pytorch/pytorch/blob/d3eb941ed96774efb8d89a0b20c9e49807ea85a7/torch/lib/c10d/ProcessGroupNCCL.cpp#L379) to
conduct ncclAllReduce. The flattened tensor has to record the
ncclAllReduce stream, otherwise multiple streams might access the
same memory space.
cc ppwwyyxx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18465
Differential Revision: D14613449
Pulled By: mrshenli
fbshipit-source-id: b62773732552d12cc87b7adeb6897e9e11753ea9