[C10d][NCCL] Refactor complex all_reduce and broadcast (#121045)
The necessity of this PR lies in the fact that autograd engine + DDP calls `all_reduce` from C++, so the changes must be made in C++.
```
[rank0]: Traceback (most recent call last):
[rank0]: File "~/complex_ddp.py", line 72, in <module>
[rank0]: main()
[rank0]: File "~/complex_ddp.py", line 64, in main
[rank0]: loss.backward()
[rank0]: File "/home/usr/pytorch/torch/_tensor.py", line 525, in backward
[rank0]: torch.autograd.backward(
[rank0]: File "/home/usr/pytorch/torch/autograd/__init__.py", line 267, in backward
[rank0]: _engine_run_backward(
[rank0]: File "/home/usr/pytorch/torch/autograd/graph.py", line 744, in _engine_run_backward
[rank0]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank0]: TypeError: Input tensor data type is not supported for NCCL process group: ComplexFloat
```
I believe, for minimizing the Python overhead, the same could be done for the rest of the ops, what do you think @kwen2501?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121045
Approved by: https://github.com/eqy, https://github.com/kwen2501