Add NCCL PreMul Sum to c10d `redce` ops (#84243)
This is based on #81272 but this conforms to TorchScript Compiler
## TODO
- [ ] Update https://github.com/pytorch/pytorch/blob/abaf8112e6d6bed2a5d33dcbc1d46ed20b8e80de/torch/csrc/distributed/c10d/ProcessGroupUCC.cpp#L64-L73 to use `ReduceOp::RedOpType`. In my first try with `USE_SYSTEM_UCC=1`, this change wasn't necessary (I think) because of `ReduceOp::RedOpType` operator. That being said, I want to make it more explicit.
cc @ptrblck @kwen2501 @aazzolini
cc @zasdfgbnm for visibility to the TODO above
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84243
Approved by: https://github.com/kwen2501