Enable ncclAvg for reductions (#62303)
Summary:
[ncclAvg](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html?highlight=ncclavg#c.ncclAvg) is a new `ncclRedOpt_t` that fuses a div-by-world-size with ncclAllReduce, Reduce, or ReduceScatter. This PR adds support.
This PR and https://github.com/pytorch/pytorch/pull/62140 lay the foundation for to DDP allreduce+average grad tensors in place with a single nccl call without additional memory pass(es) to flatten or average or unflatten. I'll write the necessary DDP changes once this PR and https://github.com/pytorch/pytorch/pull/62140 land.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62303
Reviewed By: soulitzer
Differential Revision: D30095246
Pulled By: rohan-varma
fbshipit-source-id: d3a3475345fafb0ab265c11d36db74d7c5613a0a