ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911)
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](https://github.com/pytorch/pytorch/blob/b9e86bc93d0ad71c9a2f3a01c4a41ed5ee4b665f/torch/distributed/_functional_collectives_impl.py#L216-L219) and [this](https://github.com/pytorch/pytorch/blob/b9e86bc93d0ad71c9a2f3a01c4a41ed5ee4b665f/torch/distributed/_functional_collectives_impl.py#L262-L272)). For native funcol I ran into the same issues but I'd rather just fix the coverage.
### This PR
We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following:
- Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now.
- By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`.
- The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118910