pytorch
372e9550 - ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911)

Commit View On GitHub

Commit

234 days ago

ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911) ### Motivation Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](https://github.com/pytorch/pytorch/blob/b9e86bc93d0ad71c9a2f3a01c4a41ed5ee4b665f/torch/distributed/_functional_collectives_impl.py#L216-L219) and [this](https://github.com/pytorch/pytorch/blob/b9e86bc93d0ad71c9a2f3a01c4a41ed5ee4b665f/torch/distributed/_functional_collectives_impl.py#L262-L272)). For native funcol I ran into the same issues but I'd rather just fix the coverage. ### This PR We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following: - Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now. - By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`. - The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #118910

Author

yifuwang

Committer

pytorchmergebot

Parents

65314a61

pytorch 372e9550 - ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911)

Commit

pytorch
372e9550 - ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911)