pytorch
f37d2046 - Implements allreduce_coalesced for ProcessGroupNCCL (#62140)

Commit View On GitHub

Commit

2 years ago

Implements allreduce_coalesced for ProcessGroupNCCL (#62140) Summary: Implements allreduce_coalesced for ProcessGroupNCCL as an NCCL group of allreduces on separate tensors, as proposed in https://github.com/pytorch/pytorch/issues/38995#issuecomment-882804595. In recent versions of NCCL, performance of grouped comms has improved significantly. A group can execute with just one kernel, so a grouped comm on a set of unflattened tensors can be more performant than flattening+a single flat nccl call. The same approach can easily extend to broadcast_coalesced and reduce_coalesced. I'm still not sure how (hypothetical) all_gather_coalesced and reduce_scatter_coalesced ops should be exposed or implemented, because we need to consider "_base" variants where the output or input tensor is pre-flattened. For example, https://github.com/pytorch/pytorch/issues/61781 effectively wants "allgather_base_coalesced". I'm also not sure how the _multigpu variants should enter the picture. With the approach I've written here, ProcessGroupNCCL::allreduce accepts a vector of tensors that are either all on the same device (in which case it'll do an allreduce_coalesced) or all on different devices (in which case it'll do an allreduce_multigpu). In other words it can do _coalesced or _multigpu but not both at once. for some reason github wont let me add agolynski to the reviewers cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/62140 Reviewed By: fduwjj Differential Revision: D33781010 Pulled By: cbalioglu fbshipit-source-id: f0c233da9ebae57d7ccecf6d8dc432d936d4d3ce (cherry picked from commit e43cb81d300bd9e9926f6e01ae77f4accb12c258)

References

#72894 - Merge pytorch master into lazy_tensor_staging

Author

mcarilli

Committer

pytorchmergebot

Parents

942a084c

pytorch f37d2046 - Implements allreduce_coalesced for ProcessGroupNCCL (#62140)

Commit

pytorch
f37d2046 - Implements allreduce_coalesced for ProcessGroupNCCL (#62140)