pytorch
d6f22abb - [PyTorch Distributed] Fix batch_isend_irecv

Commit
2 years ago
[PyTorch Distributed] Fix batch_isend_irecv Summary: `batch_isend_irecv` previously only worked for two-rank cases, otherwise it would hang, e.g. pytorch/pytorch#73960. This Diff extends `batch_isend_irecv` to support more than two ranks. The fix is by treating the operation more like a collective rather than two-rank P2P when selecting the communicator, since there can be more ranks participating in the batch call than "my" rank and "my" peer. Rules: - If `batch_isend_irecv` is the first collective call (including collectives and all-to-all) in the `group` given as the argument, then all ranks of the `group` are expected to participate in this call. - Otherwise, if it is not the first collective call in the `group` (i.e. the communicator has been initialized), then batched P2P communication involving only subset of processes of the `group` is allowed. Test Plan: Added p2p_tests.py testing the following patterns: + sendrecv_neighbor(input, output) # Ring like neighbor exchange + sendrecv_ripple(input, output) # Exchange with growing distance (pytorch/pytorch#73960) + sendrecv_P2P(input, output) # Single P2P operation + isendrecv_P2P(input, output) # Single non-blocking P2P operation + isendrecv_P2P_batch(input, output, 0) # batched P2P between only two ranks + isendrecv_P2P_batch(input, output, 1) # batched P2P within a new group created for two ranks Differential Revision: D35122664 Pull Request resolved: https://github.com/pytorch/pytorch/pull/74701 Approved by: https://github.com/mingzhe09088, https://github.com/osalpekar
Author
Committer
Parents
Loading