pytorch
24a084ed - [c10d] Fix async error in batch_isend_irecv (#82450)

Commit

2 years ago

[c10d] Fix async error in batch_isend_irecv (#82450) Summary: `batch_isend_irecv` previously required the use of `torch.cuda.synchronize` to avoid data race conditions. This was because the ncclStreams were recorderd in the returned ncclWork object _before_ a ncclGroupEnd by the `_batch_p2p_manager` was issued. Thus, the `req.wait()` was effectively waiting on nothing, leading to the later operators working on incorrect intermediate data. This fix: - keeps track of ncclStreams to wait on, and records them in the work objects after the batch manager issues a ncclGroupEnd - renames the `_batch_p2p_manager` to `_coalescing_manager` for generality - removes the explicit check for NCCL backend inside `_batch_p2p_manager` in distributed_c10.py and moves the manager start/end to ProcessGroup.hpp, in order to transparently work with all process groups Test Plan: Modified the unittest for `batch_isend_irecv` to check that received tensors are the same as expected tensors. Verified that the test fails before the change, and passes after the change. Differential Revision: D38100789 Pull Request resolved: https://github.com/pytorch/pytorch/pull/82450 Approved by: https://github.com/kwen2501

Author

aashaka

Committer

pytorchmergebot

Parents

88e43ca4

pytorch 24a084ed - [c10d] Fix async error in batch_isend_irecv (#82450)

pytorch
24a084ed - [c10d] Fix async error in batch_isend_irecv (#82450)