pytorch
18cc8a92 - [ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431)

Commit

2 years ago

[ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431) For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way. To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`). This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now. Cc: @awgu @janeyx99 @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/111431 Approved by: https://github.com/H-Huang

Author

kwen2501

Committer

pytorchmergebot

Parents

a7883ee4

pytorch 18cc8a92 - [ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431)

pytorch
18cc8a92 - [ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431)