pytorch
ccccd0ef - [DataLoader] Share seed via Distributed Store to get rid of CUDA dependency (#79829)

Commit

2 years ago

[DataLoader] Share seed via Distributed Store to get rid of CUDA dependency (#79829) Fixes #79828 In distributed environment, before this PR, DataLoader would create a Tensor holding the shared seed in RANK 0 and send the Tensor to other processes. However, when `NCCL` is used as the distributed backend, the Tensor is required to be moved to cuda before broadcasted from RANK 0 to other RANKs. And, this causes the Issue where DataLoader doesn't move the Tensor to cuda before sharing using `NCCL`. After offline discussion with @mrshenli, we think the distributed Store is a better solution as the shared seed is just an integer value. Then, we can get rid of the dependency on NCCL and CUDA when sharing info between distributed processes for DataLoader. Pull Request resolved: https://github.com/pytorch/pytorch/pull/79829 Approved by: https://github.com/VitalyFedyunin, https://github.com/NivekT

Author

ejguan

Committer

pytorchmergebot

Parents

16f30b49

pytorch ccccd0ef - [DataLoader] Share seed via Distributed Store to get rid of CUDA dependency (#79829)

pytorch
ccccd0ef - [DataLoader] Share seed via Distributed Store to get rid of CUDA dependency (#79829)