pytorch
c85e47b3 - [BE][PT-D] Fix race on checkpoint file (#84881)

Commit
2 years ago
[BE][PT-D] Fix race on checkpoint file (#84881) Without calling `dist.barrier()` before removing the checkpoint file, rank 0 may run ahead and delete the checkpoint file before nonzero ranks are able to load from the checkpoint. This PR adds a `dist.barrier()` to ensure all ranks can load the checkpoint before rank 0 deletes it. For example, including the added `dist.barrier()`: https://github.com/pytorch/pytorch/blob/037e8eefcf0b669430211b83d19aedf2185ed6fc/torch/testing/_internal/distributed/distributed_test.py#L5068-L5098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/84881 Approved by: https://github.com/rohan-varma
Author
Committer
Parents
Loading