pytorch
ab09f346 - [FSDP] Fix `full_optim_state_dict()` hang (#80712)

Commit

2 years ago

[FSDP] Fix `full_optim_state_dict()` hang (#80712) Fixes https://github.com/pytorch/pytorch/issues/80581. Context: https://github.com/pytorch/pytorch/blob/1f08c1d3d61d1baa43f7862c3c4489487c8635d3/torch/distributed/fsdp/_optim_utils.py#L152-L163 To-Do: I do not understand why inserting this `torch.cuda.synchronize()` prevents the `.cpu()` call from hanging and why in particular, this `torch.cuda.synchronize()` must be called on **all ranks**. If it is only called on the saving ranks (i.e. rank 0), then the hang persists. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80712 Approved by: https://github.com/rohan-varma

Author

awgu

Committer

pytorchmergebot

Parents

523b081a

pytorch ab09f346 - [FSDP] Fix `full_optim_state_dict()` hang (#80712)

pytorch
ab09f346 - [FSDP] Fix `full_optim_state_dict()` hang (#80712)