pytorch
ab09f346 - [FSDP] Fix `full_optim_state_dict()` hang (#80712)

Commit
2 years ago
[FSDP] Fix `full_optim_state_dict()` hang (#80712) Fixes https://github.com/pytorch/pytorch/issues/80581. Context: https://github.com/pytorch/pytorch/blob/1f08c1d3d61d1baa43f7862c3c4489487c8635d3/torch/distributed/fsdp/_optim_utils.py#L152-L163 To-Do: I do not understand why inserting this `torch.cuda.synchronize()` prevents the `.cpu()` call from hanging and why in particular, this `torch.cuda.synchronize()` must be called on **all ranks**. If it is only called on the saving ranks (i.e. rank 0), then the hang persists. Pull Request resolved: https://github.com/pytorch/pytorch/pull/80712 Approved by: https://github.com/rohan-varma
Author
Committer
Parents
Loading