[FSDP] Fix `full_optim_state_dict()` hang (#80712)
Fixes https://github.com/pytorch/pytorch/issues/80581.
Context:
https://github.com/pytorch/pytorch/blob/1f08c1d3d61d1baa43f7862c3c4489487c8635d3/torch/distributed/fsdp/_optim_utils.py#L152-L163
To-Do:
I do not understand why inserting this `torch.cuda.synchronize()` prevents the `.cpu()` call from hanging and why in particular, this `torch.cuda.synchronize()` must be called on **all ranks**. If it is only called on the saving ranks (i.e. rank 0), then the hang persists.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80712
Approved by: https://github.com/rohan-varma