[FSDP] Delay moving tensor to CPU until necessary for optim_state_dict() (#85761)
Optimizer state_dict currently move tensors to CPU() immediately after allgather(). However, for sharded optimizer state_dict, this moving is duplicated. We should wait until all the sharding are done. This PR may slightly reduce the performance of full optimizer state_dict as it has to allocate more memory than w/o this PR. But the benchmark shows the memory allocation is pretty light.
Differential Revision: [D39855912](https://our.internmc.facebook.com/intern/diff/D39855912/)
**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39855912/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85761
Approved by: https://github.com/rohan-varma