[FSDP] Optimizer states may be on CPU, copy them to GPU before gathering (#84708)
**Background**:
Optimizer states may not always on GPUs. Some examples include, 1.) CPU offload is enable, 2.) after lightning trainer fit() is called.
**What Does This PR Do?**
If states are not on GPUs, move them to GPUs before gathering the global states.
Differential Revision: [D39332300](https://our.internmc.facebook.com/intern/diff/D39332300/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84708
Approved by: https://github.com/awgu