[FSDP][optim_state_dict] Ensure correct devices for tensors when doing all_gather (#92992)
When doing `_all_gather_optim_state`, we need to ensure that `step` tensors are on CPU and other tensors are on GPUs. This PR add the logic to ensure the locality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92992
Approved by: https://github.com/fduwjj