[FSDP] Speed up first iter order check (part 2) (#96220)
For a tensor on GPU, moving it once to CPU and operating on it on CPU is faster than moving it element by element from CPU to GPU. This is a follow-up to also move `world_indices`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96220
Approved by: https://github.com/zhaojuanmao