[FSDP] Speed up first iter order check (#96146)
For a tensor on GPU, moving it once to CPU and operating on it on CPU is faster than moving it element by element from CPU to GPU. The relevant tensor in this case is `world_num_valid_indices`.
This closes https://github.com/pytorch/pytorch/issues/95728.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96146
Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma