[SPMD] Upstream iter_move_grads_and_optimizers (#98785)
This PR upstreams `iter_move_grads_and_optimizer` which delay some of the gradients and the corresponding optimizer to the next iteration. D44512863(credit to @lessw2020 ) is the internal implementation, which is only good for the old _SPMD expansion. This PR changes the implmentation to use the new APIs.
Differential Revision: [D44836486](https://our.internmc.facebook.com/intern/diff/D44836486/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98785
Approved by: https://github.com/mrshenli