[SPMD] Introduce schedule_comm_wait (#98578)
`schedule_comm_wait` delays the wait_tensor ops as late as possible. Note that this optimization currently does not reorder the computation ops. For `foreach` based optimizer, we observe that reordering the computation ops is required to achieve a good performance.
Differential Revision: [D44761487](https://our.internmc.facebook.com/intern/diff/D44761487/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98578
Approved by: https://github.com/mrshenli