fix #7188 (#7371)
I found that when using DeepSpeed Zero2 for my training task, the loss
becomes 0 at the third step with a grad_norm of 1.414. This issue
doesn't occur when using Zero3. I found the same issue #7188. After
conducting a series of experiments, I identified the cause: there's a
synchronization problem when using double ipg_buffer swapping. The issue
was resolved after making modifications.
before

after

Signed-off-by: vinceliu <lpnpcs@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>