Fix overlap communication of ZeRO stage 1 and 2 (#5606)
`deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer.average_tensor`
only sets reduction stream waiting for default stream. This is ok in
cases where the computation time is longer than the communication time,
but when the communication time is longer, it may result in a rewrite of
the ipg_buffer when the communication is not completed.

To fix this bug, the easiest way is just add default stream to wait for
reduction stream at the **same point**. For example, in point 1, the
`reduction stream` needs to wait for '2', so we add a wait_stream to
`reduction stream` waiting for `default stream`. Also, the `default
stream` needs to wait for 'A', so we need to add a wait_stream to
`default stream` waiting for `reduction stream` before the 'B'.

Compared with the modification of
https://github.com/microsoft/DeepSpeed/issues/5523, wait_stream does not
cause host synchronization.
Compared with the modification of
https://github.com/microsoft/DeepSpeed/issues/5545, the modification is
more simple and the logic is the same, just waiting for what needs to
wait.
---
With this modification, losses of Qwen-1.5 with and without overlap_comm
are totally identical.

---
On the contrary, there is an obvious gap with a small sequence length,
which means a short computation time.

Co-authored-by: gp513 <guopeng34@huawei.com>
Co-authored-by: CurryRice233 <nmeia@qq.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>