Fix ping-pong buffer index reset and removing redundant stream sync (#7805)
Fix #7804 and #7188
After investigating the code in
`deepspeed/runtime/zero/stage_1_and_2.py`, I have identified the root
cause. The regression regarding communication overlap was introduced in
PR #7371 (https://github.com/deepspeedai/DeepSpeed/pull/7371). While the
additional two-stream synchronization in that PR fixes gradient
corruption, it effectively disables the overlapping behavior.
The underlying issue causing the gradient corruption (which #7371
attempted to fix) was actually introduced in PR #6993
(https://github.com/deepspeedai/DeepSpeed/pull/6993). In that PR,
`bucket.clear()` incorrectly resets the ping-pong buffer index to 0 at
the end of `reduce_ipg_grads`. This logic disrupts the buffer index
swapping mechanism within
`reduce_independent_p_g_buckets_and_remove_grads`.
To fix this, L121 in `deepspeed/runtime/zero/stage_1_and_2.py` should be
removed to prevent resetting the buffer index. Additionally, the stream
synchronization logic introduced in #7371 should be removed to restore
the `overlap_comm=True` functionality.
---------
Signed-off-by: szlent <metarufolds@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>