DeepSpeed
15ad92b4 - Fix ping-pong buffer index reset and removing redundant stream sync (#7805)

Commit
63 days ago
Fix ping-pong buffer index reset and removing redundant stream sync (#7805) Fix #7804 and #7188 After investigating the code in `deepspeed/runtime/zero/stage_1_and_2.py`, I have identified the root cause. The regression regarding communication overlap was introduced in PR #7371 (https://github.com/deepspeedai/DeepSpeed/pull/7371). While the additional two-stream synchronization in that PR fixes gradient corruption, it effectively disables the overlapping behavior. The underlying issue causing the gradient corruption (which #7371 attempted to fix) was actually introduced in PR #6993 (https://github.com/deepspeedai/DeepSpeed/pull/6993). In that PR, `bucket.clear()` incorrectly resets the ping-pong buffer index to 0 at the end of `reduce_ipg_grads`. This logic disrupts the buffer index swapping mechanism within `reduce_independent_p_g_buckets_and_remove_grads`. To fix this, L121 in `deepspeed/runtime/zero/stage_1_and_2.py` should be removed to prevent resetting the buffer index. Additionally, the stream synchronization logic introduced in #7371 should be removed to restore the `overlap_comm=True` functionality. --------- Signed-off-by: szlent <metarufolds@gmail.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>
Author
Parents
Loading