Commit
146 days ago
fix #7188 (#7371) I found that when using DeepSpeed Zero2 for my training task, the loss becomes 0 at the third step with a grad_norm of 1.414. This issue doesn't occur when using Zero3. I found the same issue #7188. After conducting a series of experiments, I identified the cause: there's a synchronization problem when using double ipg_buffer swapping. The issue was resolved after making modifications. before ![image](https://github.com/user-attachments/assets/981d0829-e15f-4899-ae2c-4eca16ef138d) after ![image](https://github.com/user-attachments/assets/8b6b8403-d5df-4aa8-b573-195b9ee1fdfb) Signed-off-by: vinceliu <lpnpcs@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Author
Parents
Loading