DeepSpeed
6b6d6418 - fix sequence parallel(Ulysses) grad scale for zero0 (#5555)

Comment changes are shownComment changes are hidden
Commit
1 year ago
fix sequence parallel(Ulysses) grad scale for zero0 (#5555) use dp_world_size for grad reduction, instead of seq_dp_world_size. Currently, for zero0, only sparse tensors use the correct world_size. tiny model with sp=4 grad norm test: grad_norm | step1 | step2 | step3 | step4 |step5 | step100 -- | -- | -- | -- | -- | --| -- zero1 | 15.825 | 16.646|15.853 | 16.159 | 17.333 | 15.555 zero0 | 3.956 | 4.161 | 3.963 | 4.040 | 4.333| 3.889 zero0(this patch) | 15.825 | 16.646 | 15.853| 16.159 | 17.333 | 15.554
Author
Parents
  • deepspeed/runtime
    • File
      engine.py