fix sequence parallel(Ulysses) grad scale for zero0 (#5555)
use dp_world_size for grad reduction, instead of seq_dp_world_size.
Currently, for zero0, only sparse tensors use the correct world_size.
tiny model with sp=4 grad norm test:
grad_norm | step1 | step2 | step3 | step4 |step5 | step100
-- | -- | -- | -- | -- | --| --
zero1 | 15.825 | 16.646|15.853 | 16.159 | 17.333 | 15.555
zero0 | 3.956 | 4.161 | 3.963 | 4.040 | 4.333| 3.889
zero0(this patch) | 15.825 | 16.646 | 15.853| 16.159 | 17.333 | 15.554