Fix a convergence issues in TP topology caused by incorrect grad_norm. (#5411)
Some users are concerned that changes in TP topology during MOE training
may potentially cause interference with experiments when noticing
similar issues
https://github.com/microsoft/Megatron-DeepSpeed/issues/151
https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files
We found a grad_norm calculation error after enabling TP. This error
occurs because flattened grad of a params group is used, where the group
contains both non-TP and TP parameters. Therefore, it is not possible to
use a single attribute to determine whether flattened grad needs to
compute the norm. In the current code logic, all params are assumed to
be non-TP, resulting in only tp_rank0 grad participating in grad_norm
computation. Other tp_rank grads have grad_norm_sum equal to 0. We
tested and found that with TP=1 and TP=4, the difference in grad_norm is
approximately twice (sqrt(4)). This aligns with the aforementioned
issue. This problem should also affect dense models.
Due to the absence of flattening params_group grad in bf16, this problem
is avoided.
We tested the loss curve on the 1.3B model. In cases where TP size
increases the inconsistent gap should be larger.
with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16

without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16

---------
Co-authored-by: Conglong Li <conglong.li@gmail.com>