DeepSpeed
e5dd5501 - support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix (#5259)

Commit

1 year ago

support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix (#5259) - bf16 moe EP requires different partitions and this will impact dp gradient allreduce, zero1 params allgather, as well as gradient_norm allreduce. Currently, the bf16_optimizer does not correctly partition the group. fix and support bf16 type training. - fix calculation of moe ep grad scale and grad_norm for bf16&fp16 --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

References

#5259 - support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix

Author

inkcherry

Parents

4520edd6

DeepSpeed e5dd5501 - support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix (#5259)

DeepSpeed
e5dd5501 - support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix (#5259)