DeepSpeed
e5dd5501 - support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix (#5259)

Commit
1 year ago
support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix (#5259) - bf16 moe EP requires different partitions and this will impact dp gradient allreduce, zero1 params allgather, as well as gradient_norm allreduce. Currently, the bf16_optimizer does not correctly partition the group. fix and support bf16 type training. - fix calculation of moe ep grad scale and grad_norm for bf16&fp16 --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Author
Parents
Loading