support bf16_optimizer moe expert parallel training and moe EP grad_scale/grad_norm fix (#5259)
- bf16 moe EP requires different partitions and this will impact dp
gradient allreduce, zero1 params allgather, as well as gradient_norm
allreduce. Currently, the bf16_optimizer does not correctly partition
the group. fix and support bf16 type training.
- fix calculation of moe ep grad scale and grad_norm for bf16&fp16
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>